找工作神器,提取各大网站有效的招聘信息(前程无忧、智联招聘、猎聘网)
找工作神器的主要原理是,根据查询条件去访问相应的网站,通过程序拿到相应网站的HTML代码,再通过相应的正则表达式取相应的信息,再去截取相应的重要信息,再将相应的信息显示在表格里。這里有使用到线程池异步的方式,同时会去三个网站抓取数据,并且会抓取一条解析一条就会在表格里显示出来,这样就避免等待太长时间还看不到结果的尴尬,程序发布后各位园友可以下载程序看看效果如何,还请您能提出宝贵的意见。
整个程序显示的界面效果图:
图片上面显示的是查询条件,输入查询条件后点击查询,下面显示的查询出的数据,分别有三个页签(猎聘网、智联招聘和前程无忧),表格分别显示职位名称、公司名称、公司性质、公司规模、月薪/年薪、工作地点、工作经验、最低学历和发布时间等等信息,日后根据需要还可以继续扩充想要看到的信息,实现看到信息的基本上跟网站上的信息差不多。
查询条件
解析:现在输入的条件有工作地点、薪水范围(上限、下限),关键词、必须包含的关键词,现在暂时只支持以上几种条件,日后可能会继续加入更多的的查询条件(公司名称、公司性质、工作经验、学历要求等等条件,日后再扩展),使查询更方便。
启动查询的代码如下:
通过启动线程池异步的方式同时启动三个网站数据的加载,这样增强用户体验的效果,并且会拿到一条数据解析一条数据,并且及时显示在表格了,这样用户不需要等待太长的时间而看不到结果。
程序启动首先会加载城市对应的ID的一个字典,数据加载如下:
1.前程无忧
前程无忧我相信应该是很多园友找工作的首选,博主就是在这上面注册了简历,并且每次换工作都是在这上面取得了成功,感觉还挺不错,祝愿各位园友都能找到自己称心如意的工作,只要我们大家都一起努力应该都没有问题的。
下面介绍实现逻辑:
1 #region * 前程无忧 2 /// <summary> 3 ///线程池启动调用的方法4 /// </summary> 5 /// <param name="obj"></param> 6 private void Get51JobData(objectobj)7 {8 string workAddress = this.txtAddress.Text.Trim();//工作地点 9 string workAddressId = string.Empty;//工作地点ID 10 string keyWord = this.txtKeyWord.Text.Trim();//关键词 11 string upperSalary = this.txtSalary1.Text.Trim();//薪水范围 12 string lowerSalary = this.txtSalary2.Text.Trim();//薪水范围 13 string mustKey = string.Empty;//是否包含关键词 14 15 jobInfoList2.Clear();16 curJobInfo2 = null;17 dt2.Rows.Clear();18 this.Invoke((MethodInvoker)delegate 19 {20 this.gcJob2.DataSource =dt2;21 });22 Thread th = null;//搜索线程 23 if (th != null)24 {25 th.Abort();26 th = null;27 }28 29 //根据输入的城市找出城市ID 30 KeyValuePair<string, string> kv = dic2.FirstOrDefault(t =>t.Value.Contains(workAddress));31 if (kv.Key == null)32 {33 XtraMessageBox.Show("无法搜索该工作地点", "警告", MessageBoxButtons.OK, MessageBoxIcon.Warning);34 return;35 }36 workAddressId =kv.Key;37 //勾选包含关键词 38 if (this.chkMustKey.Checked)39 {40 mustKey = this.txtMustKey.Text.Trim();41 }42 43 //调用接口 44 JobFactory tws = new JobFactory("51Job", workAddress, workAddressId, keyWord, upperSalary, lowerSalary, mustKey);45 IJob job =tws.GetJob();46 if (job != null)47 {48 job.GetJobEnd -= newGetJobEndEventHandler(job_GetJob2End);49 job.GetJobEnd += newGetJobEndEventHandler(job_GetJob2End);50 th = new Thread(newThreadStart(job.GetJobInfoList));51 th.IsBackground = true;52 th.Start();53 }54 }55 56 /// <summary> 57 ///表格增加一行数据58 /// </summary> 59 /// <param name="o"></param> 60 /// <param name="e"></param> 61 private void job_GetJob2End(objecto, JobInfo e)62 {63 this.Invoke((MethodInvoker)delegate 64 {65 if (e != null)66 {67 jobInfoList2.Add(e);68 curJobInfo2 =e;69 this.gvJob2.AddNewRow();70 }71 else 72 {73 this.layoutControlGroup2.Enabled = true;74 }75 });76 }77 78 /// <summary> 79 ///表格增加行80 /// </summary> 81 /// <param name="sender"></param> 82 /// <param name="e"></param> 83 private void gvJob2_InitNewRow(objectsender, DevExpress.XtraGrid.Views.Grid.InitNewRowEventArgs e)84 {85 try 86 {87 DataRowView dr = this.gvJob2.GetRow(e.RowHandle) asDataRowView;88 dr["Url"] = curJobInfo2.Url;//网站链接 89 dr["Position"] = curJobInfo2.Position;//职位名称 90 dr["Company"] = curJobInfo2.Company;//公司名称 91 dr["Nature"] = curJobInfo2.Nature;//公司性质 92 dr["Scale"] = curJobInfo2.Scale;//公司规模 93 dr["Salary"] = curJobInfo2.Salary;//月薪/年薪 94 dr["Address"] = curJobInfo2.Address;//工作地点 95 dr["Experience"] = curJobInfo2.Experience;//工作经验 96 dr["Education"] = curJobInfo2.Education;//最低学历 97 dr["Time"] = curJobInfo2.Time;//发布时间 98 99 this.gvJob2.UpdateCurrentRow();100 this.gvJob2.RefreshData();101 this.gvJob2.MoveLast();102 }103 catch 104 {105 XtraMessageBox.Show("添加行失败");106 }107 }108 109 /// <summary> 110 ///双击行打开当前行链接111 /// </summary> 112 /// <param name="sender"></param> 113 /// <param name="e"></param> 114 private void gcJob2_DoubleClick(objectsender, EventArgs e)115 {116 string uri = this.gvJob2.GetFocusedDataRow()["Url"].ToString();117 System.Diagnostics.Process.Start(uri);118 }119 #endregion
以上三个函数的作用分别是线程池启动调用的方法、表格增加一行数据、表格增加行和双击行打开当前行链接四个方法,实现这四个方法即可获取前程无忧的数据,那么获取HTML内容和解析HTML需用另外一个类实现,实现这个类如下:
1 public classJobFrom51Job : IJob2 {3 #region * 私有字段 4 private string url = @"http://search.51job.com/jobsearch/search_result.php?";5 6 /// <summary> 7 ///工作地点8 /// </summary> 9 private stringworkAddress;10 /// <summary> 11 ///工作地点ID12 /// </summary> 13 private stringworkAddressId;14 /// <summary> 15 ///关键词16 /// </summary> 17 private stringkeyWord;18 /// <summary> 19 ///包含词20 /// </summary> 21 private stringmustKey;22 #endregion 23 24 public JobFrom51Job(string workAddress, string workAddressId, string keyWord, stringmustKey)25 {26 this.workAddress =workAddress;27 this.workAddressId =workAddressId;28 this.keyWord =keyWord;29 this.mustKey =mustKey;30 }31 32 public eventGetJobEndEventHandler GetJobEnd;33 public voidGetJobInfoList()34 {35 try 36 {37 StringBuilder condition = newStringBuilder();38 condition.Append("jobarea=" +workAddressId);39 if (!string.IsNullOrEmpty(keyWord))40 {41 keyWord = System.Web.HttpUtility.UrlEncode(keyWord, Encoding.GetEncoding("gb2312"));42 condition.Append("&keyword=" +keyWord);43 }44 condition.Append("&keywordtype=2");45 46 url = url +condition.ToString();47 string html = GetHtmlCode.GetByget(url, "gb2312");48 GetJobInfoFromPage(html);49 50 int pageCount = 0;51 //页面数量 52 string pageCountRegexStr = "(?<=name=\"jobid_count\"\\s*?value=\")\\d+(?=\">)";53 Regex pageCountRegex = newRegex(pageCountRegexStr);54 pageCount = (int.Parse(pageCountRegex.Match(html).Value) + 29) / 30;55 56 for (int i = 2; i <= pageCount; i++)57 {58 string url0 = url + string.Format("&curr_page={0}", i);59 html = GetHtmlCode.GetByget(url0, "gb2312");60 GetJobInfoFromPage(html);61 }62 if (GetJobEnd != null)63 {64 GetJobEnd(null, null);65 }66 }67 catch(Exception exMsg)68 {69 throw newException(exMsg.Message);70 }71 }72 73 private void GetJobInfoFromPage(stringpageStr)74 {75 try 76 {77 pageStr = Regex.Replace(pageStr, "\\s", "");78 //职位所有信息 79 string jobInfoRegexStr = "(?<=<trclass=\"tr0\").+?(?=</tr>)";80 Regex jobInfoRegex = newRegex(jobInfoRegexStr);81 MatchCollection jobInfoMC =jobInfoRegex.Matches(pageStr);82 //-- 83 foreach (Match m injobInfoMC)84 {85 if(m.Value.Contains(workAddress))86 {87 //职位URL 88 string urlRegexStr = "(?<=<aadid=\"\"href=\").+?(?=\")";89 string url0 =Regex.Match(m.Value, urlRegexStr).Value;90 GetJobInfoFromUrl(url0);91 }92 }93 }94 catch(Exception exMsg)95 {96 throw newException(exMsg.Message);97 }98 }99 100 //正则表达式过滤:正则表达式,要替换成的文本 101 private static readonly string[][] Filters = 102 {103 new[] { @"(?is)<script.*?>.*?</script>", ""},104 new[] { @"(?is)<style.*?>.*?</style>", ""},105 new[] { @"(?is)<!--.*?-->", "" }, //过滤Html代码中的注释 106 new[] { @"(?is)<footer.*?>.*?</footer>",""},107 new[] { "(?is) <div style=\"width:470px; padding-left:5px;\">.*?</div>",""},108 new[] { "(?is)<div id=\"top\">.*?</iframe> </div></div>",""},109 new[] { "(?is)<div class=\"grayline\" id=\"announcementbody\">.*?</li></ul> </div>",""}110 };111 112 private void GetJobInfoFromUrl(stringurl)113 {114 try 115 {116 JobInfo info = newJobInfo();117 //-- 118 string pageStr = GetHtmlCode.GetByget(url, "gb2312");119 if (string.IsNullOrEmpty(pageStr))120 {121 return;122 }123 //-- 124 pageStr = pageStr.Replace("\r\n", "");//替换换行符125 //获取html,body标签内容 126 string body = string.Empty;127 string bodyFilter = @"(?is)<body.*?</body>";128 Match m =Regex.Match(pageStr, bodyFilter);129 if(m.Success)130 {131 body = m.ToString().Replace("<tr >", "<tr>").Replace("\r\n", "");132 }133 //过滤样式,脚本等不相干标签 134 foreach (var filter inFilters)135 {136 body = Regex.Replace(body, filter[0], filter[1]);137 }138 //-- 139 if (!string.IsNullOrEmpty(mustKey) && !body.Contains(mustKey))140 {141 return;142 }143 body = Regex.Replace(body, "\\s", "");144 145 info.Url =url;146 string basicInfoRegexStr0 = "<tdclass=\"sr_bt\"colspan=\"2\">(.*?)</td>"; //职位名称 147 string position =Regex.Match(body, basicInfoRegexStr0).Value;148 if (string.IsNullOrEmpty(position))149 {150 basicInfoRegexStr0 = "<tdclass=\"sr_bt\"colspan=\"3\">(.*?)</td>";151 position =Regex.Match(body, basicInfoRegexStr0).Value;152 }153 info.Position = string.IsNullOrEmpty(position) ? "" : position.Substring(position.IndexOf(">") + 1, position.IndexOf("</") - position.IndexOf(">") - 1);154 155 string basicInfoRegexStr1 = ".html\">(.*?)</a>";//公司名称 156 string company =Regex.Match(body, basicInfoRegexStr1).Value;157 info.Company = string.IsNullOrEmpty(company) ? "" : company.Substring(company.IndexOf(">") + 1, company.IndexOf("</a>") - company.IndexOf(">") - 1);158 159 string basicInfoRegexStr2 = "工作地点:</td><tdclass=\"txt_2\">(.*?)</td>";//工作地点 160 string address =Regex.Match(body, basicInfoRegexStr2).Value;161 info.Address = string.IsNullOrEmpty(address) ? "" : address.Substring(address.IndexOf("\">") + 2, address.LastIndexOf("</td>") - address.IndexOf("\">") - 2);162 163 string basicInfoRegexStr3 = "公司性质:</strong> (.*?)<br><br><strong>";//公司性质 164 string nature =Regex.Match(body, basicInfoRegexStr3).Value;165 if (string.IsNullOrEmpty(nature))166 {167 basicInfoRegexStr3 = "公司行业:</strong> (.*?)<br><br><strong>";168 nature =Regex.Match(body, basicInfoRegexStr3).Value;169 }170 info.Nature = string.IsNullOrEmpty(nature) ? "" : nature.Substring(26, nature.IndexOf("<br>") - 26);//公司性质 171 172 string basicInfoRegexStr4 = "公司规模:</strong> (.*?)</td>";//公司规模 173 string scale =Regex.Match(body, basicInfoRegexStr4).Value;174 info.Scale = string.IsNullOrEmpty(scale) ? "" : scale.Substring(26, scale.IndexOf("</td>") - 26);175 176 string basicInfoRegexStr5 = "工作年限:</td><tdclass=\"txt_2\">(.*?)</td>";//工作经验 177 string experience =Regex.Match(body, basicInfoRegexStr5).Value;178 info.Experience = string.IsNullOrEmpty(experience) ? "" : experience.Substring(experience.IndexOf("\">") + 2, experience.LastIndexOf("</td>") - experience.IndexOf("\">") - 2);179 180 string basicInfoRegexStr6 = "学 历:</td><tdclass=\"txt_2\">(.*?)</td>";//学历 181 string education =Regex.Match(body, basicInfoRegexStr6).Value;182 info.Education = string.IsNullOrEmpty(education) ? "" : education.Substring(education.IndexOf("\">") + 2, education.LastIndexOf("</td>") - education.IndexOf("\">") - 2);183 184 string basicInfoRegexStr7 = "薪水范围:</td><tdclass=\"txt_2\">(.*?)</td>";//月薪 185 string salary =Regex.Match(body, basicInfoRegexStr7).Value;186 info.Salary = string.IsNullOrEmpty(salary) ? "" : salary.Substring(salary.IndexOf("\">") + 2, salary.LastIndexOf("</td>") - salary.IndexOf("\">") - 2);187 188 string basicInfoRegexStr8 = "发布日期:</td><tdclass=\"txt_2\">(.*?)</td>";//发布时间 189 string time =Regex.Match(body, basicInfoRegexStr8).Value;190 info.Time = string.IsNullOrEmpty(time) ? "" : time.Substring(time.IndexOf("\">") + 2, time.LastIndexOf("</td>") - time.IndexOf("\">") - 2); ;191 192 if (GetJobEnd != null)193 {194 GetJobEnd(pageStr, info);195 }196 }197 catch(Exception exMsg)198 {199 throw newException(exMsg.Message);200 }201 }202 }
以上这个类的作用是分别根据网址获取HTML内容,再根据正则表达式获取招聘相关信息,再通过函数截取相关字段的信息,再组装到前台界面,实现数据的显示,这个里面有一个逻辑就是动态每一条招聘信息的连接,再根据连接去获取HTML信息,相当于这中间有两层解析XML的过程。
2.智联招聘
智联招聘是我自己每次找工作的备选项,每次把前程无忧上的所有招聘信息全部看完后,就会在智联招聘上浏览下,感觉还挺不错的,不知各位园友有没有试下,不过会有很多与前程无忧是重复的招聘信息,所以还得靠自己去区分。
下面介绍实现逻辑:
1 public classJobFromZhiLian : IJob2 {3 #region 私有字段 4 private string url = @"http://sou.zhaopin.com/Jobs/SearchResult.ashx?";5 /// <summary> 6 ///工作地点7 /// </summary> 8 private stringworkAddress;9 /// <summary> 10 ///关键词11 /// </summary> 12 private stringkeyWord;13 /// <summary> 14 ///工资范围15 /// </summary> 16 private stringupperSalary;17 /// <summary> 18 ///工资范围19 /// </summary> 20 private stringlowerSalary;21 /// <summary> 22 ///包含词23 /// </summary> 24 private stringmustKey;25 #endregion 26 27 public JobFromZhiLian(string workAddress, string keyWord, string upperSalary, string lowerSalary, stringmustKey)28 {29 this.workAddress =workAddress;30 this.keyWord =keyWord;31 this.upperSalary =upperSalary;32 this.lowerSalary =lowerSalary;33 this.mustKey =mustKey;34 }35 36 public eventGetJobEndEventHandler GetJobEnd;37 public voidGetJobInfoList()38 {39 try 40 {41 StringBuilder condition = newStringBuilder();42 workAddress = HttpUtility.UrlEncode(workAddress, Encoding.GetEncoding("utf-8"));43 condition.Append("jl=" +workAddress);44 if (!string.IsNullOrEmpty(keyWord))45 {46 keyWord = HttpUtility.UrlEncode(keyWord, Encoding.GetEncoding("utf-8"));47 condition.Append("&kw=" +keyWord);48 }49 condition.Append("&sm=1");50 if (!string.IsNullOrEmpty(upperSalary))51 {52 condition.Append("&sf=" +upperSalary);53 }54 if (!string.IsNullOrEmpty(lowerSalary))55 {56 condition.Append("&st=" +lowerSalary);57 }58 59 url = url +condition.ToString();60 string html = GetHtmlCode.GetByget(url, "utf-8");61 GetJobInfoFromPage(html);62 63 //页面数量 64 string pageCountRegexStr = "(?<=οnkeypress=\"zlapply.searchjob.enter2Page\\(this,event,)\\d+";65 Regex pageCountRegex = newRegex(pageCountRegexStr);66 string pageCountStr = pageCountRegex.Match(html).Groups[0].Value;67 int pageCount = 0;68 int.TryParse(pageCountStr, outpageCount);69 70 for (int i = 2; i <= pageCount; i++)71 {72 string url0 = url + string.Format("&p={0}", i);73 html = GetHtmlCode.GetByget(url0, "utf-8");74 GetJobInfoFromPage(html);75 }76 if (GetJobEnd != null)77 {78 GetJobEnd(null, null);79 }80 }81 catch(Exception exMsg)82 {83 throw newException(exMsg.Message);84 }85 }86 87 88 //正则表达式过滤:正则表达式,要替换成的文本 89 private static readonly string[][] Filters = 90 {91 new[] { @"(?is)<script.*?>.*?</script>", ""},92 new[] { @"(?is)<style.*?>.*?</style>", ""},93 new[] { @"(?is)<!--.*?-->", "" } //过滤Html代码中的注释 94 };95 96 private void GetJobInfoFromPage( stringpageStr)97 {98 try 99 {100 JobInfo info = newJobInfo();101 //-- 102 if (string.IsNullOrEmpty(pageStr))103 {104 return;105 }106 //-- 107 pageStr = pageStr.Replace("\r\n", "");//替换换行符108 //获取html,body标签内容 109 string body = string.Empty;110 string bodyFilter = @"(?is)<body.*?</body>";111 Match m =Regex.Match(pageStr, bodyFilter);112 if(m.Success)113 {114 body = m.ToString().Replace("<tr >", "<tr>").Replace("\r\n", "");115 }116 //过滤样式,脚本等不相干标签 117 foreach (var filter inFilters)118 {119 body = Regex.Replace(body, filter[0], filter[1]);120 }121 ////-- 122 //if (!string.IsNullOrEmpty(mustKey) && !body.Contains(mustKey))123 //{124 //return;125 //} 126 body = Regex.Replace(body, "\\s", "");127 bodyFilter = "(?is)<divclass=\"newlist_list_content\"id=\"newlist_list_content_table\">.*?</dd></dl></div></div></div>";128 Match m1 =Regex.Match(body, bodyFilter);129 if(m1.Success)130 {131 body =m1.ToString();132 }133 134 135 136 137 //info.Url = xurl; 138 139 if (GetJobEnd != null)140 {141 GetJobEnd(pageStr, info);142 }143 144 //pageStr = Regex.Replace(pageStr, "\\s| |<br>|<strong>|</strong>|<b>|</b>", ""); 145 ////职位所有信息 146 //string jobInfoRegexStr = "(?<=<tableclass=\"search-result-tab\">)[\\S\\s]+?(?=</table>)";147 //Regex jobInfoRegex = new Regex(jobInfoRegexStr);148 //MatchCollection jobInfoMC = jobInfoRegex.Matches(pageStr);149 //foreach (Match m in jobInfoMC)150 //{151 //if (!string.IsNullOrEmpty(mustKey) && !m.Value.Contains(mustKey))152 //{153 //return;154 //}155 156 //JobInfo info = new JobInfo();157 158 // //职位名称,url和公司名称159 //string basicInfoRegexStr = "(?<=<ahref=\")([\\w.:+?()/%=#&]+)\"target=\"_blank\".*?>([\\s\\S]+?)(?=</a>)";160 // //地点、公司性质、公司规模、经验、学历、职位月薪161 //string basicInfoRegexStr0 = "(?<=地点:)[-/\\w]+(?=</span>)";162 //string basicInfoRegexStr1 = "(?<=公司性质:)[-/\\w]+(?=</span>)";163 //string basicInfoRegexStr2 = "(?<=公司规模:)[-/\\w]+(?=</span>)";164 //string basicInfoRegexStr3 = "(?<=经验:)[-/\\w]+(?=</span>)";165 //string basicInfoRegexStr4 = "(?<=学历:)[-/\\w]+(?=</span>)";166 //string basicInfoRegexStr5 = "(?<=职位月薪:)[-/\\w]+(?=</span>)";167 // //发布时间168 //string timeInfoRegexStr = "(?<=releasetime\">)\\d{1,2}-\\d{1,2}-\\d{1,2}";169 170 //Regex basicInfoRegex = new Regex(basicInfoRegexStr);171 //MatchCollection basicInfoMC = basicInfoRegex.Matches(m.Value);172 //info.Url = basicInfoMC[0].Groups[1].Value;173 //info.Position = basicInfoMC[0].Groups[2].Value;174 //info.Company = basicInfoMC[1].Groups[2].Value;175 //Regex basicInfoRegex0 = new Regex(basicInfoRegexStr0);176 //info.Address = new Regex(basicInfoRegexStr0).Match(m.Value).Value;177 //info.Nature = new Regex(basicInfoRegexStr1).Match(m.Value).Value;178 //info.Scale = new Regex(basicInfoRegexStr2).Match(m.Value).Value;179 //info.Experience = new Regex(basicInfoRegexStr3).Match(m.Value).Value;180 //info.Education = new Regex(basicInfoRegexStr4).Match(m.Value).Value;181 //info.Salary = new Regex(basicInfoRegexStr5).Match(m.Value).Value;182 //Regex timeInfoRegex = new Regex(timeInfoRegexStr);183 //info.Time = timeInfoRegex.Match(m.Value).Value;184 185 186 //if (GetJobEnd != null)187 //{188 //GetJobEnd(pageStr, info);189 //}190 //} 191 }192 catch(Exception exMsg)193 {194 throw newException(exMsg.Message);195 }196 }197 }
以上为智联招聘解析HTML相关类,以上逻辑中正则表达式还在完善中,还未完全实现成功,正则表达式还有问题。
3.猎聘网
猎聘网也是最近一两年才兴起的,這个网站上基本上都是很多猎头发布的信息,开的工资大多是都是十多二十万年薪的岗位,只要你具备這个实力可以去這个网站看看,应该会有所收获的,不过這个网站也有部分企业自己发布的招聘信息,如果前面两个网站都没有看到自己满意的求职信息,那么這个网站也可以是自己求职的一个补充,不知各位博友是不是支持我这种观点。
下面介绍实现逻辑:
1 public classJobFromLiePin : IJob2 {3 #region * 私有字段 4 private string url = @"http://www.liepin.com/zhaopin/?";5 6 //基本信息 7 private string basicInfoRegexStr = "<a title=[\\s\\S]+?</a>";8 9 /// <summary> 10 ///工作地点11 /// </summary> 12 private stringworkAddress;13 /// <summary> 14 ///工作地点ID15 /// </summary> 16 private stringworkAddressId;17 /// <summary> 18 ///关键词19 /// </summary> 20 private stringkeyWord;21 /// <summary> 22 ///包含词23 /// </summary> 24 private stringmustKey;25 #endregion 26 27 public JobFromLiePin(string workAddress, string workAddressId, string keyWord, stringmustKey)28 {29 this.workAddress =workAddress;30 this.workAddressId =workAddressId;31 this.keyWord =keyWord;32 this.mustKey =mustKey;33 }34 35 public eventGetJobEndEventHandler GetJobEnd;36 public voidGetJobInfoList()37 {38 try 39 {40 StringBuilder condition = newStringBuilder();41 condition.AppendFormat("dqs={0}", workAddressId);42 condition.Append("&searchField=3");43 if (!string.IsNullOrEmpty(keyWord))44 {45 keyWord = HttpUtility.UrlEncode(keyWord, Encoding.GetEncoding("utf-8"));46 condition.Append("&key=" +keyWord);47 }48 condition.Append("&pubTime=30");49 string xurl = string.Empty;50 for (int i = 0; i < 100; i++)51 {52 if (i > 0)53 {54 xurl = url + condition.ToString() + "&curPage=" +i;55 }56 else 57 {58 xurl = url +condition.ToString();59 }60 string html = GetHtmlCode.GetByget(xurl, "utf-8");61 if (string.IsNullOrEmpty(html))62 {63 break;64 }65 GetJobInfoFromPage(html);66 }67 }68 catch(Exception exMsg)69 {70 throw newException(exMsg.Message);71 }72 }73 74 private void GetJobInfoFromPage(stringpageStr)75 {76 try 77 {78 MatchCollection ms =Regex.Matches(pageStr, basicInfoRegexStr);79 //--url 80 string urlRegex = "(?<=href=\")([\\w.:+?()/%=#&]+)";81 //-- 82 foreach (Match m inms)83 {84 if(m.Value.Contains(workAddress))85 {86 string url0 =Regex.Match(m.Value, urlRegex).Value;87 GetJobInfoFromUrl(url0);88 }89 }90 if (GetJobEnd != null)91 {92 GetJobEnd(null, null);93 }94 }95 catch(Exception exMsg)96 {97 throw newException(exMsg.Message);98 }99 }100 101 //正则表达式过滤:正则表达式,要替换成的文本 102 private static readonly string[][] Filters = 103 {104 new[] { @"(?is)<script.*?>.*?</script>", ""},105 new[] { @"(?is)<style.*?>.*?</style>", ""},106 new[] { @"(?is)<!--.*?-->", "" }, //过滤Html代码中的注释 107 new[] { @"(?is)<footer.*?>.*?</footer>",""},108 //new[] { "(?is)<div class=\"job-require bottom-job-require\">.*?</div></div>",""} 109 new[] { @"(?is)<h3>常用链接:.*?</ul>",""}110 };111 112 private void GetJobInfoFromUrl(stringurl)113 {114 try 115 {116 JobInfo info = newJobInfo();117 //-- 118 string pageStr = GetHtmlCode.GetByget(url, "utf-8");119 if (string.IsNullOrEmpty(pageStr))120 {121 return;122 }123 //-- 124 pageStr = pageStr.Replace("\r\n", "");//替换换行符125 //获取html,body标签内容 126 string body = string.Empty;127 string bodyFilter = @"(?is)<body.*?</body>";128 Match m =Regex.Match(pageStr, bodyFilter);129 if(m.Success)130 {131 body = m.ToString().Replace("<tr >", "<tr>").Replace("\r\n", "");132 }133 //过滤样式,脚本等不相干标签 134 foreach (var filter inFilters)135 {136 body = Regex.Replace(body, filter[0], filter[1]);137 }138 //-- 139 if (!string.IsNullOrEmpty(mustKey) && !body.Contains(mustKey))140 {141 return;142 }143 body = Regex.Replace(body, "\\s", "");144 145 info.Url =url;146 147 string basicInfoRegexStr0 = "<h1title=([\\s\\S]+?)>(.*?)</h1>"; //职位名称 148 string position =Regex.Match(body, basicInfoRegexStr0).Value;149 info.Position = string.IsNullOrEmpty(position) ? "" : position.Substring(position.IndexOf(">") + 1, position.IndexOf("</") - position.IndexOf(">") - 1);//职位名称 150 151 string basicInfoRegexStr1 = "</h1><h3>(.*?)</h3>";//公司名称 152 string company =Regex.Match(body, basicInfoRegexStr1).Value;153 info.Company = string.IsNullOrEmpty(company) ? "" : company.Substring(company.IndexOf("<h3>") + 4, company.IndexOf("</h3>") - company.IndexOf("<h3>") - 4);//公司名称 154 155 string basicInfoRegexStr2 = "<divclass=\"resumeclearfix\"><span>(.*?)</span>";//工作地点 156 string address =Regex.Match(body, basicInfoRegexStr2).Value;157 info.Address = string.IsNullOrEmpty(address) ? "" : address.Substring(address.IndexOf("<span>") + 6, address.IndexOf("</") - address.IndexOf("<span>") - 6);//工作地点 158 159 string basicInfoRegexStr3 = "<li><span>企业性质:</span>(.*?)</li>";//公司性质 160 string nature =Regex.Match(body, basicInfoRegexStr3).Value;161 info.Nature = string.IsNullOrEmpty(nature) ? "" : nature.Substring(nature.IndexOf("</span>") + 7, nature.IndexOf("</li>") - nature.IndexOf("</span>") - 7);//公司性质 162 163 if (string.IsNullOrEmpty(info.Nature))164 {165 string basicInfoRegexStr3_1 = "<br><span>性质:</span>(.*?)<br>";166 string nature_1 =Regex.Match(body, basicInfoRegexStr3_1).Value;167 info.Nature = string.IsNullOrEmpty(nature_1) ? "" : nature_1.Substring(nature_1.IndexOf("</span>") + 7, nature_1.LastIndexOf("<br>") - nature_1.IndexOf("</span>") - 7);//公司性质 168 }169 170 string basicInfoRegexStr4 = "<li><span>企业规模:</span>(.*?)</li>";//公司规模 171 string scale =Regex.Match(body, basicInfoRegexStr4).Value;172 info.Scale = string.IsNullOrEmpty(scale) ? "" : scale.Substring(scale.IndexOf("</span>") + 7, scale.IndexOf("</li>") - scale.IndexOf("</span>") - 7);//公司规模 173 174 if (string.IsNullOrEmpty(info.Scale))175 {176 string basicInfoRegexStr4_1 = "<br><span>规模:</span>(.*?)<br>";177 string scale_1 =Regex.Match(body, basicInfoRegexStr4_1).Value;178 info.Scale = info.Nature = string.IsNullOrEmpty(scale_1) ? "" : scale_1.Substring(scale_1.IndexOf("</span>") + 7, scale_1.LastIndexOf("<br>") - scale_1.IndexOf("</span>") - 7);//公司规模 179 }180 181 string basicInfoRegexStr5 = "<spanclass=\"noborder\">(.*?)</span>";//工作经验 182 string experience =Regex.Match(body, basicInfoRegexStr5).Value;183 info.Experience = string.IsNullOrEmpty(experience) ? "" : experience.Substring(experience.IndexOf(">") + 1, experience.IndexOf("</") - experience.IndexOf(">") - 1);//工作经验 184 185 string basicInfoRegexStr6 = "</span><span>(.*?)</span><spanclass=\"noborder\">";//最低学历 186 string education =Regex.Match(body, basicInfoRegexStr6).Value;187 info.Education = string.IsNullOrEmpty(education) ? "" : education.Substring(education.IndexOf("<span>") + 6, education.IndexOf("</span><spanclass=") - education.IndexOf("<span>") - 6);//最低学历 188 189 string basicInfoRegexStr7 = "<pclass=\"job-main-title\">(.*?)<";//月薪 190 string salary =Regex.Match(body, basicInfoRegexStr7).Value;191 info.Salary = string.IsNullOrEmpty(salary) ? "" : salary.Substring(salary.IndexOf(">") + 1, salary.LastIndexOf("<") - salary.IndexOf(">") - 1);//月薪 192 193 string timeInfoRegexStr = "<pclass=\"release-time\">发布时间:<em>(.*?)</em></p>";//发布时间 194 string time =Regex.Match(body, timeInfoRegexStr).Value;195 info.Time = string.IsNullOrEmpty(time) ? "" : time.Substring(time.IndexOf("<em>") + 4, time.IndexOf("</em>") - time.IndexOf("<em>") - 4);//发布时间 196 197 if (GetJobEnd != null)198 {199 GetJobEnd(pageStr, info);200 }201 }202 catch(Exception exMsg)203 {204 throw newException(exMsg.Message);205 }206 }207 }
以上为解析猎聘网招聘信息的类。以下为猎聘网解析出的数据:
找工作神器,提取各大网站有效的招聘信息(前程无忧、智联招聘、猎聘网)相关推荐
- python工作招聘-python爬虫 智联招聘 工作地点
需求:智联上找工作的时候,工作地点在搜索页面只能看到城市-区.看不到具体的地址.(离家近的工作肯定优先考虑) 思路:爬取搜索页面(页面一)然后进去其中一个内页,再爬工作地点(页面二),[废话] 利用的 ...
- python爬虫使用selenium爬取动态网页信息——以智联招聘网站为例
python版本3.6 #导入两个模块 from selenium import webdriver import time from openpyxl import Workbook import ...
- 开课吧python怎么样-找工作得有个大杀招,你看AI换脸这个技能怎么样?
原标题:找工作得有个大杀招,你看AI换脸这个技能怎么样? 你ZAO吗,当你还沉浸在idol爆恋情的悲伤中,别人已经"和偶像同台对戏了"! 相信大部分小伙伴都已经知道"别人 ...
- python nodejs开发web_用nodejs和python实现一个爬虫来爬网站(智联招聘)的信息
最近研究了一下网站爬虫,觉得python和nodejs都有优点,所以我决定实现一个爬虫,用python来抓取网页的源代码,用nodejs的cheerio模块来获取源代码内的数据.正好我有明年换工作的打 ...
- 基于分布式的智联招聘数据的大屏可视化分析与预测
项目需求分析及体系架构 1.1项目介绍 互联网成了海量信息的载体,目前是分析市场趋势.监视竞争对手或者获取销售线索的最佳场所,数据采集以及分析能力已成为驱动业务决策的关键技能.<计算机行业岗位招 ...
- 克服反爬虫机制爬取智联招聘网站
一.实验内容 1.爬取网站: 智联招聘网站(https://www.zhaopin.com/) 2.网站的反爬虫机制: 在我频繁爬取智联招聘网站之后,它会出现以下文字(尽管我已经控制了爬虫的爬 ...
- 5分钟掌握智联招聘网站爬取并保存到MongoDB数据库
前言 本次主题分两篇文章来介绍: 一.数据采集 二.数据分析 第一篇先来介绍数据采集,即用python爬取网站数据. 1 运行环境和python库 先说下运行环境: python3.5 windows ...
- python 爬虫学习:抓取智联招聘网站职位信息(二)
在第一篇文章(python 爬虫学习:抓取智联招聘网站职位信息(一))中,我们介绍了爬取智联招聘网站上基于岗位关键字,及地区进行搜索的岗位信息,并对爬取到的岗位工资数据进行统计并生成直方图展示:同时进 ...
- python 爬虫学习:抓取智联招聘网站职位信息(一)
近期智联招聘的网站风格变化较快,这对于想爬取数据的人来说有些难受.因此,在前人基础上,我整理了针对智联招聘网站的最新结构进行数据抓取的代码,目前支持抓取职位搜索列表页面的列表项,并将职位列表以exlc ...
- ChinaHR(智联招聘)网站的一个小bug
测试空间旗下大头针出品 这几天有个学生面试,笔试的时候让他测试一下ChinaHR(智联招聘)网站,找网站的bug. 找了半天都是找了些界面方面. 像网站的测试,一般主要关注的网站的性能和功能以及内容. ...
最新文章
- ffmpeg linux安装_ffmpeg命令中文手册
- Template Method (C++实现)
- oracle表空间大小规划,关于oracle表空间的规划方法
- 希望PAT耗子尾汁:1014 福尔摩斯的约会 (20分)——22行代码AC
- java多线程实现医院叫号_Java多线程经典题目(医院挂号)
- Q新闻丨吃鸡外挂被开源;Dubbo 3.0来了;工信部约谈百度、支付宝、今日头条;内地iCloud服务将转由云上贵州运营...
- 华为2019实习生机试题1
- Flutter type ‘List<dynamic>‘ is not a subtype of type ‘Map<String, dynamic>‘
- linux学习第八周总结
- 结构建模设计——Solidworks软件之草图绘制中借助新建基准面实现在曲面表面绘制特征的实现步骤总结
- Oracle 查询一个小时之前表的数据
- unix常用操作命令
- 编程 - 变量的命名方法
- 计算机组成原理 主存储器1
- 拨开国产 COS 系统的重重迷雾
- 51单片机实例学习二 按键中断识别、定时器、利用定时器产生乐曲、数摸转换 ADC0804和DAC0832
- 一个计算机高手的成长(转载)
- python做excel多表按列合并_python如何实现excel多表合并(附代码)_后端开发
- 计算机维护工具盘,HB520WZ计算机维护工具盘(FBA文件,2013.10.18)
- 大白话之辩论DDD,阿里面试中台化理解
热门文章
- 工作如何避免情绪内耗
- java大鱼吃小鱼_大鱼吃小鱼Java课程设计
- c 语言图片转字符画,图片转化为字符画(C#版)
- 记录每天背的单词,准备考研。(4月11日)
- 使用UE4开发VR项目_性能优化(二)_思路和方法
- 论文阅读Reasoning with Latent Structure Refinement for Document-Level Relation Extraction
- 软件环境 硬件环境java,软件环境和硬件环境都指什么?
- 云中马在A股上市:总市值约为40亿元,叶福忠为实际控制人
- bzoj 4987 Tree - dp
- VMware网络问题排查思路