linkedin网页爬虫难点:

网页进入：登录网页（个人账号登录）；API接口进入（需要进行一系列复杂且不懂的申请操作，且每次爬的数据量有限）；虚拟网址登入，用requests的Session维持登录状态（目前已不可行，https://www.linkedin.com/uas/login-submit）
爬虫难点：网页项目没有明确的#css可以爬，必须找到相应的class，且网页element处于更新中，这次爬了不代表下次能用相同的规则爬。

网页进入

用selenium驱动chrome： Rselenium package
我起初首先看的网址，对我有启蒙作用，然而有一些方法已经失效，aja渲染也没必要用，只能帮助进入，无法帮助爬虫，且在python环境下完成
使用R软件的Rselenium包自动驱动；非常详细地告诉你R怎么配置chromedriver，需要用到java执行档
更详细生动的python selenium的操作，selenium和Rselenium其实原理都相同，互通的。

#在cmd中执行： java -Dwebdriver.chrome.driver=D:\Chromedriver.exe -jar D:\selenium-server-standalone-3.141.59.jar
library(Rselenium)
remDr=remoteDriver(browserName='chrome')
#打开领英网页
remDr$open()
loginurl='https://www.linkedin.com/login?fromSignIn=true&trk=guest_homepage-basic_nav-header-signin'
remDr$navigate(loginurl)
# 用户名和密码；这里是通过代码输入了，实际上自己在监控网页上输入用户名密码即可，只要你不关闭chrome，就保持了cookie.
t0=remDr$findElement(using='css','#username')
t0$sendKeysToElement(list('在这里输入你的用户名'))
t0=remDr$findElement(using='css','#password')
t0$sendKeysToElement(list('在这里输入你的密码'))
t0=remDr$findElement(using='css','.from__button--floating')
t0$clickElement()
Sys.sleep(5)

领英网页爬虫

可参考大概思路，python，且方法已经过时，可借鉴：selenium进入网页+获取网页的element+关闭自动测试+python爬虫

我们需要完成的几个任务：

输入人名，找到人名的profileurl。
进入人名的url，并下载他的简历，save-to-pdf。
普通的领英简历，包括top-cv-card,experience, education, license and certificate等系列模块。扒这些板块，并按照条目登记信息。

linkedin爬虫关键点：
xpath+class找元素：
xpath=’//*[contains(@class,“class名字”)]'
F12，左边找到相应的条目，右击检查，对应定位相应的区块。

输入人名，找到人名的profileurl

两种方法：
第一，通过bing.com输入“人名+linkedin.com/in"来搜索。好处在于bing网页响应快,大部分时间我们取第一个搜索结果即可，坏处在于当存在人名同名同姓的多个id,我们还是需要进入多个id的领英搜索页面进行二次筛选。
第二，通过linekedin自身的搜索网页进行人名搜索。好处在于多个id的筛选更加明显，且手动地根据特定的人物信息选择一个，坏处在于领英网页响应很慢，需要你不断刷新。
这里的人名是有特定公司要求，只能通过方法二，进行手工筛选。自动化的部分在于，陆续进入人名搜索网页，不需要手工搜索。

 # 这里我们采用方法二，直接进入领英搜索页面，truename是人名，做了一个循环url=paste0('https://www.linkedin.com/search/results/people/?keywords=',truename,'&origin=SWITCH_SEARCH_VERTICAL')# 进入搜索页面，出现一系列的人名项目remDr$navigate(url)# 这里人工进行筛选，输入键盘Rconsole，有你要的人名，则手工进入人名网页，键盘输入1。没有，则键盘输入0。手工筛选实在是最优选择了，，，，因为要的人名太特殊，完全需要人手动筛选，不可能通过机器来做。但是普通的task是不需要手工筛选的。w=readline()# w=0，1来选择下一步，是进入下一个人名搜索，还是进入这个人名的信息搜集。

下载他的简历

进入个人网页后，下简历：第一，点击more按钮，第二，点击savetopdf按钮，第三，等待并保证下载完毕。
这里最重要的是两次点击。如何找到所有领英网页的点击的元素共通点：通过class.

         profileurl=remDr$getCurrentUrl()# save resume pdft4=remDr$findElement(using='xpath','//*[contains(@class,"overflow-toggle")]')t4$clickElement()t5=remDr$findElement(using='xpath','//*[contains(@class, "profile-actions--save-to-pdf")]')t5$clickElement()Sys.sleep(5)

领英简历扒经历，教育等

获取网页element之前，需要把所有的show more rules展开。这里的showmorerules存在很多个，需要用到lapply函数依次进行点击。

  # expand "show more roles"t4=remDr$findElements(using='css','.link-without-hover-state')lapply(t4,function(e)e$clickElement())

按照模块获取网页element

#获取网页element
a=remDr$getPageSource()[[1]]
#element按照网页读取，并只要中间的body部分
b=read_html(a) %>% html_nodes('body')
#按照模块部分进行抓取        exp1=b%>% html_nodes(xpath='//*[contains(@class,"pv-entity__position-group-pager pv-profile-section__list-item ember-view")]')edu1=b%>% html_nodes(xpath='//*[contains(@class,"pv-profile-section__list-item pv-education-entity pv-profile-section__card-item ember-view")]')cert1=b%>%html_nodes(xpath='//*[contains(@class,"pv-profile-section__sortable-item pv-certification-entity ember-view")]')conn1=b%>%html_nodes(xpath='//*[contains(@class,"pv-top-card--list pv-top-card--list-bullet mt1")]')

接下来，就是根据经历，教育等模块来慢慢登记具体条目信息。这里展示education的，我们通过rvest，stringr包来做文本分析。额，这样写可能挺麻烦的，但是这是按照规定excel的要求做出来自动化的excel.
有一个难点，所有模块的信息不是标准化的，比如，有人只填年，不填月日；有人只填公司，其他信息不填；所以需要你考虑到很多情况，写一个包容性的代码。

education=plyr::llply(as.list(edu1),function(data){text=data %>%html_text()text=gsub('  ','',text)text1=str_split(text,'\n')[[1]]text1=text1[which(text1!='')]schoolname=text1[1]schooldate=text1[grep("Dates attended or expected graduation",text1)+1]degreename=text1[grep("Degree Name",text1)+1]major=text1[grep("Field Of Study",text1)+1]if(length(schoolname)==0){schoolname=NA}if(length(degreename)==0){degreename=NA}if(length(major)==0){major=NA}if(length(schooldate)!=0){begindate=str_split(schooldate,' – ')[[1]][1]enddate=str_split(schooldate,' – ')[[1]][2]begindate=as.numeric(begindate)enddate=as.numeric(enddate)}else{begindate=NA;enddate=NA}education=data.frame(Staff_name=truename,School_Name=schoolname,Begin_Year=begindate,End_Year=enddate,Degree=degreename,Major=major)})

linkedin 爬虫相关推荐

linkedin爬虫_机器学习的学生和从业者的常见问题在LinkedIn上提问
linkedin爬虫经验 (Experience) 介绍 (Introduction) LinkedIn has grown in popularity over the years, and it ...
linkedin爬虫_重新设计Linkedin的指导功能-用户体验案例研究
linkedin爬虫为什么选择导师+ Linkedin平台? (Why mentorship + Linkedin platform?) As a recent graduate, I went o ...
linkedin爬虫_您应该在LinkedIn上关注的8个人
linkedin爬虫 Finding great mentors are hard to come by these days. With so much information and so man ...
linkedin爬虫_如何建立一个惊人的LinkedIn个人资料[15+个行之有效的技巧]
linkedin爬虫 Looking for some LinkedIn profile tips to help you step up your game, beat out the compet ...
爬虫-根据公司名抓取相关员工的linkedin数据
前言: 几个月前,应朋友要求,写了一个linkedin爬虫,难度不大,但功能还算好玩,所以就整理了一下放出来了.代码见Github:LinkedinSpider. 爬虫功能:输入一个公司名称,抓取相关 ...
爬虫goodreads数据_使用Python从Goodreads数据中预测好书
爬虫goodreads数据 Photo of old books by Ed Robertson on Unsplash 埃德·罗伯森 ( Ed Robertson)的旧书照片,内容为Unsplash ...
爬虫神经网络_股市筛选和分析：在投资中使用网络爬虫，神经网络和回归分析...
爬虫神经网络与AI交易 (Trading with AI) Stock markets tend to react very quickly to a variety of factors such ...
肝了N小时，整理了100+Python爬虫项目
提到爬虫,相信绝大部分人的第一反应就是 Python,尽管其他编程语言一样能写爬虫,但在人们的印象中,爬虫似乎与 Python 绑定了一样,由此可见爬虫在 Python 中的份量. 最近小二做了个免费 ...
170724 社工-领英爬虫
1625-5 王子昂总结<2017年7月24日> [连续第295天总结] A. LinkedIn 爬虫 B. 领英的爬虫很找到,java的工程不会使用,只好使用Python的 https ...
肝了N小时，整理了100+Python爬虫项目（附源码）
提到爬虫,相信绝大部分人的第一反应就是 Python,尽管其他编程语言一样能写爬虫,但在人们的印象中,爬虫似乎与 Python 绑定了一样,由此可见爬虫在 Python 中的份量. 最近我做了个免费的 ...

linkedin 爬虫