学习路线
#mermaid-svg-rEYRPmEheUhGdD9T .label {font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family);fill: #333;color: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .label text {fill: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .node rect, #mermaid-svg-rEYRPmEheUhGdD9T .node circle, #mermaid-svg-rEYRPmEheUhGdD9T .node ellipse, #mermaid-svg-rEYRPmEheUhGdD9T .node polygon, #mermaid-svg-rEYRPmEheUhGdD9T .node path {fill: #ECECFF;stroke: #9370DB;stroke-width: 1px; }#mermaid-svg-rEYRPmEheUhGdD9T .node .label {text-align: center;fill: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .node.clickable {cursor: pointer; }#mermaid-svg-rEYRPmEheUhGdD9T .arrowheadPath {fill: #333333; }#mermaid-svg-rEYRPmEheUhGdD9T .edgePath .path {stroke: #333333;stroke-width: 1.5px; }#mermaid-svg-rEYRPmEheUhGdD9T .flowchart-link {stroke: #333333;fill: none; }#mermaid-svg-rEYRPmEheUhGdD9T .edgeLabel {background-color: #e8e8e8;text-align: center; } #mermaid-svg-rEYRPmEheUhGdD9T .edgeLabel rect {opacity: 0.9; } #mermaid-svg-rEYRPmEheUhGdD9T .edgeLabel span {color: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .cluster rect {fill: #ffffde;stroke: #aaaa33;stroke-width: 1px; }#mermaid-svg-rEYRPmEheUhGdD9T .cluster text {fill: #333; }#mermaid-svg-rEYRPmEheUhGdD9T div.mermaidTooltip {position: absolute;text-align: center;max-width: 200px;padding: 2px;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family);font-size: 12px;background: #ffffde;border: 1px solid #aaaa33;border-radius: 2px;pointer-events: none;z-index: 100; }#mermaid-svg-rEYRPmEheUhGdD9T .actor {stroke: #CCCCFF;fill: #ECECFF; }#mermaid-svg-rEYRPmEheUhGdD9T text.actor > tspan {fill: black;stroke: none; }#mermaid-svg-rEYRPmEheUhGdD9T .actor-line {stroke: grey; }#mermaid-svg-rEYRPmEheUhGdD9T .messageLine0 {stroke-width: 1.5;stroke-dasharray: none;stroke: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .messageLine1 {stroke-width: 1.5;stroke-dasharray: 2, 2;stroke: #333; }#mermaid-svg-rEYRPmEheUhGdD9T #arrowhead path {fill: #333;stroke: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .sequenceNumber {fill: white; }#mermaid-svg-rEYRPmEheUhGdD9T #sequencenumber {fill: #333; }#mermaid-svg-rEYRPmEheUhGdD9T #crosshead path {fill: #333;stroke: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .messageText {fill: #333;stroke: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .labelBox {stroke: #CCCCFF;fill: #ECECFF; }#mermaid-svg-rEYRPmEheUhGdD9T .labelText,#mermaid-svg-rEYRPmEheUhGdD9T .labelText > tspan {fill: black;stroke: none; }#mermaid-svg-rEYRPmEheUhGdD9T .loopText,#mermaid-svg-rEYRPmEheUhGdD9T .loopText > tspan {fill: black;stroke: none; }#mermaid-svg-rEYRPmEheUhGdD9T .loopLine {stroke-width: 2px;stroke-dasharray: 2, 2;stroke: #CCCCFF;fill: #CCCCFF; }#mermaid-svg-rEYRPmEheUhGdD9T .note {stroke: #aaaa33;fill: #fff5ad; }#mermaid-svg-rEYRPmEheUhGdD9T .noteText,#mermaid-svg-rEYRPmEheUhGdD9T .noteText > tspan {fill: black;stroke: none; }#mermaid-svg-rEYRPmEheUhGdD9T .activation0 {fill: #f4f4f4;stroke: #666; }#mermaid-svg-rEYRPmEheUhGdD9T .activation1 {fill: #f4f4f4;stroke: #666; }#mermaid-svg-rEYRPmEheUhGdD9T .activation2 {fill: #f4f4f4;stroke: #666; }#mermaid-svg-rEYRPmEheUhGdD9T .mermaid-main-font {font-family: "trebuchet ms", verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T .section {stroke: none;opacity: 0.2; }#mermaid-svg-rEYRPmEheUhGdD9T .section0 {fill: rgba(102, 102, 255, 0.49); }#mermaid-svg-rEYRPmEheUhGdD9T .section2 {fill: #fff400; }#mermaid-svg-rEYRPmEheUhGdD9T .section1, #mermaid-svg-rEYRPmEheUhGdD9T .section3 {fill: white;opacity: 0.2; }#mermaid-svg-rEYRPmEheUhGdD9T .sectionTitle0 {fill: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .sectionTitle1 {fill: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .sectionTitle2 {fill: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .sectionTitle3 {fill: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .sectionTitle {text-anchor: start;font-size: 11px;text-height: 14px;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T .grid .tick {stroke: lightgrey;opacity: 0.8;shape-rendering: crispEdges; } #mermaid-svg-rEYRPmEheUhGdD9T .grid .tick text {font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T .grid path {stroke-width: 0; }#mermaid-svg-rEYRPmEheUhGdD9T .today {fill: none;stroke: red;stroke-width: 2px; }#mermaid-svg-rEYRPmEheUhGdD9T .task {stroke-width: 2; }#mermaid-svg-rEYRPmEheUhGdD9T .taskText {text-anchor: middle;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T .taskText:not([font-size]) {font-size: 11px; }#mermaid-svg-rEYRPmEheUhGdD9T .taskTextOutsideRight {fill: black;text-anchor: start;font-size: 11px;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T .taskTextOutsideLeft {fill: black;text-anchor: end;font-size: 11px; }#mermaid-svg-rEYRPmEheUhGdD9T .task.clickable {cursor: pointer; }#mermaid-svg-rEYRPmEheUhGdD9T .taskText.clickable {cursor: pointer;fill: #003163 !important;font-weight: bold; }#mermaid-svg-rEYRPmEheUhGdD9T .taskTextOutsideLeft.clickable {cursor: pointer;fill: #003163 !important;font-weight: bold; }#mermaid-svg-rEYRPmEheUhGdD9T .taskTextOutsideRight.clickable {cursor: pointer;fill: #003163 !important;font-weight: bold; }#mermaid-svg-rEYRPmEheUhGdD9T .taskText0, #mermaid-svg-rEYRPmEheUhGdD9T .taskText1, #mermaid-svg-rEYRPmEheUhGdD9T .taskText2, #mermaid-svg-rEYRPmEheUhGdD9T .taskText3 {fill: white; }#mermaid-svg-rEYRPmEheUhGdD9T .task0, #mermaid-svg-rEYRPmEheUhGdD9T .task1, #mermaid-svg-rEYRPmEheUhGdD9T .task2, #mermaid-svg-rEYRPmEheUhGdD9T .task3 {fill: #8a90dd;stroke: #534fbc; }#mermaid-svg-rEYRPmEheUhGdD9T .taskTextOutside0, #mermaid-svg-rEYRPmEheUhGdD9T .taskTextOutside2 {fill: black; }#mermaid-svg-rEYRPmEheUhGdD9T .taskTextOutside1, #mermaid-svg-rEYRPmEheUhGdD9T .taskTextOutside3 {fill: black; }#mermaid-svg-rEYRPmEheUhGdD9T .active0, #mermaid-svg-rEYRPmEheUhGdD9T .active1, #mermaid-svg-rEYRPmEheUhGdD9T .active2, #mermaid-svg-rEYRPmEheUhGdD9T .active3 {fill: #bfc7ff;stroke: #534fbc; }#mermaid-svg-rEYRPmEheUhGdD9T .activeText0, #mermaid-svg-rEYRPmEheUhGdD9T .activeText1, #mermaid-svg-rEYRPmEheUhGdD9T .activeText2, #mermaid-svg-rEYRPmEheUhGdD9T .activeText3 {fill: black !important; }#mermaid-svg-rEYRPmEheUhGdD9T .done0, #mermaid-svg-rEYRPmEheUhGdD9T .done1, #mermaid-svg-rEYRPmEheUhGdD9T .done2, #mermaid-svg-rEYRPmEheUhGdD9T .done3 {stroke: grey;fill: lightgrey;stroke-width: 2; }#mermaid-svg-rEYRPmEheUhGdD9T .doneText0, #mermaid-svg-rEYRPmEheUhGdD9T .doneText1, #mermaid-svg-rEYRPmEheUhGdD9T .doneText2, #mermaid-svg-rEYRPmEheUhGdD9T .doneText3 {fill: black !important; }#mermaid-svg-rEYRPmEheUhGdD9T .crit0, #mermaid-svg-rEYRPmEheUhGdD9T .crit1, #mermaid-svg-rEYRPmEheUhGdD9T .crit2, #mermaid-svg-rEYRPmEheUhGdD9T .crit3 {stroke: #ff8888;fill: red;stroke-width: 2; }#mermaid-svg-rEYRPmEheUhGdD9T .activeCrit0, #mermaid-svg-rEYRPmEheUhGdD9T .activeCrit1, #mermaid-svg-rEYRPmEheUhGdD9T .activeCrit2, #mermaid-svg-rEYRPmEheUhGdD9T .activeCrit3 {stroke: #ff8888;fill: #bfc7ff;stroke-width: 2; }#mermaid-svg-rEYRPmEheUhGdD9T .doneCrit0, #mermaid-svg-rEYRPmEheUhGdD9T .doneCrit1, #mermaid-svg-rEYRPmEheUhGdD9T .doneCrit2, #mermaid-svg-rEYRPmEheUhGdD9T .doneCrit3 {stroke: #ff8888;fill: lightgrey;stroke-width: 2;cursor: pointer;shape-rendering: crispEdges; }#mermaid-svg-rEYRPmEheUhGdD9T .milestone {transform: rotate(45deg) scale(0.8, 0.8); }#mermaid-svg-rEYRPmEheUhGdD9T .milestoneText {font-style: italic; }#mermaid-svg-rEYRPmEheUhGdD9T .doneCritText0, #mermaid-svg-rEYRPmEheUhGdD9T .doneCritText1, #mermaid-svg-rEYRPmEheUhGdD9T .doneCritText2, #mermaid-svg-rEYRPmEheUhGdD9T .doneCritText3 {fill: black !important; }#mermaid-svg-rEYRPmEheUhGdD9T .activeCritText0, #mermaid-svg-rEYRPmEheUhGdD9T .activeCritText1, #mermaid-svg-rEYRPmEheUhGdD9T .activeCritText2, #mermaid-svg-rEYRPmEheUhGdD9T .activeCritText3 {fill: black !important; }#mermaid-svg-rEYRPmEheUhGdD9T .titleText {text-anchor: middle;font-size: 18px;fill: black;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T g.classGroup text {fill: #9370DB;stroke: none;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family);font-size: 10px; } #mermaid-svg-rEYRPmEheUhGdD9T g.classGroup text .title {font-weight: bolder; }#mermaid-svg-rEYRPmEheUhGdD9T g.clickable {cursor: pointer; }#mermaid-svg-rEYRPmEheUhGdD9T g.classGroup rect {fill: #ECECFF;stroke: #9370DB; }#mermaid-svg-rEYRPmEheUhGdD9T g.classGroup line {stroke: #9370DB;stroke-width: 1; }#mermaid-svg-rEYRPmEheUhGdD9T .classLabel .box {stroke: none;stroke-width: 0;fill: #ECECFF;opacity: 0.5; }#mermaid-svg-rEYRPmEheUhGdD9T .classLabel .label {fill: #9370DB;font-size: 10px; }#mermaid-svg-rEYRPmEheUhGdD9T .relation {stroke: #9370DB;stroke-width: 1;fill: none; }#mermaid-svg-rEYRPmEheUhGdD9T .dashed-line {stroke-dasharray: 3; }#mermaid-svg-rEYRPmEheUhGdD9T #compositionStart {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-rEYRPmEheUhGdD9T #compositionEnd {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-rEYRPmEheUhGdD9T #aggregationStart {fill: #ECECFF;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-rEYRPmEheUhGdD9T #aggregationEnd {fill: #ECECFF;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-rEYRPmEheUhGdD9T #dependencyStart {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-rEYRPmEheUhGdD9T #dependencyEnd {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-rEYRPmEheUhGdD9T #extensionStart {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-rEYRPmEheUhGdD9T #extensionEnd {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-rEYRPmEheUhGdD9T .commit-id, #mermaid-svg-rEYRPmEheUhGdD9T .commit-msg, #mermaid-svg-rEYRPmEheUhGdD9T .branch-label {fill: lightgrey;color: lightgrey;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T .pieTitleText {text-anchor: middle;font-size: 25px;fill: black;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T .slice {font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T g.stateGroup text {fill: #9370DB;stroke: none;font-size: 10px;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T g.stateGroup text {fill: #9370DB;fill: #333;stroke: none;font-size: 10px; }#mermaid-svg-rEYRPmEheUhGdD9T g.statediagram-cluster .cluster-label text {fill: #333; }#mermaid-svg-rEYRPmEheUhGdD9T g.stateGroup .state-title {font-weight: bolder;fill: black; }#mermaid-svg-rEYRPmEheUhGdD9T g.stateGroup rect {fill: #ECECFF;stroke: #9370DB; }#mermaid-svg-rEYRPmEheUhGdD9T g.stateGroup line {stroke: #9370DB;stroke-width: 1; }#mermaid-svg-rEYRPmEheUhGdD9T .transition {stroke: #9370DB;stroke-width: 1;fill: none; }#mermaid-svg-rEYRPmEheUhGdD9T .stateGroup .composit {fill: white;border-bottom: 1px; }#mermaid-svg-rEYRPmEheUhGdD9T .stateGroup .alt-composit {fill: #e0e0e0;border-bottom: 1px; }#mermaid-svg-rEYRPmEheUhGdD9T .state-note {stroke: #aaaa33;fill: #fff5ad; } #mermaid-svg-rEYRPmEheUhGdD9T .state-note text {fill: black;stroke: none;font-size: 10px; }#mermaid-svg-rEYRPmEheUhGdD9T .stateLabel .box {stroke: none;stroke-width: 0;fill: #ECECFF;opacity: 0.7; }#mermaid-svg-rEYRPmEheUhGdD9T .edgeLabel text {fill: #333; }#mermaid-svg-rEYRPmEheUhGdD9T .stateLabel text {fill: black;font-size: 10px;font-weight: bold;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-rEYRPmEheUhGdD9T .node circle.state-start {fill: black;stroke: black; }#mermaid-svg-rEYRPmEheUhGdD9T .node circle.state-end {fill: black;stroke: white;stroke-width: 1.5; }#mermaid-svg-rEYRPmEheUhGdD9T #statediagram-barbEnd {fill: #9370DB; }#mermaid-svg-rEYRPmEheUhGdD9T .statediagram-cluster rect {fill: #ECECFF;stroke: #9370DB;stroke-width: 1px; }#mermaid-svg-rEYRPmEheUhGdD9T .statediagram-cluster rect.outer {rx: 5px;ry: 5px; }#mermaid-svg-rEYRPmEheUhGdD9T .statediagram-state .divider {stroke: #9370DB; }#mermaid-svg-rEYRPmEheUhGdD9T .statediagram-state .title-state {rx: 5px;ry: 5px; }#mermaid-svg-rEYRPmEheUhGdD9T .statediagram-cluster.statediagram-cluster .inner {fill: white; }#mermaid-svg-rEYRPmEheUhGdD9T .statediagram-cluster.statediagram-cluster-alt .inner {fill: #e0e0e0; }#mermaid-svg-rEYRPmEheUhGdD9T .statediagram-cluster .inner {rx: 0;ry: 0; }#mermaid-svg-rEYRPmEheUhGdD9T .statediagram-state rect.basic {rx: 5px;ry: 5px; }#mermaid-svg-rEYRPmEheUhGdD9T .statediagram-state rect.divider {stroke-dasharray: 10,10;fill: #efefef; }#mermaid-svg-rEYRPmEheUhGdD9T .note-edge {stroke-dasharray: 5; }#mermaid-svg-rEYRPmEheUhGdD9T .statediagram-note rect {fill: #fff5ad;stroke: #aaaa33;stroke-width: 1px;rx: 0;ry: 0; }:root {--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive; }#mermaid-svg-rEYRPmEheUhGdD9T .error-icon {fill: #552222; }#mermaid-svg-rEYRPmEheUhGdD9T .error-text {fill: #552222;stroke: #552222; }#mermaid-svg-rEYRPmEheUhGdD9T .edge-thickness-normal {stroke-width: 2px; }#mermaid-svg-rEYRPmEheUhGdD9T .edge-thickness-thick {stroke-width: 3.5px; }#mermaid-svg-rEYRPmEheUhGdD9T .edge-pattern-solid {stroke-dasharray: 0; }#mermaid-svg-rEYRPmEheUhGdD9T .edge-pattern-dashed {stroke-dasharray: 3; }#mermaid-svg-rEYRPmEheUhGdD9T .edge-pattern-dotted {stroke-dasharray: 2; }#mermaid-svg-rEYRPmEheUhGdD9T .marker {fill: #333333; }#mermaid-svg-rEYRPmEheUhGdD9T .marker.cross {stroke: #333333; }:root { --mermaid-font-family: "trebuchet ms", verdana, arial;}#mermaid-svg-rEYRPmEheUhGdD9T {color: rgba(0, 0, 0, 0.75);font: ;}
到
到
到
requests
Beautiful soup
Re正则表达式
Scrapy
小规模,数据量小
爬取速度不敏感
Requests库
中规模,数据规模较大
爬取速度敏感
Scrapy库
大规模,搜索引擎
爬取速度关键
定制开发
一、requests库
首先理解http协议
方法
|
说明
|
GET
|
请求获取URL位置的资源
|
HEAD
|
请求获取URL位置资源的响应消息报告,即获得该资源的头部信息
|
POST
|
请求向URL位置的资源后附加新的数据
|
PUT
|
请求向URL位置存储一个资源,覆盖原URL位置的资源
|
PATCH
|
请求局部更新URL位置的资源,即改变该处资源的部分内容
|
DELETE
|
请求删除URL位置存储的资源
|
1、requests库的7个主要方法
方法
|
说明
|
requests.request()
|
构造一个请求,支撑以下各方法的基础方法
|
requests.get()
|
获取HTML网页的主要方法,对应于HTTP的GET
|
requests.head()
|
获取HTML网页头信息的方法,对应于HTTP的HEAD
|
requests.post()
|
向HTML网页提交POST请求的方法,对应于HTTP的POST
|
requests.put()
|
向HTML网页提交PUT请求的方法,对应于HTTP的PUT
|
requests.patch()
|
向HTML网页提交局部修改请求,对应于HTTP的PATCH
|
requests.delete()
|
向HTML页面提交删除请求,对应于HTTP的DELETE
|
2、Response对象属性
属性
|
说明
|
r.status_code
|
HTTP请求的返回状态,200表示连接成功,404表示失败
|
r.text
|
HTTP响应内容的字符串形式,即,url对应的页面内容
|
r.encoding
|
HTTP响应内容的字符串形式,即,url对应的页面内容
|
r.apparent_encoding
|
从内容中分析出的响应内容编码方式(备选编码方式)
|
r.content
|
HTTP响应内容的二进制形式
|
例如:
import requests
r = requests.get("http://www.baidu.com")
print(r.status_code) #检测请求的状态码,200表示成功
print(r.headers) #头部信息
print(r.encoding) #响应内容编码方式
print(r.content) #http响应内容的二进制形式
print(r.apparent_encoding)
r.encoding = 'utf-8'
print(r.text)
解析:
r = requests.get(“http://www.baidu.com”)
#mermaid-svg-DSTshJeCInxMIUR6 .label {font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family);fill: #333;color: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .label text {fill: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .node rect, #mermaid-svg-DSTshJeCInxMIUR6 .node circle, #mermaid-svg-DSTshJeCInxMIUR6 .node ellipse, #mermaid-svg-DSTshJeCInxMIUR6 .node polygon, #mermaid-svg-DSTshJeCInxMIUR6 .node path {fill: #ECECFF;stroke: #9370DB;stroke-width: 1px; }#mermaid-svg-DSTshJeCInxMIUR6 .node .label {text-align: center;fill: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .node.clickable {cursor: pointer; }#mermaid-svg-DSTshJeCInxMIUR6 .arrowheadPath {fill: #333333; }#mermaid-svg-DSTshJeCInxMIUR6 .edgePath .path {stroke: #333333;stroke-width: 1.5px; }#mermaid-svg-DSTshJeCInxMIUR6 .flowchart-link {stroke: #333333;fill: none; }#mermaid-svg-DSTshJeCInxMIUR6 .edgeLabel {background-color: #e8e8e8;text-align: center; } #mermaid-svg-DSTshJeCInxMIUR6 .edgeLabel rect {opacity: 0.9; } #mermaid-svg-DSTshJeCInxMIUR6 .edgeLabel span {color: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .cluster rect {fill: #ffffde;stroke: #aaaa33;stroke-width: 1px; }#mermaid-svg-DSTshJeCInxMIUR6 .cluster text {fill: #333; }#mermaid-svg-DSTshJeCInxMIUR6 div.mermaidTooltip {position: absolute;text-align: center;max-width: 200px;padding: 2px;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family);font-size: 12px;background: #ffffde;border: 1px solid #aaaa33;border-radius: 2px;pointer-events: none;z-index: 100; }#mermaid-svg-DSTshJeCInxMIUR6 .actor {stroke: #CCCCFF;fill: #ECECFF; }#mermaid-svg-DSTshJeCInxMIUR6 text.actor > tspan {fill: black;stroke: none; }#mermaid-svg-DSTshJeCInxMIUR6 .actor-line {stroke: grey; }#mermaid-svg-DSTshJeCInxMIUR6 .messageLine0 {stroke-width: 1.5;stroke-dasharray: none;stroke: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .messageLine1 {stroke-width: 1.5;stroke-dasharray: 2, 2;stroke: #333; }#mermaid-svg-DSTshJeCInxMIUR6 #arrowhead path {fill: #333;stroke: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .sequenceNumber {fill: white; }#mermaid-svg-DSTshJeCInxMIUR6 #sequencenumber {fill: #333; }#mermaid-svg-DSTshJeCInxMIUR6 #crosshead path {fill: #333;stroke: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .messageText {fill: #333;stroke: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .labelBox {stroke: #CCCCFF;fill: #ECECFF; }#mermaid-svg-DSTshJeCInxMIUR6 .labelText,#mermaid-svg-DSTshJeCInxMIUR6 .labelText > tspan {fill: black;stroke: none; }#mermaid-svg-DSTshJeCInxMIUR6 .loopText,#mermaid-svg-DSTshJeCInxMIUR6 .loopText > tspan {fill: black;stroke: none; }#mermaid-svg-DSTshJeCInxMIUR6 .loopLine {stroke-width: 2px;stroke-dasharray: 2, 2;stroke: #CCCCFF;fill: #CCCCFF; }#mermaid-svg-DSTshJeCInxMIUR6 .note {stroke: #aaaa33;fill: #fff5ad; }#mermaid-svg-DSTshJeCInxMIUR6 .noteText,#mermaid-svg-DSTshJeCInxMIUR6 .noteText > tspan {fill: black;stroke: none; }#mermaid-svg-DSTshJeCInxMIUR6 .activation0 {fill: #f4f4f4;stroke: #666; }#mermaid-svg-DSTshJeCInxMIUR6 .activation1 {fill: #f4f4f4;stroke: #666; }#mermaid-svg-DSTshJeCInxMIUR6 .activation2 {fill: #f4f4f4;stroke: #666; }#mermaid-svg-DSTshJeCInxMIUR6 .mermaid-main-font {font-family: "trebuchet ms", verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 .section {stroke: none;opacity: 0.2; }#mermaid-svg-DSTshJeCInxMIUR6 .section0 {fill: rgba(102, 102, 255, 0.49); }#mermaid-svg-DSTshJeCInxMIUR6 .section2 {fill: #fff400; }#mermaid-svg-DSTshJeCInxMIUR6 .section1, #mermaid-svg-DSTshJeCInxMIUR6 .section3 {fill: white;opacity: 0.2; }#mermaid-svg-DSTshJeCInxMIUR6 .sectionTitle0 {fill: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .sectionTitle1 {fill: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .sectionTitle2 {fill: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .sectionTitle3 {fill: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .sectionTitle {text-anchor: start;font-size: 11px;text-height: 14px;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 .grid .tick {stroke: lightgrey;opacity: 0.8;shape-rendering: crispEdges; } #mermaid-svg-DSTshJeCInxMIUR6 .grid .tick text {font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 .grid path {stroke-width: 0; }#mermaid-svg-DSTshJeCInxMIUR6 .today {fill: none;stroke: red;stroke-width: 2px; }#mermaid-svg-DSTshJeCInxMIUR6 .task {stroke-width: 2; }#mermaid-svg-DSTshJeCInxMIUR6 .taskText {text-anchor: middle;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 .taskText:not([font-size]) {font-size: 11px; }#mermaid-svg-DSTshJeCInxMIUR6 .taskTextOutsideRight {fill: black;text-anchor: start;font-size: 11px;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 .taskTextOutsideLeft {fill: black;text-anchor: end;font-size: 11px; }#mermaid-svg-DSTshJeCInxMIUR6 .task.clickable {cursor: pointer; }#mermaid-svg-DSTshJeCInxMIUR6 .taskText.clickable {cursor: pointer;fill: #003163 !important;font-weight: bold; }#mermaid-svg-DSTshJeCInxMIUR6 .taskTextOutsideLeft.clickable {cursor: pointer;fill: #003163 !important;font-weight: bold; }#mermaid-svg-DSTshJeCInxMIUR6 .taskTextOutsideRight.clickable {cursor: pointer;fill: #003163 !important;font-weight: bold; }#mermaid-svg-DSTshJeCInxMIUR6 .taskText0, #mermaid-svg-DSTshJeCInxMIUR6 .taskText1, #mermaid-svg-DSTshJeCInxMIUR6 .taskText2, #mermaid-svg-DSTshJeCInxMIUR6 .taskText3 {fill: white; }#mermaid-svg-DSTshJeCInxMIUR6 .task0, #mermaid-svg-DSTshJeCInxMIUR6 .task1, #mermaid-svg-DSTshJeCInxMIUR6 .task2, #mermaid-svg-DSTshJeCInxMIUR6 .task3 {fill: #8a90dd;stroke: #534fbc; }#mermaid-svg-DSTshJeCInxMIUR6 .taskTextOutside0, #mermaid-svg-DSTshJeCInxMIUR6 .taskTextOutside2 {fill: black; }#mermaid-svg-DSTshJeCInxMIUR6 .taskTextOutside1, #mermaid-svg-DSTshJeCInxMIUR6 .taskTextOutside3 {fill: black; }#mermaid-svg-DSTshJeCInxMIUR6 .active0, #mermaid-svg-DSTshJeCInxMIUR6 .active1, #mermaid-svg-DSTshJeCInxMIUR6 .active2, #mermaid-svg-DSTshJeCInxMIUR6 .active3 {fill: #bfc7ff;stroke: #534fbc; }#mermaid-svg-DSTshJeCInxMIUR6 .activeText0, #mermaid-svg-DSTshJeCInxMIUR6 .activeText1, #mermaid-svg-DSTshJeCInxMIUR6 .activeText2, #mermaid-svg-DSTshJeCInxMIUR6 .activeText3 {fill: black !important; }#mermaid-svg-DSTshJeCInxMIUR6 .done0, #mermaid-svg-DSTshJeCInxMIUR6 .done1, #mermaid-svg-DSTshJeCInxMIUR6 .done2, #mermaid-svg-DSTshJeCInxMIUR6 .done3 {stroke: grey;fill: lightgrey;stroke-width: 2; }#mermaid-svg-DSTshJeCInxMIUR6 .doneText0, #mermaid-svg-DSTshJeCInxMIUR6 .doneText1, #mermaid-svg-DSTshJeCInxMIUR6 .doneText2, #mermaid-svg-DSTshJeCInxMIUR6 .doneText3 {fill: black !important; }#mermaid-svg-DSTshJeCInxMIUR6 .crit0, #mermaid-svg-DSTshJeCInxMIUR6 .crit1, #mermaid-svg-DSTshJeCInxMIUR6 .crit2, #mermaid-svg-DSTshJeCInxMIUR6 .crit3 {stroke: #ff8888;fill: red;stroke-width: 2; }#mermaid-svg-DSTshJeCInxMIUR6 .activeCrit0, #mermaid-svg-DSTshJeCInxMIUR6 .activeCrit1, #mermaid-svg-DSTshJeCInxMIUR6 .activeCrit2, #mermaid-svg-DSTshJeCInxMIUR6 .activeCrit3 {stroke: #ff8888;fill: #bfc7ff;stroke-width: 2; }#mermaid-svg-DSTshJeCInxMIUR6 .doneCrit0, #mermaid-svg-DSTshJeCInxMIUR6 .doneCrit1, #mermaid-svg-DSTshJeCInxMIUR6 .doneCrit2, #mermaid-svg-DSTshJeCInxMIUR6 .doneCrit3 {stroke: #ff8888;fill: lightgrey;stroke-width: 2;cursor: pointer;shape-rendering: crispEdges; }#mermaid-svg-DSTshJeCInxMIUR6 .milestone {transform: rotate(45deg) scale(0.8, 0.8); }#mermaid-svg-DSTshJeCInxMIUR6 .milestoneText {font-style: italic; }#mermaid-svg-DSTshJeCInxMIUR6 .doneCritText0, #mermaid-svg-DSTshJeCInxMIUR6 .doneCritText1, #mermaid-svg-DSTshJeCInxMIUR6 .doneCritText2, #mermaid-svg-DSTshJeCInxMIUR6 .doneCritText3 {fill: black !important; }#mermaid-svg-DSTshJeCInxMIUR6 .activeCritText0, #mermaid-svg-DSTshJeCInxMIUR6 .activeCritText1, #mermaid-svg-DSTshJeCInxMIUR6 .activeCritText2, #mermaid-svg-DSTshJeCInxMIUR6 .activeCritText3 {fill: black !important; }#mermaid-svg-DSTshJeCInxMIUR6 .titleText {text-anchor: middle;font-size: 18px;fill: black;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 g.classGroup text {fill: #9370DB;stroke: none;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family);font-size: 10px; } #mermaid-svg-DSTshJeCInxMIUR6 g.classGroup text .title {font-weight: bolder; }#mermaid-svg-DSTshJeCInxMIUR6 g.clickable {cursor: pointer; }#mermaid-svg-DSTshJeCInxMIUR6 g.classGroup rect {fill: #ECECFF;stroke: #9370DB; }#mermaid-svg-DSTshJeCInxMIUR6 g.classGroup line {stroke: #9370DB;stroke-width: 1; }#mermaid-svg-DSTshJeCInxMIUR6 .classLabel .box {stroke: none;stroke-width: 0;fill: #ECECFF;opacity: 0.5; }#mermaid-svg-DSTshJeCInxMIUR6 .classLabel .label {fill: #9370DB;font-size: 10px; }#mermaid-svg-DSTshJeCInxMIUR6 .relation {stroke: #9370DB;stroke-width: 1;fill: none; }#mermaid-svg-DSTshJeCInxMIUR6 .dashed-line {stroke-dasharray: 3; }#mermaid-svg-DSTshJeCInxMIUR6 #compositionStart {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-DSTshJeCInxMIUR6 #compositionEnd {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-DSTshJeCInxMIUR6 #aggregationStart {fill: #ECECFF;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-DSTshJeCInxMIUR6 #aggregationEnd {fill: #ECECFF;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-DSTshJeCInxMIUR6 #dependencyStart {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-DSTshJeCInxMIUR6 #dependencyEnd {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-DSTshJeCInxMIUR6 #extensionStart {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-DSTshJeCInxMIUR6 #extensionEnd {fill: #9370DB;stroke: #9370DB;stroke-width: 1; }#mermaid-svg-DSTshJeCInxMIUR6 .commit-id, #mermaid-svg-DSTshJeCInxMIUR6 .commit-msg, #mermaid-svg-DSTshJeCInxMIUR6 .branch-label {fill: lightgrey;color: lightgrey;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 .pieTitleText {text-anchor: middle;font-size: 25px;fill: black;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 .slice {font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 g.stateGroup text {fill: #9370DB;stroke: none;font-size: 10px;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 g.stateGroup text {fill: #9370DB;fill: #333;stroke: none;font-size: 10px; }#mermaid-svg-DSTshJeCInxMIUR6 g.statediagram-cluster .cluster-label text {fill: #333; }#mermaid-svg-DSTshJeCInxMIUR6 g.stateGroup .state-title {font-weight: bolder;fill: black; }#mermaid-svg-DSTshJeCInxMIUR6 g.stateGroup rect {fill: #ECECFF;stroke: #9370DB; }#mermaid-svg-DSTshJeCInxMIUR6 g.stateGroup line {stroke: #9370DB;stroke-width: 1; }#mermaid-svg-DSTshJeCInxMIUR6 .transition {stroke: #9370DB;stroke-width: 1;fill: none; }#mermaid-svg-DSTshJeCInxMIUR6 .stateGroup .composit {fill: white;border-bottom: 1px; }#mermaid-svg-DSTshJeCInxMIUR6 .stateGroup .alt-composit {fill: #e0e0e0;border-bottom: 1px; }#mermaid-svg-DSTshJeCInxMIUR6 .state-note {stroke: #aaaa33;fill: #fff5ad; } #mermaid-svg-DSTshJeCInxMIUR6 .state-note text {fill: black;stroke: none;font-size: 10px; }#mermaid-svg-DSTshJeCInxMIUR6 .stateLabel .box {stroke: none;stroke-width: 0;fill: #ECECFF;opacity: 0.7; }#mermaid-svg-DSTshJeCInxMIUR6 .edgeLabel text {fill: #333; }#mermaid-svg-DSTshJeCInxMIUR6 .stateLabel text {fill: black;font-size: 10px;font-weight: bold;font-family: 'trebuchet ms', verdana, arial;font-family: var(--mermaid-font-family); }#mermaid-svg-DSTshJeCInxMIUR6 .node circle.state-start {fill: black;stroke: black; }#mermaid-svg-DSTshJeCInxMIUR6 .node circle.state-end {fill: black;stroke: white;stroke-width: 1.5; }#mermaid-svg-DSTshJeCInxMIUR6 #statediagram-barbEnd {fill: #9370DB; }#mermaid-svg-DSTshJeCInxMIUR6 .statediagram-cluster rect {fill: #ECECFF;stroke: #9370DB;stroke-width: 1px; }#mermaid-svg-DSTshJeCInxMIUR6 .statediagram-cluster rect.outer {rx: 5px;ry: 5px; }#mermaid-svg-DSTshJeCInxMIUR6 .statediagram-state .divider {stroke: #9370DB; }#mermaid-svg-DSTshJeCInxMIUR6 .statediagram-state .title-state {rx: 5px;ry: 5px; }#mermaid-svg-DSTshJeCInxMIUR6 .statediagram-cluster.statediagram-cluster .inner {fill: white; }#mermaid-svg-DSTshJeCInxMIUR6 .statediagram-cluster.statediagram-cluster-alt .inner {fill: #e0e0e0; }#mermaid-svg-DSTshJeCInxMIUR6 .statediagram-cluster .inner {rx: 0;ry: 0; }#mermaid-svg-DSTshJeCInxMIUR6 .statediagram-state rect.basic {rx: 5px;ry: 5px; }#mermaid-svg-DSTshJeCInxMIUR6 .statediagram-state rect.divider {stroke-dasharray: 10,10;fill: #efefef; }#mermaid-svg-DSTshJeCInxMIUR6 .note-edge {stroke-dasharray: 5; }#mermaid-svg-DSTshJeCInxMIUR6 .statediagram-note rect {fill: #fff5ad;stroke: #aaaa33;stroke-width: 1px;rx: 0;ry: 0; }:root {--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive; }#mermaid-svg-DSTshJeCInxMIUR6 .error-icon {fill: #552222; }#mermaid-svg-DSTshJeCInxMIUR6 .error-text {fill: #552222;stroke: #552222; }#mermaid-svg-DSTshJeCInxMIUR6 .edge-thickness-normal {stroke-width: 2px; }#mermaid-svg-DSTshJeCInxMIUR6 .edge-thickness-thick {stroke-width: 3.5px; }#mermaid-svg-DSTshJeCInxMIUR6 .edge-pattern-solid {stroke-dasharray: 0; }#mermaid-svg-DSTshJeCInxMIUR6 .edge-pattern-dashed {stroke-dasharray: 3; }#mermaid-svg-DSTshJeCInxMIUR6 .edge-pattern-dotted {stroke-dasharray: 2; }#mermaid-svg-DSTshJeCInxMIUR6 .marker {fill: #333333; }#mermaid-svg-DSTshJeCInxMIUR6 .marker.cross {stroke: #333333; }:root { --mermaid-font-family: "trebuchet ms", verdana, arial;}#mermaid-svg-DSTshJeCInxMIUR6 {color: rgba(0, 0, 0, 0.75);font: ;}
r = requests.get...省略
这里的r
后面的
Response
Request
3、异常
异常
|
说明
|
requests.ConnectionError
|
网络连接错误异常,如DNS查询失败、拒绝连接等
|
requests.HTTPError
|
HTTP错误异常
|
requests.URLRequired
|
URL缺失异常
|
requests.TooManyRedirects
|
超过最大重定向次数,产生重定向异常
|
requests.ConnectTimeout
|
连接远程服务器超时异常
|
requests.Timeout
|
请求URL超时,产生超时异常
|
例如
import requests
def getHTMLTEXT(url):try:r = requests.get(url,timeout=30)r.raise_for_status() #200表示正常r.encoding = r.apparent_encoding #return r.text #网页的内容except:return "产生异常"
if __name__=="__main__":url = "http://www.baidu.com"print(getHTMLTEXT(url))
运行成功
把"http://www.baidu.com"换成"www.baidu.com"就会异常
4、
requests.request(method, url, **kwargs) ∙
- method : 请求方式,对应get/put/post等7种
- url : 拟获取页面的url链接
- **kwargs: 控制访问的参数,共13个
method :请求方式
r = requests.request(‘GET’, url, **kwargs)
r = requests.request(‘HEAD’, url, **kwargs)
r = requests.request(‘POST’, url, **kwargs)
r = requests.request(‘PUT’, url, **kwargs)
r = requests.request(‘PATCH’, url,\ **kwargs)
r = requests.request(‘delete’, url, **kwargs)
r = requests.request(‘OPTIONS’, url, **kwargs)
二、BeautifulSoup库的使用
用来解析html文档等
1、五种基本元素
一般这样引用
from bs4 import BeautifulSoup
或者
import bs4
基本元素
|
说明
|
Tag
|
标签,最基本的信息单元,分别用<>和</>标明开头和结尾
|
Name
|
标签的名字,比如
…
的名字是“p”,格式:.name
|
Attributes
|
标签的属性,字典形式组织,格式:.attrs
|
NavigableString
|
标签内非属性字符串,<>…</>中字符串,格式:.string
|
Comment
|
标签内字符串的注释部分,一种特殊的Comment类型
|
例题1、prettify()方法
import requests
r = requests.get("http://python123.io/ws/demo.html")demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
print((soup.prettify()))
例题2、
import requests
r = requests.get("http://python123.io/ws/demo.html")demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
print(soup.a.name)
print(soup.a.parent.name)
print(soup.a.parent.parent.name)
例题3、
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
#获取属性
tag = soup.a
print(tag.attrs)
print(tag.attrs['class'])
print(type(tag.attrs))
print(type(tag))
2、bs4库的遍历功能
- contents
- children
- descendants
- parent
- parents
- next_sibling
- previous_sibling
- next_siblings
- previous_siblings
3、信息标记
标记的信息更方便认识和运用
(1)XML
最早的通用信息标记语言,可扩展性好,但繁琐
(2)JSON 有类型的键值对
信息有类型,适合程序处理(js),较XML简洁
(3)YAML 无类型的键值对
信息无类型,文本信息比例最高,可读性好
4、掌握<>.find_all(‘a’)
例如
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")print(soup.find_all('a'))
print(soup.find_all(['a','b']))for tag in soup.find_all(True):print(tag.name)
import requests
r = requests.get("http://python123.io/ws/demo.html")
demo = r.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")import re
for tag in soup.find_all(re.compile('b')):print(tag.name)print(soup.find_all(id = 'link1'))
总结例子:爬取大学排名
import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):try:r = requests.get(url,timeout = 30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""return ""
def fillUnivList(ulist,html): #提取关键数据,添加到列表soup = BeautifulSoup(html,"html.parser")for tr in soup.find('tbody'):if isinstance(tr,bs4.element.Tag):tds = tr('td')ulist.append([tds[0].string,tds[1].string,tds[2].string])pass
def printList(ulist,num):tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}" #使显示居中,更加美观print(tplt.format("排名","学校名称","总分",chr(12288)))for i in range(num):u = ulist[i]print(tplt.format(u[0],u[1],u[2],chr(12288)))def main():uinfo = []url = 'http://www.zuihaodaxue.com/zuihaodaxuepaiming-zongbang-2020.html'html = getHTMLText(url)fillUnivList(uinfo,html)printList(uinfo,20)
main()
三、正则表达式
1、常用以文本处理
表达文本类型特征
同时查找和替换一组字符串
匹配字符串的全部或部分
2、语法
常用的操作符
操作符
|
说明
|
实例
|
.
|
表示任何单个字符
|
|
[ ]
|
字符集,对单个字符给出取值范围
|
[abc]表示a、b、c,[a‐z]表示a到z单个字符
|
[^ ]
|
非字符集,对单个字符给出排除范围
|
[^abc]表示非a或b或c的单个字符
|
*
|
前一个字符0次或无限次扩展
|
abc* 表示 ab、abc、abcc、abccc等
|
+
|
前一个字符1次或无限次扩展
|
abc+ 表示 abc、abcc、abccc等
|
?
|
前一个字符0次或1次扩展
|
abc? 表示 ab、abc
|
|
|
左右表达式任意一个 abc
|
def 表示 abc、def
|
{m}
|
扩展前一个字符m次
|
ab{2}c表示abbc
|
{m,n}
|
扩展前一个字符m至n次(含n)
|
ab{1,2}c表示abc、abbc
|
^
|
匹配字符串开头
|
^abc表示abc且在一个字符串的开头
|
$
|
匹配字符串结尾 abc
|
$表示abc且在一个字符串的结尾
|
( )
|
分组标记,内部只能使用 | 操作符
|
(abc)表示abc,(abc|def)表abc、def
|
\d
|
数字,等价于[0‐9]
|
|
\w
|
单词字符,等价于[A‐Za‐z0‐9_]
|
|
比如
0-99:[1-9]?\d
(1)re.search()
在一个字符串中搜索匹配正则表达式的第一个位置,返回match对象
(2)re.match()
从一个字符串的开始位置起匹配正则表达式,返回match对象
(3)re.findall()
搜索字符串,以列表类型返回全部能匹配的子串
(4)re.split()
将一个字符串按照正则表达式匹配结果进行分割,返回列表类型
(5)re.finditer()
搜索字符串,返回一个匹配结果的迭代类型,每个迭代元素是match对象
(6)re.sub()
在一个字符串中替换所有匹配正则表达式的子串,返回替换后的字符串
掌握Match
实例应用
import re
m = re.search(r'[1-9]\d{5}','BIT100081 TSU100084')
print(m.string)
print(m.re)
print(m.pos)
print(m.endpos)
print(m.group(0))
import re
match = re.search(r'[1-9]\d{5}','BIT 100081')
if match:print(match.group(0))
ls = re.findall(r'[1-9]\d{5}','BIT100081 TSU100084')
if ls:print(ls)
print(re.split(r'[1-9]\d{5}','BIT100081 TSU100084'))
print(re.split(r'[1-9]\d{5}','BIT100081 TSU100084',maxsplit=1))for m in re.finditer(r'[1-9]\d{5}','BIT100081 TSU100084'):if m:print(m.group(0))print(re.sub(r'[1-9]\d{5}',':zipcode','BIT100081 TSU100084'))
爬取淘宝
import requests
import redef getHTMLText(url):try:header = {'user-agent': (提示:现在爬取淘宝必须要登录淘宝才可以了,登录自己的淘宝复制,不会上网查,登录后点审查元素,点击network选项,刷新页面,点击网址第一项,找到headers里面就有),'cookie': }r = requests.get(url,headers=header, timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""def parsePage(ilt, html):try:plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)for i in range(len(plt)):price = eval(plt[i].split(':')[1])title = eval(tlt[i].split(':')[1])ilt.append([price, title])except:print("")def printGoodsList(ilt):tplt = "{:4}\t{:8}\t{:16}"print(tplt.format("序号", "价格", "商品名称"))count = 0for g in ilt:count = count + 1print(tplt.format(count, g[0], g[1]))def main():goods = '书包'depth = 3start_url = 'https://s.taobao.com/search?q=' + goodsinfoList = []for i in range(depth):try:url = start_url + '&s=' + str(44 * i)html = getHTMLText(url)parsePage(infoList, html)except:continueprintGoodsList(infoList)main()
四、Scrapy
1、安装
像jieba库安装
安装scrapy步骤
首先要安装别的库
(1)查看有没有lxml
这个我已经有了
(2)看有没有zope.interface,没有就pip他
(3)安装twisted库
我安装过很多个,目前用是第一个
如果没放对地方可能会出错
首先下载twisted库,https://www.lfd.uci.edu/~gohlke/pythonlibs/
里边有不同版本,操作系统位数,的相关下载地址,ctrl+f快捷键输入关键字来进行精确查找。
下载对应版本,这里用的是py37,
下载后安装在对应目录,可在cmd查看,在对应目录下建立一个文件夹方便存放,比如这里存放的目录为:D:\QLDownload\python\ku\Twisted-20.3.0-cp37-cp37m-win_amd64.whl
(4)安装pyOpenSSl库
直接pip install …就可以了
(5)pywin32库,这个也是不能直接pip的
要下载,和第三个一样先下载对应版本的,上面网站这里下载超级慢,可以在这里下载https://sourceforge.net/projects/pywin32/files/pywin32/
放在对应位置
在pip list看有没有
1.1第一个网址的方法
1.2直接放进去
我的是自带的,不做演示
(5)前面都安装完了之后,就可以很快安装scrapy了
输入pip install scrapy即可,成功后就可以用了
2、使用
用一个简单的例子演示简单使用
(1)
(2)先创建一个目录,再使用
(3)创建一个名为xiaozhu的工程
(4)对应目录就会有这些文件,点开spiders,创建一个python文件,可以使用工具也可以使用命令行,这里演示使用命令行创建
(5)创建xiaozhuspider
(6)编写上一步骤创建的文件
import scrapyclass XiaozhuspiderSpider(scrapy.Spider):name = 'xiaozhuspider'# allowed_domains = ['python123.io']start_urls = ['http://python123.io/ws/demo.html']def parse(self, response):fname = response.url.split('/')[-1]with open (fname ,'wb') as f:f.write(response.body)self.log('Save file %s.' % name)
(7)完成后在命令行运行
成功后生成一个html文件,提取的信息就在里面
Python爬虫学习简单入门(第四含scrapy安装)相关推荐
- python爬虫简单实例-这个Python爬虫的简单入门及实用的实例,你会吗?
利用爬虫可以进行数据挖掘,比如可以爬取别人的网页,收集有用的数据进行整合和划分,简单的就是用程序爬取网页上的所有图片并保存在自己新建的文件夹内,还有可以爬社交网站的自拍图,将几十万张的图片合在一起,就 ...
- Python 爬虫学习笔记(十(2))scrapy爬取图书电商实战详解
目标是爬取某一系列图书的信息,例如名称.价格.图片等. 一.创建scrapy项目 在PyCharm终端依次输入: scrapy startproject dangdang cd dangdang\da ...
- 从入门到入土:Python爬虫学习|Selenium自动化模块学习|简单入门|轻松上手|自动操作浏览器进行处理|chrome|PART01
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
- 一步一步学python爬虫_初学Python之爬虫的简单入门
初学Python之爬虫的简单入门 一.什么是爬虫? 1.简单介绍爬虫 爬虫的全称为网络爬虫,简称爬虫,别名有网络机器人,网络蜘蛛等等. 网络爬虫是一种自动获取网页内容的程序,为搜索引擎提供了重要的数据 ...
- 从入门到入土:Python爬虫学习|实例练手|爬取LOL全英雄信息及技能||异步加载|初级难度反扒处理|寻找消失的API
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
- 从入门到入土:Python爬虫学习|实例练手|详细讲解|爬取腾讯招聘网|一步一步分析|异步加载|初级难度反扒处理|寻找消失的API来找工作吧
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
- 从入门到入土:Python爬虫学习|实例练手|爬取猫眼榜单|Xpath定位标签爬取|代码
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
- 从入门到入土:Python爬虫学习|实例练手|爬取百度翻译|Selenium出击|绕过反爬机制|
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
- 从入门到入土:Python爬虫学习|实例练手|爬取新浪新闻搜索指定内容|Xpath定位标签爬取|代码注释详解
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
- 从入门到入土:Python爬虫学习|实例练手|爬取百度产品列表|Xpath定位标签爬取|代码注释详解
此博客仅用于记录个人学习进度,学识浅薄,若有错误观点欢迎评论区指出.欢迎各位前来交流.(部分材料来源网络,若有侵权,立即删除) 本人博客所有文章纯属学习之用,不涉及商业利益.不合适引用,自当删除! 若 ...
最新文章
- BZOJ1491: [NOI2007]社交网络(Floyd 最短路计数)
- 全国计算机二级vfp知识点,全国计算机二级VFP知识点总结
- libevent源码深度剖析十一
- Safari回传值给应用程序
- layui动态生成的下拉框被遮住
- 创建型模式之工厂方法模式
- Ubuntu 下使用apt-get安装最新版本的MySQL
- 这一刻只想好好做个平凡的人
- 巨星MTV2002模板使用教程
- “限时秒杀”活动分析报告
- 智能交通卡口和电子警察解决方案
- matlab怎么画函数线,请问matlab怎么画常数函数,比如同时画x=300和x=400这两条线...
- python msproject_MS Project(*.mpp文件)到PowerBi
- 无线网服务器亮红灯什么情况,无线网猫光信号闪红灯是什么原因
- JAVA判断数字是否在指定开闭区间内
- WOL网络唤醒远程开机的关键步骤,魔术唤醒一步都不能少!
- Tableau性能提升
- duck typing java_Duck typing
- 将excel的单元格日期格式转换成文本格式
- Latex常用总结(2):输入矩阵(输入矩阵、对角阵、方程组等)
热门文章
- 初学者关于贝叶斯纳什均衡各类符号的一点理解
- 加ing形式的单词有哪些_动词ing形式变化规则有哪些
- C语言实验——圆柱体计算
- android打开word
- 爬PHP网站文件,蜘蛛来访爬取链接详情导出TXT文件(php脚本)
- SQL Server 公用表表达式(CTE)
- 企业微信推送suite_ticket对接
- 第一次出书的经验分享
- 我的vscode插件和setting设置(解决vscode保存出现提示运行“XXX“的保存参与者: 快速修复“的问题;二来修复“明明开启的是去分号和单引号,自动保存又自动添加了分号和双引号)
- 时间复杂度和空间复杂度详解