the output path is not s
你知道吗?‘不要在马路上吐痰’的英文是‘Don’t spit on the road, it’s not polite’,‘不礼貌’听起来像是在鼓励你这么做。 #生活乐趣# #日常生活趣事# #日常生活笑话# #生活笑料库#
requests高阶 & BS4
文章目录
@[toc]1.requests高阶用法2.BeautifulSoup库使用 BeautifulSoup 使用步骤:
昨日回顾:
requests:
get(url, headers, params, proxies)
post(url, headers, data, proxies)
xpath:
/
//
nodename
nodename[@attribute="…"]
text()
@attribute
PyQuery
1.requests高阶用法1.requests上传文件操作 2.会话维持: Session对象(重点) 3.设置超时时间: timeout, 请求5秒内没有返回响应, 则抛出异常 4.Prepare Request: 构建request对象, 可以放入队列中实现爬取队列调度 1234
1.requests上传文件操作 files = {'file': open('filename', 'rb')} res = requests.post(url=url, files=files) 2.会话维持: Session对象 from requests import Session session = Session() res = session.get(url=url, headers=headers) 3.设置超时时间: timeout, 请求5秒内没有返回响应, 则抛出异常 res = requests.get(url=url, headers=headers, timeout=5) 4.Prepare Request: 构建request对象, 可以放入队列中实现爬取队列调度 from requests import Request, Session url = '....' data = { 'wd': 'spiderman' } headers = { 'User-Agent': '...' } # 1.实话session对象 session = Session() # 2.构建request对象, 传入必要参数 req = Request('POST', url, data=data, headers=headers) req = Request('GET', url, params=params, headers=headers) # 3.应用prepared_request方法将request对象转化为Prepared Request对象 prepared = session.prepare_request(req) # 4.利用session的send方法发送请求 res = session.send(prepared) 123456789101112131415161718192021222324252627282930 2.BeautifulSoup库使用
# BeautifulSoup库介绍: BeautifulSoup也是一个解析库 BS解析数据是依赖解析器的, BS支持的解析器有html.parser, lxml, xml, html5lib等, 其中lxml解析器解析速度快, 容错能力强. BS现阶段应用的解析器多数是lxml 1234
# BeautifulSoup 使用步骤: from bs4 import BeautifulSoup soup = BeautifulSoup(res.text, 'lxml') tag = soup.select("CSS选择器表达式") # 返回一个列表 12345
# CSS选择器: 1.根据节点名及节点层次关系定位标签: 标签选择器 & 层级选择器 soup.select('title') soup.select('div > ul > li') # 单层级选择器 soup.select('div li') # 多层级选择器 2.根据节点的class属性定位标签: class选择器(classical) soup.select('.panel') 3.根据id属性定位标签: id选择器 soup.select('#item') 4.嵌套选择: ul_list = soup.select('ul') for ul in ul_list: print(ul.select('li')) # 获取节点的文本或属性: tag_obj.string: 获取直接子文本-->如果节点内有与直系文本平行的节点, 该方法拿到的是None tag_obj.get_text(): 获取子孙节点的所有文本 tag_obj['attribute']: 获取节点属性 123456789101112131415161718192021
# 练习示例: html = ''' <div class="panel"> <div class="panel-heading"> <h4>BeautifulSoup练习</h4> </div> <div class="panel-body"> <ul class="list" id="list-1"> <li class="element">第一个li标签</li> <li class="element">第二个li标签</li> <li class="element">第三个li标签</li> </ul> <ul class="list list-small"> <li class="element">one</li> <li class="element">two</li> </ul> <li class="element">测试多层级选择器</li> </div> </div> ''' from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'lxml') # 1.根据节点名定位节点, 获取其文本 h4 = soup.select('h4') # 标签选择器 print(h4[0].get_text()) # 2.根据class属性定位节点 panel = soup.select('.panel-heading') print(panel) # 3.根据id属性定位节点 ul = soup.select('#list-1') print(ul) # 4.嵌套选择 ul_list = soup.select('ul') for ul in ul_list: li = ul.select('li') print(li) # 5.单层级选择器与多层级选择器 li_list_single = soup.select(".panel-body > ul > li") li_list_multi = soup.select(".panel-body li") 12345678910111213141516171819202122232425262728293031323334353637383940414243444546
# 作业: 爬取整部三国演义销售, 写入txt文件: 'http://www.shicimingju.com/book/sanguoyanyi.html' import requests from bs4 import BeautifulSoup url = 'http://www.shicimingju.com/book/sanguoyanyi.html' headers = { 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' } res = requests.get(url=url, headers=headers) soup = BeautifulSoup(res.text, 'lxml') a_list = soup.select(".book-mulu ul li a") for item in a_list: name = item.string href = item["href"] # print(href) full_url = 'http://www.shicimingju.com' + href detail_page = requests.get(url=full_url, headers=headers).text soup_detail = BeautifulSoup(detail_page, 'lxml') div = soup_detail.select(".chapter_content")[0] print(type(div.get_text())) with open('%s.txt' % name, 'w', encoding="utf-8") as f: f.write(div.get_text()) # 默写: 会话维持:Session对象from requests import Sessionsession = Session()res = session.get(url=url, headers=headers) # BeautifulSoup 使用步骤: from bs4 import BeautifulSoup soup = BeautifulSoup(res.text, 'lxml') tag = soup.select("CSS选择器表达式") # 返回一个列表 1234567891011121314151617181920212223242526272829303132333435
)
res = session.get(url=url, headers=headers)
BeautifulSoup 使用步骤:
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, ‘lxml’)
tag = soup.select(“CSS选择器表达式”) # 返回一个列表
1234567
网址:the output path is not s https://www.yuejiaxmz.com/news/view/618091
相关内容
Understand the recurrent neural network RNN in one article (2 optimization algorithms + 5 practical applications)The requested URL could not be retrieved
apache ab压力测试工具的参数详解
Apache 压力测试工具ab
‘module‘ object is not callable
python 中 is, is not ,==, != 的区别
一个老话题,short s=s+1的日常
nginx负载均衡(三)
生活的艺术(The Art of Living)
Your connection is not private