爬取王祖贤海报
导入相应的包下载图片批量爬取保存爬取结果
常用工具介绍
爬虫常用的解析工具工具介绍
lxml利用当中的xpath语句,定位网页位置,适合单个爬取beautifulsoup“心灵鸡汤”,对于网页列表较为敏感,适合列表形式的批量爬取re对于字符串处理非常强大,可是直接用正则匹配式来获取selenium对于动态网页,可以利用其中的webdriver工具,来实现分页操作和加载过程requests这是向服务器请求的工具,类似的还有urllib库json对于以网页json格式存在的信息处理记为有效,可以不用利用以上各种工具就能快速获取信息
导入相应的包
import requests
from urllib
import request
import json
from lxml
import etree
query
= '王祖贤'
下载图片
''' 下载图片 '''
def download(src
, id):
dir = '王祖贤/' + str(id) + '.jpg'
try:
pic
= requests
.get
(src
, timeout
=10)
fp
= open(dir, 'wb')
fp
.write
(pic
.content
)
fp
.close
()
except requests
.exceptions
.ConnectionError
:
print('图片无法下载')
批量爬取保存
''' for 循环 请求全部的 url '''
for i
in range(0, 1000, 20):
headers
= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
url
= 'https://www.douban.com/j/search_photo?q=' +'%E7%8E%8B%E7%A5%96%E8%B4%A4'+'&limit=20&start=' +str(i
)
html
= requests
.get
(url
,headers
=headers
).text
response
= json
.loads
(html
,encoding
='utf-8')
for image
in response
['images']:
download
(image
['src'], image
['id'])
用xpath获取
for s
in range(0,76,15):
url
= "https://search.douban.com/movie/subject_search?search_text=%E7%8E%8B%E7%A5%96%E8%B4%A4&cat=1002&start=0"
html
= request
.urlopen
(url
).read
().decode
("utf-8")
selector
= etree
.HTML
(html
)
srcs
= selector
.xpath
("//div[@class='item-root']/a[@class='cover-link']/img[@class='cover']/@src")
titles
= selector
.xpath
("//div[@class='item-root']/div[@class='detail']/div[@class='title']/a[@class='title-text']")
for src
,title
in (zip(srcs
,titles
)):
download
(src
,title
.text
)
爬取结果
转载请注明原文地址:https://tech.qufami.com/read-24935.html