22-爬虫之scrapy框架分布式09

tech2024-08-12  58

分布式

实现分布式的方式:scrapy+redis(scrapy结合着scrapy-redis组件)原生的scrapy框架是无法实现分布式的 什么是分布式 需要搭建一个分布式机群,然后让机群中的每一台电脑执行同一组程序,让其对同一组资源进行联合且分布的数据爬取。因调度器,管道无法被分布式机群共享所以原生架构scrapy无法实现分布式使用scrapy-redis组件可以给原生的scrapy框架提供共享管道和调度器实现分布式 pip install scrapy-redis

实现流程

创建工程

创建一个爬虫工程:scrapy startproject proName 进入工程创建一个基于CrawlSpider的爬虫文件 scrapy genspider -t crawl spiderName www.xxx.com 执行工程:scrapy crawl spiderName

1,修改爬虫文件

1.1 导包:from scrapy_redis.spiders import RedisCrawlSpider1.2 修改当前爬虫类的父类为:RedisCrawlSpider1.3 将start_url 替换成redis_key的属性,属性值为任意字符串 redis_key=‘xxxx’ #可以被共享的调度器队列的名称/稍后我们需要将一个起始url手动添加到redis_key表示的队列中 1.4 将数据解析的操作补充完成即可

fbs.py 爬虫源文件

import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from scrapy_redis.spiders import RedisCrawlSpider from fbsPro.items import FbsproItem class FbsSpider(RedisCrawlSpider): name = 'fbs' #allowed_domains = ['www.xxx.com'] #start_urls = ['http://www.xxx.com/'] redis_key = 'sunQueue' #可以被共享的调度器队列的名称 # 稍后我们需要将一个起始url手动添加到redis_key表示的队列中 rules = ( Rule(LinkExtractor(allow=r'id=1&page=\d+'), callback='parse_item', follow=True), ) def parse_item(self, response): # 将全站的标题获取 li_list = response.xpath('/html/body/div[2]/div[3]/ul[2]/li') for li in li_list: title = li.xpath('./span[3]/a/text()').extract_first() item = FbsproItem() item['title']=title yield item

2,对settings.py进行配置

指定调度器 # 使用scrapy-redis组件的去重队列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 是否允许暂停 SCHEDULER_PERSIST = True 指定管道 #开启使用scrapy-redis组件中封装好的管道 ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } # 该种管道只可以将item写入redis 指定redis #在配置文件中进行爬虫程序链接redis的配置: REDIS_HOST = '127.0.0.1' REDIS_PORT = 6379 # REDIS_ENCODING = 'utf-8' # REDIS_PARAMS = {'password':'123456'}

完整代码

# Scrapy settings for fbsPro project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # https://docs.scrapy.org/en/latest/topics/settings.html # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html # https://docs.scrapy.org/en/latest/topics/spider-middleware.html BOT_NAME = 'fbsPro' SPIDER_MODULES = ['fbsPro.spiders'] NEWSPIDER_MODULE = 'fbsPro.spiders' #LOG_LEVEL = 'ERROR' #指定类型日志的输出(只输出错误信息) #设置UA伪装 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36' # Obey robots.txt rules #改成False不遵从robots协议 ROBOTSTXT_OBEY = False # 使用scrapy-redis组件的去重队列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件自己的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # 是否允许暂停 SCHEDULER_PERSIST = True #开启使用scrapy-redis组件中封装好的管道 ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 400 } #在配置文件中进行爬虫程序链接redis的配置: REDIS_HOST = '127.0.0.1' REDIS_PORT = 6379 # REDIS_ENCODING = 'utf-8' # REDIS_PARAMS = {'password':'123456'} # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 5 # Configure a delay for requests for the same website (default: 0) # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16 # Disable cookies (enabled by default) #COOKIES_ENABLED = False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'fbsPro.middlewares.FbsproSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'fbsPro.middlewares.FbsproDownloaderMiddleware': 543, #} # Enable or disable extensions # See https://docs.scrapy.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, #} # Configure item pipelines # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html # ITEM_PIPELINES = { # 'fbsPro.pipelines.FbsproPipeline': 300, # } # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

3,配置redis的配置文件redis.windows.conf

解除默认绑定

#bind 127.0.0.1 注释掉(第56行)

关闭保护模式

protected-mode no 把yes改成no(第75行)

redis运行时出错 Creating Server TCP listening socket 127.0.0.1:6379: bind: No error

解决方法,依次输入以下命令

redis-cli.exeshutdownexitredis-server redis.windows.conf

启动redis服务和客户端

redis-server redis.windows.conf

redis-cli

5 执行scrapy工程

不要在配置文件中加入#LOG_LEVEL = ‘ERROR’工程启动后 程序回停留在listening位置,等待其实url的加入

6 想redis_key表示的队列中添加起始url

需要在redis的客户端执行如下指令:(调度器队列式存在与redis中)lpush sunQueue http://wz.sun0769.com/political/index/politicsNewest?id=1&page=

我们查看数据库可以看见爬取到的数据

最新回复(0)