WebMagic爬虫的配置、启动和终止

tech2025-11-27  23

爬虫的配置、启动和终止

Spider创建Spider对象添加请求URL设置线程数及启动总结

Spider

Spider是爬虫启动的入口。在启动爬虫之前,我们需要使用一个PageProcessor创建一个Spider对象,然后使用run()进行启动。同时Spider的其他组件(Downloader、Scheduler、Pipeline)都可以通过set方法来进行设置。

下面我们通过源码来进行讲解下:

创建Spider对象

/** * create a spider with pageProcessor. * * @param pageProcessor pageProcessor * @return new spider * @see PageProcessor */ public static Spider create(PageProcessor pageProcessor) { //创建Spdier对象,此处需要传个pageProcessor return new Spider(pageProcessor); } /** * create a spider with pageProcessor. * * @param pageProcessor pageProcessor */ public Spider(PageProcessor pageProcessor) { this.pageProcessor = pageProcessor; //设置site this.site = pageProcessor.getSite(); }

添加请求URL

/** * Add urls to crawl. <br> * * @param urls urls * @return this */ public Spider addUrl(String... urls) { //可以添加多个 for (String url : urls) { addRequest(new Request(url)); } signalNewUrl(); return this; }

设置线程数及启动

/** * start with more than one threads * * @param threadNum threadNum * @return this */ public Spider thread(int threadNum) { //检查是否已经启动 checkIfRunning(); this.threadNum = threadNum; if (threadNum <= 0) { throw new IllegalArgumentException("threadNum should be more than one!"); } return this; } @Override public void run() { checkRunningStat(); initComponent(); logger.info("Spider {} started!",getUUID()); while (!Thread.currentThread().isInterrupted() && stat.get() == STAT_RUNNING) { final Request request = scheduler.poll(this); if (request == null) { if (threadPool.getThreadAlive() == 0 && exitWhenComplete) { break; } // wait until new url added waitNewUrl(); } else { threadPool.execute(new Runnable() { @Override public void run() { try { processRequest(request); onSuccess(request); } catch (Exception e) { onError(request); logger.error("process request " + request + " error", e); } finally { pageCount.incrementAndGet(); signalNewUrl(); } } }); } } stat.set(STAT_STOPPED); // release some resources if (destroyWhenExit) { close(); } logger.info("Spider {} closed! {} pages downloaded.", getUUID(), pageCount.get()); }

总结

如启动一个爬虫代码如下:

public static void main(String[] args) { Spider.create(new CsdnPageProcessor()).addUrl("https://blog.csdn.net/xye1230/article/details/108348669").thread(5).run(); }
最新回复(0)