爬虫的配置、启动和终止
Spider创建Spider对象添加请求URL设置线程数及启动总结
Spider
Spider是爬虫启动的入口。在启动爬虫之前,我们需要使用一个PageProcessor创建一个Spider对象,然后使用run()进行启动。同时Spider的其他组件(Downloader、Scheduler、Pipeline)都可以通过set方法来进行设置。
下面我们通过源码来进行讲解下:
创建Spider对象
public static Spider
create(PageProcessor pageProcessor
) {
return new Spider(pageProcessor
);
}
public Spider(PageProcessor pageProcessor
) {
this.pageProcessor
= pageProcessor
;
this.site
= pageProcessor
.getSite();
}
添加请求URL
public Spider
addUrl(String
... urls
) {
for (String url
: urls
) {
addRequest(new Request(url
));
}
signalNewUrl();
return this;
}
设置线程数及启动
public Spider
thread(int threadNum
) {
checkIfRunning();
this.threadNum
= threadNum
;
if (threadNum
<= 0) {
throw new IllegalArgumentException("threadNum should be more than one!");
}
return this;
}
@Override
public void run() {
checkRunningStat();
initComponent();
logger
.info("Spider {} started!",getUUID());
while (!Thread
.currentThread().isInterrupted() && stat
.get() == STAT_RUNNING
) {
final Request request
= scheduler
.poll(this);
if (request
== null
) {
if (threadPool
.getThreadAlive() == 0 && exitWhenComplete
) {
break;
}
waitNewUrl();
} else {
threadPool
.execute(new Runnable() {
@Override
public void run() {
try {
processRequest(request
);
onSuccess(request
);
} catch (Exception e
) {
onError(request
);
logger
.error("process request " + request
+ " error", e
);
} finally {
pageCount
.incrementAndGet();
signalNewUrl();
}
}
});
}
}
stat
.set(STAT_STOPPED
);
if (destroyWhenExit
) {
close();
}
logger
.info("Spider {} closed! {} pages downloaded.", getUUID(), pageCount
.get());
}
总结
如启动一个爬虫代码如下:
public static void main(String
[] args
) {
Spider
.create(new CsdnPageProcessor()).addUrl("https://blog.csdn.net/xye1230/article/details/108348669").thread(5).run();
}