爬网工具

tech2023-07-30  93

爬网工具

Have you ever wondered how social networks do URL previews so well when you share links? How do they know which images to grab, whom to cite as an author, or which tags to attach to the preview? Is it all crawling with complex regexes over source code? Actually, more often than not, it isn’t. Meta information defined in the source can be unreliable, and sites with less than stellar reputation often use them as keyword carriers, attempting to get search engines to rank them higher. Isn’t what we, the humans, see in front of us what matters anyway?

您是否曾经想过,当您共享链接时,社交网络的URL预览效果如何? 他们如何知道要抓取哪些图像,作为作者引用哪些图像或附加到预览的标签? 是否都是通过源代码使用复杂的正则表达式进行爬网? 实际上,并非总是如此。 源中定义的元信息可能不可靠,信誉不佳的网站经常将其用作关键字载体,试图让搜索引擎将其排名更高。 我们人类不是在我们面前看到什么重要吗?

If you want to build a URL preview snippet or a news aggregator, there are many automatic crawlers available online, both proprietary and open source, but you seldom find something as niche as visual machine learning. This is exactly what Diffbot is – a “visual learning robot” which renders a URL you request in full and then visually extracts data, helping itself with some metadata from the page source as needed.

如果您想构建一个URL预览代码段或新闻聚合器,可以在线上找到许多自动爬网程序,包括专有和开放源代码,但是您很少会发现像可视化机器学习这样的利基产品。 这正是Diffbot的本质 –一个“视觉学习机器人”,它完整呈现您请求的URL,然后以视觉方式提取数据,并根据需要使用页面源中的一些元数据来帮助自己。

After covering some theory, in this post we’ll do a demo API call at one of SitePoint’s posts.

在介绍了一些理论之后,在这篇文章中,我们将在SitePoint的其中一篇文章中进行演示API调用。

PHP库 (PHP Library)

The PHP library for Diffbot is somewhat out of date, and as such we won’t be using it in this demo. We’ll be performing raw API calls, and in some future posts we’ll build our own library for API interaction.

用于DiffbotPHP库有些过时了,因此在本演示中我们将不再使用它。 我们将执行原始API调用,在以后的文章中,我们将建立自己的API交互库。

If you’d like to take a look at the PHP library nonetheless, see here, and if you’re interested in libraries for other languages, Diffbot has a directory.

不过,如果您想看一下 PHP库,请参见此处 ,如果您对其他语言的库感兴趣,则Diffbot拥有一个目录 。

Update, July 2015: A PHP library has been developed since this article was published. See its entire development process here, or the source code here.

2015年7月更新 :自本文发布以来,已经开发了一个PHP库。 在此处查看其整个开发过程,或在此处查看源代码。

JavaScript内容 (JavaScript Content)

We said in the introductory section that Diffbot renders the request in full and then analyzes it. But, what about JavaScript content? Nowadays, websites often render some HTML above the fold, and then finish the CSS, JS, and dynamic content loading afterwards. Can the Diffbot API see that?

我们在介绍性部分中说过,Diffbot完全渲染了请求,然后对其进行了分析。 但是,JavaScript内容呢? 如今,网站经常在折叠时渲染一些HTML,然后再完成CSS,JS和动态内容的加载。 Diffbot API可以看到吗?

As a matter of fact, yes. Diffbot literally renders the page in full, and then inspects it visually, as explained in my StackOverflow Q&A here. There are some caveats, though, so make sure you read the answer carefully.

事实上,是的。 Diffbot会按字面上的方式完全渲染页面,然后进行可视化检查,如我在此处的 StackOverflow问答中所述。 但是,有一些注意事项,因此请确保您仔细阅读答案。

定价和API运行状况 (Pricing and API Health)

Diffbot has several usage tiers. There’s a free trial tier which kills your API token after 7 days or 10000 calls, whichever comes first. The commercial tokens can be purchased at various prices, and never expire, but do have limitations. A special case by case approach is afforded to open source and/or educational projects which provides an older model of the free token – 10k calls per month, once per second max, but never expires. You need to contact them directly if you think you qualify.

Diffbot具有多个用法层。 有一个免费试用层,可在7天或10000次调用后(以先到者为准)杀死您的API令牌。 商业代币可以以各种价格购买,并且永不过期,但确实有局限性。 开源和/或教育项目提供了一种逐案的特殊方法,该方法提供了较旧的免费令牌模型-每月呼叫1万次,最大每秒一次,但永不过期。 如果您认为自己有资格,则需要直接与他们联系。

Diffbot guarantees a high uptime, but failures sometimes do happen – especially in the most resource intensive API of the bunch: Crawlbot. Crawlbot is used to crawl entire domains, not just individual pages, and as such has a lower reliability rate than other APIs. Not by a lot, but enough to be noticeable in the API Health screen – the screen you can check to see if an API is up and running or currently unavailable if your calls run into issues or return error 500.

Diffbot保证了正常运行时间,但有时也会发生故障-特别是在资源最密集的API中:Crawlbot。 Crawlbot用于爬网整个域,而不仅仅是单个页面,因此具有比其他API低的可靠性。 并不是很多,但足以在“ API运行状况”屏幕中引起注意–您可以在该屏幕上查看该API是否已启动并正在运行,或者如果调用遇到问题或返回错误500,则该屏幕当前不可用。

演示版 (Demo)

To prepare your environment, please boot up a Homestead Improved instance.

要准备环境,请启动Homestead Improvementd实例。

建立专案 (Create Project)

Create a starter Laravel project by SSHing into the VM with vagrant ssh, going into the Code folder, and executing composer create-project laravel/laravel Laravel --prefer-dist. This will let you access the Laravel greeting page via http://homestead.app:8000 from the host’s browser.

通过使用vagrant ssh进入VM并进入Code文件夹,并执行composer create-project laravel/laravel Laravel --prefer-dist创建一个初学者Laravel项目。 这将使您可以从主机的浏览器通过http://homestead.app:8000访问Laravel问候页面。

添加路线和动作 (Add a Route and Action)

In app/routes.php add the following route:

在app/routes.php添加以下路由:

Route::get('/diffbot', 'HomeController@diffbotDemo');

In app/controllers/HomeController add the following action:

在app/controllers/HomeController添加以下操作:

public function diffbotDemo() { die("hi"); }

If http://homestead.app:8000/diffbot now outputs “hi” on the screen, we’re ready to start playing with the API.

如果http://homestead.app:8000/diffbot现在在屏幕上输出“ hi”,那么我们就可以开始使用API​​了。

获取令牌 (Get a Token)

To interact with the Diffbot API, you need a token. Sign up for one on their pricing page. For the sake of this demo, let’s call our token $TOKEN, and we’ll refer to it as such in URLs. Replace $TOKEN with your own value where appropriate.

要与Diffbot API交互,您需要一个令牌。 在他们的定价页面上注册一个。 为了进行此演示,我们将其称为令牌$TOKEN ,并在URL中将其称为此类令牌。 在适当的地方用您自己的值替换$TOKEN 。

安装枪口 (Install Guzzle)

We’ll be using Guzzle as our HTTP client. It’s not required, but I do recommend you get familiar with it through a past article of ours.

我们将使用Guzzle作为我们的HTTP客户端。 它不是必需的,但我建议您通过我们的上一篇文章熟悉它。

Add the "guzzlehttp/guzzle": "4.1.*@dev" to your composer.json so the require block looks like this:

在您的composer.json添加"guzzlehttp/guzzle": "4.1.*@dev" ,以便require块如下所示:

"require": { "laravel/framework": "4.2.*", "guzzlehttp/guzzle": "4.1.*@dev" },

In the project root, run composer update.

在项目根目录中,运行composer update 。

获取文章数据 (Fetch Article Data)

In the first example, we’ll crawl a SitePoint post with the default Article API from Diffbot. To do this, we refer to the docs which do an excellent job at explaining the workflow. Change the body of the diffbotDemo action to the following code:

在第一个示例中,我们将使用Diffbot中的默认Article API来抓取SitePoint帖子。 为此,我们参考在解释工作流程方面做得很好的文档 。 将diffbotDemo操作的主体更改为以下代码:

public function diffbotDemo() { $token = "$TOKEN"; $version = 'v3'; $client = new GuzzleHttp\Client(['base_url' => 'http://api.diffbot.com/']); $response = $client->get($version.'/article', ['query' => [ 'token' => $token, 'url' => 'https://www.sitepoint.com/7-mistakes-commonly-made-php-developers/' ]]); die(var_dump($response->json())); }

First, we set our token. Then, we define a variable that’ll hold the API version. Next, it’s up to us to create a new Guzzle client, and we also give it a base URL so we don’t have to type it in every time we make another request.

首先,我们设置令牌。 然后,我们定义一个将包含API版本的变量。 接下来,由我们决定创建一个新的Guzzle客户端,并且我们还为其提供一个基本URL,这样就不必在每次发出另一个请求时都键入它。

Next up, we create a response object by sending a GET request to the API’s URL, and we add in an array of query parameters in key => value format. In this case, we only pass in the token and the URL, the most basic of parameters.

接下来,我们通过向API的URL发送GET请求来创建响应对象,并以key => value格式添加一系列查询参数。 在这种情况下,我们仅传递令牌和最基本的参数URL。

Finally, since the Diffbot API returns JSON data, we use Guzzle’s json() method to automatically decode it into an array. We then pretty-print this data:

最后,由于Diffbot API返回JSON数据,因此我们使用Guzzle的json()方法自动将其解码为数组。 然后,我们漂亮地打印这些数据:

As you can see, we got some information back rather quickly. There’s the icon that was used, a preview of the text, the title, even the language, date and HTML have been returned. You’ll notice there’s no author, however. Let’s change this and request some more values.

如您所见,我们很快就获得了一些信息。 这里使用了图标,预览了文本,标题,甚至还返回了语言,日期和HTML。 您会注意到没有作者。 让我们更改此设置并请求更多值。

If we add the “fields” parameter to the query params list and give it a value of “tags”, Diffbot will attempt to extract tags/categories from the URL provided. Add this line to the query array:

如果我们将“ fields”参数添加到查询参数列表中,并为其赋予“ tags”值,则Diffbot将尝试从提供的URL中提取标签/类别。 将此行添加到query数组:

'fields' => 'tags'

and then change the die part to this:

然后将die部分更改为此:

$data = $response->json(); die(var_dump($data['objects'][0]['tags']));

Refreshing the screen now gives us this:

现在,刷新屏幕可为我们提供以下功能:

But, the source code of the article notes several other tags:

但是,本文的源代码还标注了其他几个标签:

Why is the result so very different? It’s precisely due to the reason we mentioned at the end of the very first paragraph of this post: what we humans see takes precedence. Diffbot is a visual learning robot, and as such its AI deducts the tags from the actual rendered content – what it can see – rather than from looking at the source code which is far too easily spiced up for SEO purposes.

为什么结果如此不同? 正是由于我们在本文的第一段末尾提到的原因:人类所看到的优先。 Diffbot是一种视觉学习机器人,因此,它的AI从实际呈现的内容(它可以看到的内容)中扣除标签,而不是查看太容易用于SEO的源代码。

Is there a way to get the tags from the source code, though, if one really needs them? Furthermore, can we make Diffbot recognize the author on SitePoint articles? Yes. With the Custom API.

但是,如果确实需要标签,有没有办法从源代码中获取标签? 此外,我们可以让Diffbot在SitePoint文章上认出作者吗? 是。 使用自定义API。

元标记和具有自定义API的作者 (Meta Tags and Author with Custom API)

The Custom API is a feature which allows you to not only tweak existing Diffbot API to your liking by adding new fields and rules for content extraction, but also allows you to create completely new APIs (accessed via a dedicated URL, too) for custom content processing.

自定义API是一项功能,不仅可以通过添加内容提取新字段和规则来调整现有Diffbot API的喜好,还可以为自定义内容创建全新的API(也可以通过专用URL访问)处理。

Go to the dev dashboard and log in with your token. Then, go into “Custom API”. Activate the “Create a Rule” tab at the bottom, and input the URL of the article we’re crawling into the URL box, then click Test. Your screen should look something like this:

转到开发人员仪表板并使用令牌登录。 然后,进入“自定义API”。 激活底部的“创建规则”标签,然后将我们要检索的文章的URL输入到URL框中,然后单击“测试”。 您的屏幕应如下所示:

You’ll immediately notice the Author field is empty. You can tweak the author-searching rule by clicking Edit next to it, and finding the Author element in the live preview window that opens, then click on it to get the desired result. However, due to some, well, less than perfect CSS on SitePoint’s end, it’s very difficult to provide Diffbot’s API with a consistent path to the author name, especially by clicking on elements. Instead, add the following rule manually: .contributor--large .contributor_name a and click Save.

您会立即注意到“作者”字段为空。 您可以调整作者搜索规则,方法是单击旁边的“编辑”,然后在打开的实时预览窗口中找到“作者”元素,然后单击它以获取所需的结果。 但是,由于SitePoint末尾CSS不够完善,因此很难为Diffbot的API提供与作者名称一致的路径,尤其是通过单击元素。 而是手动添加以下规则: .contributor--large .contributor_name a ,然后单击“保存”。

You’ll notice the Preview window now correctly populates the Author field:

您会注意到“预览”窗口现在可以正确填充“作者”字段:

In fact, this new rule is automatically applied to all SitePoint links for your token. If you try to preview another SitePoint article, like this one, you’ll notice Peter Nijssen is successfully extracted:

实际上,此新规则将自动应用于您令牌的所有SitePoint链接。 如果你试图预览另一个SitePoint文章,像这一个 ,你会发现彼得Nijssen被成功提取:

Ok, let’s modify the API further. We need the article:tag values that are visible in source code. Doing this requires a two-step process.

好的,让我们进一步修改API。 我们需要article:tag值,这些值在源代码中可见。 这样做需要两个步骤。

步骤1:定义集合 (Step 1: Define a Collection)

A collection is exactly what it sounds like – a collection of values grabbed via a specific ruleset. We’ll call our collection “MetaTags”, and give it the following selector: meta[property=article:tag]. This means “find all meta elements in the HTML that have the property attribute with the value article:tag“.

集合的确切含义是-通过特定规则集获取的值的集合。 我们将集合称为“ MetaTags”,并为其指定以下选择器: meta[property=article:tag] 。 这意味着“在HTML中找到所有具有property属性并带有值article:tag元元素”。

步骤2:定义集合字段 (Step 2: Define Collection Fields)

Collection fields are individual entries in a collection – in our case, the various tags. Click on “Add a custom field to this collection”, and add the following values:

集合字段是集合中的各个条目-在我们的例子中是各种标签。 单击“将自定义字段添加到此集合”,然后添加以下值:

Click Save. You’ll immediately have access to the list of Tags in the result window:

单击保存。 您将立即在结果窗口中访问标签列表:

Change the final output of the diffbotDemo() action to this:

将diffbotDemo()操作的最终输出更改为此:

die(var_dump($data['objects'][0]['metaTags']));

If you now refresh the URL we tested with (http://homestead.app:8000/diffbot), you’ll notice the author and meta tags values are there. Here’s the output the above line of code produces:

如果现在刷新我们用( http://homestead.app:8000/diffbot )测试过的URL,您会注意到这里有author和meta标签值。 这是上面的代码行产生的输出:

We have our tags!

我们有我们的标签!

结论 (Conclusion)

Diffbot is a powerful data extractor for the web – whether you need to consolidate many sites into a single search index without combining their back-ends, want to build a news aggregator, have an idea for a URL preview web component, or want to regularly harvest the contents of competitors’ public pricing lists, Diffbot can help. With dead simple API calls and highly structured responses, you’ll be up and running in next to no time. In a later article, we’ll build a brand new API for using Diffbot with PHP, and redo the calls above with it. We’ll also host the library on Packagist, so you can easily install it with Composer. Stay tuned!

Diffbot是Web上功能强大的数据提取器–无论您是需要将多个站点合并到一个搜索索引中,而无需组合它们的后端,是否想要构建新闻聚合器,是否有URL预览Web组件的想法,还是想要定期收获竞争对手公开定价清单的内容,Diffbot可以提供帮助。 借助简单的简单API调用和高度结构化的响应,您几乎可以立即启动并运行。 在以后的文章中,我们将构建一个全新的API,以将Diffbot与PHP结合使用,并使用它重做上面的调用。 我们还将在Packagist上托管该库,因此您可以轻松地通过Composer安装它。 敬请关注!

翻译自: https://www.sitepoint.com/diffbot-crawling-visual-machine-learning/

爬网工具

相关资源:sharepoint自定义爬网工具
最新回复(0)