diffbot api调用
In the previous post on Analyzing SitePoint Authors’ Profiles with Diffbot we built a Custom API that automatically paginates an author’s list of work and extracts his name, bio and a list of posts with basic data (URL, title and date stamp). In this post, we’ll extract the links to the author’s social networks.
在上一篇有关使用Diffbot分析SitePoint作者的个人资料的文章中,我们构建了一个自定义API,该API可自动对作者的工作清单进行分页,并提取其姓名,简历以及带有基本数据(URL,标题和日期戳)的文章清单。 在本文中,我们将提取作者社交网络的链接。
If you look at the social network icons inside an author’s bio frame on their profile page, you’ll notice they vary. There can be none, or there can be eight, or anything in between. What’s worse, the links aren’t classed in any semantically meaningful way – they’re just links with an icon and a href attribute.
如果在作者的个人资料页面上查看其作者简介内的社交网络图标,您会发现它们有所不同。 可能没有,也可能有八个,或者介于两者之间。 更糟糕的是,链接没有以任何有意义的语义进行分类,它们只是带有图标和href属性的链接。
This makes turning them into an extractable pattern difficult, and yet that’s exactly what we’ll be doing here because hey, who doesn’t love a challenge?
这使得将它们变成可提取的模式很困难,但这正是我们在这里要做的,因为嘿,谁不喜欢挑战?
To get set up, please read and go through the first part. When you’re done, re-enter the dev dashboard.
要进行设置,请阅读并阅读第一部分 。 完成后,重新进入开发人员仪表板。
The logical approach would be to define a new collection just like for posts, but one that targets the social network links. Then, just target the href attribute on each and we’re set, right? Nope.
逻辑方法是像为帖子一样定义一个新集合,但以社交网络链接为目标。 然后,只需将目标定位在href属性上,就可以了,对吗? 不。
Observe below:
观察以下内容:
As you can see, we get all the social links. But we get them all X times, where X is the number of pages in an author’s profile. This happens because the Diffbot API concatenates the HTML of all the pages into a single big one, and our collection rule finds several sets of these social network icon-links.
如您所见,我们获得了所有的社交链接。 但是我们将它们全部获得X次,其中X是作者个人资料中的页面数。 发生这种情况是因为Diffbot API将所有页面HTML连接到一个大页面中,并且我们的收集规则会找到几组这些社交网络图标链接。
Intuition might lead you to use a :first-child pseudo element on the parent of the collection on the first page, but the API doesn’t work like that. The HTML contents of the individual pages are concatenated, yes, but the rules are executed on them first. In reality, only the result is being concatenated. This is why it isn’t possible to use main:first-child to target the first page only. Likewise, at this moment the Diffbot API does not have any :first-page custom pseudo elements, but them appearing at a later stage is not out of the question. How, then, do we do this?
直觉可能会导致您在首页的集合的父项上使用:first-child伪元素,但API并非如此。 是的,各个页面HTML内容是串联在一起的,但是规则首先在它们上执行。 实际上,只有结果被级联。 这就是为什么无法使用main:first-child仅定位第一页的原因。 同样,此刻Diffbot API没有任何:first-page自定义伪元素,但在以后出现它们并不是不可能的。 那我们该怎么做呢?
Diffbot allows you to define several custom rulesets for the same API endpoint, differing by domain regex. When an API endpoint is called, all the rulesets that match the URL are executed, the results are concatenated, and you get a unique set back, as if it was all in a single API. This is what we’re going to do, too.
Diffbot允许您为同一API终结点定义几个自定义规则集,不同的域正则表达式不同。 调用API端点时,将执行所有与URL匹配的规则集,并连接结果,并获得唯一的设置,就好像它们都在单个API中一样。 这也是我们要做的。
Start off by going to “Create a rule” and selecting a Custom API, so you get asked for a name. Enter the same name as the one in the first part (in my case, AuthorFolio). Enter the typical test url (https://www.sitepoint.com/author/bskvorc/) and run the Test. Then, change the domain regex to this:
首先,转到“创建规则”并选择一个自定义API,以便要求您输入名称。 输入与第一部分相同的名称(在本例中为AuthorFolio)。 输入典型的测试URL( https://www.sitepoint.com/author/bskvorc/ )并运行测试。 然后,将域正则表达式更改为此:
(http(s)?://)?(.*\.)?sitepoint.com/author/[^/]+/This tells the API to only target the first page of any author profile – it ignores pagination completely.
这告诉API仅定位任何作者个人资料的首页-它完全忽略了分页。
Next, define a new collection. Call it “social” and give it a custom field with the selector of .contributor_social li. Name the field “link”, and give it a selector of “a” with an attribute filter of href. Save, wait for the reload, and notice that you now have the four links extracted:
接下来,定义一个新集合。 将其称为“社交”,并使用.contributor_social li选择器为它提供一个自定义字段。 将字段命名为“ link”,并为其选择器选择“ a”,并使用href属性过滤器。 保存,等待重新加载,然后注意您现在提取了四个链接:
But having just the links there kind of sucks, doesn’t it? It would be nice if we had a social network name, too. SitePoint’s design, however, doesn’t class them in any semantically meaningful way, so there’s no easy way to get the network name. How can we tackle this?
但是,只有链接很烂,不是吗? 如果我们也有一个社交网络名称,那就太好了。 但是,SitePoint的设计不会以任何有意义的语义对它们进行分类,因此没有简单的方法来获取网络名称。 我们该如何解决呢?
Regex Rewrite Filters to the rescue!
正则表达式重写过滤器以营救!
Custom fields have three available filters:
自定义字段具有三个可用的过滤器:
attribute: extracts an HTML element’s attribute attribute:提取HTML元素的属性 ignore: ignores certain HTML elements based on a css selector ignore:基于CSS选择器忽略某些HTML元素 replace: replaces the content of the output with the given content if a regex pattern matches replace:如果正则表达式模式匹配,则用给定的内容替换输出的内容We’ll be using the third one – read more about them here.
我们将使用第三个- 在此处阅读有关它们的更多信息。
Add a new field to our “social” collection. Give it the name “network”, the selector a, and an attribute filter of href so it extracts the link just like the “link” field. Then, add a new “replace” filter.
在我们的“社交”收藏中添加一个新字段。 给它起一个名字“ network”,选择器a ,并给href一个属性过滤器,以便它像“ link”字段一样提取链接。 然后,添加一个新的“替换”过滤器。
SitePoint author profiles can have the following social networks attached to their profiles: Google+, Twitter, Facebook, Reddit, Youtube, Flickr, Github and Linkedin. Luckily, each of those has pretty straightforward URLs with full domain names, so regexing the names out is a piece of cake. The correct regex is ^.*KEYWORD.*$:
SitePoint作者个人资料可以附加以下社交网络:Google +,Twitter,Facebook,Reddit,Youtube,Flickr,Github和Linkedin。 幸运的是,每个域名都有非常简单的URL,带有完整的域名,因此重新命名这些域名很容易。 正确的正则表达式为^.*KEYWORD.*$ :
Save, wait for the reload, and notice that you now have a well formed collection of an author’s social links.
保存,等待重新加载,然后注意您现在已经拥有一个格式良好的作者社交链接集合。
Finally, let’s fetch all this data at once. According to what we said above, executing a call to an author page with the AuthorFolio API should now give us a single JSON response containing the sum of everything we’ve defined so far, including the fields from the first post. Let’s see if that’s true. Visit the following link in your browser:
最后,让我们一次获取所有这些数据。 根据以上所述,使用AuthorFolio API执行对作者页面的调用现在应该会给我们一个JSON响应,其中包含到目前为止我们定义的所有内容的总和,包括第一篇文章中的字段。 让我们看看这是真的。 在浏览器中访问以下链接:
http://diffbot.com/api/AuthorFolio?token=xxxxxxxxx&url=https://www.sitepoint.com/author/bskvorc/This is the result I get:
这是我得到的结果:
As you can see, we successfully merged the two APIs and got back a single result of everything we wanted. We can now consume this API URL at will from any third party application, and pull in the portfolio of an author, easily grouping by date, detecting changes in the bio, registering newly added social networks, and much more.
如您所见,我们成功地合并了两个API,并返回了我们想要的所有结果的单个结果。 现在,我们可以从任何第三方应用程序中随意使用此API URL,并获取作者的作品集,轻松按日期进行分组,检测个人简介的变化,注册新添加的社交网络等等。
In this post we looked at some trickier aspects of visual crawling with Diffbot like repeated collections and duplicate APIs on custom domain regexes. We built an endpoint that allows us to extract valuable information from an author’s profile, and we learned how to apply this knowledge to any similar situation.
在本文中,我们研究了使用Diffbot进行视觉爬网的一些棘手方面,例如重复的集合和自定义域正则表达式上的重复API。 我们建立了一个端点,使我们可以从作者的个人资料中提取有价值的信息,并且我们学习了如何将这种知识应用于任何类似情况。
Did you crawl something interesting using these techniques? Did you run into any trouble? Let us know in the comments below!
使用这些技术,您是否抓取了一些有趣的东西? 你有什么麻烦吗? 在下面的评论中让我们知道!
翻译自: https://www.sitepoint.com/diffbot-repeated-collections-merged-apis/
diffbot api调用