后浪小萌新Python --- BeautifulSoup

tech2024-01-23 132

BeautifulSoup的用法

beautifulSoup是一个灵活又方便的网页解析库，处理高效，支持多种解析器。利用它不用编写正则表达式即可方便地实现网页信息的提取。

安装

通过指令: pip install beautifulsoup4 或者在pycharm第三方库安装页面中搜索安装beautifulsoup4即可。

使用

解析库

解析器使用方法优势劣势Python标准库BeautifulSoup(markup, ‘html.parser’)Python的内置标准库、执行速度适中、文档容错能力强低版本中文容错能力差lxml HTML解析器BeautifulSoup(markup, ‘lxml’)速度快、文档容错能力强需要安装C语言库lxml XML解析器BeautifulSoup(markup, ‘xml’)速度快、唯一支持xml的解析器需要安装C语言库Html5libBeautifulSoup(markup, ‘html5lib’)最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档速度慢，不依赖外部扩展

基本使用

创建解析器对象: BeautifulSoup(html文本内容, 解析器) from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> The Dormouse's story Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well. ... """ soup = BeautifulSoup(html, 'lxml') print(soup.prettify()) print(soup.title.string)

注：会自动补全网页标签

标签选择器

解析器对象.标签名 # 获取title标签 print(soup.title) print(type(soup.title)) # 获取 head 标签 print(soup.head) # 获取 p 标签 print(soup.p)

获取名称

标签对象.name print(soup.title.name) # 'title'

获取属性

标签对象.attrs - 获取指定标签所有的属性和值对应的字典标签对象.attrs[属性名] print(soup.a.attrs['href']) # ’http://example.com/elsie‘

获取内容

标签对象.string 获取标签中的文本内容(如果内容是标签返回子标签中的文本内容，如果文本和子标签同时存在返回None) 标签对象.get_text()

获取标签中的文本内容(如果有子标签，只获取子标签中的文本信息)

内容：标签对象.contents

以列表的形式返回标签内容（列表中的元素是文本和子标签）

print(soup.p.string) # The Dormouse's story

嵌套选择

解析器对象.标签1.标签2 print(soup.head.title.string)

子节点和子孙节点

子节点：标签对象.children子孙节点：标签对象.descendants print(soup.p.contents) for x in soup.div.children: print('x:', x) for x in soup.div.descendants: print('x:', x)

父节点和祖先节点

父节点：标签对象.parent祖先节点：标签对象.parents print(soup.span.parent) for x in soup.span.parents: print('x:', x)

兄弟节点

标签对象.next_siblings标签对象.previous_siblings print(list(enumerate(soup.a.next_siblings))) print(list(enumerate(soup.a.previous_siblings)))

标准选择器

根据标签名查找标签：解析器对象/标签对象.find_all(标签名)根据指定属性值查找标签：解析器对象/标签对象.find_all(attrs={属性名: 属性值})根据标签内容查找：解析器对象/标签对象.find_all(text=内容)（没有什么用！）

find_all表示查找所有，把它改成find表示查找单个

print(soup.find_all('ul')) print(type(soup.find_all('ul')[0])) for ul in soup.find_all('ul'): print(ul.find_all('li')) print(soup.find_all(attrs={'id': 'list-1'})) print(soup.find_all(attrs={'name': 'elements'})) print(soup.find_all(id='list-1')) print(soup.find_all(class_='element')) print(soup.find_all(text='Foo')) find_parents()返回所有祖先节点，find_parent()返回直接父节点。find_next_siblings()返回后面所有兄弟节点，find_next_sibling()返回后面第一个兄弟节点。find_previous_siblings()返回前面所有兄弟节点，find_previous_sibling()返回前面第一个兄弟节点。find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

CSS选择器

标签对象.select(css选择器) print(soup.select('.panel .panel-heading')) print(soup.select('ul li')) print(soup.select('#list-2 .element')) print(type(soup.select('ul')[0]))

总结

推荐使用lxml解析库，必要时使用html.parser标签选择筛选功能弱但是速度快建议使用find()、find_all() 查询匹配单个结果或者多个结果如果对CSS选择器熟悉建议使用select()记住常用的获取属性和文本值的方法

最新回复(0)