XPath语法和lxml模块(下)

tech2025-10-19  7

1.解析字符串:

#解析字符串 from lxml import etree text = """ <ul> <li><a href="#">a</a></li> <li><a href="#">a</a></li> <li><a href="#">a</a></li> <li><a href="#">a</a></li> <li><a href="#">a</a></li> </ul> """ htmlElement = etree.HTML(text) #用etree.HTML(text)去解析html print(etree.tostring(htmlElement,encoding='utf-8').decode('utf-8'))

2.解析html文档:

#解析html文档 from lxml import etree # htmlElement = etree.parse('tencent.html') # # print(etree.tostring(htmlElement,encoding='utf-8').decode('utf-8')) #拉钩html代码不规范所以 etree.parse方法解析失败 这时候我们需要 自己从创建htmk解析器 paser = etree.HTMLParser(encoding='utf-8') htmlElement = etree.parse('lagou.html',parser=paser) print(etree.tostring(htmlElement,encoding='utf-8').decode('utf-8'))

3.获取标签及相关信息:

from lxml import etree paser = etree.HTMLParser(encoding='utf-8') html = etree.parse('tencent.html',parser=paser) # print(html) # 获取所有的tr标签 # trs = html.xpath("//tr") #返回的是一个列表 # # for tr in trs: # print(etree.tostring(tr,encoding='utf-8').decode('utf-8')) #序列化成字符串 # # 获取第二个tr标签 # trs = html.xpath("//tr[2]") #返回的是一个列表 tr = html.xpath("//tr[2]")[0] #返回的是一个列表 print(etree.tostring(tr,encoding='utf-8').decode('utf-8')) # 获取所有class 为even的 tr标签 # trs = html.xpath("//a[@target='_blank']") trs = html.xpath("//tr[@class='even']") for tr in trs: print(etree.tostring(tr,encoding='utf-8').decode('utf-8')) #序列化成字符串 # 获取所有 a标签的href属性 aList = html.xpath("//a/@href") for a in aList: print("http://hr.tencent.com/"+a) # 获取所有的职位信息 positions = [] #存放最终的结果 trs = html.xpath("//tr[position()>1]") for tr in trs: href = tr.xpath(".//a/@href")[0] full_url = 'http://hr.tencent.com/'+href title = tr.xpath("./td[1]//text()")[0] category = tr.xpath("./td[2]//text()")[0] nums = tr.xpath("./td[3]//text()")[0] base = tr.xpath("./td[4]//text()")[0] pub_time = tr.xpath("./td[5]//text()")[0] position = { 'title':title, 'category':category, 'nums':nums, 'base':base, 'pub_time':pub_time, 'url':full_url, } positions.append(position) print(positions)

以上的tencent.html文件里面内容如下:

<table class="tablelist" cellpadding="0" cellspacing="0"> <tbody> <tr class="h"> <td class="l" width="374">职位名称</td> <td>职位类别</td> <td>人数</td> <td>地点</td> <td>发布时间</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=33824&amp;keywords=python&amp;tid=87&amp;lid=2218">22989-金融云区块链高级研发工程师(深圳)</a></td> <td>技术类</td> <td>1</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=29938&amp;keywords=python&amp;tid=87&amp;lid=2218">22989-金融云高级后台开发</a></td> <td>技术类</td> <td>2</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=31236&amp;keywords=python&amp;tid=87&amp;lid=2218">SNG16-腾讯音乐运营开发工程师(深圳)</a></td> <td>技术类</td> <td>2</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=31235&amp;keywords=python&amp;tid=87&amp;lid=2218">SNG16-腾讯音乐业务运维工程师(深圳)</a></td> <td>技术类</td> <td>1</td> <td>深圳</td> <td>2017-11-25</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=34531&amp;keywords=python&amp;tid=87&amp;lid=2218">TEG03-高级研发工程师(深圳)</a></td> <td>技术类</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=34532&amp;keywords=python&amp;tid=87&amp;lid=2218">TEG03-高级图像算法研发工程师(深圳)</a></td> <td>技术类</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=31648&amp;keywords=python&amp;tid=87&amp;lid=2218">TEG11-高级AI开发工程师(深圳)</a></td> <td>技术类</td> <td>4</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=32218&amp;keywords=python&amp;tid=87&amp;lid=2218">15851-后台开发工程师</a></td> <td>技术类</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="even"> <td class="l square"><a target="_blank" href="position_detail.php?id=32217&amp;keywords=python&amp;tid=87&amp;lid=2218">15851-后台开发工程师</a></td> <td>技术类</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> <tr class="odd"> <td class="l square"><a target="_blank" href="position_detail.php?id=34511&amp;keywords=python&amp;tid=87&amp;lid=2218">SNG11-高级业务运维工程师(深圳)</a></td> <td>技术类</td> <td>1</td> <td>深圳</td> <td>2017-11-24</td> </tr> </tbody> </table>

此文件在pycharm里面创建即可。 万水千山总是情,点个关注行不行。 你的一个小小举动,将是我分享更多干货的动力。

最新回复(0)