lxml

python爬虫--数据解析

浪尽此生 提交于 2019-12-17 03:26:24
数据解析 什么是数据解析及作用 概念:就是将一组数据中的局部数据进行提取 作用:来实现聚焦爬虫 数据解析的通用原理 标签定位 取文本或者属性 正则解析 正则回顾 单字符: . : 除换行以外所有字符 [] :[aoe] [a-w] 匹配集合中任意一个字符 \d :数字 [0-9] \D : 非数字 \w :数字、字母、下划线、中文 \W : 非\w \s :所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。 \S : 非空白 数量修饰: * : 任意多次 >=0 + : 至少1次 >=1 ? : 可有可无 0次或者1次 {m} :固定m次 hello{3,} {m,} :至少m次 {m,n} :m-n次 边界: $ : 以某某结尾 ^ : 以某某开头 分组: (ab) 贪婪模式: .* 非贪婪(惰性)模式: .*? re.I : 忽略大小写 re.M :多行匹配 re.S :单行匹配 re.sub(正则表达式, 替换内容, 字符串) 正则练习 import re #提取出python key="javapythonc++php" res = re.findall('python',key)[0] #re.findall('python',key)返回的结果是列表类型的数据 print(res) #提取出hello world key="<html>

How can this function be rewritten to implement OrderedDict?

谁说胖子不能爱 提交于 2019-12-17 02:38:36
问题 I have the following function which does a crude job of parsing an XML file into a dictionary. Unfortunately, since Python dictionaries are not ordered, I am unable to cycle through the nodes as I would like. How do I change this so it outputs an ordered dictionary which reflects the original order of the nodes when looped with for . def simplexml_load_file(file): import collections from lxml import etree tree = etree.parse(file) root = tree.getroot() def xml_to_item(el): item = None if el

How can this function be rewritten to implement OrderedDict?

 ̄綄美尐妖づ 提交于 2019-12-17 02:38:17
问题 I have the following function which does a crude job of parsing an XML file into a dictionary. Unfortunately, since Python dictionaries are not ordered, I am unable to cycle through the nodes as I would like. How do I change this so it outputs an ordered dictionary which reflects the original order of the nodes when looped with for . def simplexml_load_file(file): import collections from lxml import etree tree = etree.parse(file) root = tree.getroot() def xml_to_item(el): item = None if el

解析库之 beautifulsoup模块

删除回忆录丶 提交于 2019-12-15 12:34:22
介绍:Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库. 它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间. Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 1 #安装 Beautiful Soup 2 pip install beautifulsoup4 3 4 #安装解析器 5 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: 6 7 $ apt-get install Python-lxml 8 9 $ easy_install lxml 10 11 $ pip install lxml 12 13 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: 14 15 $ apt-get install Python-html5lib 16 17 $ easy_install html5lib 18 19 $ pip install html5lib

XPath taking text with hyperlinks (Python)

自闭症网瘾萝莉.ら 提交于 2019-12-14 03:03:44
问题 I'm new at using XPath (and I'm a relative beginner at Python in general). I'm trying to take the text out of the first paragraph of a Wikipedia page through it. Take for instance the Python Page (https://en.wikipedia.org/wiki/Python_(programming_language)) if I get it into a variable page = requests.get("https://en.wikipedia.org/wiki/Python_(programming_language)") tree = html.fromstring(page.content) Then I know the desired paragraph is on XPath /html/body/div[3]/div[3]/div[4]/div/p[1] So I

xpath for img src within element

天涯浪子 提交于 2019-12-14 03:02:43
问题 How would I modify the below code so it picks out the source of any images found within the description element, which contains html? At the moment it just gets the full text from inside the element and I'm not sure how to modify this to get the sources of any img tags. >>> from lxml import etree >>> tree = etree.parse('temp.xml') >>> for guide in tree.xpath('guide'): ... '---', guide.xpath('id')[0].text ... for pages in guide.xpath('.//pages'): ... for page in pages: ... '------', page.xpath

comparing two xml files irrespective of their order

谁都会走 提交于 2019-12-14 02:32:12
问题 I am currently working on a python project and stuck in one little problem related to comparison of two XML files using python. Now for instance assume that we have two xml files: A file: <m1:time timeinterval="5"> <m1:vehicle distance="40" speed="5"\> <m1:location hours = "1" path = '1'\> <m1:feature color="2" type="a">564</m1:feature> <m1:feature color="3" type="b">570</m1:feature> <m1:feature color="4" type="c">570</m1:feature> <\m1:location> <m1:location hours = "5" path = '1'\> <m1

python——lxml

耗尽温柔 提交于 2019-12-14 02:17:38
python3解析库lxml 阅读目录 1、python库lxml的安装 2、XPath常用规则 (1)读取文本解析节点 (2)读取HTML文件进行解析 (3)获取所有节点 (4)获取子节点 (5)获取父节点 (6)属性匹配 (7)文本获取 (8)属性获取 (9)属性多值匹配 (10)多属性匹配 (11)XPath中的运算符 (12)按序选择 (13)节点轴选择 (14)案例应用:抓取TIOBE指数前20名排行开发语言 lxml是python的一个解析库,支持HTML和XML的解析,支持XPath解析方式,而且解析效率非常高 XPath,全称XML Path Language,即XML路径语言,它是一门在XML文档中查找信息的语言,它最初是用来搜寻XML文档的,但是它同样适用于HTML文档的搜索 XPath的选择功能十分强大,它提供了非常简明的路径选择表达式,另外,它还提供了超过100个内建函数,用于字符串、数值、时间的匹配以及节点、序列的处理等,几乎所有我们想要定位的节点,都可以用XPath来选择 XPath于1999年11月16日成为W3C标准,它被设计为供XSLT、XPointer以及其他XML解析软件使用,更多的文档可以访问其官方网站: https://www.w3.org/TR/xpath/ 回到顶部 1、python库lxml的安装 windows系统下的安装:

HTML elements in lxml get incorrectly encoded like Най

依然范特西╮ 提交于 2019-12-14 02:17:00
问题 I need to print RSS link from a web page, but this link is decoded incorrectly. Here is my code: import urllib2 from lxml import html, etree import chardet data = urllib2.urlopen('http://facts-and-joy.ru/') S=data.read() encoding = chardet.detect(S)['encoding'] #S=S.decode(encoding) #encoding='utf-8' print encoding parser = html.HTMLParser(encoding=encoding) content = html.document_fromstring(S,parser) loLinks = content.xpath('//link[@type="application/rss+xml"]') for oLink in loLinks: print

Beautiful Soup fetch dynamic table data

僤鯓⒐⒋嵵緔 提交于 2019-12-14 01:55:01
问题 I have the following code: url = 'https://www.basketball-reference.com/leagues/NBA_2017_standings.html#all_expanded_standings' html = urlopen(url) soup = BeautifulSoup(html, 'lxml') print(len(soup.findAll('table'))) print(soup.findAll('table')) There are 6 tables on the webpage, but it only returns 4 tables. I tried to use 'html.parser' or 'html5lib' as parsers but did not work either. Any idea how I can get the Table "expanded standings" from the webpage? Thanks! 回答1: requests can't fetch