lxml | 易学教程

How to add/use libraries in Python (3.5.1)

阅读更多关于 How to add/use libraries in Python (3.5.1)

问题 I've recently been playing around with python and have now expanded into doing stuff like scraping through websites and other cool stuff and I need to import new libraries for these things like lxml, pandas, urllib2 and such. So I have Python 3.5.1 installed and is also using Wing IDE. I (think) also managed to install pip using this tutorial, but then got lost after the Run python get-pip.py part. So how would I go about installing those libraries to try new projects? Thanks! 回答1: python 3.5

lxml findall SyntaxError: invalid predicate

阅读更多关于 lxml findall SyntaxError: invalid predicate

问题 I'm trying to find elements in xml using xpath. This is my code: utf8_parser = etree.XMLParser(encoding='utf-8') root = etree.fromstring(someString.encode('utf-8'), parser=utf8_parser) somelist = root.findall("model/class[*/attributes/attribute/@name='var']/@name") xml in someString looks like: <?xml version="1.0" encoding="UTF-8"?> <model> <class name="B" kind="abstract"> <inheritance> <from name="A" privacy="private" /> </inheritance> <private> <methods> <method name="f" type="int" scope=

Parsing dtd file with lxml library (python)

阅读更多关于 Parsing dtd file with lxml library (python)

问题 I need your help. I use lxml library to parsing dtd file. How can i get c subexpression in this example? dtd = etree.DTD(StringIO('<!ELEMENT a (b,c,d)>')) I try this content = dtd.elements()[0].content left, right = content.left, content.right but it left of right subexpression. http://lxml.de/validation.html#id1 回答1: I'm completely guessing (I've never touched this before) but: from io import StringIO from lxml import etree dtd.elements()[0].content.right.left #>>> <lxml.etree.

python lxml write to file in predefined order

阅读更多关于 python lxml write to file in predefined order

问题 I want to write following lxml etree subelements : <ElementProtocolat0x3803048>, <ElementStudyEventDefat0x3803108>, <ElementFormDefat0x3803248>, <ElementItemGroupDefat0x38032c8>, <ElementClinicalDataat0x3803408>, <ElementItemGroupDataat0x38035c8>, <ElementFormDefat0x38036c8>, to my odm xml file in a predefined order . i.e. <ElementProtocolat0x3803048>, <ElementStudyEventDefat0x3803108>, <ElementFormDefat0x3803248>, <ElementFormDefat0x38036c8>, <ElementItemGroupDefat0x38032c8>,

Python网页解析：BeautifulSoup vs lxml.html

阅读更多关于 Python网页解析：BeautifulSoup vs lxml.html

【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> Python里常用的网页解析库有BeautifulSoup和lxml.html，其中前者可能更知名一点吧，熊猫开始也是使用的BeautifulSoup，但是发现它实在有几个问题绕不过去，因此最后采用的还是lxml： 1. BeautifulSoup太慢。熊猫原来写的程序是需要提取不定网页里的正文，因此需要对网页进行很多DOM解析工作，经过测试可以认定BS平均比lxml要慢10倍左右。原因应该是libxml2+libxslt的原生C代码比python还是要更快吧 2. BS依赖python自带的sgmllib，但是这个sgmllib至少有两个问题。首先，它解析“class=我的CSS类”这样的字符串会有问题，参考下面的代码就知道了。 from BeautifulSoup import BeautifulSoup html = u'<div class=我的CSS类>hello</div>' print BeautifulSoup(html).find('div')['class'] 打印出来的结果是长度为零的字符串，而不是“我的CSS类”。不过这个问题可以通过外围代码来解决，只要改写一下sgmllib的attrfind这个查找元素属性的正则就行，可以改成 sgmllib.attrfind = re

scrapy: Remove elements from an xpath selector

阅读更多关于 scrapy: Remove elements from an xpath selector

问题 I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end. Here's the gist. <div id="easy-id"> <stuff I don't want> text I don't want <div id="another-easy-id" more stuff I don't want> text I want <stuff I want> ... <more stuff I want> text I want ... <div id="one-more-easy-id" more stuff I *don't* want> <more stuff I *don't* want> NB: The indenting

lxml.etree iterparse() and parsing element completely

阅读更多关于 lxml.etree iterparse() and parsing element completely

问题 I have an XML file with nodes that looks like this: <trkpt lat="-37.7944415" lon="144.9616159"> <ele>41.3681107</ele> <time>2015-04-11T03:52:33.000Z</time> <speed>3.9598</speed> </trkpt> I am using lxml.etree.iterparse() to iteratively parse the tree. I loop over each trkpt element's children and want to print the text value of the children nodes. E.g. for event, element in etree.iterparse(infile, events=("start", "end")): if element.tag == NAMESPACE + 'trkpt': for child in list(element):

Getting all href from a code

阅读更多关于 Getting all href from a code

问题 I'm making a web-crawler. For finding the links in a page I was using xpath in selenium driver = webdriver.Firefox() driver.get(side) Listlinker = driver.find_elements_by_xpath("//a") This worked fine. Testing the crawler however, I found that not all links come under the a tag. href is sometimes used in area or div tags as well. Right now I'm stuck with driver = webdriver.Firefox() driver.get(side) Listlinkera = driver.find_elements_by_xpath("//a") Listlinkerdiv = driver.find_elements_by

force xpath to return a string lxml

阅读更多关于 force xpath to return a string lxml

问题 I am using lxml and I have a scrapped page from Google Scholar. Following is a minimal working example and things I have tried. In [56]: seed = "https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:machine_learning" In [60]: page = urllib2.urlopen(seed).read() In [63]: tree = html.fromstring(page) In [64]: xpath = '(/html/body/div[1]/div[4]/div[2]/div/span/button[2]/@onclick)[1]' In [65]: tree.xpath(xpath) #first element returns as list Out[65]: ["window.location='

How to update a KML to include a new Element with a value

阅读更多关于 How to update a KML to include a new Element with a value

问题 I have a KML that does not have the Element "name" under Element Placemark. As such, when opening the KML using Google Earth, there is no name appearing next to each polygon under the tree. In the original kml below, there are 2 Placemark. Each has an Element simpleData name="ID". The 2 values associated with them are FM2 and FM3 respectively. <?xml version="1.0" encoding="UTF-8"?> <kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www