lxml

How to add/use libraries in Python (3.5.1)

老子叫甜甜 提交于 2019-12-11 12:43:57
问题 I've recently been playing around with python and have now expanded into doing stuff like scraping through websites and other cool stuff and I need to import new libraries for these things like lxml, pandas, urllib2 and such. So I have Python 3.5.1 installed and is also using Wing IDE. I (think) also managed to install pip using this tutorial, but then got lost after the Run python get-pip.py part. So how would I go about installing those libraries to try new projects? Thanks! 回答1: python 3.5

lxml findall SyntaxError: invalid predicate

五迷三道 提交于 2019-12-11 12:37:45
问题 I'm trying to find elements in xml using xpath. This is my code: utf8_parser = etree.XMLParser(encoding='utf-8') root = etree.fromstring(someString.encode('utf-8'), parser=utf8_parser) somelist = root.findall("model/class[*/attributes/attribute/@name='var']/@name") xml in someString looks like: <?xml version="1.0" encoding="UTF-8"?> <model> <class name="B" kind="abstract"> <inheritance> <from name="A" privacy="private" /> </inheritance> <private> <methods> <method name="f" type="int" scope=

Parsing dtd file with lxml library (python)

帅比萌擦擦* 提交于 2019-12-11 12:17:12
问题 I need your help. I use lxml library to parsing dtd file. How can i get c subexpression in this example? dtd = etree.DTD(StringIO('<!ELEMENT a (b,c,d)>')) I try this content = dtd.elements()[0].content left, right = content.left, content.right but it left of right subexpression. http://lxml.de/validation.html#id1 回答1: I'm completely guessing (I've never touched this before) but: from io import StringIO from lxml import etree dtd.elements()[0].content.right.left #>>> <lxml.etree.

python lxml write to file in predefined order

自作多情 提交于 2019-12-11 11:56:15
问题 I want to write following lxml etree subelements : <ElementProtocolat0x3803048>, <ElementStudyEventDefat0x3803108>, <ElementFormDefat0x3803248>, <ElementItemGroupDefat0x38032c8>, <ElementClinicalDataat0x3803408>, <ElementItemGroupDataat0x38035c8>, <ElementFormDefat0x38036c8>, to my odm xml file in a predefined order . i.e. <ElementProtocolat0x3803048>, <ElementStudyEventDefat0x3803108>, <ElementFormDefat0x3803248>, <ElementFormDefat0x38036c8>, <ElementItemGroupDefat0x38032c8>,

Python网页解析:BeautifulSoup vs lxml.html

耗尽温柔 提交于 2019-12-11 11:42:25
【推荐】2019 Java 开发者跳槽指南.pdf(吐血整理) >>> Python里常用的网页解析库有BeautifulSoup和lxml.html,其中前者可能更知名一点吧,熊猫开始也是使用的BeautifulSoup,但是发现它实在有几个问题绕不过去,因此最后采用的还是lxml: 1. BeautifulSoup太慢。熊猫原来写的程序是需要提取不定网页里的正文,因此需要对网页进行很多DOM解析工作,经过测试可以认定BS平均比lxml要慢10倍左右。原因应该是libxml2+libxslt的原生C代码比python还是要更快吧 2. BS依赖python自带的sgmllib,但是这个sgmllib至少有两个问题。首先,它解析“class=我的CSS类”这样的字符串会有问题,参考下面的代码就知道了。 from BeautifulSoup import BeautifulSoup html = u'<div class=我的CSS类>hello</div>' print BeautifulSoup(html).find('div')['class'] 打印出来的结果是长度为零的字符串,而不是“我的CSS类”。 不过这个问题可以通过外围代码来解决,只要改写一下sgmllib的attrfind这个查找元素属性的正则就行,可以改成 sgmllib.attrfind = re

scrapy: Remove elements from an xpath selector

老子叫甜甜 提交于 2019-12-11 11:36:17
问题 I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end. Here's the gist. <div id="easy-id"> <stuff I don't want> text I don't want <div id="another-easy-id" more stuff I don't want> text I want <stuff I want> ... <more stuff I want> text I want ... <div id="one-more-easy-id" more stuff I *don't* want> <more stuff I *don't* want> NB: The indenting

lxml.etree iterparse() and parsing element completely

落爺英雄遲暮 提交于 2019-12-11 10:36:52
问题 I have an XML file with nodes that looks like this: <trkpt lat="-37.7944415" lon="144.9616159"> <ele>41.3681107</ele> <time>2015-04-11T03:52:33.000Z</time> <speed>3.9598</speed> </trkpt> I am using lxml.etree.iterparse() to iteratively parse the tree. I loop over each trkpt element's children and want to print the text value of the children nodes. E.g. for event, element in etree.iterparse(infile, events=("start", "end")): if element.tag == NAMESPACE + 'trkpt': for child in list(element):

Getting all href from a code

人盡茶涼 提交于 2019-12-11 10:33:52
问题 I'm making a web-crawler. For finding the links in a page I was using xpath in selenium driver = webdriver.Firefox() driver.get(side) Listlinker = driver.find_elements_by_xpath("//a") This worked fine. Testing the crawler however, I found that not all links come under the a tag. href is sometimes used in area or div tags as well. Right now I'm stuck with driver = webdriver.Firefox() driver.get(side) Listlinkera = driver.find_elements_by_xpath("//a") Listlinkerdiv = driver.find_elements_by

force xpath to return a string lxml

不羁岁月 提交于 2019-12-11 09:59:35
问题 I am using lxml and I have a scrapped page from Google Scholar. Following is a minimal working example and things I have tried. In [56]: seed = "https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:machine_learning" In [60]: page = urllib2.urlopen(seed).read() In [63]: tree = html.fromstring(page) In [64]: xpath = '(/html/body/div[1]/div[4]/div[2]/div/span/button[2]/@onclick)[1]' In [65]: tree.xpath(xpath) #first element returns as list Out[65]: ["window.location='

How to update a KML to include a new Element with a value

孤者浪人 提交于 2019-12-11 09:57:39
问题 I have a KML that does not have the Element "name" under Element Placemark. As such, when opening the KML using Google Earth, there is no name appearing next to each polygon under the tree. In the original kml below, there are 2 Placemark. Each has an Element simpleData name="ID". The 2 values associated with them are FM2 and FM3 respectively. <?xml version="1.0" encoding="UTF-8"?> <kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2" xmlns:kml="http://www