lxml

How do I scrape an https page? [duplicate]

送分小仙女□ 提交于 2019-12-07 18:48:59
问题 This question already has answers here : Python Requests throwing SSLError (22 answers) Closed 5 years ago . I'm using a python script with 'lxml' and 'requests' to scrape a web page. My goal is to grab an element from a page and download it, but the content is on an HTTPS page and I'm getting an error when trying to access the stuff in the page. I'm sure there is some kind of certificate or authentication I have to include, but I'm struggling to find the right resources. I'm using: page =

lxml xpath in python, how to handle missing tags?

半城伤御伤魂 提交于 2019-12-07 17:17:05
问题 suppose I want to parse with an lxml xpath expression the folowing xml <pack xmlns="http://ns.qubic.tv/2010/item"> <packitem> <duration>520</duration> <max_count>14</max_count> </packitem> <packitem> <duration>12</duration> </packitem> </pack> which is a variation of what can be found at http://python-thoughts.blogspot.fr/2012/01/default-value-for-text-function-using.html How can I achieve a parsing of the different elements that would give me once zipped (in the zip or izip python function

lxml.etree fromsting() and tostring() are not returning the same data

家住魔仙堡 提交于 2019-12-07 14:26:37
问题 I'm learning lxml (after using ElementTree) and I'm baffled why .fromstring and .tostring do not appear to be reversible. Here's my example: import lxml.etree as ET f = open('somefile.xml','r') data = f.read() tree_in = ET.fromstring(data) tree_out = ET.tostring(tree_in) f2 = open('samefile.xml','w') f2.write(tree_out) f2.close 'somefile.xml' was 132 KB. 'samefile.xml' - the output - was 113 KB, and it is missing the end of the file at some arbirtrary point. The closing tags of the overall

Installing lxml in virtualenv via pip install error: command 'x86_64-linux-gnu-gcc' failed

心不动则不痛 提交于 2019-12-07 13:30:31
问题 when I activate virtualenv and type 'pip install lxml' installation process crashes with message: /usr/bin/ld: cannot find -lz collect2: error: ld returned 1 exit status error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 回答1: The error you have to pay attention to is the first "/usr/bin/ld: cannot find -lz": tnhat means you don't have zlib-dev installed. Depending on your linux distribution it could be named zlib-dev or zlib1g-dev in Ubuntu, I don't know in other distros. 回答2:

i have an error when executing “from lxml import etree” in the python command line after successfully installed lxml by pip

…衆ロ難τιáo~ 提交于 2019-12-07 12:18:15
问题 bash-3.2$ pip install lxml-2.3.5.tgz Unpacking ./lxml-2.3.5.tgz Running setup.py egg_info for package from file:///Users/apple/workspace/pythonhome/misc/lxml-2.3.5.tgz Building lxml version 2.3.5. Building with Cython 0.17. Using build configuration of libxslt 1.1.27 Building against libxml2/libxslt in the following directory: /usr/local/lib warning: no previously-included files found matching '*.py' Installing collected packages: lxml Running setup.py install for lxml Building lxml version 2

Parsing lxml.etree._Element contents

喜夏-厌秋 提交于 2019-12-07 11:41:53
问题 I have the following element that I parsed out of a <table> <td align="center" valign="top"> <a href="ConfigGroups.aspx?cfgID=451161&prjID=11778&grpID=DTST" target="_blank"> 5548U </a><br/>Power La Vaca<br/>(M8025K)<br/>Linux 4.2.x.x<br/> </td> I am trying to extract "55488 Power La Vaca (8025K) Linux 4.2.x.x" from this element (including the spaces). import lxml.etree as ET td_html = """ <td align="center" valign="top"> <a href="ConfigGroups.aspx?cfgID=451161&prjID=11778&grpID=DTST" target="

Iterate through all the rows in a table using python lxml xpath

假如想象 提交于 2019-12-07 10:27:53
问题 This is the source code of the html page I want to extract data from. Webpage: http://gbgfotboll.se/information/?scr=table&ftid=51168 The table is at the bottom of the page <html> <table class="clCommonGrid" cellspacing="0"> <thead> <tr> <td colspan="3">Kommande matcher</td> </tr> <tr> <th style="width:1%;">Tid</th> <th style="width:69%;">Match</th> <th style="width:30%;">Arena</th> </tr> </thead> <tbody class="clGrid"> <tr class="clTrOdd"> <td nowrap="nowrap" class="no-line-through"> <span

lxml/Python : get previous-sibling

半城伤御伤魂 提交于 2019-12-07 07:56:36
问题 I have the following html: <div id = "big"> <span>header 1</span> <ul id = "outer"> <li id = "inner">aaa</li> <li id = "inner">bbb</li> </ul> <span>header 2</span> <ul id = "outer"> <li id = "inner">ccc</li> <li id = "inner">ddd</li> </ul> </div> I want it to loop it in the order: header 1 aaa bbb header 2 ccc ddd I have tried looping through each ul and then printing the header and the li values. However, I don't know how to get the span header that is associated with a ul. sets = tree.xpath

Python xml etree DTD from a StringIO source?

左心房为你撑大大i 提交于 2019-12-07 07:08:44
问题 I'm adapting the following code (created via advice in this question), that took an XML file and it's DTD and converted them to a different format. For this problem only the loading section is important: xmldoc = open(filename) parser = etree.XMLParser(dtd_validation=True, load_dtd=True) tree = etree.parse(xmldoc, parser) This worked fine, whilst using the file system, but I'm converting it to run via a web framework, where the two files are loaded via a form. Loading the xml file works fine:

Python, XPath: Find all links to images

南楼画角 提交于 2019-12-07 06:56:17
问题 I'm using lxml in Python to parse some HTML and I want to extract all link to images. The way I do it right now is: //a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)] There are a couple of problem with this approach: you have to list all possible image extensions in all cases (both "jpg" and "JPG"), wich is not elegant in a weird situations, the href may contain .jpg somewhere in the middle, not at the end of the string I wanted to use regexp, but I failed: //a[regx:match(