lxml | 易学教程

How do I scrape an https page? [duplicate]

阅读更多关于 How do I scrape an https page? [duplicate]

问题 This question already has answers here : Python Requests throwing SSLError (22 answers) Closed 5 years ago . I'm using a python script with 'lxml' and 'requests' to scrape a web page. My goal is to grab an element from a page and download it, but the content is on an HTTPS page and I'm getting an error when trying to access the stuff in the page. I'm sure there is some kind of certificate or authentication I have to include, but I'm struggling to find the right resources. I'm using: page =

lxml xpath in python, how to handle missing tags?

阅读更多关于 lxml xpath in python, how to handle missing tags?

问题 suppose I want to parse with an lxml xpath expression the folowing xml <pack xmlns="http://ns.qubic.tv/2010/item"> <packitem> <duration>520</duration> <max_count>14</max_count> </packitem> <packitem> <duration>12</duration> </packitem> </pack> which is a variation of what can be found at http://python-thoughts.blogspot.fr/2012/01/default-value-for-text-function-using.html How can I achieve a parsing of the different elements that would give me once zipped (in the zip or izip python function

lxml.etree fromsting() and tostring() are not returning the same data

阅读更多关于 lxml.etree fromsting() and tostring() are not returning the same data

问题 I'm learning lxml (after using ElementTree) and I'm baffled why .fromstring and .tostring do not appear to be reversible. Here's my example: import lxml.etree as ET f = open('somefile.xml','r') data = f.read() tree_in = ET.fromstring(data) tree_out = ET.tostring(tree_in) f2 = open('samefile.xml','w') f2.write(tree_out) f2.close 'somefile.xml' was 132 KB. 'samefile.xml' - the output - was 113 KB, and it is missing the end of the file at some arbirtrary point. The closing tags of the overall

Installing lxml in virtualenv via pip install error: command 'x86_64-linux-gnu-gcc' failed

阅读更多关于 Installing lxml in virtualenv via pip install error: command 'x86_64-linux-gnu-gcc' failed

问题 when I activate virtualenv and type 'pip install lxml' installation process crashes with message: /usr/bin/ld: cannot find -lz collect2: error: ld returned 1 exit status error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 回答1: The error you have to pay attention to is the first "/usr/bin/ld: cannot find -lz": tnhat means you don't have zlib-dev installed. Depending on your linux distribution it could be named zlib-dev or zlib1g-dev in Ubuntu, I don't know in other distros. 回答2:

i have an error when executing “from lxml import etree” in the python command line after successfully installed lxml by pip

阅读更多关于 i have an error when executing “from lxml import etree” in the python command line after successfully installed lxml by pip

问题 bash-3.2$ pip install lxml-2.3.5.tgz Unpacking ./lxml-2.3.5.tgz Running setup.py egg_info for package from file:///Users/apple/workspace/pythonhome/misc/lxml-2.3.5.tgz Building lxml version 2.3.5. Building with Cython 0.17. Using build configuration of libxslt 1.1.27 Building against libxml2/libxslt in the following directory: /usr/local/lib warning: no previously-included files found matching '*.py' Installing collected packages: lxml Running setup.py install for lxml Building lxml version 2

Parsing lxml.etree._Element contents

阅读更多关于 Parsing lxml.etree._Element contents

问题 I have the following element that I parsed out of a <table> <td align="center" valign="top"> <a href="ConfigGroups.aspx?cfgID=451161&prjID=11778&grpID=DTST" target="_blank"> 5548U </a><br/>Power La Vaca<br/>(M8025K)<br/>Linux 4.2.x.x<br/> </td> I am trying to extract "55488 Power La Vaca (8025K) Linux 4.2.x.x" from this element (including the spaces). import lxml.etree as ET td_html = """ <td align="center" valign="top"> <a href="ConfigGroups.aspx?cfgID=451161&prjID=11778&grpID=DTST" target="

Iterate through all the rows in a table using python lxml xpath

阅读更多关于 Iterate through all the rows in a table using python lxml xpath

问题 This is the source code of the html page I want to extract data from. Webpage: http://gbgfotboll.se/information/?scr=table&ftid=51168 The table is at the bottom of the page <html> <table class="clCommonGrid" cellspacing="0"> <thead> <tr> <td colspan="3">Kommande matcher</td> </tr> <tr> <th style="width:1%;">Tid</th> <th style="width:69%;">Match</th> <th style="width:30%;">Arena</th> </tr> </thead> <tbody class="clGrid"> <tr class="clTrOdd"> <td nowrap="nowrap" class="no-line-through"> <span

lxml/Python : get previous-sibling

阅读更多关于 lxml/Python : get previous-sibling

问题 I have the following html: <div id = "big"> <span>header 1</span> <ul id = "outer"> <li id = "inner">aaa</li> <li id = "inner">bbb</li> </ul> <span>header 2</span> <ul id = "outer"> <li id = "inner">ccc</li> <li id = "inner">ddd</li> </ul> </div> I want it to loop it in the order: header 1 aaa bbb header 2 ccc ddd I have tried looping through each ul and then printing the header and the li values. However, I don't know how to get the span header that is associated with a ul. sets = tree.xpath

Python xml etree DTD from a StringIO source?

阅读更多关于 Python xml etree DTD from a StringIO source?

问题 I'm adapting the following code (created via advice in this question), that took an XML file and it's DTD and converted them to a different format. For this problem only the loading section is important: xmldoc = open(filename) parser = etree.XMLParser(dtd_validation=True, load_dtd=True) tree = etree.parse(xmldoc, parser) This worked fine, whilst using the file system, but I'm converting it to run via a web framework, where the two files are loaded via a form. Loading the xml file works fine:

Python, XPath: Find all links to images

阅读更多关于 Python, XPath: Find all links to images

问题 I'm using lxml in Python to parse some HTML and I want to extract all link to images. The way I do it right now is: //a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)] There are a couple of problem with this approach: you have to list all possible image extensions in all cases (both "jpg" and "JPG"), wich is not elegant in a weird situations, the href may contain .jpg somewhere in the middle, not at the end of the string I wanted to use regexp, but I failed: //a[regx:match(