lxml | 易学教程

Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element

阅读更多关于 Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element

问题 I'd like to detect the header of an HTML table when that table does not have <thead> elements. (MediaWiki, which drives Wikipedia, does not support <thead> elements.) I'd like to do this with python in both BeautifulSoup and lxml. Let's say I already have a table object and I'd like to get out of it a thead object, a tbody object, and a tfoot object. Currently, parse_thead does the following when the <thead> tag is present: In BeautifulSoup, I get table objects with doc.find_all('table') and

lxml --pretty_print— write file problem

阅读更多关于 lxml --pretty_print— write file problem

问题 I am write a raw data to xml file python program, in my design,we get the raw data line by line, then write it into xml file like: `<root>\n <a> value </a>\n <b> value </b>\n </root> The first time i write into xml file with pretty_print=True, i got what i want, but when the second time i read the file, get the element root, --add-- new elemnts then save it back with pretty_print=True, but i can not get what i want,it just like: ...\n <c> value </c></root> ` what's wrong with lxml? Or my

Python lxml changes tag hierarchy?

阅读更多关于 Python lxml changes tag hierarchy?

问题 I'm having a small issue with lxml. I'm converting an XML doc into an HTML doc. The original XML looks like this (it looks like HTML, but it's in the XML doc): <p>Localization - Eiffel tower? Paris or Vegas <p>Bayes theorem p(A|B)</p></p> When I do this (item is the string above) lxml.html.tostring(lxml.html.fromstring(item)) I get this: <div><p>Localization - Eiffel tower? Paris or Vegas </p><p>Bayes theorem p(A|B)</p></div> I don't have any problem with the <div>s, but the fact that the

lxml.objectify and leading zeros

阅读更多关于 lxml.objectify and leading zeros

问题 When the objectify element is printed on the console, the leading zero is lost, but it is preserved in the .text : >>> from lxml import objectify >>> >>> xml = "<a><b>01</b></a>" >>> a = objectify.fromstring(xml) >>> print(a.b) 1 >>> print(a.b.text) 01 From what I understand, objectify automatically makes the b element an IntElement class instance. But, it also does that even if I try to explicitly set the type with an XSD schema: from io import StringIO from lxml import etree, objectify f =

Failing to get duration of youtube video using xpath

阅读更多关于 Failing to get duration of youtube video using xpath

问题 I wanted to write something that would return me the video duration of a youtube link. So I found requests and lxml and started out following this guide. Here's the setup: import requests from lxml import html url = 'https://www.youtube.com/watch?v=EN8fNb6uhns' page = requests.get(url) tree = html.fromstring(page.content) Then I try and use xpath to get the duration, but it doesn't work. Trying to get the duration: tree.xpath('//span[@class="ytp-time-duration"]/text()') returns an empty list.

How I do capture all of the element names of an XML file using LXML in Python?

阅读更多关于 How I do capture all of the element names of an XML file using LXML in Python?

问题 I am able to use lxml to accomplish most of what I would like to do, although it was a struggle to go through the obfuscating examples and tutorials. In short, I am able to read an external xml file and import it via lxml into the proper tree-like format. To demonstrate this, if I were to type: print(etree.tostring(myXmlTree, pretty_print= True, method= "xml") ) I get the following output: <net xmlns="http://www.arin.net/whoisrws/core/v1" xmlns:ns2="http://www.arin.net/whoisrws/rdns/v1" xmlns

CDATA getting stripped in lxml even after using strip_cdata=False

阅读更多关于 CDATA getting stripped in lxml even after using strip_cdata=False

问题 I have a requirement in which I need to read a XML file and replace a string with a certain value. The XML contains CDATA element and I need to preserve it. I have tried using parser and setting strip_data to false. This is not working and need help to figure out a way to achieve it. import lxml.etree as ET parser1 = ET.XMLParser(strip_cdata=False) with open('testxml.xml', encoding="utf8") as f: tree = ET.parse(f, parser=parser1) root = tree.getroot() for elem in root.getiterator(): try: elem

converting scrapy to lxml

阅读更多关于 converting scrapy to lxml

问题 I have scrapy code that looks like this for row in response.css("div#flexBox_flex_calendar_mainCal table tr.calendar_row"): print "================" print row.xpath(".//td[@class='time']/text()").extract() print row.xpath(".//td[@class='currency']/text()").extract() print row.xpath(".//td[@class='impact']/span/@title").extract() print row.xpath(".//td[@class='event']/span/text()").extract() print row.xpath(".//td[@class='actual']/text()").extract() print row.xpath(".//td[@class='forecast']

How to install LXML Python 3.3 Windows 8 64 Bit

阅读更多关于 How to install LXML Python 3.3 Windows 8 64 Bit

问题 I think I'm too stupid for installing LXML Lib on my System. Please can anyone help me with instructions for stupid people? I found a lot of instruction, but they did not help me much. I looked at LXML-Homepage For installation I need pip 1.4.1? I downloaded it... But, how can I install it? Unzip pip-1.4.1.tar.gz Then I opened the setup.py with my Python Shell. Run the modul: Traceback (most recent call last): File "C:\................\dist\pip-1.4.1\setup.py", line 5, in <module> from

Replacing elements with lxml.html

阅读更多关于 Replacing elements with lxml.html

问题 I'm fairly new to lxml and HTML Parsers as a whole. I was wondering if there is a way to replace an element within a tree with another element... For example I have: body = """<code> def function(arg): print arg </code> Blah blah blah <code> int main() { return 0; } </code> """ doc = lxml.html.fromstring(body) codeblocks = doc.cssselect('code') for block in codeblocks: lexer = guess_lexer(block.text_content()) hilited = highlight(block.text_content(), lexer, HtmlFormatter()) doc.replace(block