lxml

Detecting header in HTML tables using beautifulsoup / lxml when table lacks thead element

这一生的挚爱 提交于 2019-12-24 07:01:34
问题 I'd like to detect the header of an HTML table when that table does not have <thead> elements. (MediaWiki, which drives Wikipedia, does not support <thead> elements.) I'd like to do this with python in both BeautifulSoup and lxml. Let's say I already have a table object and I'd like to get out of it a thead object, a tbody object, and a tfoot object. Currently, parse_thead does the following when the <thead> tag is present: In BeautifulSoup, I get table objects with doc.find_all('table') and

lxml --pretty_print— write file problem

你离开我真会死。 提交于 2019-12-24 03:51:12
问题 I am write a raw data to xml file python program, in my design,we get the raw data line by line, then write it into xml file like: `<root>\n <a> value </a>\n <b> value </b>\n </root> The first time i write into xml file with pretty_print=True, i got what i want, but when the second time i read the file, get the element root, --add-- new elemnts then save it back with pretty_print=True, but i can not get what i want,it just like: ...\n <c> value </c></root> ` what's wrong with lxml? Or my

Python lxml changes tag hierarchy?

荒凉一梦 提交于 2019-12-24 03:29:36
问题 I'm having a small issue with lxml. I'm converting an XML doc into an HTML doc. The original XML looks like this (it looks like HTML, but it's in the XML doc): <p>Localization - Eiffel tower? Paris or Vegas <p>Bayes theorem p(A|B)</p></p> When I do this (item is the string above) lxml.html.tostring(lxml.html.fromstring(item)) I get this: <div><p>Localization - Eiffel tower? Paris or Vegas </p><p>Bayes theorem p(A|B)</p></div> I don't have any problem with the <div>s, but the fact that the

lxml.objectify and leading zeros

此生再无相见时 提交于 2019-12-24 02:09:11
问题 When the objectify element is printed on the console, the leading zero is lost, but it is preserved in the .text : >>> from lxml import objectify >>> >>> xml = "<a><b>01</b></a>" >>> a = objectify.fromstring(xml) >>> print(a.b) 1 >>> print(a.b.text) 01 From what I understand, objectify automatically makes the b element an IntElement class instance. But, it also does that even if I try to explicitly set the type with an XSD schema: from io import StringIO from lxml import etree, objectify f =

Failing to get duration of youtube video using xpath

自闭症网瘾萝莉.ら 提交于 2019-12-24 01:55:29
问题 I wanted to write something that would return me the video duration of a youtube link. So I found requests and lxml and started out following this guide. Here's the setup: import requests from lxml import html url = 'https://www.youtube.com/watch?v=EN8fNb6uhns' page = requests.get(url) tree = html.fromstring(page.content) Then I try and use xpath to get the duration, but it doesn't work. Trying to get the duration: tree.xpath('//span[@class="ytp-time-duration"]/text()') returns an empty list.

How I do capture all of the element names of an XML file using LXML in Python?

自古美人都是妖i 提交于 2019-12-24 01:25:16
问题 I am able to use lxml to accomplish most of what I would like to do, although it was a struggle to go through the obfuscating examples and tutorials. In short, I am able to read an external xml file and import it via lxml into the proper tree-like format. To demonstrate this, if I were to type: print(etree.tostring(myXmlTree, pretty_print= True, method= "xml") ) I get the following output: <net xmlns="http://www.arin.net/whoisrws/core/v1" xmlns:ns2="http://www.arin.net/whoisrws/rdns/v1" xmlns

CDATA getting stripped in lxml even after using strip_cdata=False

↘锁芯ラ 提交于 2019-12-23 20:41:37
问题 I have a requirement in which I need to read a XML file and replace a string with a certain value. The XML contains CDATA element and I need to preserve it. I have tried using parser and setting strip_data to false. This is not working and need help to figure out a way to achieve it. import lxml.etree as ET parser1 = ET.XMLParser(strip_cdata=False) with open('testxml.xml', encoding="utf8") as f: tree = ET.parse(f, parser=parser1) root = tree.getroot() for elem in root.getiterator(): try: elem

converting scrapy to lxml

家住魔仙堡 提交于 2019-12-23 20:12:20
问题 I have scrapy code that looks like this for row in response.css("div#flexBox_flex_calendar_mainCal table tr.calendar_row"): print "================" print row.xpath(".//td[@class='time']/text()").extract() print row.xpath(".//td[@class='currency']/text()").extract() print row.xpath(".//td[@class='impact']/span/@title").extract() print row.xpath(".//td[@class='event']/span/text()").extract() print row.xpath(".//td[@class='actual']/text()").extract() print row.xpath(".//td[@class='forecast']

How to install LXML Python 3.3 Windows 8 64 Bit

旧巷老猫 提交于 2019-12-23 10:57:26
问题 I think I'm too stupid for installing LXML Lib on my System. Please can anyone help me with instructions for stupid people? I found a lot of instruction, but they did not help me much. I looked at LXML-Homepage For installation I need pip 1.4.1? I downloaded it... But, how can I install it? Unzip pip-1.4.1.tar.gz Then I opened the setup.py with my Python Shell. Run the modul: Traceback (most recent call last): File "C:\................\dist\pip-1.4.1\setup.py", line 5, in <module> from

Replacing elements with lxml.html

爱⌒轻易说出口 提交于 2019-12-23 08:35:18
问题 I'm fairly new to lxml and HTML Parsers as a whole. I was wondering if there is a way to replace an element within a tree with another element... For example I have: body = """<code> def function(arg): print arg </code> Blah blah blah <code> int main() { return 0; } </code> """ doc = lxml.html.fromstring(body) codeblocks = doc.cssselect('code') for block in codeblocks: lexer = guess_lexer(block.text_content()) hilited = highlight(block.text_content(), lexer, HtmlFormatter()) doc.replace(block