lxml | 易学教程

Replacing elements with lxml.html

阅读更多关于 Replacing elements with lxml.html

问题 I'm fairly new to lxml and HTML Parsers as a whole. I was wondering if there is a way to replace an element within a tree with another element... For example I have: body = """<code> def function(arg): print arg </code> Blah blah blah <code> int main() { return 0; } </code> """ doc = lxml.html.fromstring(body) codeblocks = doc.cssselect('code') for block in codeblocks: lexer = guess_lexer(block.text_content()) hilited = highlight(block.text_content(), lexer, HtmlFormatter()) doc.replace(block

extracting paragraph in python using lxml

阅读更多关于 extracting paragraph in python using lxml

问题 I would like to extract paragraphs in html by python. I used lxml module but it doesn't do exactly what I am looking for. print html.parse(url).xpath('//p')[1].text_content() Here is the First Paragraph.Here is the second Paragraph.Paragraph Three." I should add that, in different pages I have different number of paragraph, so would like to make a list and put paragraph into it

How to prevent lxml.etree.HTML( data ) from crashing on certain type of data?

阅读更多关于 How to prevent lxml.etree.HTML( data ) from crashing on certain type of data?

问题 I'm running etree.HTML( data ) like below for lots of different data contents. With a specific data conent, however, lxml.etree.HTML will not parse it, but go into an infinite loop and consume 100% CPU. Does anyone know exactly what in this data below that can be causing this? And more importantly, how can I prevent this from happening on an infinite number of random, broken data ? Edit: Turns out this is a bug with lxml version 2.7.8 and below (at least). Updated to lxml 2.9.0, and bug is

Accesing values in xml file with namespaces in python 2.7 lxml

阅读更多关于 Accesing values in xml file with namespaces in python 2.7 lxml

问题 I'm following this link to try to get values of several tags: Parsing XML with namespace in Python via 'ElementTree' In this link there is no problem to access to the root tag like this: import sys from lxml import etree as ET doc = ET.parse('file.xml') namespaces_rdf = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'} # add more as needed namespaces_dcat = {'dcat': 'http://www.w3.org/ns/dcat#'} # add more as needed namespaces_dct = {'dct': 'http://purl.org/dc/terms/'} print doc.findall(

lxml xpath unable to display html items

阅读更多关于 lxml xpath unable to display html items

问题 I'm trying to use lxml to parse a webpage below. But something seems to be wrong with my xpath. I'm not sure what am I doing wrong. web_content = requests.get(r"https://www.quandl.com/data/TSE").content dataset_count = html.fromstring(web_content) print(dataset_count.xpath(r'//*[@id="ember667"]/div[2]/main/section/section/section[2]/div[3]/div[2]/span[2]')) I'm trying to get it to return the dataset number of 3908. But this xpath doesn't seem to work for me. Any thoughts? Also, I'm hoping

how do i map to a dictionary rather than a list?

阅读更多关于 how do i map to a dictionary rather than a list?

问题 i have the following function, which doe a basic job of mapping an lxml object to a dictionary... from lxml import etree tree = etree.parse('file.xml') root = tree.getroot() def xml_to_dict(el): d={} if el.text: print '***write tag as string' d[el.tag] = el.text else: d[el.tag] = {} children = el.getchildren() if children: d[el.tag] = map(xml_to_dict, children) return d v = xml_to_dict(root) at the moment it gives me.... >>>print v {'root': [{'a': '1'}, {'a': [{'b': '2'}, {'b': '2'}]}, {'aa':

How to fix lxml assertion error

阅读更多关于 How to fix lxml assertion error

问题 I have an ubuntu machine running pythong.2.7.6 . When I try using lxml, which has been installed using pip , I get the following error: Traceback (most recent call last): File "./export.py", line 44, in fetch_item root.append(elem) File "lxml.etree.pyx", line 742, in lxml.etree._Element.append (src/lxml/lxml.etree.c:44339) File "apihelpers.pxi", line 24, in lxml.etree._assertValidNode (src/lxml/lxml.etree.c:14127) AssertionError: invalid Element proxy at 140443984439416 What does this mean,

Missing lxml module in python?

阅读更多关于 Missing lxml module in python?

问题 I want o use Python-docx library to process word files. A docx.py references lxml, as i assume from from lxml import etree When i start the script, i get error: No module named lxml Is this a standard library? Why is not it referenced properly then? IronPython 2.7 RC1. 回答1: You need to install lxml which is not part of the stdlib. I don't know if it will work with IronPython though. Update : Seems like it might be non-trivial to get lxml working with IronPython. See this question: How to get

How to use BeautifulSoup to parse google search results in Python

阅读更多关于 How to use BeautifulSoup to parse google search results in Python

问题 I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far: from urllib.request import urlretrieve import urllib.parse from urllib.parse import urlencode, urlparse, parse_qs import webbrowser from bs4 import BeautifulSoup import requests address = 'https://google.com/#q=' # Default Google search address start file = open( "OCR.txt", "rt" ) # Open text document that contains the question word = file

random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

阅读更多关于 random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

问题 I am, for the sake of testing my web app, pasting some random characters from /dev/random into my web frontend. This line throws an error: print repr(comment) import html5lib print html5lib.parse(comment, treebuilder="lxml") 'a\xef\xbf\xbd\xef\xbf\xbd\xc9\xb6E\xef\xbf\xbd\xef\xbf\xbd`\xef\xbf\xbd]\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd2 \x14\xef\xbf\xbd\xc7\xbe\xef\xbf\xbdy\xcb\x9c\xef\xbf\xbdi1O\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbdZ\xef\xbf\xbd.\xef\xbf\xbd\x17^C'