lxml

Replacing elements with lxml.html

▼魔方 西西 提交于 2019-12-23 08:34:24
问题 I'm fairly new to lxml and HTML Parsers as a whole. I was wondering if there is a way to replace an element within a tree with another element... For example I have: body = """<code> def function(arg): print arg </code> Blah blah blah <code> int main() { return 0; } </code> """ doc = lxml.html.fromstring(body) codeblocks = doc.cssselect('code') for block in codeblocks: lexer = guess_lexer(block.text_content()) hilited = highlight(block.text_content(), lexer, HtmlFormatter()) doc.replace(block

extracting paragraph in python using lxml

岁酱吖の 提交于 2019-12-23 02:45:11
问题 I would like to extract paragraphs in html by python. I used lxml module but it doesn't do exactly what I am looking for. print html.parse(url).xpath('//p')[1].text_content() <span id="midArticle_1"></span><p>Here is the First Paragraph.</p><span id="midArticle_2"></span><p>Here is the second Paragraph.</p><span id="midArticle_3"></span><p>Paragraph Three."</p> I should add that, in different pages I have different number of paragraph, so would like to make a list and put paragraph into it

How to prevent lxml.etree.HTML( data ) from crashing on certain type of data?

心已入冬 提交于 2019-12-23 02:40:06
问题 I'm running etree.HTML( data ) like below for lots of different data contents. With a specific data conent, however, lxml.etree.HTML will not parse it, but go into an infinite loop and consume 100% CPU. Does anyone know exactly what in this data below that can be causing this? And more importantly, how can I prevent this from happening on an infinite number of random, broken data ? Edit: Turns out this is a bug with lxml version 2.7.8 and below (at least). Updated to lxml 2.9.0, and bug is

Accesing values in xml file with namespaces in python 2.7 lxml

那年仲夏 提交于 2019-12-23 01:45:17
问题 I'm following this link to try to get values of several tags: Parsing XML with namespace in Python via 'ElementTree' In this link there is no problem to access to the root tag like this: import sys from lxml import etree as ET doc = ET.parse('file.xml') namespaces_rdf = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'} # add more as needed namespaces_dcat = {'dcat': 'http://www.w3.org/ns/dcat#'} # add more as needed namespaces_dct = {'dct': 'http://purl.org/dc/terms/'} print doc.findall(

lxml xpath unable to display html items

人盡茶涼 提交于 2019-12-23 01:43:11
问题 I'm trying to use lxml to parse a webpage below. But something seems to be wrong with my xpath. I'm not sure what am I doing wrong. web_content = requests.get(r"https://www.quandl.com/data/TSE").content dataset_count = html.fromstring(web_content) print(dataset_count.xpath(r'//*[@id="ember667"]/div[2]/main/section/section/section[2]/div[3]/div[2]/span[2]')) I'm trying to get it to return the dataset number of 3908. But this xpath doesn't seem to work for me. Any thoughts? Also, I'm hoping

how do i map to a dictionary rather than a list?

依然范特西╮ 提交于 2019-12-23 00:52:46
问题 i have the following function, which doe a basic job of mapping an lxml object to a dictionary... from lxml import etree tree = etree.parse('file.xml') root = tree.getroot() def xml_to_dict(el): d={} if el.text: print '***write tag as string' d[el.tag] = el.text else: d[el.tag] = {} children = el.getchildren() if children: d[el.tag] = map(xml_to_dict, children) return d v = xml_to_dict(root) at the moment it gives me.... >>>print v {'root': [{'a': '1'}, {'a': [{'b': '2'}, {'b': '2'}]}, {'aa':

How to fix lxml assertion error

狂风中的少年 提交于 2019-12-22 18:06:38
问题 I have an ubuntu machine running pythong.2.7.6 . When I try using lxml, which has been installed using pip , I get the following error: Traceback (most recent call last): File "./export.py", line 44, in fetch_item root.append(elem) File "lxml.etree.pyx", line 742, in lxml.etree._Element.append (src/lxml/lxml.etree.c:44339) File "apihelpers.pxi", line 24, in lxml.etree._assertValidNode (src/lxml/lxml.etree.c:14127) AssertionError: invalid Element proxy at 140443984439416 What does this mean,

Missing lxml module in python?

↘锁芯ラ 提交于 2019-12-22 17:58:36
问题 I want o use Python-docx library to process word files. A docx.py references lxml, as i assume from from lxml import etree When i start the script, i get error: No module named lxml Is this a standard library? Why is not it referenced properly then? IronPython 2.7 RC1. 回答1: You need to install lxml which is not part of the stdlib. I don't know if it will work with IronPython though. Update : Seems like it might be non-trivial to get lxml working with IronPython. See this question: How to get

How to use BeautifulSoup to parse google search results in Python

人走茶凉 提交于 2019-12-22 16:34:07
问题 I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far: from urllib.request import urlretrieve import urllib.parse from urllib.parse import urlencode, urlparse, parse_qs import webbrowser from bs4 import BeautifulSoup import requests address = 'https://google.com/#q=' # Default Google search address start file = open( "OCR.txt", "rt" ) # Open text document that contains the question word = file

random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

☆樱花仙子☆ 提交于 2019-12-22 14:41:32
问题 I am, for the sake of testing my web app, pasting some random characters from /dev/random into my web frontend. This line throws an error: print repr(comment) import html5lib print html5lib.parse(comment, treebuilder="lxml") 'a\xef\xbf\xbd\xef\xbf\xbd\xc9\xb6E\xef\xbf\xbd\xef\xbf\xbd`\xef\xbf\xbd]\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd2 \x14\xef\xbf\xbd\xc7\xbe\xef\xbf\xbdy\xcb\x9c\xef\xbf\xbdi1O\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbdZ\xef\xbf\xbd.\xef\xbf\xbd\x17^C'