lxml | 易学教程

lxml.html extract a string by searching for a keyword

阅读更多关于 lxml.html extract a string by searching for a keyword

问题 I have a portion of html like below <li><label>The Keyword:</label><a href="../../..">The text</a></li> I want to get the string "The keyword: The text". I know that I can get xpath of above html using Chrome inspect or FF firebug, then select(xpath).extract(), then strip html tags to get the string. However, the approach is not generic enough since the xpath is not consistent across different pages. Hence, I'm thinking of below approach: Firstly, search for "The Keyword:" using

Using LXML with Html, Requests, and ETree, it gives links, but wont let me search links for specific text

阅读更多关于 Using LXML with Html, Requests, and ETree, it gives links, but wont let me search links for specific text

问题 I am trying to pull specific data out of the link provided below. When I run the code, it gives me all of the href links as expected, but when I try further testing for the same string, but using the contains syntax, it comes back as empty. Ive checked read the docs, as well as DevHints, and everywhere I look, the "Contains" syntax is the recommended method to capture what Im looking for when all I know is that the syntax will be included, but not where or how. Im trying to build a scraper to

French and lxml text

阅读更多关于 French and lxml text

问题 I'm trying to assign a valid French text string to a text string using lxml: el = etree.Element("someelement") el.text = 'Disponible Ã partir du 1er Octobre' I get the error: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters I've also tried: el.ext = etree.CDATA('Disponible Ã partir du 1er Octobre') However I get the same error. How do I handle French in XML, in particular, ISO-8859-1? There are ways to specify encoding within the tostring()

Why can't parse all div elements in the target.html with lxml.html?

阅读更多关于 Why can't parse all div elements in the target.html with lxml.html?

问题 Please download the file in dropbox and save it as /tmp/target.html . target.html Open it in firefox with firebug to watch the html struture. It is clear that there are at least 10 div in target.html . Now to parse all div elements in the target.html with lxml.html. python3 >>> import lxml.html >>> doc=lxml.html.parse("/tmp/target.html") >>> divs=doc.xpath("//div") >>> len(divs) 4 Get the result 4 ,why so many divs can't be parsed with above code? At lease 10 divs in the target.html . Same

how to install lxml with pypy in virtualenv

阅读更多关于 how to install lxml with pypy in virtualenv

问题 I am trying to use pypy in a virtualenv for better performance in running my python program. I was able to install all the required modules, except for lxml So far, I tried pip install lxml Also tried pip install --upgrade lxml It shows the following message at the end: Successfully installed lxml-3.4.4 However, when I start pypy prompt and try to import lxml, I get the error: (venv)➜ pypy pypy Python 2.7.3 (2.2.1+dfsg-1ubuntu0.2, Dec 02 2014, 23:00:55) [PyPy 2.2.1 with GCC 4.8.2] on linux2

Not able to install lxml verison 3.3.5 in ubuntu

阅读更多关于 Not able to install lxml verison 3.3.5 in ubuntu

问题 I am using openpyxl python package in my application. I am getting the following message when using the same. /usr/local/lib/python2.7/dist-packages/openpyxl/ init .py:31: UserWarning: The installed version of lxml is too old to be used with openpyxl warnings.warn("The installed version of lxml is too old to be used with openpyxl") Openpyxl requires lxml version 3.2.5 or above, and the version in my machine is 3.2.0. When I try to upgrade lxml to the latest version ie 3.3.5, it is getting

Invalid requirements.txt on deploying to AWS. Pip couldn't install lxml

阅读更多关于 Invalid requirements.txt on deploying to AWS. Pip couldn't install lxml

问题 I have a problem deploying Flask application to AWS EC2 instance. Probably with pip installing lxml . But I don't know how to solve the issue. AWS EC2: Platform: 64bit Amazon Linux 2015.03 v1.4.6 running Python 2.7 From CLI on eb create : CalledProcessError: Command '/opt/python/run/venv/bin/pip install -r /opt/python/ondeck/app/requirements.txt' returned non-zero exit status 1. From logs: src/lxml/lxml.etree.c:200873: error: ‘XML_XPATH_INVALID_ARITY’ undeclared (first use in this function)

lxml xml parsing with html tags inside xml tags

阅读更多关于 lxml xml parsing with html tags inside xml tags

问题 <xml> <maintag> <content> lorem ipsum dolor sit and so on </content> </maintag> </xml> The xml file that i regularly parse, may have html tags inside of content tag as shown above. Here how i parse the file: parser = etree.XMLParser(remove_blank_text=False) tree = etree.parse(StringIO(xmlFile), parser) for item in tree.iter('maintag'): my_content = item.find('content').text #print my_content #output: lorem as a result it results my_content = ' lorem ' instead of

What's needed to get BeautifulSoup4+lxml to work with cx_freeze?

阅读更多关于 What's needed to get BeautifulSoup4+lxml to work with cx_freeze?

问题 Summary: I have a wxPython/bs4 app that I'm building into an exe with cx_freeze. There build succeeds with no errors, but trying to run the EXE results a FeatureNotFound error from BeautifulSoup4. It's complaining that I don't have my lxml library installed. I've since stripped the program down to it's minimal state and still get the error. Has anyone else had success building a bs4 app with cx_freeze? Please take a look at the details below and let me know of any ideas you may have. Thanks,

What's needed to get BeautifulSoup4+lxml to work with cx_freeze?

阅读更多关于 What's needed to get BeautifulSoup4+lxml to work with cx_freeze?