lxml

lxml.html extract a string by searching for a keyword

坚强是说给别人听的谎言 提交于 2020-01-06 19:43:22
问题 I have a portion of html like below <li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li> I want to get the string "The keyword: The text". I know that I can get xpath of above html using Chrome inspect or FF firebug, then select(xpath).extract(), then strip html tags to get the string. However, the approach is not generic enough since the xpath is not consistent across different pages. Hence, I'm thinking of below approach: Firstly, search for "The Keyword:" using

Using LXML with Html, Requests, and ETree, it gives links, but wont let me search links for specific text

瘦欲@ 提交于 2020-01-06 12:42:30
问题 I am trying to pull specific data out of the link provided below. When I run the code, it gives me all of the href links as expected, but when I try further testing for the same string, but using the contains syntax, it comes back as empty. Ive checked read the docs, as well as DevHints, and everywhere I look, the "Contains" syntax is the recommended method to capture what Im looking for when all I know is that the syntax will be included, but not where or how. Im trying to build a scraper to

French and lxml text

不打扰是莪最后的温柔 提交于 2020-01-06 06:07:06
问题 I'm trying to assign a valid French text string to a text string using lxml: el = etree.Element("someelement") el.text = 'Disponible à partir du 1er Octobre' I get the error: ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters I've also tried: el.ext = etree.CDATA('Disponible à partir du 1er Octobre') However I get the same error. How do I handle French in XML, in particular, ISO-8859-1? There are ways to specify encoding within the tostring()

Why can't parse all div elements in the target.html with lxml.html?

不羁的心 提交于 2020-01-06 02:46:13
问题 Please download the file in dropbox and save it as /tmp/target.html . target.html Open it in firefox with firebug to watch the html struture. It is clear that there are at least 10 div in target.html . Now to parse all div elements in the target.html with lxml.html. python3 >>> import lxml.html >>> doc=lxml.html.parse("/tmp/target.html") >>> divs=doc.xpath("//div") >>> len(divs) 4 Get the result 4 ,why so many divs can't be parsed with above code? At lease 10 divs in the target.html . Same

how to install lxml with pypy in virtualenv

混江龙づ霸主 提交于 2020-01-06 02:45:07
问题 I am trying to use pypy in a virtualenv for better performance in running my python program. I was able to install all the required modules, except for lxml So far, I tried pip install lxml Also tried pip install --upgrade lxml It shows the following message at the end: Successfully installed lxml-3.4.4 However, when I start pypy prompt and try to import lxml, I get the error: (venv)➜ pypy pypy Python 2.7.3 (2.2.1+dfsg-1ubuntu0.2, Dec 02 2014, 23:00:55) [PyPy 2.2.1 with GCC 4.8.2] on linux2

Not able to install lxml verison 3.3.5 in ubuntu

自古美人都是妖i 提交于 2020-01-06 02:20:08
问题 I am using openpyxl python package in my application. I am getting the following message when using the same. /usr/local/lib/python2.7/dist-packages/openpyxl/ init .py:31: UserWarning: The installed version of lxml is too old to be used with openpyxl warnings.warn("The installed version of lxml is too old to be used with openpyxl") Openpyxl requires lxml version 3.2.5 or above, and the version in my machine is 3.2.0. When I try to upgrade lxml to the latest version ie 3.3.5, it is getting

Invalid requirements.txt on deploying to AWS. Pip couldn't install lxml

杀马特。学长 韩版系。学妹 提交于 2020-01-06 01:33:08
问题 I have a problem deploying Flask application to AWS EC2 instance. Probably with pip installing lxml . But I don't know how to solve the issue. AWS EC2: Platform: 64bit Amazon Linux 2015.03 v1.4.6 running Python 2.7 From CLI on eb create : CalledProcessError: Command '/opt/python/run/venv/bin/pip install -r /opt/python/ondeck/app/requirements.txt' returned non-zero exit status 1. From logs: src/lxml/lxml.etree.c:200873: error: ‘XML_XPATH_INVALID_ARITY’ undeclared (first use in this function)

lxml xml parsing with html tags inside xml tags

本秂侑毒 提交于 2020-01-05 13:14:13
问题 <xml> <maintag> <content> lorem <br>ipsum</br> <strong> dolor sit </strong> and so on </content> </maintag> </xml> The xml file that i regularly parse, may have html tags inside of content tag as shown above. Here how i parse the file: parser = etree.XMLParser(remove_blank_text=False) tree = etree.parse(StringIO(xmlFile), parser) for item in tree.iter('maintag'): my_content = item.find('content').text #print my_content #output: lorem as a result it results my_content = ' lorem ' instead of

What's needed to get BeautifulSoup4+lxml to work with cx_freeze?

怎甘沉沦 提交于 2020-01-05 08:51:50
问题 Summary: I have a wxPython/bs4 app that I'm building into an exe with cx_freeze. There build succeeds with no errors, but trying to run the EXE results a FeatureNotFound error from BeautifulSoup4. It's complaining that I don't have my lxml library installed. I've since stripped the program down to it's minimal state and still get the error. Has anyone else had success building a bs4 app with cx_freeze? Please take a look at the details below and let me know of any ideas you may have. Thanks,

What's needed to get BeautifulSoup4+lxml to work with cx_freeze?

强颜欢笑 提交于 2020-01-05 08:51:09
问题 Summary: I have a wxPython/bs4 app that I'm building into an exe with cx_freeze. There build succeeds with no errors, but trying to run the EXE results a FeatureNotFound error from BeautifulSoup4. It's complaining that I don't have my lxml library installed. I've since stripped the program down to it's minimal state and still get the error. Has anyone else had success building a bs4 app with cx_freeze? Please take a look at the details below and let me know of any ideas you may have. Thanks,