lxml

lxml etree.parse MemoryAllocation Error

不问归期 提交于 2021-02-07 19:52:25
问题 I'm using lxml etree.parse to parse a, somehow, huge XML file (around 65MB - 300MB). When I run my stand alone python script containing the below function, I am getting a Memory Allocation failure: Error: Memory allocation failed : xmlSAX2Characters, line 5350155, column 16 Partial function code: def getID(): try: from lxml import etree xml = etree.parse(<xml_file>) # here is where the failure occurs for element in xml.iter(): ... result = <formed by concatenating element texts> return result

lxml etree.parse MemoryAllocation Error

徘徊边缘 提交于 2021-02-07 19:50:26
问题 I'm using lxml etree.parse to parse a, somehow, huge XML file (around 65MB - 300MB). When I run my stand alone python script containing the below function, I am getting a Memory Allocation failure: Error: Memory allocation failed : xmlSAX2Characters, line 5350155, column 16 Partial function code: def getID(): try: from lxml import etree xml = etree.parse(<xml_file>) # here is where the failure occurs for element in xml.iter(): ... result = <formed by concatenating element texts> return result

How to preserve namespace information when parsing HTML with lxml?

此生再无相见时 提交于 2021-02-07 12:18:55
问题 >>> from lxml.etree import HTML, tostring >>> tostring(HTML('<fb:like>')) '<html><body><like/></body></html>' Note how the tag turns from <fb:like> to simply <like> . This makes processing pages that incorporate XFBML with lxml much harder. (Same thing happens to <g:plusone></g:plusone> ) Any help is appreciated. 回答1: Try adding the namespace prefix definitions that are missing. lxml will avoid the namespaces otherwise, supposedly to make it easier for you. Most likely the sites you try to

How to preserve namespace information when parsing HTML with lxml?

牧云@^-^@ 提交于 2021-02-07 12:18:00
问题 >>> from lxml.etree import HTML, tostring >>> tostring(HTML('<fb:like>')) '<html><body><like/></body></html>' Note how the tag turns from <fb:like> to simply <like> . This makes processing pages that incorporate XFBML with lxml much harder. (Same thing happens to <g:plusone></g:plusone> ) Any help is appreciated. 回答1: Try adding the namespace prefix definitions that are missing. lxml will avoid the namespaces otherwise, supposedly to make it easier for you. Most likely the sites you try to

ImportError: No module named lxml - Even though LXML Is installed

拜拜、爱过 提交于 2021-02-07 11:42:26
问题 I'm getting this error " ImportError: No module named lxml " Even though LXML Is definitely installed. Specifically it's installed within the python Virtualenv for the project. and ultimately I'm working on the Python/Amazon Product API. I get the error after trying to run one of the example scripts for that project from the terminal (mac). How can I fix this? or further track down the issue? Google searching lead me to: Reintsall LXML Ensure Xcode license was agreed to: sudo xcodebuild

ImportError: No module named lxml - Even though LXML Is installed

Deadly 提交于 2021-02-07 11:42:14
问题 I'm getting this error " ImportError: No module named lxml " Even though LXML Is definitely installed. Specifically it's installed within the python Virtualenv for the project. and ultimately I'm working on the Python/Amazon Product API. I get the error after trying to run one of the example scripts for that project from the terminal (mac). How can I fix this? or further track down the issue? Google searching lead me to: Reintsall LXML Ensure Xcode license was agreed to: sudo xcodebuild

ImportError: No module named lxml - Even though LXML Is installed

时光毁灭记忆、已成空白 提交于 2021-02-07 11:42:13
问题 I'm getting this error " ImportError: No module named lxml " Even though LXML Is definitely installed. Specifically it's installed within the python Virtualenv for the project. and ultimately I'm working on the Python/Amazon Product API. I get the error after trying to run one of the example scripts for that project from the terminal (mac). How can I fix this? or further track down the issue? Google searching lead me to: Reintsall LXML Ensure Xcode license was agreed to: sudo xcodebuild

How to get path of all elements in lxml with attribute

我们两清 提交于 2021-02-07 10:37:22
问题 I have the following code: tree = etree.ElementTree(new_xml) for e in new_xml.iter(): print tree.getpath(e), e.text This will give me something like the following: /Item/Purchases /Item/Purchases/Purchase[1] /Item/Purchases/Purchase[1]/URL http://tvgo.xfinity.com/watch/x/6091165185315991112/movies /Item/Purchases/Purchase[1]/Rating R /Item/Purchases/Purchase[2] /Item/Purchases/Purchase[2]/URL http://tvgo.xfinity.com/watch/x/6091165185315991112/movies /Item/Purchases/Purchase[2]/Rating R

Anyway to scrape a link that redirects?

≯℡__Kan透↙ 提交于 2021-02-07 09:47:58
问题 Is there anyway that I can make python click a link such as a bit.ly link and then scrape the resulting link? When I am scraping a certain page, the only link I can scrape is a link that redirects, where it redirects to is where the information I need is located. 回答1: There are 3 types of redirections HTTP - as information in response headers (with code 301, 302, 3xx) HTML - as tag <meta> in HTML (wikipedia: Meta refresh) JavaScript - as code like window.location = new_url requests execute

How to extract img src from web page via lxml in beautifulsoup using python?

空扰寡人 提交于 2021-02-05 06:44:45
问题 I am new in python and I am working on web scraping project from amazon and I have a problem on how to extract the product img src from product page via lxml using BeautifulSoup I tried the following code to extract it but it doesn't show the url of the img. here is my code: import requests from bs4 import BeautifulSoup import re url = 'https://www.amazon.com/crocs-Unisex-Classic-Black-Women/dp/B0014C0LSY/ref=sr_1_2?_encoding=UTF8&qid=1560091629&s=fashion-womens-intl-ship&sr=1-2&th=1&psc=1' r