lxml | 易学教程

Unknown encoding of files in a resulting Beautiful Soup txt file

阅读更多关于 Unknown encoding of files in a resulting Beautiful Soup txt file

问题 I downloaded 13 000 files (10-K reports from different companies) and I need to extract a specific part of these files (section 1A- Risk factors). The problem is that I can open these files in Word easily and they are perfect, while as I open them in a normal txt editor, the document appear to be an HTML with tons of encrypted string in the end (EDIT: I suspect this is due to XBRL format of these files). Same happens as a result of using BeautifulSoup. I've tried using online decoder, because

text extraction using python lxml looping issue

阅读更多关于 text extraction using python lxml looping issue

问题 Here is a part of my xml file.. - <a:p> - <a:pPr lvl="2"> - <a:spcBef> <a:spcPts val="200" /> </a:spcBef> </a:pPr> - <a:r> <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" /> <a:t>The</a:t> </a:r> - <a:r> <a:rPr lang="en-US" sz="1400" dirty="0" /> <a:t>world</a:t> </a:r> - <a:r> <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" /> <a:t>is small</a:t> </a:r> </a:p> - <a:p> - <a:pPr lvl="2"> - <a:spcBef> <a:spcPts val="200" /> </a:spcBef> </a:pPr> - <a:r> <a:rPr lang="en-US" sz="1400"

Handling pagination in lxml

阅读更多关于 Handling pagination in lxml

问题 I am trying to mirror a ruby scraper that I wrote but for a python only environment. I've decided to use lxml and requests to get this done. My problem is pagination: base_url = "http://example.com/something/?page=%s" for url in [base_url % i for i in xrange(10)]: r = requests.get(url) I'm new to python and this library so I'm not sure the best way to perform the equivalent ruby code: last_pg = (page.xpath("//div[contains(@class, 'b-tabs-utility')]").text.split('of ')[-1].split(' Results')[0]

Populating Python list using data obtained from lxml xpath command

阅读更多关于 Populating Python list using data obtained from lxml xpath command

问题 I'm reading instrument data from a specialty server that delivers the info in xml format. The code I've written is: from lxml import etree as ET xmlDoc = ET.parse('http://192.168.1.198/Bench_read.xml') print ET.tostring(xmlDoc, pretty_print=True) dmtCount = xmlDoc.xpath('//dmt') print(len(dmtCount)) dmtVal = [] for i in range(1, len(dmtCount)): dmtVal[i:0] = xmlDoc.xpath('./address/text()') dmtVal[i:1] = xmlDoc.xpath('./status/text()') dmtVal[i:2] = xmlDoc.xpath('./flow/text()') dmtVal[i:3] =

Website is up and running but parsing it results in HTTP Error 503

阅读更多关于 Website is up and running but parsing it results in HTTP Error 503

问题 I want to crawl a webpage using urllib2 library and extract some information according to my need. I am able to freely navigate the site(going from one link to another and so on), but when I try to parse-it I am getting an error HTTP Error 503 : Service Temporarily Unavailable I searched about it on net and found out that this error occurs when "web site's server is not available at that time" I am confused after reading this, if website server is down then how come its up and running(since I

Return result from arbitrarily nested xml tree sum

阅读更多关于 Return result from arbitrarily nested xml tree sum

问题 I have the following code that recurses(?) over an xml tree, which represents a simple equation: root = etree.XML(request.data['expression']) def addleafnodes(root): numbers = [] for child in root: if root.tag != "root" and root.tag != "expression": print(root.tag, child.text) if child.tag != "add" and child.tag != "multiply": numbers.append(int(child.text)) print("NUMBERS", numbers) elif child.tag == "add": numbers.append(np.sum(addleafnodes(child))) print("NUMBERS", numbers) elif child.tag

RuntimeWarning: compiletime version 2.6 of module 'lxml.etree' does not match runtime version 2.7

阅读更多关于 RuntimeWarning: compiletime version 2.6 of module 'lxml.etree' does not match runtime version 2.7

问题 I am using python 2.7 and I am trying to use lxml, but when I try using lxml.etree, I get this error: RuntimeWarning: compiletime version 2.6 of module 'lxml.etree' does not match runtime version 2.7 And then this error: File "lxml.etree.pyx", line 123, in init lxml.etree (src/lxml/lxml.etree.c:160385) TypeError: encode() argument 1 must be string without null bytes, not unicode I have tried installing using easy_install and using pip After installing, I see this message: Installed /usr/lib

Save troublesome webpage and import back into Python

阅读更多关于 Save troublesome webpage and import back into Python

问题 I am trying to extract some information from a variety of pages and struggling a bit. This shows my challenge: import requests from lxml import html url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary" response = requests.get(url) print(response.content) If you copy the output into Notepad, you cannot find the value "9.20" anywhere in the output (the Team A odds in the bottom right of the webpage). However, if you open the webpage, do a Save-As and then import it back into Python

installing lxml on 64 bit windows

阅读更多关于 installing lxml on 64 bit windows

问题 So I'm trying to install lxml on my machine, and I can't seem to get it to work. I've got Windows 8.1 64-bit and python 3.5 I've used both pip install lxml and easy_install lxml I keep getting this error message: C:\Users\jgarber\Downloads>pip install readability-lxml --upgrade Requirement already up-to-date: readability-lxml in c:\python\lib\site-packages\ readability_lxml-0.6.2-py3.5.egg Requirement already up-to-date: chardet in c:\python\lib\site-packages (from rea dability-lxml)

Lost in XML and Python

阅读更多关于 Lost in XML and Python

问题 Hi I have started learning python and want to use it to do something to a XML file with. I have been looking for information on the best course to follow but frankly I got a little lost. There are so many ways of manipulating XML files like ElementTree, lxml,minidom etc, etc, . Could someone point me into the right direction to go. Or point me to some code I can wrap my head around. I have started experimenting with lxml but haven't gotten any further then printing all elements yet. Here is