lxml

Unknown encoding of files in a resulting Beautiful Soup txt file

ⅰ亾dé卋堺 提交于 2019-12-11 06:35:33
问题 I downloaded 13 000 files (10-K reports from different companies) and I need to extract a specific part of these files (section 1A- Risk factors). The problem is that I can open these files in Word easily and they are perfect, while as I open them in a normal txt editor, the document appear to be an HTML with tons of encrypted string in the end (EDIT: I suspect this is due to XBRL format of these files). Same happens as a result of using BeautifulSoup. I've tried using online decoder, because

text extraction using python lxml looping issue

只谈情不闲聊 提交于 2019-12-11 06:27:32
问题 Here is a part of my xml file.. - <a:p> - <a:pPr lvl="2"> - <a:spcBef> <a:spcPts val="200" /> </a:spcBef> </a:pPr> - <a:r> <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" /> <a:t>The</a:t> </a:r> - <a:r> <a:rPr lang="en-US" sz="1400" dirty="0" /> <a:t>world</a:t> </a:r> - <a:r> <a:rPr lang="en-US" sz="1400" dirty="0" smtClean="0" /> <a:t>is small</a:t> </a:r> </a:p> - <a:p> - <a:pPr lvl="2"> - <a:spcBef> <a:spcPts val="200" /> </a:spcBef> </a:pPr> - <a:r> <a:rPr lang="en-US" sz="1400"

Handling pagination in lxml

此生再无相见时 提交于 2019-12-11 06:15:32
问题 I am trying to mirror a ruby scraper that I wrote but for a python only environment. I've decided to use lxml and requests to get this done. My problem is pagination: base_url = "http://example.com/something/?page=%s" for url in [base_url % i for i in xrange(10)]: r = requests.get(url) I'm new to python and this library so I'm not sure the best way to perform the equivalent ruby code: last_pg = (page.xpath("//div[contains(@class, 'b-tabs-utility')]").text.split('of ')[-1].split(' Results')[0]

Populating Python list using data obtained from lxml xpath command

本小妞迷上赌 提交于 2019-12-11 05:41:36
问题 I'm reading instrument data from a specialty server that delivers the info in xml format. The code I've written is: from lxml import etree as ET xmlDoc = ET.parse('http://192.168.1.198/Bench_read.xml') print ET.tostring(xmlDoc, pretty_print=True) dmtCount = xmlDoc.xpath('//dmt') print(len(dmtCount)) dmtVal = [] for i in range(1, len(dmtCount)): dmtVal[i:0] = xmlDoc.xpath('./address/text()') dmtVal[i:1] = xmlDoc.xpath('./status/text()') dmtVal[i:2] = xmlDoc.xpath('./flow/text()') dmtVal[i:3] =

Website is up and running but parsing it results in HTTP Error 503

狂风中的少年 提交于 2019-12-11 05:38:06
问题 I want to crawl a webpage using urllib2 library and extract some information according to my need. I am able to freely navigate the site(going from one link to another and so on), but when I try to parse-it I am getting an error HTTP Error 503 : Service Temporarily Unavailable I searched about it on net and found out that this error occurs when "web site's server is not available at that time" I am confused after reading this, if website server is down then how come its up and running(since I

Return result from arbitrarily nested xml tree sum

可紊 提交于 2019-12-11 05:23:18
问题 I have the following code that recurses(?) over an xml tree, which represents a simple equation: root = etree.XML(request.data['expression']) def addleafnodes(root): numbers = [] for child in root: if root.tag != "root" and root.tag != "expression": print(root.tag, child.text) if child.tag != "add" and child.tag != "multiply": numbers.append(int(child.text)) print("NUMBERS", numbers) elif child.tag == "add": numbers.append(np.sum(addleafnodes(child))) print("NUMBERS", numbers) elif child.tag

RuntimeWarning: compiletime version 2.6 of module 'lxml.etree' does not match runtime version 2.7

让人想犯罪 __ 提交于 2019-12-11 04:49:40
问题 I am using python 2.7 and I am trying to use lxml, but when I try using lxml.etree, I get this error: RuntimeWarning: compiletime version 2.6 of module 'lxml.etree' does not match runtime version 2.7 And then this error: File "lxml.etree.pyx", line 123, in init lxml.etree (src/lxml/lxml.etree.c:160385) TypeError: encode() argument 1 must be string without null bytes, not unicode I have tried installing using easy_install and using pip After installing, I see this message: Installed /usr/lib

Save troublesome webpage and import back into Python

强颜欢笑 提交于 2019-12-11 04:47:49
问题 I am trying to extract some information from a variety of pages and struggling a bit. This shows my challenge: import requests from lxml import html url = "https://www.soccer24.com/match/C4RB2hO0/#match-summary" response = requests.get(url) print(response.content) If you copy the output into Notepad, you cannot find the value "9.20" anywhere in the output (the Team A odds in the bottom right of the webpage). However, if you open the webpage, do a Save-As and then import it back into Python

installing lxml on 64 bit windows

纵然是瞬间 提交于 2019-12-11 04:44:31
问题 So I'm trying to install lxml on my machine, and I can't seem to get it to work. I've got Windows 8.1 64-bit and python 3.5 I've used both pip install lxml and easy_install lxml I keep getting this error message: C:\Users\jgarber\Downloads>pip install readability-lxml --upgrade Requirement already up-to-date: readability-lxml in c:\python\lib\site-packages\ readability_lxml-0.6.2-py3.5.egg Requirement already up-to-date: chardet in c:\python\lib\site-packages (from rea dability-lxml)

Lost in XML and Python

泄露秘密 提交于 2019-12-11 04:44:23
问题 Hi I have started learning python and want to use it to do something to a XML file with. I have been looking for information on the best course to follow but frankly I got a little lost. There are so many ways of manipulating XML files like ElementTree, lxml,minidom etc, etc, . Could someone point me into the right direction to go. Or point me to some code I can wrap my head around. I have started experimenting with lxml but haven't gotten any further then printing all elements yet. Here is