lxml

python, lxml and xpath - html table parsing

坚强是说给别人听的谎言 提交于 2019-12-03 13:38:53
问题 I 'am new to lxml, quite new to python and could not find a solution to the following: I need to import a few tables with 3 columns and an undefined number of rows starting at row 3. When the second column of any row is empty, this row is discarded and the processing of the table is aborted. The following code prints the table's data fine (but I'm unable to reuse the data afterwards): from lxml.html import parse def process_row(row): for cell in row.xpath('./td'): print cell.text_content()

Python BeautifulSoup equivalent to lxml make_links_absolute

纵然是瞬间 提交于 2019-12-03 13:01:24
So lxml has a very hand feature: make_links_absolute: doc = lxml.html.fromstring(some_html_page) doc.make_links_absolute(url_for_some_html_page) and all the links in doc are absolute now. Is there an easy equivalent in BeautifulSoup or do I simply need to pass it through urlparse and normalize it: soup = BeautifulSoup(some_html_page) for tag in soup.findAll('a', href=True): url_data = urlparse(tag['href']) if url_data[0] == "": full_url = url_for_some_html_page + test_url Chris Morgan In my answer to What is a simple way to extract the list of URLs on a webpage using python? I covered that

Is there a switch to ignore undefined namespace prefixes in LXML?

☆樱花仙子☆ 提交于 2019-12-03 12:45:32
问题 I'm parsing a non-compliant XML file (Sphinx's xmlpipe2 format) and would like LXML parser to ignore the fact that there are unresolved namespace prefixes. An example of the Sphinx XML: <sphinx:schema> <sphinx:field name="subject"/> <sphinx:field name="content"/> <sphinx:attr name="published" type="timestamp"/> <sphinx:attr name="author_id" type="int" bits="16" default="1"/> </sphinx:schema> I'm aware of passing a parser keyword option to try and recover broken XML, e.g. parser = etree

Multiple XML Namespaces in tag with LXML

非 Y 不嫁゛ 提交于 2019-12-03 11:41:55
I am trying to use Pythons LXML library to create a GPX file that can be read by Garmin's Mapsource Product. The header on their GPX files looks like this <?xml version="1.0" encoding="UTF-8" standalone="no" ?> <gpx xmlns="http://www.topografix.com/GPX/1/1" creator="MapSource 6.15.5" version="1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.topografix.com/GPX/1/1 http://www.topografix.com/GPX/1/1/gpx.xsd"> When I use the following code: xmlns = "http://www.topografix.com/GPX/1/1" xsi = "http://www.w3.org/2001/XMLSchema-instance" schemaLocation = "http:/

parsing HTML table using python - HTMLparser or lxml

此生再无相见时 提交于 2019-12-03 11:32:23
I have a html page which consist of a table & I want to fetch all the values in td, tr in that table. I have tried working with beautifulsoup but now i wanted to work on lxml or HML parser with python. I have attached the example. I want to fetch values as lists of tuple as [ [( value of 2050 jan, value of main subject-part1-sub part1-subject1 ), ( value of 2050 feb, value of main subject-part1-sub part1-subject1 ),... ], [( value of 2050 jan, value of main subject-part1-sub part1-subject2 ), ( value of 2050 feb, value of main subject-part1-sub part1-subject2 )... ] ] and so on. Can anyone let

Parsing UTF-8/unicode strings with lxml HTML

倾然丶 夕夏残阳落幕 提交于 2019-12-03 11:00:44
问题 I have been trying to parse with etree.HTML() a text encoded as UTF-8 without success. → python Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> from lxml import etree >>> import requests >>> headers = {'User-Agent': "Opera/9.80 (Macintosh; Intel Mac OS X 10.8.0) Presto/2.12.363 Version/12.50"} >>> r = requests.get("http://www.rakuten.co.jp/

Set lxml as default BeautifulSoup parser

百般思念 提交于 2019-12-03 10:55:56
I'm working on a web scraping project and have ran into problems with speed. To try to fix it, I want to use lxml instead of html.parser as BeautifulSoup's parser. I've been able to do this: soup = bs4.BeautifulSoup(html, 'lxml') but I don't want to have to repeatedly type 'lxml' every time I call BeautifulSoup. Is there a way I can set which parser to use once at the beginning of my program? According to the Specifying the parser to use documentation page: The first argument to the BeautifulSoup constructor is a string or an open filehandle–the markup you want parsed. The second argument is

lxml: cssselect(): AttributeError: &#039;lxml.etree._Element&#039; object has no attribute &#039;cssselect&#039;

匿名 (未验证) 提交于 2019-12-03 10:09:14
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Can someone explain why the first call to root.cssselect() works, while the second fails? from lxml.html import fromstring from lxml import etree html='<html><a href="http://example.com">example</a></html' root = fromstring(html) print 'via fromstring', repr(root) # via fromstring <Element html at 0x...> print root.cssselect("a") root2 = etree.HTML(html) print 'via etree.HTML()', repr(root2) # via etree.HTML() <Element html at 0x...> root2.cssselect("a") # --> Exception I get: Traceback (most recent call last): File "/home/foo_eins_d/src/foo

Multithreading for faster downloading

与世无争的帅哥 提交于 2019-12-03 09:09:54
How can I download multiple links simultaneously? My script below works but only downloads one at a time and it is extremely slow. I can't figure out how to incorporate multithreading in my script. The Python script: from BeautifulSoup import BeautifulSoup import lxml.html as html import urlparse import os, sys import urllib2 import re print ("downloading and parsing Bibles...") root = html.parse(open('links.html')) for link in root.findall('//a'): url = link.get('href') name = urlparse.urlparse(url).path.split('/')[-1] dirname = urlparse.urlparse(url).path.split('.')[-1] f = urllib2.urlopen

Scraping new ESPN site using xpath [Python]

依然范特西╮ 提交于 2019-12-03 09:08:01
I am trying to scrape the new ESPN NBA scoreboard. Here is a simple script which should return the start times for all games on 4/5/15: import requests import lxml.html from lxml.cssselect import CSSSelector doc = lxml.html.fromstring(requests.get('http://scores.espn.go.com/nba/scoreboard?date=20150405').text) #xpath print doc.xpath("//title/text()") #print page title print doc.xpath("//span/@time") print doc.xpath("//span[@class='time']") print doc.xpath("//span[@class='time']/text()") #CCS Selector sel = CSSSelector('span.time') for i in sel(doc): print i.text It doesn't return anything, but