lxml

Extracting information from a table on a website using python, LXML & XPATH

爱⌒轻易说出口 提交于 2019-12-13 03:30:48
问题 I managed after lots of hard work to extract some information that i needed from a table from this website: http://gbgfotboll.se/serier/?scr=table&ftid=57108 From the table "Kommande Matcher"(second table) I managed to extract the date and the team names. But now i am totally stuck trying to extract from the first table: The first column "Lag" The second column "S" 6h column "GM-IM" last column "P" Any ideas? , Thanks 回答1: I've just did it: from io import BytesIO import urllib2 as net from

Unable to get the full content using selector

做~自己de王妃 提交于 2019-12-13 03:29:49
问题 I've written some selector used within python to get some items and it's value. I wish to scrape the items not to style. However, when I run my script, It only gets the items but can't reach the value of those items which are separated by "br" tag. How can I grab them? I do not with to use xpath in this very case to serve the purpose. Thanks in advance. Here are the elements: html = ''' <div class="elems"><br> <ul> <li><b>Item Name:</b><br> titan </li> <li><b>Item No:</b><br> 23003400 </li>

Parse large python xml using xmltree

时间秒杀一切 提交于 2019-12-13 03:20:37
问题 I have a python script that parses huge xml files ( largest one is 446 MB) try: parser = etree.XMLParser(encoding='utf-8') tree = etree.parse(os.path.join(srcDir, fileName), parser) root = tree.getroot() except Exception, e: print "Error parsing file "+str(fileName) + " Reason "+str(e.message) for child in root: if "PersonName" in child.tag: personName = child.text This is what my xml looks like : <?xml version="1.0" encoding="utf-8"?> <MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema

Trouble Installing Correct Version of lxml (with updated context)

耗尽温柔 提交于 2019-12-13 02:59:05
问题 I've posted this question earlier but have more info. now that may be helpful in trying to resolve this issue. After checking what Python platform I have using: import platform platform.architecture() It reads: ('64bit', 'WindowsPE') I still haven't found a way to get the correct version of lxml, since I'm still getting the error when I import the module, which reads: File "L:\Code\Scripts\YelpScraper.py", line 1, in <module> from lxml import html File "L:\Code\Scripts\lxml\html\__init__.py",

Getting data from broken xml in Python

陌路散爱 提交于 2019-12-13 02:55:38
问题 I would like to get data from xml, but it structure seems to be broken. I have this example URL: https://b2b.snapoutdoor.pl/rest/V1/extendvariantstocart/73478 Which is xml with data about the product. import requests import json from xml.etree import ElementTree from pprint import pprint response = requests.get( "https://b2b.snapoutdoor.pl/rest/V1/extendvariantstocart/86559", headers={"Accept": "application/xml"}, ) node = ElementTree.fromstring(response.content) data = json.loads(node.text)

Scraping web content using xpath won't work

痞子三分冷 提交于 2019-12-13 02:11:30
问题 I'm using xpath to scrape a amazon webpage particular, but it doesn't work. Can any one give me some advice? Here's the link to that page: a link I want to scrape these: "Fun, credit card-sized prints" The code i'm using is here: from lxml import html import requests url = 'http://www.amazon.co.uk/dp/B009CX5VN2' page = requests.get(url) tree = html.fromstring(page.text) feature_bullets = tree.xpath('//*[@id="feature-bullets"]/ul/li[1]/span/text()') But the feature_bullets is always empty.

Extract data from XML file if arguments are of certain values

有些话、适合烂在心里 提交于 2019-12-13 01:14:59
问题 I want to loop through a Wikipedia dump in XML format and for each revision I want to save the Timestamp and the Comment if the revision is made by a certain username. Is this possible? I'm trying to get familiar with lxml. <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en"> <siteinfo> <sitename

python lxml how i use tag in items name?

纵然是瞬间 提交于 2019-12-13 00:57:17
问题 i need to build xml file using special name of items, this is my current code : from lxml import etree import lxml from lxml.builder import E wp = E.wp tmp = wp("title") print(etree.tostring(tmp)) current output is this : b'<wp>title</wp>' i want to be : b'<wp:title>title</title:wp>' how i can create items with name like this : wp:title ? 回答1: You confused the namespace prefix wp with the tag name. The namespace prefix is a document-local name for a namespace URI. wp:title requires a parser

Extract information from website using Xpath, Python

余生长醉 提交于 2019-12-13 00:53:01
问题 Trying to extract some useful information from a website. I came a bit now im stuck and in need of your help! I need the information from this table http://gbgfotboll.se/serier/?scr=scorers&ftid=57700 I wrote this code and i got the information that i wanted: import lxml.html from lxml.etree import XPath url = ("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700") rows_xpath = XPath("//*[@id='content-primary']/div[1]/table/tbody/tr") name_xpath = XPath("td[1]//text()") team_xpath = XPath("td

Converting my python script from lxml to xml.etree

元气小坏坏 提交于 2019-12-13 00:39:01
问题 I am trying to convert my script (https://github.com/fletchermoore/n2c2) to use the default package xml.etree instead of lxml. This was an oversight on my part, but now I am realizing it would be impossible to get my target audience to set up lxml on their macs. I think that most of the code should just work by switching out the import, but when I tried it I found out that xml.etree handles namespaces differently (which I do not understand). Specifically, what would be the easiest way to