lxml | 易学教程

Extracting information from a table on a website using python, LXML & XPATH

阅读更多关于 Extracting information from a table on a website using python, LXML & XPATH

问题 I managed after lots of hard work to extract some information that i needed from a table from this website: http://gbgfotboll.se/serier/?scr=table&ftid=57108 From the table "Kommande Matcher"(second table) I managed to extract the date and the team names. But now i am totally stuck trying to extract from the first table: The first column "Lag" The second column "S" 6h column "GM-IM" last column "P" Any ideas? , Thanks 回答1: I've just did it: from io import BytesIO import urllib2 as net from

Unable to get the full content using selector

阅读更多关于 Unable to get the full content using selector

问题 I've written some selector used within python to get some items and it's value. I wish to scrape the items not to style. However, when I run my script, It only gets the items but can't reach the value of those items which are separated by "br" tag. How can I grab them? I do not with to use xpath in this very case to serve the purpose. Thanks in advance. Here are the elements: html = ''' <div class="elems"> <ul> <li>Item Name: titan </li> <li>Item No: 23003400 </li>

Parse large python xml using xmltree

阅读更多关于 Parse large python xml using xmltree

问题 I have a python script that parses huge xml files ( largest one is 446 MB) try: parser = etree.XMLParser(encoding='utf-8') tree = etree.parse(os.path.join(srcDir, fileName), parser) root = tree.getroot() except Exception, e: print "Error parsing file "+str(fileName) + " Reason "+str(e.message) for child in root: if "PersonName" in child.tag: personName = child.text This is what my xml looks like : <?xml version="1.0" encoding="utf-8"?> <MyRoot xmlns:xsi="http://www.w3.org/2001/XMLSchema

Trouble Installing Correct Version of lxml (with updated context)

阅读更多关于 Trouble Installing Correct Version of lxml (with updated context)

问题 I've posted this question earlier but have more info. now that may be helpful in trying to resolve this issue. After checking what Python platform I have using: import platform platform.architecture() It reads: ('64bit', 'WindowsPE') I still haven't found a way to get the correct version of lxml, since I'm still getting the error when I import the module, which reads: File "L:\Code\Scripts\YelpScraper.py", line 1, in <module> from lxml import html File "L:\Code\Scripts\lxml\html\__init__.py",

Getting data from broken xml in Python

阅读更多关于 Getting data from broken xml in Python

问题 I would like to get data from xml, but it structure seems to be broken. I have this example URL: https://b2b.snapoutdoor.pl/rest/V1/extendvariantstocart/73478 Which is xml with data about the product. import requests import json from xml.etree import ElementTree from pprint import pprint response = requests.get( "https://b2b.snapoutdoor.pl/rest/V1/extendvariantstocart/86559", headers={"Accept": "application/xml"}, ) node = ElementTree.fromstring(response.content) data = json.loads(node.text)

Scraping web content using xpath won't work

阅读更多关于 Scraping web content using xpath won't work

问题 I'm using xpath to scrape a amazon webpage particular, but it doesn't work. Can any one give me some advice? Here's the link to that page: a link I want to scrape these: "Fun, credit card-sized prints" The code i'm using is here: from lxml import html import requests url = 'http://www.amazon.co.uk/dp/B009CX5VN2' page = requests.get(url) tree = html.fromstring(page.text) feature_bullets = tree.xpath('//*[@id="feature-bullets"]/ul/li[1]/span/text()') But the feature_bullets is always empty.

Extract data from XML file if arguments are of certain values

阅读更多关于 Extract data from XML file if arguments are of certain values

问题 I want to loop through a Wikipedia dump in XML format and for each revision I want to save the Timestamp and the Comment if the revision is made by a certain username. Is this possible? I'm trying to get familiar with lxml. <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.10/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" version="0.10" xml:lang="en"> <siteinfo> <sitename

python lxml how i use tag in items name?

阅读更多关于 python lxml how i use tag in items name?

问题 i need to build xml file using special name of items, this is my current code : from lxml import etree import lxml from lxml.builder import E wp = E.wp tmp = wp("title") print(etree.tostring(tmp)) current output is this : b'<wp>title</wp>' i want to be : b'<wp:title>title</title:wp>' how i can create items with name like this : wp:title ? 回答1: You confused the namespace prefix wp with the tag name. The namespace prefix is a document-local name for a namespace URI. wp:title requires a parser

Extract information from website using Xpath, Python

阅读更多关于 Extract information from website using Xpath, Python

问题 Trying to extract some useful information from a website. I came a bit now im stuck and in need of your help! I need the information from this table http://gbgfotboll.se/serier/?scr=scorers&ftid=57700 I wrote this code and i got the information that i wanted: import lxml.html from lxml.etree import XPath url = ("http://gbgfotboll.se/serier/?scr=scorers&ftid=57700") rows_xpath = XPath("//*[@id='content-primary']/div[1]/table/tbody/tr") name_xpath = XPath("td[1]//text()") team_xpath = XPath("td

Converting my python script from lxml to xml.etree

阅读更多关于 Converting my python script from lxml to xml.etree

问题 I am trying to convert my script (https://github.com/fletchermoore/n2c2) to use the default package xml.etree instead of lxml. This was an oversight on my part, but now I am realizing it would be impossible to get my target audience to set up lxml on their macs. I think that most of the code should just work by switching out the import, but when I tried it I found out that xml.etree handles namespaces differently (which I do not understand). Specifically, what would be the easiest way to