lxml

parsing xml and html page with lxml and requests package in python

久未见 提交于 2019-12-13 14:13:45
问题 I have been trying to parse xml and html page by using lxml and requests package in python. I using the following code for this purpose: in python: import requests import lxml.etree url = "" req = requests.get(url) tree = html.fromstring(req.content) root = tree.xpath('') for item in root: print(item.text) This code works fine but for some web pages can't show their contents properly and need to set encoding utf-8 but i don't know how i can add set encoding in this code 回答1: requests

Is it possible to use Python lxml on Google App Engine?

我与影子孤独终老i 提交于 2019-12-13 11:59:48
问题 Can I use Python lxml on Google App Engine? (or do I have to use Beautiful Soup?) I have started using Beautiful Soup but it seems slow. I am just starting to play with the idea of "screen scraping" data from other websites to create some sort of "mash-up". 回答1: EDIT : The lxml library is now supported. Short answer: you can't. From AppEngine's docs: "Application code written for the Python environment must be written exclusively in Python. Extensions written in the C language are not

I want to be able to “walk” a nested XSD using lxml from Python 3

只愿长相守 提交于 2019-12-13 06:47:59
问题 I know how to walk a basic XSD with lxml with Python 3 #!/usr/bin/python # # Go through an XSD, listing attributes and entities # import argparse from lxml import etree def do_element(elmnt): nam = elmnt.get('name') if len(elmnt) != 0: # Entity print("Entity: ", nam, " Type: ",elmnt.get('type','None')) else: # Attribute if nam != None: print("Attrib: ", nam, " Type: ",elmnt.get('type','None') ) def main(): parser = argparse.ArgumentParser(prog='test') parser.add_argument('-d',action='store

Python - How to append the same XML element multiple times with lxml.objectify

冷暖自知 提交于 2019-12-13 06:07:04
问题 I have the following XML that I am trying to recreate with the lxml.objectify package <file> <customers> <customer> <phone> <type>home</type> <number>555-555-5555</number> </phone> <phone> <type>cell</type> <number>999-999-9999</number> </phone> <phone> <type>home</type> <number>111-111-1111</number> </phone> </customer> </customers> </file> I can't figure out how to create the phone element multiple times. Basically, I have the following non-working code: # create phone element 1 root

Parse HTML/XML and find locations of elements in original document

雨燕双飞 提交于 2019-12-13 05:19:18
问题 Is there a way to get the original location of an element in a document, ie. the start and end character index, when parsing html/xml in Python? I've looked through the lxml documentation and couldn't find anything. eg. <a>1</a><b>2</b> ... print tree.find('b').original_position # result: (9, 16) 回答1: Google found this, the gist of which is: it's hard for malformed documents because parsing requires synthesizing valid tokens that don't have any corresponding input. It's possible for valid

Installing lxml OSX Mavericks 10.9.2

末鹿安然 提交于 2019-12-13 05:15:27
问题 I am trying to install lxml on 10.9.2 Mavericks and i used all the solutions mentioned here before but i seem to get a different error, the argument '-mno-fused-madd' is unknown, i believe it just triggered a warning back then, but now it throws an error here's the log cc -fno-strict-aliasing -fno-common -dynamic -arch x86_64 -arch i386 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -mno-fused-madd -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG

Fetching data using Python & lxml

末鹿安然 提交于 2019-12-13 04:42:22
问题 I have a my HTML which looks like below. I would like to get the text which is in the <span class="zzAggregateRatingStat"> . According to the e.g given below I would get 3 and 5. For this work I am using Python2.7 & lxml <div class="pp-meta-review"> <span class="zrvwidget" style=""> <span g:inline="true" g:type="NumUsersFoundThisHelpful" g:hideonnoratings="true" g:entity.annotation.groups="maps" g:entity.annotation.id="http://maps.google.com/?q=Central+Kia+of+Irving++(972)+659-2204+loc:+1600

Removing spaces and non-printable character in Python

风流意气都作罢 提交于 2019-12-13 04:41:36
问题 I am working with xml file using lxml etree xpath method. My code is from lxml import etree File="c:\file.xml" doc=etree.parse(File) alltext = doc.xpath('descendant-or-self::text()') clump = "".join(alltext) clump I got the following output: "'\n\t\n\t\t\n\t\t\n\t\t\n\t\t\n\t\n\t\n\t\t\t\n\t\n\t\t\n\t\t\t\n\t\t\t\tIntroduction\n\t\t\t\n\t\t\t\n\t\t\n\t\t\n\t\t\t\n\t\t\t\tAccessibility\n\t\t\t\n\t\t\t\n\t\t\n\t\t\n\t\t\t\n\t\t\t\tOpening eBooks\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\ I want to remove

Python Lxml: Adding and deleting tags

懵懂的女人 提交于 2019-12-13 04:33:18
问题 I am attempting to add and remove tags in an xml tree (snip below). I have a dict of boolean values that I use to determine whether to add or remove a tag. If the value is true, and the element does not exist, it creates the tag (and its parent if it doesn't exist). If false, it deletes the value. However, it doesn't seem to work, and I can't find out why. <Assets> <asset name="Adham"> <pos> <x>27913.769923</x> <y>5174.627773</y> </pos> <GFX> <space>P03.png</space> <exterior>snow.png<

Open and Read : Multiple xml files from the Folder python

穿精又带淫゛_ 提交于 2019-12-13 04:23:00
问题 I have stored about 150+ XML files in one folder. I want to open and read those XML files from that folder (about 150+ XML files); after that, I do the next analysis. What do I need to change in the below code to open/read the multiple XML files from that folder? from bs4 import BeautifulSoup import lxml import pandas as pd infile = open("F:\\itprocess\\xmltest.xml","r") contents = infile.read() 回答1: os module's listdir() function is a good way to use while reading multiple files. from bs4