lxml | 易学教程

When using lxml, can the XML be rendered without namespace attributes?

阅读更多关于 When using lxml, can the XML be rendered without namespace attributes?

问题 I am generating some XML with lxml and getting nodes generated like this: <QBXML xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" py:pytype="TREE"> and: <MaxReturned py:pytype="int"> These custom attributes are killing Quickbooks' parser. Can I get LXML to render without the custom stuff? 回答1: Looks like the following take care of it: objectify.deannotate(root, xsi_nil=True) etree.cleanup

How to open this XML file to create dataframe in Python?

阅读更多关于 How to open this XML file to create dataframe in Python?

问题 Does anyone have a suggestion for the best way to open the xml data on the site below to put it in a dataframe (I prefer working with pandas) in python? The file is on the "Data - XML (sdmx/zip)" link on this site: http://www.federalreserve.gov/pubs/feds/2006/200628/200628abs.html I've tried using the following by copying from http://timhomelab.blogspot.com/2014/01/how-to-read-xml-file-into-dataframe.html, and it seems I'm getting close: from lxml import objectify import pandas as pd path =

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

阅读更多关于 bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

... soup = BeautifulSoup(html, "lxml") File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 152, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? The above outputs on my Terminal. I am on Mac OS 10.7.x. I have Python 2.7.1, and followed this tutorial to get Beautiful Soup and lxml, which both installed successfully and work with a separate test file located here . In the Python script that causes this error, I have included this line: from pageCrawler import comparePages

builtins.TypeError: must be str, not bytes

阅读更多关于 builtins.TypeError: must be str, not bytes

I've converted my scripts from Python 2.7 to 3.2, and I have a bug. # -*- coding: utf-8 -*- import time from datetime import date from lxml import etree from collections import OrderedDict # Create the root element page = etree.Element('results') # Make a new document tree doc = etree.ElementTree(page) # Add the subelements pageElement = etree.SubElement(page, 'Country',Tim = 'Now', name='Germany', AnotherParameter = 'Bye', Code='DE', Storage='Basic') pageElement = etree.SubElement(page, 'City', name='Germany', Code='PZ', Storage='Basic',AnotherParameter = 'Hello') # For multiple multiple

How can I strip namespaces out of an lxml tree?

阅读更多关于 How can I strip namespaces out of an lxml tree?

问题 Following on from Removing child elements in XML using python ... Thanks to @Tichodroma, I have this code: If you can use lxml, try this: import lxml.etree tree = lxml.etree.parse("leg.xml") for dog in tree.xpath("//Leg1:Dog", namespaces={"Leg1": "http://what.not"}): parent = dog.xpath("..")[0] parent.remove(dog) parent.text = None tree.write("leg.out.xml") Now leg.out.xml looks like this: <?xml version="1.0"?> <Leg1:MOR xmlns:Leg1="http://what.not" oCount="7"> <Leg1:Order> <Leg1:CTemp id="FO

parsing xml containing default namespace to get an element value using lxml

阅读更多关于 parsing xml containing default namespace to get an element value using lxml

I have a xml string like this str1 = """<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc> http://www.example.org/sitemap_1.xml.gz </loc> <lastmod>2015-07-01</lastmod> </sitemap> </sitemapindex> """ I want to extract all the urls present inside <loc> node i.e http://www.example.org/sitemap_1.xml.gz I tried this code but it didn't word from lxml import etree root = etree.fromstring(str1) urls = root.xpath("//loc/text()") print urls [] I tried to check if my root node is formed correctly. I tried this and get back the same string as str1 etree.tostring(root) '

HTML encoding and lxml parsing

阅读更多关于 HTML encoding and lxml parsing

问题 I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered: 1. <!DOCTYPE html> <html lang='en'> <head> <title>Unicode Chars: 은 —’</title> <meta charset='utf-8'> </head> <body></body> </html> 2. <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR"> <head> <title>Unicode Chars: 은 —’</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> <

解析库beautisoup

阅读更多关于解析库beautisoup

一、介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip install html5lib 二、基本使用 html_doc = """ <html>

Py2exe lxml woes

阅读更多关于 Py2exe lxml woes

问题 I have a wxpython application that depends on lxml and works well when running it through the python interpreter. However, when creating an exe with py2exe, I got this error ImportError: No module named _elementpath I then used python setup.py py2exe -p lxml and I did not get the above error but another one saying ImportError: No module named gzip Could anyone let me know what the problem is and how I can fix it. Also should I put any dll files like libxml2, libxslt etc in my dist folder? I

lxml etree xmlparser remove unwanted namespace

阅读更多关于 lxml etree xmlparser remove unwanted namespace

I have an xml doc that I am trying to parse using Etree.lxml <Envelope xmlns="http://www.example.com/zzz/yyy"> <Header> <Version>1</Version> </Header> <Body> some stuff <Body> <Envelope> My code is: path = "path to xml file" from lxml import etree as ET parser = ET.XMLParser(ns_clean=True) dom = ET.parse(path, parser) dom.getroot() When I try to get dom.getroot() I get: <Element {http://www.example.com/zzz/yyy}Envelope at 28adacac> However I only want: <Element Envelope at 28adacac> When i do dom.getroot().find("Body") I get nothing returned. However, when I dom.getroot().find("{http://www