lxml

When using lxml, can the XML be rendered without namespace attributes?

感情迁移 提交于 2019-11-27 03:26:05
问题 I am generating some XML with lxml and getting nodes generated like this: <QBXML xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" py:pytype="TREE"> and: <MaxReturned py:pytype="int"> These custom attributes are killing Quickbooks' parser. Can I get LXML to render without the custom stuff? 回答1: Looks like the following take care of it: objectify.deannotate(root, xsi_nil=True) etree.cleanup

How to open this XML file to create dataframe in Python?

半城伤御伤魂 提交于 2019-11-27 02:55:43
问题 Does anyone have a suggestion for the best way to open the xml data on the site below to put it in a dataframe (I prefer working with pandas) in python? The file is on the "Data - XML (sdmx/zip)" link on this site: http://www.federalreserve.gov/pubs/feds/2006/200628/200628abs.html I've tried using the following by copying from http://timhomelab.blogspot.com/2014/01/how-to-read-xml-file-into-dataframe.html, and it seems I'm getting close: from lxml import objectify import pandas as pd path =

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

白昼怎懂夜的黑 提交于 2019-11-27 02:50:02
... soup = BeautifulSoup(html, "lxml") File "/Library/Python/2.7/site-packages/bs4/__init__.py", line 152, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library? The above outputs on my Terminal. I am on Mac OS 10.7.x. I have Python 2.7.1, and followed this tutorial to get Beautiful Soup and lxml, which both installed successfully and work with a separate test file located here . In the Python script that causes this error, I have included this line: from pageCrawler import comparePages

builtins.TypeError: must be str, not bytes

 ̄綄美尐妖づ 提交于 2019-11-27 02:40:50
I've converted my scripts from Python 2.7 to 3.2, and I have a bug. # -*- coding: utf-8 -*- import time from datetime import date from lxml import etree from collections import OrderedDict # Create the root element page = etree.Element('results') # Make a new document tree doc = etree.ElementTree(page) # Add the subelements pageElement = etree.SubElement(page, 'Country',Tim = 'Now', name='Germany', AnotherParameter = 'Bye', Code='DE', Storage='Basic') pageElement = etree.SubElement(page, 'City', name='Germany', Code='PZ', Storage='Basic',AnotherParameter = 'Hello') # For multiple multiple

How can I strip namespaces out of an lxml tree?

Deadly 提交于 2019-11-27 02:31:25
问题 Following on from Removing child elements in XML using python ... Thanks to @Tichodroma, I have this code: If you can use lxml, try this: import lxml.etree tree = lxml.etree.parse("leg.xml") for dog in tree.xpath("//Leg1:Dog", namespaces={"Leg1": "http://what.not"}): parent = dog.xpath("..")[0] parent.remove(dog) parent.text = None tree.write("leg.out.xml") Now leg.out.xml looks like this: <?xml version="1.0"?> <Leg1:MOR xmlns:Leg1="http://what.not" oCount="7"> <Leg1:Order> <Leg1:CTemp id="FO

parsing xml containing default namespace to get an element value using lxml

拈花ヽ惹草 提交于 2019-11-27 02:11:32
I have a xml string like this str1 = """<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <sitemap> <loc> http://www.example.org/sitemap_1.xml.gz </loc> <lastmod>2015-07-01</lastmod> </sitemap> </sitemapindex> """ I want to extract all the urls present inside <loc> node i.e http://www.example.org/sitemap_1.xml.gz I tried this code but it didn't word from lxml import etree root = etree.fromstring(str1) urls = root.xpath("//loc/text()") print urls [] I tried to check if my root node is formed correctly. I tried this and get back the same string as str1 etree.tostring(root) '

HTML encoding and lxml parsing

你。 提交于 2019-11-27 01:11:25
问题 I'm trying to finally solve some encoding issues that pop up from trying to scrape HTML with lxml. Here are three sample HTML documents that I've encountered: 1. <!DOCTYPE html> <html lang='en'> <head> <title>Unicode Chars: 은 —’</title> <meta charset='utf-8'> </head> <body></body> </html> 2. <!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR"> <head> <title>Unicode Chars: 은 —’</title> <meta http-equiv="content-type" content="text/html; charset=utf-8" /> <

解析库beautisoup

回眸只為那壹抹淺笑 提交于 2019-11-27 01:04:00
一、介绍 Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup会帮你节省数小时甚至数天的工作时间.你可能在寻找 Beautiful Soup3 的文档,Beautiful Soup 3 目前已经停止开发,官网推荐在现在的项目中使用Beautiful Soup 4, 移植到BS4 #安装 Beautiful Soup pip install beautifulsoup4 #安装解析器 Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器,其中一个是 lxml .根据操作系统不同,可以选择下列方法来安装lxml: $ apt-get install Python-lxml $ easy_install lxml $ pip install lxml 另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib: $ apt-get install Python-html5lib $ easy_install html5lib $ pip install html5lib 二、基本使用 html_doc = """ <html>

Py2exe lxml woes

半城伤御伤魂 提交于 2019-11-27 00:54:58
问题 I have a wxpython application that depends on lxml and works well when running it through the python interpreter. However, when creating an exe with py2exe, I got this error ImportError: No module named _elementpath I then used python setup.py py2exe -p lxml and I did not get the above error but another one saying ImportError: No module named gzip Could anyone let me know what the problem is and how I can fix it. Also should I put any dll files like libxml2, libxslt etc in my dist folder? I

lxml etree xmlparser remove unwanted namespace

丶灬走出姿态 提交于 2019-11-27 00:20:57
I have an xml doc that I am trying to parse using Etree.lxml <Envelope xmlns="http://www.example.com/zzz/yyy"> <Header> <Version>1</Version> </Header> <Body> some stuff <Body> <Envelope> My code is: path = "path to xml file" from lxml import etree as ET parser = ET.XMLParser(ns_clean=True) dom = ET.parse(path, parser) dom.getroot() When I try to get dom.getroot() I get: <Element {http://www.example.com/zzz/yyy}Envelope at 28adacac> However I only want: <Element Envelope at 28adacac> When i do dom.getroot().find("Body") I get nothing returned. However, when I dom.getroot().find("{http://www