lxml

Modify large file containing multiple XML files to create small file depending on condition

天大地大妈咪最大 提交于 2020-01-05 08:33:48
问题 I have a large file that contains multiple XMLs in different lines. I want to create a new file with lines (or XMLs) depending on a condition where multiple tags match columns of spreadsheet. For example, I have a large XML file. <?xml version="1.0" encoding="UTF-8"?><data><student><result><grade>A</grade></result><details><name>John</name><house>Red</house><id>100</id><age>16</age><email>john@mail.com</email></details></student></data> <?xml version="1.0" encoding="UTF-8"?><data><student>

Failing to load packages with pycharm

有些话、适合烂在心里 提交于 2020-01-05 07:34:07
问题 I am trying to do some web scrapping using python with PyCharm on a windows 10 machine. Some sites suggest using lxml library and it sounds good. I am trying to load the package but am having trouble. What should I do? OK great. I go to add lxml 3.6.4 in the package installer and it fails with the message(s): ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program or batch file.\r\n" and Could not find function xmlCheckVersion in library libxml2. Is

list python-dev as install_requires in setup.py

99封情书 提交于 2020-01-05 04:18:32
问题 Is there a way to tell python in the setup.py file that "python-dev" (which cannot be installed with pip because is a OS package) is necessary and therefore should be installed? How to install it automatically? 回答1: No. For one thing, the python-dev package is specific to Debian-like distributions; there is no guarantee that other distributions will have a package with the same name that fulfills the desired role. For another, the user installing your Python package may have permission to

Crawling tables from webpage

我怕爱的太早我们不能终老 提交于 2020-01-05 03:19:05
问题 I'm trying to extract csu employee salary data from this webpage (http://www.sacbee.com/statepay/#req=employee%2Fsearch%2Fname%3D%2Fyear%3D2013%2Fdepartment%3DCSU%20Sacramento). I've tried using urlib2 and requests library, but none of them returned the actual table from the webpage. I guessed the reason could be that the table was generated dynamically by javascript. Below is my code using requests. from lxml import html import requests page = requests.get("http://www.sacbee.com/statepay/

openshift: can't install lxml for python app

烈酒焚心 提交于 2020-01-04 06:17:28
问题 I am trying Openshift but I can't deploy a python app with lxml . Below are my steps, I'm only adding a lxml requirement. The error happens when I push. I am able to ssh so I don't think it's a problem of connectivity. If I don't add the lxml requirement but add some other libraries, it works. The problem is with lxml only. I think it's because it has some system dependencies (I have to run this on a ubuntu machine before: sudo apt-get install -y libxml2-dev libxslt-dev python-dev ) However I

Merge multiple <br /> tags to a single one with python lxml

夙愿已清 提交于 2020-01-04 04:43:51
问题 I've a python script to clean scraped html content, it uses BeautifulSoup4 and works pretty well. Recently I have decided to learn lxml but I found the tutorials are harder (for me) to follow. For example I use the following code to merge multiple <br /> tags into one, i.e, if there are more than one <br /> tags, remove all but keep just one: from bs4 import BeautifulSoup, Tag data = 'foo<br /><br>bar. <p>foo<br/><br id="1"><br/>bar' soup = BeautifulSoup(data) for br in soup.find_all("br"):

How to convert < into < in lxml, Python?

隐身守侯 提交于 2020-01-03 21:01:23
问题 There's a xml file: <body> <entry> I go to <hw>to</hw> to school. </entry> </body> For some reason, I changed <hw> to <hw> and </hw> to </hw> before parsing it with lxml parser. <body> <entry> I go to <hw>to</hw> to school. </entry> </body> But after modifying the parsed xml data, I want to get a <hw> element, not <hw> . How can I do that? 回答1: First find a unescape function: from xml.sax.saxutils import unescape entry=body[0] unescape and replace it with the original: body.replace(entry, e

lxml attributes require full namespace

こ雲淡風輕ζ 提交于 2020-01-03 16:49:01
问题 The code below reads the a table from an Excel 2003 XML workbook using lxml (python 3.3). The code works fine, however in order to access the Type attribute of the Data element via the get() method I need to use the key '{urn:schemas-microsoft-com:office:spreadsheet}Type' - why is this, I've specified this namespace with the ss prefix. All I can think of is this namespace appears twice in the document, once with a namespace prefix and once without i.e. <Workbook xmlns="urn:schemas-microsoft

How to get lxml working under IronPython?

百般思念 提交于 2020-01-03 10:55:19
问题 I need to port some code that relies heavily on lxml from a CPython application to IronPython. lxml is very Pythonic and I would like to keep using it under IronPython, but it depends on libxslt and libxml2, which are C extensions. Does anyone know of a workaround to allow lxml under IronPython or a version of lxml that doesn't have those C-extension dependencies? 回答1: You might check out IronClad, which is an open source project intended to make C Extensions for Python available in

Copy a node from one xml file to another using lxml

这一生的挚爱 提交于 2020-01-03 05:29:04
问题 I'm trying to find the simplest way of copying one node to another XML file. Both files will contain the same node - just the contents of that node will be different. In the past I've done some crazy copying of each element and subelement - but there has to be a better way.. #Master XML parser = etree.XMLParser(strip_cdata=False) tree = etree.parse('file1.xml', parser) # Find the //input node - which has a lot of subelems inputMaster= tree.xpath('//input')[0] #Dest XML - parser2 = etree