lxml

How to use lxml to find an element by text?

本秂侑毒 提交于 2019-12-02 18:57:04
Assume we have the following html: <html> <body> <a href="/1234.html">TEXT A</a> <a href="/3243.html">TEXT B</a> <a href="/7445.html">TEXT C</a> <body> </html> How do I make it find the element "a", which contains "TEXT A"? So far I've got: root = lxml.hmtl.document_fromstring(the_html_above) e = root.find('.//a') I've tried: e = root.find('.//a[@text="TEXT A"]') but that didn't work, as the "a" tags have no attribute "text". Is there any way I can solve this in a similar fashion to what I've tried? You are very close. Use text()= rather than @text (which indicates an attribute). e = root

Writing a custom XML file for the Wordpress Importer using lxml

心已入冬 提交于 2019-12-02 18:41:11
问题 Okay, so here is my current situation: My knowledge of XML or lxml isn't very good yet, since I rarely used XML files until now. So please tell me if something in my approach to this is really stupid. ;-) I want to feed my Wordpress installation a custom XML file, using the Wordpress importer. The Default Format can be seen here: XML File Now there are some tags looking like this <wp:author> I am not a hundred percent sure, but as far as I learned today, the wp: part of the tag is the

How to read all new files of a directory with python?

我的未来我决定 提交于 2019-12-02 18:37:02
问题 I'm beginner in Python and I'm wondering to know how can I add a condition in this code to read only all new files of .../data/ directory (for example from 24 hours ago) or (from last execution time). Because I parse my .xml files every day and it is parsing all old files again and it takes time. from lxml import etree as ET import glob import sys import os path = '/home/sky/data/' for filename in glob.glob(os.path.join(path, '*.xml')): try: tree = ET.parse(filename) root = tree.getroot()

Installing easy_install… to get to installing lxml

给你一囗甜甜゛ 提交于 2019-12-02 18:09:35
I've come to grips with the fact that ElementTree isn't going to do what I want it to do. I've checked out the documentation for lxml, and it appears that it will serve my purposes. To get lxml, I need to get easy_install. So I downloaded it from here , and put it in /Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/ . Then I went to that folder, and ran sh setuptools-0.6c11-py2.6.egg . That installed successfully. Then I got excited because I thought the whole point of easy_install was that I could then just install via easy_install lxml, and lxml would magically

finding elements by attribute with lxml

柔情痞子 提交于 2019-12-02 17:37:20
I need to parse a xml file to extract some data. I only need some elements with certain attributes, here's an example of document: <root> <articles> <article type="news"> <content>some text</content> </article> <article type="info"> <content>some text</content> </article> <article type="news"> <content>some text</content> </article> </articles> </root> Here I would like to get only the article with the type "news". What's the most efficient and elegant way to do it with lxml? I tried with the find method but it's not very nice: from lxml import etree f = etree.parse("myfile") root = f.getroot(

Flask example with POST

▼魔方 西西 提交于 2019-12-02 17:26:50
Suppose the following route which accesses an xml file to replace the text of a specific tag with a given xpath (?key=): @app.route('/resource', methods = ['POST']) def update_text(): # CODE Then, I would use cURL like this: curl -X POST http://ip:5000/resource?key=listOfUsers/user1 -d "John" The xpath expreesion listOfUsers/user1 should access the tag <user1> to change its current text to "John". I have no idea on how to achieve this because I'm just starting to learn Flask and REST and I can't find any good example for this specific case. Also, I'd like to use lxml to manipulate the xml file

selecting attribute values from lxml

断了今生、忘了曾经 提交于 2019-12-02 17:00:39
I want to use an xpath expression to get the value of an attribute. I expected the following to work from lxml import etree for customer in etree.parse('file.xml').getroot().findall('BOB'): print customer.find('./@NAME') but this gives an error : Traceback (most recent call last): File "bob.py", line 22, in <module> print customer.find('./@ID') File "lxml.etree.pyx", line 1409, in lxml.etree._Element.find (src/lxml/lxml.etree.c:39972) File "/usr/local/lib/python2.7/dist-packages/lxml/_elementpath.py", line 272, in find it = iterfind(elem, path, namespaces) File "/usr/local/lib/python2.7/dist

Python3爬虫笔记 -- 解析库Beautiful Soup

 ̄綄美尐妖づ 提交于 2019-12-02 15:13:57
1、简介 Beautiful Soup :Python的一个HTML或XML的解析库,借助网页的结构和属性等特性来解析网页。有了它,我们不用再去写一些复杂的正则表达式,只需要简单的几条语句,就可以完成网页中某个元素的提取。 Beautiful Soup在解析时实际上依赖解析器,这里推荐使用 lxml 解析器,在初始化Beautiful Soup时,把第二个参数改为lxml即可: from bs4 import BeautifulSoup soup = BeautifulSoup ( '<p>Hello</p>' , 'lxml' ) print ( soup . p . string ) 2、基本用法 初始化BeautifulSoup时,完成对不完整的html代码的补全 prettify() :把要解析的字符串以标准的缩进格式输出 soup.title.string :是输出HTML中title节点的文本内容;soup.title会输出整个title节点 html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a

How to read all new files of a directory with python?

喜你入骨 提交于 2019-12-02 11:35:01
I'm beginner in Python and I'm wondering to know how can I add a condition in this code to read only all new files of .../data/ directory (for example from 24 hours ago) or (from last execution time). Because I parse my .xml files every day and it is parsing all old files again and it takes time. from lxml import etree as ET import glob import sys import os path = '/home/sky/data/' for filename in glob.glob(os.path.join(path, '*.xml')): try: tree = ET.parse(filename) root = tree.getroot() #other codes here except Exception: pass Thanks! for filename in glob.glob(os.path.join(path, '*.xml')):

爬虫-lxml用法

≯℡__Kan透↙ 提交于 2019-12-02 11:15:59
安装 pip install lxml 用法 # coding=utf-8 from lxml import etree text = ''' <div> <ul> <li class="item-1"><a>first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' html = etree.HTML(text) print(html) #查看element对象中包含的字符串 # print(etree.tostring(html).decode()) #获取class为item-1 li下的a的herf ret1 = html.xpath("//li[@class='item-1']/a/@href") print(ret1) #获取class含有item-1 li下的a的文本