lxml | 易学教程

Installing lxml in virtualenv for windows

阅读更多关于 Installing lxml in virtualenv for windows

问题 I've recently started using virtualenv, and would like to install lxml in this isolated environment. Normally I would use the windows binary installer, but I want to use lxml in this virtualenv (not globally). Pip install does not work for lxml, so I'm at a loss for what I can do. I've read that creating symlinks may work, although I unfamiliar with how symlinks work and what files I should be creating them for. Does anyone else know of any methods to install lxml in a virtualenv on Windows?

Adding attributes to existing elements, removing elements, etc with lxml

阅读更多关于 Adding attributes to existing elements, removing elements, etc with lxml

问题 I parse in the XML using from lxml import etree tree = etree.parse('test.xml', etree.XMLParser()) Now I want to work on the parsed XML. I'm having trouble removing elements with namespaces or just elements in general such as <rdf:description><dc:title>Example</dc:title></rdf:description> and I want to remove that entire element as well as everything within the tags. I also want to add attributes to existing elements as well. The methods I need are in the Element class but I have no idea how

Lxml html xpath context

阅读更多关于 Lxml html xpath context

问题 I'm using lxml to parse a HTML file and I'd like to know how can I set the context of xpath search. What I mean I that I have a node element and want to make xpath search only inside this node as if it was the root one. For example, I have a form node and xpath search //input return only inputs of the given form as opposed to all inputs of all forms on the page. How can I do that? I've found some xpath context docs here, but it doesn't seems to be quite what I want. 回答1: XPath expression /

Is there an elegant way to count tag elements in a xml file using lxml in python?

阅读更多关于 Is there an elegant way to count tag elements in a xml file using lxml in python?

问题 I could read the content of the xml file to a string and use string operations to achieve this, but I guess there is a more elegant way to do this. Since I did not find a clue in the docus, I am sking here: Given an xml (see below) file, how do you count xml tags, like count of author-tags in the example bewlow the most elegant way ? We assume, that each author appears exactly once. <root> <author>Tim</author> <author>Eva</author> <author>Martin</author> etc. </root> This xml file is trivial,

python爬虫beautifulsoup

阅读更多关于 python爬虫beautifulsoup

1、BeautifulSoup库，也叫beautifulsoup4或bs4 　　功能：解析HTML/XML文档 2、HTML格式　　成对尖括号构成 3、库引用 #bs4为简写，BeautifulSoup为其中一个类 from bs4 import BeautifulSoup #直接引用库 import bs4 3.1、BeautifulSoup类　　>>from bs4 import BeautifulSoup 　　>>soup=BeautifulSoup("<html>data</html>","html.parser") 　　>>soups=BeautifulSoup(open("D://demo.html"),"html.parser") 　　可以直接操作源码，也可以操作文件　　3.1、html.parser为bs4的html解析器，安装了bs4库即可使用　　　　　lxml为lxml的HTML解析器，安装lxml 　　　　　xml为lxml的xml解析器，安装lxml 　　　　　html5lib为html5lib的解析器，安装html5lib 　　3.2、基本元素　　　　3.2.1、Tag：标签，最基本信息组织单元，分别用<>和</>标明开头和结尾　　　　3.2.2、Name：标签的名字，<p>...</p>,格式:<tag>.attrs 　　　　3.2.3

Parsing CDATA in xml with python

阅读更多关于 Parsing CDATA in xml with python

问题 I need to parse an XML file with a number of blocks of CDATA that I need to retain for later plotting: <process id="process1"> <log name="name1" device="device1"><![CDATA[timestamp value]]]></log> <log name="name2" device="device2"><![CDATA[timestamp value, timestamp value, timestamp]]]></log> </process> I will need to do this repeatedly and quickly, and I am looking for the best way to do this. I've read that ElementTree is the faster of the methods, but I am open to other suggestions. 回答1:

Need python lxml syntax help for parsing html

阅读更多关于 Need python lxml syntax help for parsing html

问题 I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with: HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail. I need to find the middle table with the search result rows (this one I was able to figure out): self

how to remove attribute of a etree Element?

阅读更多关于 how to remove attribute of a etree Element?

问题 I've Element of etree having some attributes - how can we delete the attribute of perticular etree Element. 回答1: The .attrib member of the element object contains the dict of attributes - you can use .pop("key") or del like you would on any other dict to remove a key-val pair. 回答2: Example : >>> from lxml import etree >>> from lxml.builder import E >>> otree = E.div() >>> otree.set("id","123") >>> otree.set("data","321") >>> etree.tostring(otree) '<div id="123" data="321"/>' >>> del otree

Extracting p within h1 with Python/Scrapy

阅读更多关于 Extracting p within h1 with Python/Scrapy

问题 I am using Scrapy to extract some data about musical concerts from websites. At least one website I'm working with uses (incorrectly, according to W3C - Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?) a p element within an h1 element. I need to extract the text within the p element nevertheless, and cannot figure out how. I have read the documentation and looked around for example uses, but am relatively new to Scrapy. I understand the solution has

python库：bs4，BeautifulSoup库、Requests库

阅读更多关于 python库：bs4，BeautifulSoup库、Requests库

Beautiful Soup https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/ 　　Beautiful Soup 4.2.0 文档 http://www.imooc.com/learn/712　　视频课程：python遇见数据采集 https://segmentfault.com/a/1190000005182997　　PyQuery的使用方法 import bs4 print(bs4.__version__) #当前版本是4.5.3　　2017-4-6 安装第三方库 C:\Python3\scripts\> python pip.exe install bs4 （引入第三方库 bs4 ）——BeautifulSoup C:\Python3\scripts\> python pip.exe install html5lib（引入第三方库 html5lib ）——html5解析器，BeautifulSoup要用到打开本地的zzzzz.html文件，用 BeautifulSoup 解析 from urllib import request from bs4 import BeautifulSoup import html5lib #html5解析器 url='file:///C:/Python3/zz/zzzzz