lxml

Installing lxml in virtualenv for windows

天大地大妈咪最大 提交于 2019-12-21 05:05:15
问题 I've recently started using virtualenv, and would like to install lxml in this isolated environment. Normally I would use the windows binary installer, but I want to use lxml in this virtualenv (not globally). Pip install does not work for lxml, so I'm at a loss for what I can do. I've read that creating symlinks may work, although I unfamiliar with how symlinks work and what files I should be creating them for. Does anyone else know of any methods to install lxml in a virtualenv on Windows?

Adding attributes to existing elements, removing elements, etc with lxml

你说的曾经没有我的故事 提交于 2019-12-21 04:56:37
问题 I parse in the XML using from lxml import etree tree = etree.parse('test.xml', etree.XMLParser()) Now I want to work on the parsed XML. I'm having trouble removing elements with namespaces or just elements in general such as <rdf:description><dc:title>Example</dc:title></rdf:description> and I want to remove that entire element as well as everything within the tags. I also want to add attributes to existing elements as well. The methods I need are in the Element class but I have no idea how

Lxml html xpath context

青春壹個敷衍的年華 提交于 2019-12-21 04:15:21
问题 I'm using lxml to parse a HTML file and I'd like to know how can I set the context of xpath search. What I mean I that I have a node element and want to make xpath search only inside this node as if it was the root one. For example, I have a form node and xpath search //input return only inputs of the given form as opposed to all inputs of all forms on the page. How can I do that? I've found some xpath context docs here, but it doesn't seems to be quite what I want. 回答1: XPath expression /

Is there an elegant way to count tag elements in a xml file using lxml in python?

佐手、 提交于 2019-12-21 03:25:13
问题 I could read the content of the xml file to a string and use string operations to achieve this, but I guess there is a more elegant way to do this. Since I did not find a clue in the docus, I am sking here: Given an xml (see below) file, how do you count xml tags, like count of author-tags in the example bewlow the most elegant way ? We assume, that each author appears exactly once. <root> <author>Tim</author> <author>Eva</author> <author>Martin</author> etc. </root> This xml file is trivial,

python爬虫beautifulsoup

回眸只為那壹抹淺笑 提交于 2019-12-21 01:40:25
1、BeautifulSoup库,也叫beautifulsoup4或bs4   功能:解析HTML/XML文档 2、HTML格式   成对尖括号构成 3、库引用 #bs4为简写,BeautifulSoup为其中一个类 from bs4 import BeautifulSoup #直接引用库 import bs4 3.1、BeautifulSoup类   >>from bs4 import BeautifulSoup   >>soup=BeautifulSoup("<html>data</html>","html.parser")   >>soups=BeautifulSoup(open("D://demo.html"),"html.parser")   可以直接操作源码,也可以操作文件   3.1、html.parser为bs4的html解析器,安装了bs4库即可使用      lxml为lxml的HTML解析器,安装lxml      xml为lxml的xml解析器,安装lxml      html5lib为html5lib的解析器,安装html5lib   3.2、基本元素     3.2.1、Tag:标签,最基本信息组织单元,分别用<>和</>标明开头和结尾     3.2.2、Name:标签的名字,<p>...</p>,格式:<tag>.attrs     3.2.3

Parsing CDATA in xml with python

允我心安 提交于 2019-12-20 12:33:25
问题 I need to parse an XML file with a number of blocks of CDATA that I need to retain for later plotting: <process id="process1"> <log name="name1" device="device1"><![CDATA[timestamp value]]]></log> <log name="name2" device="device2"><![CDATA[timestamp value, timestamp value, timestamp]]]></log> </process> I will need to do this repeatedly and quickly, and I am looking for the best way to do this. I've read that ElementTree is the faster of the methods, but I am open to other suggestions. 回答1:

Need python lxml syntax help for parsing html

允我心安 提交于 2019-12-20 11:54:07
问题 I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with: HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail. I need to find the middle table with the search result rows (this one I was able to figure out): self

how to remove attribute of a etree Element?

|▌冷眼眸甩不掉的悲伤 提交于 2019-12-20 10:17:38
问题 I've Element of etree having some attributes - how can we delete the attribute of perticular etree Element. 回答1: The .attrib member of the element object contains the dict of attributes - you can use .pop("key") or del like you would on any other dict to remove a key-val pair. 回答2: Example : >>> from lxml import etree >>> from lxml.builder import E >>> otree = E.div() >>> otree.set("id","123") >>> otree.set("data","321") >>> etree.tostring(otree) '<div id="123" data="321"/>' >>> del otree

Extracting p within h1 with Python/Scrapy

ⅰ亾dé卋堺 提交于 2019-12-20 07:26:31
问题 I am using Scrapy to extract some data about musical concerts from websites. At least one website I'm working with uses (incorrectly, according to W3C - Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?) a p element within an h1 element. I need to extract the text within the p element nevertheless, and cannot figure out how. I have read the documentation and looked around for example uses, but am relatively new to Scrapy. I understand the solution has

python库:bs4,BeautifulSoup库、Requests库

折月煮酒 提交于 2019-12-20 07:21:49
Beautiful Soup https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/   Beautiful Soup 4.2.0 文档 http://www.imooc.com/learn/712  视频课程:python遇见数据采集 https://segmentfault.com/a/1190000005182997  PyQuery的使用方法 import bs4 print(bs4.__version__) #当前版本是4.5.3  2017-4-6 安装第三方库 C:\Python3\scripts\> python pip.exe install bs4 (引入第三方库 bs4 )——BeautifulSoup C:\Python3\scripts\> python pip.exe install html5lib(引入第三方库 html5lib )——html5解析器,BeautifulSoup要用到 打开本地的zzzzz.html文件,用 BeautifulSoup 解析 from urllib import request from bs4 import BeautifulSoup import html5lib #html5解析器 url='file:///C:/Python3/zz/zzzzz