lxml | 易学教程

8.正则表达式和XPath

阅读更多关于 8.正则表达式和XPath

1.使用正则表达式爬取内涵段子 import requests import re def loadPage(page): url = "http://www.neihan8.com/article/list_5_" +page+".html" #User-Agent头 user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT6.1; Trident/5.0' headers = {'User-Agent': user_agent} response = requests.get(url,headers=headers) response.encoding = 'gbk' html = response.text return html if __name__=="__main__": page=input('请输入要爬取的页面:') html=loadPage(page) # with open('a.html','w') as f: # f.write(html) # 找到所有的段子内容<div class="f18 mb20"></div> # re.S 如果没有re.S 则是只匹配一行有没有符合规则的字符串，如果没有则下一行重新匹配 # 如果加上re.S 则是将所有的字符串将一个整体进行匹配,找到(.*?

Python Lxml - Append a existing xml with new data

阅读更多关于 Python Lxml - Append a existing xml with new data

问题 I am new to python/lxml After reading the lxml site and dive into python I could not find the solution to my n00b troubles. I have the below xml sample: --------------- <addressbook> <person> <name>Eric Idle</name> <phone type='fix'>999-999-999</phone> <phone type='mobile'>555-555-555</phone> <address> <street>12, spam road</street> <city>London</city> <zip>H4B 1X3</zip> </address> </person> </addressbook> ------------------------------- I am trying to append one child to the root element and

Parsing a partial XML with python lxml

阅读更多关于 Parsing a partial XML with python lxml

问题 I'm trying to parse a large XML file which is being received from the network in Python. In order to do that, I get the data and pass it to lxml.etree.iterparse However, if the XML has yet to fully be sent, like so: <MyXML> <MyNode foo="bar"> <MyNode foo="ba If I run etree.iterparse(f, tag='MyNode').next() I get an XMLSyntaxError at whereever it was cut off. Is there any way I can make it so I can receive the first tag (i.e. the first MyNode) and only get an exception when I reach that part

AWS Lambda not importing LXML

阅读更多关于 AWS Lambda not importing LXML

问题 I am trying to use the LXML module within AWS Lambda and having no luck. I downloaded LXML using the following command: pip install lxml -t folder To download it to my lambda function deployment package. I zipped the contents of my lambda function up as I have done with all other lambda functions, and uploaded it to AWS Lambda. However no matter what I try I get this error when I run the function: Unable to import module 'handler': /var/task/lxml/etree.so: undefined symbol: PyFPE_jbuf When I

python [lxml] - cleaning out html tags

阅读更多关于 python [lxml] - cleaning out html tags

问题 from lxml.html.clean import clean_html, Cleaner def clean(text): try: cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True, remove_tags = ['a', 'li', 'td']) print (len(cleaner.clean_html(text))- len(text)) return cleaner.clean_html(text) except: print 'Error in clean_html' print sys.exc_info() return text I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html

python [lxml] - cleaning out html tags

阅读更多关于 python [lxml] - cleaning out html tags

Split long XML tags in multiple lines with lxml

阅读更多关于 Split long XML tags in multiple lines with lxml

问题 My python (2.7) script is outputting the following XML using lxml library: <Button android:id="@+id/button1" android:layout_width="wrap_content" android:layout_height="wrap_content" android:layout_marginLeft="17dp" android:layout_marginTop="16dp" android:text="Button"/> I would like to output it in multiple lines, one per attribute: <Button android:id="@+id/button1" android:layout_width="wrap_content" android:layout_height="wrap_content" android:layout_marginLeft="17dp" android:layout

Is it a xpath (lxml) bug?

阅读更多关于 Is it a xpath (lxml) bug?

问题 I have my xpath: //*[namespace-uri() = 'http://foundation.org/UA/2011/03/NodeSet.xsd'][local-name() = 'Reference'][@ReferenceType = 'HasNotifier']/../../Description[@Locale="en"] but don't work with this xml file. Maybe is my mistake, or maybe is a lxml bug ... i don't know. I'm trying few day to create right and correct xpath code. But unfurnetli, i can't do this correct :( Is it a lxml bug or my mistake ? What I want to get, if "HasNotifier" print "002CC-ESSO01.(WAAA05.01?1)" My XML File

HTML Table to List Parsing - <TBODY> monkey wrench for both xml and lxml

阅读更多关于 HTML Table to List Parsing - monkey wrench for both xml and lxml

问题 I read the answers to Parse HTML table to Python list? and tried to use the ideas to read/process my local html downloaded from a web site (the files contain one table and start with the <table class="table"> label). I ran into problems due to the presence of two html tags. With the <thead> label the parse doesn't pick up the header, and the <tbody> causes both xml and lxml to completely fail. I tried googling for a solution but the answer most likely is embedded in some documentation

Facing problem regarding installation of lxml in venv,

阅读更多关于 Facing problem regarding installation of lxml in venv,

问题 I am trying to setup evalai-cli using pip, but i am facing problems during setup when i try to run pip install evalai Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed? ERROR: Command errored out with exit status 1: command: 'c:\users\amana\evalai-cli\venv\scripts\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\amana\AppData\Local\Temp\pip-install-iwb_ci9r\lxml\setup.py'"'"'; file ='"'"'C:\Users\amana\AppData\Local\Temp\pip