lxml

8.正则表达式和XPath

徘徊边缘 提交于 2020-01-24 04:56:36
1.使用正则表达式爬取内涵段子 import requests import re def loadPage(page): url = "http://www.neihan8.com/article/list_5_" +page+".html" #User-Agent头 user_agent = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT6.1; Trident/5.0' headers = {'User-Agent': user_agent} response = requests.get(url,headers=headers) response.encoding = 'gbk' html = response.text return html if __name__=="__main__": page=input('请输入要爬取的页面:') html=loadPage(page) # with open('a.html','w') as f: # f.write(html) # 找到所有的段子内容<div class="f18 mb20"></div> # re.S 如果没有re.S 则是只匹配一行有没有符合规则的字符串,如果没有则下一行重新匹配 # 如果加上re.S 则是将所有的字符串将一个整体进行匹配,找到(.*?

Python Lxml - Append a existing xml with new data

巧了我就是萌 提交于 2020-01-23 05:42:11
问题 I am new to python/lxml After reading the lxml site and dive into python I could not find the solution to my n00b troubles. I have the below xml sample: --------------- <addressbook> <person> <name>Eric Idle</name> <phone type='fix'>999-999-999</phone> <phone type='mobile'>555-555-555</phone> <address> <street>12, spam road</street> <city>London</city> <zip>H4B 1X3</zip> </address> </person> </addressbook> ------------------------------- I am trying to append one child to the root element and

Parsing a partial XML with python lxml

穿精又带淫゛_ 提交于 2020-01-21 05:25:07
问题 I'm trying to parse a large XML file which is being received from the network in Python. In order to do that, I get the data and pass it to lxml.etree.iterparse However, if the XML has yet to fully be sent, like so: <MyXML> <MyNode foo="bar"> <MyNode foo="ba If I run etree.iterparse(f, tag='MyNode').next() I get an XMLSyntaxError at whereever it was cut off. Is there any way I can make it so I can receive the first tag (i.e. the first MyNode) and only get an exception when I reach that part

AWS Lambda not importing LXML

别来无恙 提交于 2020-01-21 03:11:35
问题 I am trying to use the LXML module within AWS Lambda and having no luck. I downloaded LXML using the following command: pip install lxml -t folder To download it to my lambda function deployment package. I zipped the contents of my lambda function up as I have done with all other lambda functions, and uploaded it to AWS Lambda. However no matter what I try I get this error when I run the function: Unable to import module 'handler': /var/task/lxml/etree.so: undefined symbol: PyFPE_jbuf When I

python [lxml] - cleaning out html tags

大城市里の小女人 提交于 2020-01-19 05:42:53
问题 from lxml.html.clean import clean_html, Cleaner def clean(text): try: cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True, remove_tags = ['a', 'li', 'td']) print (len(cleaner.clean_html(text))- len(text)) return cleaner.clean_html(text) except: print 'Error in clean_html' print sys.exc_info() return text I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html

python [lxml] - cleaning out html tags

岁酱吖の 提交于 2020-01-19 05:42:34
问题 from lxml.html.clean import clean_html, Cleaner def clean(text): try: cleaner = Cleaner(scripts=True, embedded=True, meta=True, page_structure=True, links=True, style=True, remove_tags = ['a', 'li', 'td']) print (len(cleaner.clean_html(text))- len(text)) return cleaner.clean_html(text) except: print 'Error in clean_html' print sys.exc_info() return text I put together the above (ugly) code as my initial forays into python land. I'm trying to use lxml cleaner to clean out a couple of html

Split long XML tags in multiple lines with lxml

时光毁灭记忆、已成空白 提交于 2020-01-17 07:51:39
问题 My python (2.7) script is outputting the following XML using lxml library: <Button android:id="@+id/button1" android:layout_width="wrap_content" android:layout_height="wrap_content" android:layout_marginLeft="17dp" android:layout_marginTop="16dp" android:text="Button"/> I would like to output it in multiple lines, one per attribute: <Button android:id="@+id/button1" android:layout_width="wrap_content" android:layout_height="wrap_content" android:layout_marginLeft="17dp" android:layout

Is it a xpath (lxml) bug?

我的未来我决定 提交于 2020-01-17 06:07:06
问题 I have my xpath: //*[namespace-uri() = 'http://foundation.org/UA/2011/03/NodeSet.xsd'][local-name() = 'Reference'][@ReferenceType = 'HasNotifier']/../../Description[@Locale="en"] but don't work with this xml file. Maybe is my mistake, or maybe is a lxml bug ... i don't know. I'm trying few day to create right and correct xpath code. But unfurnetli, i can't do this correct :( Is it a lxml bug or my mistake ? What I want to get, if "HasNotifier" print "002CC-ESSO01.(WAAA05.01?1)" My XML File

HTML Table to List Parsing - <TBODY> monkey wrench for both xml and lxml

一世执手 提交于 2020-01-16 18:44:06
问题 I read the answers to Parse HTML table to Python list? and tried to use the ideas to read/process my local html downloaded from a web site (the files contain one table and start with the <table class="table"> label). I ran into problems due to the presence of two html tags. With the <thead> label the parse doesn't pick up the header, and the <tbody> causes both xml and lxml to completely fail. I tried googling for a solution but the answer most likely is embedded in some documentation

Facing problem regarding installation of lxml in venv,

允我心安 提交于 2020-01-16 08:46:13
问题 I am trying to setup evalai-cli using pip, but i am facing problems during setup when i try to run pip install evalai Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed? ERROR: Command errored out with exit status 1: command: 'c:\users\amana\evalai-cli\venv\scripts\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\amana\AppData\Local\Temp\pip-install-iwb_ci9r\lxml\setup.py'"'"'; file ='"'"'C:\Users\amana\AppData\Local\Temp\pip