lxml | 易学教程

Python爬虫学习之爬美女图片

阅读更多关于 Python爬虫学习之爬美女图片

学习python可能最先入手的就是爬虫了，闲来没事就找了找爬点什么内容比较好。突然发现最近很流行爬去美女图片啊！！！！二话不说，搞起来。先来看看网站长啥样。再看看网站的Html结构。好了，知道了网站html结构，name就开干吧。先创建py文件，引入第三方包urllib.request、BeautifulSoup和os。 1、创建文件保存方法 2、定义请求头 3、网页分析 4、主函数 5、结果 6、程序源码 import urllib.request from bs4 import BeautifulSoup import os def Download(url, picAlt, name): path = ' D:\\tupian\\ ' + picAlt + ' \\ ' # 判断系统是否存在该路径，不存在则创建 if not os.path.exists(path): os.makedirs(path) # 下载图片并保存在本地 urllib.request.urlretrieve(url, ' {0}{1}.jpg ' .format(path, name)) #定义请求头 header = { " User-Agent " : ' Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like

Keep lxml from creating self-closing tags

阅读更多关于 Keep lxml from creating self-closing tags

问题 I have a (old) tool which does not understand self-closing tags like <STATUS/> . So, we need to serialize our XML files with opened/closed tags like this: <STATUS></STATUS> . Currently I have: >>> from lxml import etree >>> para = """<ERROR>The status is <STATUS></STATUS>.</ERROR>""" >>> tree = etree.XML(para) >>> etree.tostring(tree) '<ERROR>The status is <STATUS/>.</ERROR>' How can I serialize with opened/closed tags? <ERROR>The status is <STATUS></STATUS>.</ERROR> Solution Given by

Remove ns0 from XML

阅读更多关于 Remove ns0 from XML

问题 I have an XML file where I would like to edit certain attributes. I am able to properly edit the attributes but when I write the changes to the file, the tags have a strange "ns0" added onto them. How can I get rid of this? This is what I have tried and have been unsuccessful. I am working in Python and using lxml. import xml.etree.ElementTree as ET from xml.etree import ElementTree as etree from lxml import etree, objectify frag_xml_tree = ET.parse(xml_name) frag_root = frag_xml_tree.getroot

Python lxml - How to remove empty repeated tags

阅读更多关于 Python lxml - How to remove empty repeated tags

问题 I have some XML that is generated by a script that may or may not have empty elements. I was told that now we cannot have empty elements in the XML. Here is an example: <customer> <govId> <id>@</id> <idType>SSN</idType> <issueDate/> <expireDate/> <dob/> <state/> <county/> <country/> </govId> <govId> <id/> <idType/> <issueDate/> <expireDate/> <dob/> <state/> <county/> <country/> </govId> </customer> The output should look like this: <customer> <govId> <id>@</id> <idType>SSN</idType> </govId> <

lxml / BeautifulSoup parser warning

阅读更多关于 lxml / BeautifulSoup parser warning

问题 Using Python 3, I'm trying to parse ugly HTML (which is not under my control) by using lxml with BeautifulSoup as explained here: http://lxml.de/elementsoup.html Specifically, I want to use lxml , but I'd like to use BeautifulSoup because like I said, it's ugly HTML and lxml will reject it on its own. The link above says: "All you need to do is pass it to the fromstring() function:" from lxml.html.soupparser import fromstring root = fromstring(tag_soup) So that's what I'm doing: URL = 'http:/

lxml: write an incremental pretty print xml

阅读更多关于 lxml: write an incremental pretty print xml

问题 I'm working with very large XML files (>1GB) and require a method to write them incrementally. There's a single top level element and thousands of large 2nd level elements (each has it's own multi level hierarchy). I tried this: from lxml import etree with etree.xmlfile(out_file_name, encoding = 'UTF-8') as xf: xf.write_declaration() with xf.element('top'): xf.write('\n') # parse individual input files and write the 2nd level element to the output for file_name in file_list: context = etree

Can't install lxml on python 3.5

阅读更多关于 Can't install lxml on python 3.5

问题 I searched for solution to my problem through this and other sites. Unfortunately, none of given solutions helped me. My problem is the following. I'd like to install python lxml library but every time I get errors. 1) When I try install lxml in pycharm , I get the follwoing error: ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program or batch file.\r\n" 2) When I try install lxml in command line with pip install lxml, I get the following error: C:

Parsing json var inside script tag

阅读更多关于 Parsing json var inside script tag

问题 I'm currently trying to scrape the json output of the follow 'https://sports.bovada.lv/soccer/premier-league' it has a source with the following <script type="text/javascript">var swc_market_lists = {"items":[{"description":"Game Lines","id":"23", ... </script> I'm trying to get the contents of the swc_market_lists var Now the issue I have is that when I use the following code import requests from lxml import html url = 'https://sports.bovada.lv/soccer/premier-league' r = requests.get(url)

Parsing json var inside script tag

阅读更多关于 Parsing json var inside script tag

python爬虫07 | 有了 BeautifulSoup ，妈妈再也不用担心我的正则表达式了

阅读更多关于 python爬虫07 | 有了 BeautifulSoup ，妈妈再也不用担心我的正则表达式了

我们上次做了你的第一个爬虫，爬取当当网 Top 500 本五星好评书籍有些朋友觉得利用正则表达式去提取信息太特么麻烦了有没有什么别的方式更方便过滤我们想要的内容啊 emmmm 你还别说还真有有一个高效的网页解析库它的名字叫做 BeautifulSoup 那可是它是一个可以从 HTML 或 XML 文件中提取数据的 Python 库那么这么玩呢 ... 接下来就是学习python的正确姿势首先我们要安装一下这个库 pip install beautifulsoup4 beautifulsoup支持不同的解析器比如对 HTML 的解析对 XML 的解析对 HTML5 的解析你看一般情况下我们用的比较多的是 lxml 解析器我们先来使用一个例子让你体验一下 beautifulsoup 的一些常用的方法可流弊了呢比如我们有这样一段 HTML 代码 html_doc = """ < html > < head > < title > 学习python的正确姿势 </ title > </ head > < body > < p class = "title" > < b > 小帅b的故事 </ b > </ p > < p class = "story" > 有一天，小帅b想给大家讲两个笑话 < a href = "http:/