lxml

Python爬虫学习之爬美女图片

佐手、 提交于 2020-07-28 01:58:15
学习python可能最先入手的就是爬虫了,闲来没事就找了找爬点什么内容比较好。突然发现最近很流行爬去美女图片啊!!!!二话不说,搞起来。 先来看看网站长啥样。 再看看网站的Html结构。 好了,知道了网站html结构,name就开干吧。先创建py文件,引入第三方包urllib.request、BeautifulSoup和os。 1、创建文件保存方法 2、定义请求头 3、网页分析 4、主函数 5、结果 6、程序源码 import urllib.request from bs4 import BeautifulSoup import os def Download(url, picAlt, name): path = ' D:\\tupian\\ ' + picAlt + ' \\ ' # 判断系统是否存在该路径,不存在则创建 if not os.path.exists(path): os.makedirs(path) # 下载图片并保存在本地 urllib.request.urlretrieve(url, ' {0}{1}.jpg ' .format(path, name)) #定义请求头 header = { " User-Agent " : ' Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like

Keep lxml from creating self-closing tags

白昼怎懂夜的黑 提交于 2020-07-20 10:38:11
问题 I have a (old) tool which does not understand self-closing tags like <STATUS/> . So, we need to serialize our XML files with opened/closed tags like this: <STATUS></STATUS> . Currently I have: >>> from lxml import etree >>> para = """<ERROR>The status is <STATUS></STATUS>.</ERROR>""" >>> tree = etree.XML(para) >>> etree.tostring(tree) '<ERROR>The status is <STATUS/>.</ERROR>' How can I serialize with opened/closed tags? <ERROR>The status is <STATUS></STATUS>.</ERROR> Solution Given by

Remove ns0 from XML

放肆的年华 提交于 2020-07-20 07:28:51
问题 I have an XML file where I would like to edit certain attributes. I am able to properly edit the attributes but when I write the changes to the file, the tags have a strange "ns0" added onto them. How can I get rid of this? This is what I have tried and have been unsuccessful. I am working in Python and using lxml. import xml.etree.ElementTree as ET from xml.etree import ElementTree as etree from lxml import etree, objectify frag_xml_tree = ET.parse(xml_name) frag_root = frag_xml_tree.getroot

Python lxml - How to remove empty repeated tags

浪子不回头ぞ 提交于 2020-07-17 11:24:49
问题 I have some XML that is generated by a script that may or may not have empty elements. I was told that now we cannot have empty elements in the XML. Here is an example: <customer> <govId> <id>@</id> <idType>SSN</idType> <issueDate/> <expireDate/> <dob/> <state/> <county/> <country/> </govId> <govId> <id/> <idType/> <issueDate/> <expireDate/> <dob/> <state/> <county/> <country/> </govId> </customer> The output should look like this: <customer> <govId> <id>@</id> <idType>SSN</idType> </govId> <

lxml / BeautifulSoup parser warning

好久不见. 提交于 2020-07-17 10:26:49
问题 Using Python 3, I'm trying to parse ugly HTML (which is not under my control) by using lxml with BeautifulSoup as explained here: http://lxml.de/elementsoup.html Specifically, I want to use lxml , but I'd like to use BeautifulSoup because like I said, it's ugly HTML and lxml will reject it on its own. The link above says: "All you need to do is pass it to the fromstring() function:" from lxml.html.soupparser import fromstring root = fromstring(tag_soup) So that's what I'm doing: URL = 'http:/

lxml: write an incremental pretty print xml

狂风中的少年 提交于 2020-06-16 05:18:51
问题 I'm working with very large XML files (>1GB) and require a method to write them incrementally. There's a single top level element and thousands of large 2nd level elements (each has it's own multi level hierarchy). I tried this: from lxml import etree with etree.xmlfile(out_file_name, encoding = 'UTF-8') as xf: xf.write_declaration() with xf.element('top'): xf.write('\n') # parse individual input files and write the 2nd level element to the output for file_name in file_list: context = etree

Can't install lxml on python 3.5

不打扰是莪最后的温柔 提交于 2020-05-13 06:13:30
问题 I searched for solution to my problem through this and other sites. Unfortunately, none of given solutions helped me. My problem is the following. I'd like to install python lxml library but every time I get errors. 1) When I try install lxml in pycharm , I get the follwoing error: ERROR: b"'xslt-config' is not recognized as an internal or external command,\r\noperable program or batch file.\r\n" 2) When I try install lxml in command line with pip install lxml, I get the following error: C:

Parsing json var inside script tag

北慕城南 提交于 2020-05-10 03:26:56
问题 I'm currently trying to scrape the json output of the follow 'https://sports.bovada.lv/soccer/premier-league' it has a source with the following <script type="text/javascript">var swc_market_lists = {"items":[{"description":"Game Lines","id":"23", ... </script> I'm trying to get the contents of the swc_market_lists var Now the issue I have is that when I use the following code import requests from lxml import html url = 'https://sports.bovada.lv/soccer/premier-league' r = requests.get(url)

Parsing json var inside script tag

允我心安 提交于 2020-05-10 03:25:07
问题 I'm currently trying to scrape the json output of the follow 'https://sports.bovada.lv/soccer/premier-league' it has a source with the following <script type="text/javascript">var swc_market_lists = {"items":[{"description":"Game Lines","id":"23", ... </script> I'm trying to get the contents of the swc_market_lists var Now the issue I have is that when I use the following code import requests from lxml import html url = 'https://sports.bovada.lv/soccer/premier-league' r = requests.get(url)

python爬虫07 | 有了 BeautifulSoup ,妈妈再也不用担心我的正则表达式了

非 Y 不嫁゛ 提交于 2020-05-08 05:07:00
我们上次做了 你的第一个爬虫,爬取当当网 Top 500 本五星好评书籍 有些朋友觉得 利用 正则表达式 去提取信息 太特么麻烦了 有没有什么别的方式 更方便过滤我们想要的内容啊 emmmm 你还别说 还真有 有一个高效的网页解析库 它的名字叫做 BeautifulSoup 那可是 它 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库 那么这么玩呢 ... 接下来就是 学习python的正确姿势 首先我们要安装一下这个库 pip install beautifulsoup4 beautifulsoup支持不同的解析器 比如 对 HTML 的解析 对 XML 的解析 对 HTML5 的解析 你看 一般情况下 我们用的比较多的是 lxml 解析器 我们先来使用一个例子 让你体验一下 beautifulsoup 的一些常用的方法 可流弊了呢 比如我们有这样一段 HTML 代码 html_doc = """ < html > < head > < title > 学习python的正确姿势 </ title > </ head > < body > < p class = "title" > < b > 小帅b的故事 </ b > </ p > < p class = "story" > 有一天,小帅b想给大家讲两个笑话 < a href = "http:/