lxml

How do I scrape an https page? [duplicate]

时光总嘲笑我的痴心妄想 提交于 2019-12-06 09:05:43
This question already has answers here : Python Requests throwing SSLError (22 answers) Closed 5 years ago . I'm using a python script with 'lxml' and 'requests' to scrape a web page. My goal is to grab an element from a page and download it, but the content is on an HTTPS page and I'm getting an error when trying to access the stuff in the page. I'm sure there is some kind of certificate or authentication I have to include, but I'm struggling to find the right resources. I'm using: page = requests.get("https://[example-page.com]", auth=('[username]','[password]')) and the error is: requests

Iteratively parse a large XML file without using the DOM approach

核能气质少年 提交于 2019-12-06 08:51:58
问题 I have an xml file <temp> <email id="1" Body="abc"/> <email id="2" Body="fre"/> . . <email id="998349883487454359203" Body="hi"/> </temp> I want to read the xml file for each email tag. That is, at a time I want to read email id=1..extract body from it, the read email id=2...and extract body from it...and so on I tried to do this using DOM model for XML parsing, since my file size is 100 GB..the approach does not work. I then tried using: from xml.etree import ElementTree as ET tree=ET.parse(

爬虫练手项目:获取豆瓣评分最高的电影并下载

混江龙づ霸主 提交于 2019-12-06 08:44:28
前期回顾 上篇博文我们学习了Python爬虫的四大库 urllib , requests , BeautifulSoup 以及 selenium 爬虫常用库介绍 学习了 urllib 与 request 的常见用法 学习了使用 BeautifulSoup 来解析网页以及使用 selenium 来驱动浏览器 # 我们导入了 web 驱动模块 from selenium import webdriver # 接着我们创建了一个 Chrome 驱动 driver = webdriver.Chrome() # 接着使用 get 方法打开百度 driver.get("https://www.baidu.com") # 获取输入框并且往里面写入我们要搜索的内容 input = driver.find_element_by_css_selector('#kw') input.send_keys("波多野结衣照片") # 我们就获取到搜索这个按钮然后点击 button = driver.find_element_by_css_selector('#su') button.click() 则是上次查看波多老师图片的代码,效果如下 抓取豆瓣电影并保存本地 我们来抓取一下豆瓣上排名前250的电影 import requests from bs4 import BeautifulSoup import

lxml and 代理ip

非 Y 不嫁゛ 提交于 2019-12-06 06:54:46
pip install lxml 导包 From lxml import etree 1. 注意这个是本地html就直接使用etree.parse即可 2. html_etree=etree.parse("test.html") print(html_etree) 3. Html_etree.xpath("//li")//就是直接打印出来li所有的属性 4. 获取所有li下面的的class的值 html_etree.xpath(//li/@class) 5. 获取li下的所有的span的标签 html_etree.xpath("//li//span") /是用来获取子元素的 而span和li不是子元素 6. 获取li下面的所有的a标签里面的所有的class html_etree.xpath("//li/a//@class") 7. html.xpath('//li[last()]/a/@href') # 谓语 [last()] 可以找到最后一个元素 获取最后一个li的a的href属性对应的值 8. 获取倒数第二个就是[last()-1/a] 获取倒数第二个li元素的内容 在本地没有文件的情况下使用response读取到的数据直接打开就可以了 #html_etree=etree.HTML(html) / 从根节点选取。 // 从匹配选择的当前节点选择文档中的节点,而不考虑它们的位置。

random text from /dev/random raising an error in lxml: All strings must be XML compatible: Unicode or ASCII, no NULL bytes

时光怂恿深爱的人放手 提交于 2019-12-06 06:51:40
I am, for the sake of testing my web app, pasting some random characters from /dev/random into my web frontend. This line throws an error: print repr(comment) import html5lib print html5lib.parse(comment, treebuilder="lxml") 'a\xef\xbf\xbd\xef\xbf\xbd\xc9\xb6E\xef\xbf\xbd\xef\xbf\xbd`\xef\xbf\xbd]\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd2 \x14\xef\xbf\xbd\xc7\xbe\xef\xbf\xbdy\xcb\x9c\xef\xbf\xbdi1O\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbdZ\xef\xbf\xbd.\xef\xbf\xbd\x17^C' Unhandled Error Traceback (most recent call last): File "/usr/lib/python2.6/dist-packages/twisted/internet

How to use BeautifulSoup to parse google search results in Python

独自空忆成欢 提交于 2019-12-06 05:44:13
I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far: from urllib.request import urlretrieve import urllib.parse from urllib.parse import urlencode, urlparse, parse_qs import webbrowser from bs4 import BeautifulSoup import requests address = 'https://google.com/#q=' # Default Google search address start file = open( "OCR.txt", "rt" ) # Open text document that contains the question word = file.read() file.close() myList = [item for item in word.split('\n')] newString = ' '.join(myList) # The

How can lxml validate some XML against both an XSD file while also loading an inline schema too?

你说的曾经没有我的故事 提交于 2019-12-06 05:08:48
I'm having problems getting lxml to successfully validate some xml. The XSD schema and XML file are both from Amazon documentation so should be compatible. But the XML itself refers to another schema that's not being loaded. Here is my code, which is based on the lxml validation tutorial : xsd_doc = etree.parse('ProductImage.xsd') xsd = etree.XMLSchema(xsd_doc) xml = etree.parse('ProductImage_sample.xml') xsd.validate(xml) print xsd.error_log "ProductImage_sample.xml:2:0:ERROR:SCHEMASV:SCHEMAV_CVC_ELT_1: Element 'AmazonEnvelope': No matching global declaration available for the validation root

Python lxml - get index of tag's text

ε祈祈猫儿з 提交于 2019-12-06 04:59:25
问题 I have an xml-file with a format similar to docx, i.e.: <w:r> <w:rPr> <w:sz w:val="36"/> <w:szCs w:val="36"/> </w:rPr> <w:t>BIG_TEXT</w:t> </w:r> EDIT: I need to get an index of "BIG_TEXT" in source xml, like: from lxml import etree text = open('/devel/tmp/doc2/word/document.xml', 'r').read() root = etree.XML(text) start = 0 for e in root.iter("*"): if e.text: offset = text.index(e.text, start) l = len(e.text) print 'Text "%s" at offset %s and len=%s' % (e.text, offset, l) start = offset + l

lxml use namespace instead of ns0, ns1,

巧了我就是萌 提交于 2019-12-06 04:53:17
问题 I have just started with lxml basics and I am stuck with namespaces: I need to generate an xml like this: <CityModel xmlns:bldg="http://www.opengis.net/citygml/building/2.0" <cityObjectMember> <bldg:Building> <bldg:function>1000</bldg:function> </bldg:Building> </cityObjectMember> </CityModel> By using the following code: from lxml import etree cityModel = etree.Element("cityModel") cityObject = etree.SubElement(cityModel, "cityObjectMember") bldg = etree.SubElement(cityObject, "{http:/

XML walking in python [closed]

余生颓废 提交于 2019-12-06 04:13:08
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 7 years ago . I am new to python and would like to understand parsing xml. I have not been able to find any great examples or explanations of how to create a generic program to walk an XML nodeset. I want to be able to