lxml

lxml include relative path

[亡魂溺海] 提交于 2021-02-17 02:07:32
问题 Using Python's lxml library, I'm trying to load a .xsd as schema. The Python script is in one directory and the schemas are in another: /root my_script.py /data /xsd schema_1.xsd schema_2.xsd The problem is that schema_1.xsd includes schema_2.xsd like this: <xsd:include schemaLocation="schema_2.xsd"/> Being schema_2.xsd a relative path (the two schemas are in the same directory), lxml doesn't find it and it rises and error: schema_root = etree.fromstring(open('data/xsd/schema_1.xsd').read()

lxml include relative path

我的未来我决定 提交于 2021-02-17 02:04:44
问题 Using Python's lxml library, I'm trying to load a .xsd as schema. The Python script is in one directory and the schemas are in another: /root my_script.py /data /xsd schema_1.xsd schema_2.xsd The problem is that schema_1.xsd includes schema_2.xsd like this: <xsd:include schemaLocation="schema_2.xsd"/> Being schema_2.xsd a relative path (the two schemas are in the same directory), lxml doesn't find it and it rises and error: schema_root = etree.fromstring(open('data/xsd/schema_1.xsd').read()

lxml include relative path

妖精的绣舞 提交于 2021-02-17 02:04:28
问题 Using Python's lxml library, I'm trying to load a .xsd as schema. The Python script is in one directory and the schemas are in another: /root my_script.py /data /xsd schema_1.xsd schema_2.xsd The problem is that schema_1.xsd includes schema_2.xsd like this: <xsd:include schemaLocation="schema_2.xsd"/> Being schema_2.xsd a relative path (the two schemas are in the same directory), lxml doesn't find it and it rises and error: schema_root = etree.fromstring(open('data/xsd/schema_1.xsd').read()

pretty_print option in tostring not working in lxml

江枫思渺然 提交于 2021-02-15 11:35:17
问题 I'm trying to use the tostring method in XML to get a "pretty" version of my XML as a string. The example on the lxml site shows this example: >>> import lxml.etree as etree >>> root = etree.Element("root") >>> print(root.tag) root >>> root.append( etree.Element("child1") ) >>> child2 = etree.SubElement(root, "child2") >>> child3 = etree.SubElement(root, "child3") >>> print(etree.tostring(root, pretty_print=True)) <root> <child1/> <child2/> <child3/> </root> However my output, running those

pretty_print option in tostring not working in lxml

白昼怎懂夜的黑 提交于 2021-02-15 11:34:58
问题 I'm trying to use the tostring method in XML to get a "pretty" version of my XML as a string. The example on the lxml site shows this example: >>> import lxml.etree as etree >>> root = etree.Element("root") >>> print(root.tag) root >>> root.append( etree.Element("child1") ) >>> child2 = etree.SubElement(root, "child2") >>> child3 = etree.SubElement(root, "child3") >>> print(etree.tostring(root, pretty_print=True)) <root> <child1/> <child2/> <child3/> </root> However my output, running those

使用requests爬取拉勾网python职位数据

痴心易碎 提交于 2021-02-14 08:00:34
爬虫目的 本文想通过爬取 拉勾网 Python相关岗位数据,简单梳理 Requests 和 xpath 的使用方法。 代码部分并没有做封装,数据请求也比较简单,所以该项目只是为了熟悉requests爬虫的基本原理,无法用于稳定的爬虫项目。 爬虫工具 这次使用 Requests 库发送http请求,然后用 lxml.etree 解析HTML文档对象,并使用 xpath 提取职位信息。 Requests简介 Requests是一款目前非常流行的http请求库,使用python编写,能非常方便的对网页Requests进行爬取。 官网里介绍说:Requests is an elegant and simple HTTP library for Python, built for human beings. Requests优雅、简易,专为人类打造! 总而言之,Requests用起来简单顺手。 Requests库可以使用 pip 或者 conda 安装,本文python环境为py3.6。 试试对百度首页进行数据请求: # 导入requests模块 import requests<br> # 发出http请求 re = requests.get( "https://www.baidu.com/" ) # 查看响应状态 print(re.status_code) # 查看url print(re

python爬虫-豆瓣电影的尝试

橙三吉。 提交于 2021-02-13 16:41:06
一、背景介绍 1. 使用工具   Pycharm 2. 安装的第三方库   requests、BeautifulSoup   2.1 如何安装第三方库   File => Settings => Project Interpreter => + 中搜索你需要的插件    3. 可掌握的小知识   1. 根据url 获取页面html内容   2. 解析html内容,选出自己需要的内容 二、代码示例   网页的样子是这个,获取排行榜中电影的名字 1 import requests 2 from bs4 import BeautifulSoup 3 4 def getHtml(): 5 url = ' https://movie.douban.com/chart ' 6 # Get获取改页面的内容 7 html = requests.get(url) 8 # 用lxml解析器解析该页面的内容 9 soup = BeautifulSoup(html.content, " lxml " ) 10 getFilmName(soup) 11 # print(soup) 12 13 14 def getFilmName(html): 15 for i in html.find_all( ' a ' , class_= " nbg " ): 16 img = i.find( ' img ' ) 17

lxml web-scraping is returning empty values

两盒软妹~` 提交于 2021-02-11 15:21:10
问题 I am trying to get all the food categories from this site https://www.walmart.com/cp/976759 here is snapshot of the category container <div id="cp-center-module-5" class="cp-center-module"><span style="font-size: 0px;"></span><div data-module="FeaturedCategoriesCollapsible" data-module-id="e05783ed-f2bb-44f3-956f-9d7d5286d25b" class="TempoTileCollapsible FeaturedCategoriesCollapsible" data-tl-id="categorypage-FeaturedCategoriesCollapsible"><div class="TempoTileCollapsible-header"><div class=

Parsing Xml files >3gb using lxml etree iterparse [duplicate]

拜拜、爱过 提交于 2021-02-11 13:49:22
问题 This question already has answers here : Using Python Iterparse For Large XML Files (6 answers) Parsing large XML using iterparse() consumes too much memory. Any alternative? (2 answers) using lxml and iterparse() to parse a big (+- 1Gb) XML file (3 answers) Closed 9 months ago . I am not able to parse XML file of huge size using lxml tree. What I came to know from my research is that lxml iterparse loads the xml file until it gets tag which it is looking for. This is snippet of my code :-

Registering namespaces with lxml before parsing

 ̄綄美尐妖づ 提交于 2021-02-11 05:05:25
问题 I am using lxml to parse XML from an external service that has namespaces, but doesn't register them with xmlns . I am trying to register it by hand with register_namespace , but that doesn't seem to work. from lxml import etree xml = """ <Foo xsi:type="xsd:string">bar</Foo> """ etree.register_namespace('xsi', 'http://www.w3.org/2001/XMLSchema-instance') el = etree.fromstring(xml) # lxml.etree.XMLSyntaxError: Namespace prefix xsi for type on Foo is not defined What am I missing? Oddly enough,