lxml

How can I preserve <br> as newlines with lxml.html text_content() or equivalent?

邮差的信 提交于 2019-11-28 21:25:20
I want to preserve <br> tags as \n when extracting the text content from lxml elements. Example code: fragment = '<div>This is a text node.<br/>This is another text node.<br/><br/><span>And a child element.</span><span>Another child,<br> with two text nodes</span></div>' h = lxml.html.fromstring(fragment) Output: > h.text_content() 'This is a text node.This is another text node.And a child element.Another child, with two text nodes' Prepending an \n character to the tail of each <br /> element should give the result you're expecting: >>> import lxml.html as html >>> fragment = '<div>This is a

Write xml file using lxml library in Python

≯℡__Kan透↙ 提交于 2019-11-28 20:49:22
问题 I'm using lxml to create an XML file from scratch; having a code like this: from lxml import etree root = etree.Element("root") root.set("interesting", "somewhat") child1 = etree.SubElement(root, "test") How do I write root Element object to an xml file using write() method of ElementTree class? 回答1: You can get a string from the element and then write that from lxml tutorial str = etree.tostring(root, pretty_print=True) or convert to an element tree et = etree.ElementTree(root) et.write(sys

Python学习笔记(七)

放肆的年华 提交于 2019-11-28 19:47:21
Python爬虫框架 安装Scrapy框架 1、命令行 conda install scrapy 2、PYcharm setting -> project interpreter -> + 号,搜索scrapy ,install 爬虫工作基本原理——数据采集 大数据数据采集有两种: 1、从网上爬取 crawling 2、从本地收集 scraping 流程步骤: 1、模拟浏览器发送请求 : urllib,request,scrapy(框架) 2、获取浏览器的相应 3、解析响应内容 : lxml ,beautifulsoup 4、存储所需数据 : CSV,JSON,RDBMS(pymysql),Mongodb ========================================================================= 1、 Request 的API ========================================================================= 1、发送请求 import requests as req #发送请求,获取响应 reply=req.get("https://beijing.8684.cn/x_35b1e697") 2、获取响应 print(reply.content)

Parsing a table with rowspan and colspan

拟墨画扇 提交于 2019-11-28 19:40:51
I have a table that I need to parse, specifically it is a school schedule with 4 blocks of time, and 5 blocks of days for every week. I've attempted to parse it, but honestly have not gotten very far because I am stuck with how to deal with rowspan and colspan attributes, because they essentially mean there is a lack of data that I need to continue. As an example of what I want to do, here's a table: <tr> <td colspan="2" rowspan="4">#1</td> <td rowspan="4">#2</td> <td rowspan="2">#3</td> <td rowspan="2">#4</td> </tr> <tr> </tr> <tr> <td rowspan="2">#5</td> <td rowspan="2">#6</td> </tr> <tr> <

How to find XML Elements via XPath in Python in a namespace-agnostic way?

送分小仙女□ 提交于 2019-11-28 18:48:54
since I had this annoying issue for the 2nd time, I thought that asking would help. Sometimes I have to get Elements from XML documents, but the ways to do this are awkward. I’d like to know a python library that does what I want, a elegant way to formulate my XPaths, a way to register the namespaces in prefixes automatically or a hidden preference in the builtin XML implementations or in lxml to strip namespaces completely. Clarification follows unless you already know what I want :) Example-doc: <root xmlns="http://really-long-namespace.uri" xmlns:other="http://with-ambivalent.end/#"> <other

Encoding in python with lxml - complex solution

限于喜欢 提交于 2019-11-28 18:27:42
I need to download and parse webpage with lxml and build UTF-8 xml output. I think schema in pseudocode is more illustrative: from lxml import etree webfile = urllib2.urlopen(url) root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True)) txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8)) output = etree.Element("out") output.text = txt outputfile.write(etree.tostring(output, encoding=utf8)) So webfile can be in any encoding (lxml should handle this). Outputfile have to be in utf-8. I'm not sure where to use encoding/coding. Is this schema ok? (I cant

xpath的使用

别来无恙 提交于 2019-11-28 17:42:35
安装lxml库 pip --default-timeout=100 install lxml -i http://pypi.douban.com/simple --trusted-host pypi.douban.com requests和xpath的使用 from lxml import etree import requests headers = {'User-Agent' : 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'} response = requests.get("http://www.baidu.com",headers=headers) text = response.text # 将文本标签树对象来使用xpath方法。 selector = etree.HTML(text) # 获取所有div的内容 t = selector.xpath('//div//text()') print(t) xpath选择符号 # 基本符号和方法(未匹配到节点时返回空 [] ) // #可以跨层匹配(相对节点) selector

用Python爬取需要登录的网站

[亡魂溺海] 提交于 2019-11-28 16:09:42
最近我必须执行一项从一个需要登录的网站上爬取一些网页的操作。它没有我想象中那么简单,因此我决定为它写一个辅助教程。 在本教程中,我们将从我们的bitbucket账户中爬取一个项目列表。 教程中的代码可以从我的 Github 中找到。 我们将会按照以下步骤进行: 提取登录需要的详细信息 执行站点登录 爬取所需要的数据 在本教程中,我使用了以下包(可以在 requirements.txt 中找到): requests lxml #步骤一:研究该网站 打开登录页面 进入以下页面 “bitbucket.org/account/signin”。你会看到如下图所示的页面(执行注销,以防你已经登录) 仔细研究那些我们需要提取的详细信息,以供登录之用 在这一部分,我们会创建一个字典来保存执行登录的详细信息: 1. 右击 “Username or email” 字段,选择“查看元素”。我们将使用 “name” 属性为 “username” 的输入框的值。“username”将会是 key 值,我们的用户名/电子邮箱就是对应的 value 值(在其他的网站上这些 key 值可能是 “email”,“ user_name”,“ login”,等等)。 2. 右击 “Password” 字段,选择“查看元素”。在脚本中我们需要使用 “name” 属性为 “password” 的输入框的值。

out of memory issue in installing packages on Ubuntu server

二次信任 提交于 2019-11-28 15:41:03
问题 I am using a Ubuntu cloud server with limited 512MB RAM and 20 GB HDD. Its 450MB+ RAM is already used by processes. I need to install a new package called lxml which gets complied using Cpython while installation and its a very heavy process so it always exits with error gcc: internal compiler error: Killed (program cc1) which is due to no RAM available for it to run. Upgrading the machine is a choice but it has its own issues and few of my services/websites live from this server itself. But

Parse large XML with lxml

蓝咒 提交于 2019-11-28 13:03:07
I am trying to get my script working. So far it doesn't managed to output anything. This is my test.xml <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="it"> <page> <title>MediaWiki:Category</title> <ns>0</ns> <id>2</id> <revision> <id>11248</id> <timestamp>2003-12-31T13:47:54Z</timestamp> <contributor> <username>Frieda</username> <id>0</id> </contributor> <minor /> <text xml:space="preserve">categoria