lxml | 易学教程

How can I preserve <br> as newlines with lxml.html text_content() or equivalent?

阅读更多关于 How can I preserve as newlines with lxml.html text_content() or equivalent?

I want to preserve <br> tags as \n when extracting the text content from lxml elements. Example code: fragment = '<div>This is a text node.<br/>This is another text node.<br/><br/><span>And a child element.</span><span>Another child,<br> with two text nodes</span></div>' h = lxml.html.fromstring(fragment) Output: > h.text_content() 'This is a text node.This is another text node.And a child element.Another child, with two text nodes' Prepending an \n character to the tail of each <br /> element should give the result you're expecting: >>> import lxml.html as html >>> fragment = '<div>This is a

Write xml file using lxml library in Python

阅读更多关于 Write xml file using lxml library in Python

问题 I'm using lxml to create an XML file from scratch; having a code like this: from lxml import etree root = etree.Element("root") root.set("interesting", "somewhat") child1 = etree.SubElement(root, "test") How do I write root Element object to an xml file using write() method of ElementTree class? 回答1: You can get a string from the element and then write that from lxml tutorial str = etree.tostring(root, pretty_print=True) or convert to an element tree et = etree.ElementTree(root) et.write(sys

Python学习笔记（七）

阅读更多关于 Python学习笔记（七）

Python爬虫框架安装Scrapy框架 1、命令行 conda install scrapy 2、PYcharm setting -> project interpreter -> + 号，搜索scrapy ,install 爬虫工作基本原理——数据采集大数据数据采集有两种： 1、从网上爬取 crawling 2、从本地收集 scraping 流程步骤： 1、模拟浏览器发送请求： urllib，request，scrapy（框架） 2、获取浏览器的相应 3、解析响应内容： lxml ,beautifulsoup 4、存储所需数据 : CSV,JSON,RDBMS(pymysql)，Mongodb ========================================================================= 1、 Request 的API ========================================================================= 1、发送请求 import requests as req #发送请求,获取响应 reply=req.get("https://beijing.8684.cn/x_35b1e697") 2、获取响应 print(reply.content)

Parsing a table with rowspan and colspan

阅读更多关于 Parsing a table with rowspan and colspan

I have a table that I need to parse, specifically it is a school schedule with 4 blocks of time, and 5 blocks of days for every week. I've attempted to parse it, but honestly have not gotten very far because I am stuck with how to deal with rowspan and colspan attributes, because they essentially mean there is a lack of data that I need to continue. As an example of what I want to do, here's a table: <tr> <td colspan="2" rowspan="4">#1</td> <td rowspan="4">#2</td> <td rowspan="2">#3</td> <td rowspan="2">#4</td> </tr> <tr> </tr> <tr> <td rowspan="2">#5</td> <td rowspan="2">#6</td> </tr> <tr> <

How to find XML Elements via XPath in Python in a namespace-agnostic way?

阅读更多关于 How to find XML Elements via XPath in Python in a namespace-agnostic way?

since I had this annoying issue for the 2nd time, I thought that asking would help. Sometimes I have to get Elements from XML documents, but the ways to do this are awkward. I’d like to know a python library that does what I want, a elegant way to formulate my XPaths, a way to register the namespaces in prefixes automatically or a hidden preference in the builtin XML implementations or in lxml to strip namespaces completely. Clarification follows unless you already know what I want :) Example-doc: <root xmlns="http://really-long-namespace.uri" xmlns:other="http://with-ambivalent.end/#"> <other

Encoding in python with lxml - complex solution

阅读更多关于 Encoding in python with lxml - complex solution

I need to download and parse webpage with lxml and build UTF-8 xml output. I think schema in pseudocode is more illustrative: from lxml import etree webfile = urllib2.urlopen(url) root = etree.parse(webfile.read(), parser=etree.HTMLParser(recover=True)) txt = my_process_text(etree.tostring(root.xpath('/html/body'), encoding=utf8)) output = etree.Element("out") output.text = txt outputfile.write(etree.tostring(output, encoding=utf8)) So webfile can be in any encoding (lxml should handle this). Outputfile have to be in utf-8. I'm not sure where to use encoding/coding. Is this schema ok? (I cant

xpath的使用

阅读更多关于 xpath的使用

安装lxml库 pip --default-timeout=100 install lxml -i http://pypi.douban.com/simple --trusted-host pypi.douban.com requests和xpath的使用 from lxml import etree import requests headers = {'User-Agent' : 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'} response = requests.get("http://www.baidu.com",headers=headers) text = response.text # 将文本标签树对象来使用xpath方法。 selector = etree.HTML(text) # 获取所有div的内容 t = selector.xpath('//div//text()') print(t) xpath选择符号 # 基本符号和方法(未匹配到节点时返回空 [] ) // #可以跨层匹配(相对节点) selector

用Python爬取需要登录的网站

阅读更多关于用Python爬取需要登录的网站

最近我必须执行一项从一个需要登录的网站上爬取一些网页的操作。它没有我想象中那么简单，因此我决定为它写一个辅助教程。在本教程中，我们将从我们的bitbucket账户中爬取一个项目列表。教程中的代码可以从我的 Github 中找到。我们将会按照以下步骤进行：提取登录需要的详细信息执行站点登录爬取所需要的数据在本教程中，我使用了以下包（可以在 requirements.txt 中找到）： requests lxml #步骤一：研究该网站打开登录页面进入以下页面 “bitbucket.org/account/signin”。你会看到如下图所示的页面（执行注销，以防你已经登录）仔细研究那些我们需要提取的详细信息，以供登录之用在这一部分，我们会创建一个字典来保存执行登录的详细信息： 1. 右击 “Username or email” 字段，选择“查看元素”。我们将使用 “name” 属性为 “username” 的输入框的值。“username”将会是 key 值，我们的用户名/电子邮箱就是对应的 value 值（在其他的网站上这些 key 值可能是 “email”，“ user_name”，“ login”，等等）。 2. 右击 “Password” 字段，选择“查看元素”。在脚本中我们需要使用 “name” 属性为 “password” 的输入框的值。

out of memory issue in installing packages on Ubuntu server

阅读更多关于 out of memory issue in installing packages on Ubuntu server

问题 I am using a Ubuntu cloud server with limited 512MB RAM and 20 GB HDD. Its 450MB+ RAM is already used by processes. I need to install a new package called lxml which gets complied using Cpython while installation and its a very heavy process so it always exits with error gcc: internal compiler error: Killed (program cc1) which is due to no RAM available for it to run. Upgrading the machine is a choice but it has its own issues and few of my services/websites live from this server itself. But

Parse large XML with lxml

阅读更多关于 Parse large XML with lxml

I am trying to get my script working. So far it doesn't managed to output anything. This is my test.xml <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="it"> <page> <title>MediaWiki:Category</title> <ns>0</ns> <id>2</id> <revision> <id>11248</id> <timestamp>2003-12-31T13:47:54Z</timestamp> <contributor> <username>Frieda</username> <id>0</id> </contributor> <minor /> <text xml:space="preserve">categoria

订阅 lxml