lxml | 易学教程

WebScraping with BeautifulSoup or LXML.HTML

阅读更多关于 WebScraping with BeautifulSoup or LXML.HTML

问题 I have seen some webcasts and need help in trying to do this: I have been using lxml.html. Yahoo recently changed the web structure. target page; http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true In Chrome using inspector: I see the data in //*[@id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table then some more code How Do get this data out into a list. I want to change to other stock from "LLY" to "Msft"? How do I switch between dates....And get all

lxml installation error ubuntu 14.04 (internal compiler error)

阅读更多关于 lxml installation error ubuntu 14.04 (internal compiler error)

I am having problems with installing lxml . I have tried the solutions of the relative questions in this site and other sites but could not fix the problem. Need some suggestions/solution on this. I am providing the full log after executing pip install lxml , Downloading/unpacking lxml Downloading lxml-3.3.5.tar.gz (3.5MB): 3.5MB downloaded Running setup.py (path:/tmp/pip_build_root/lxml/setup.py) egg_info for package lxml /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url' warnings.warn(msg) Building lxml version 3.3.5. Building without Cython.

Xpath vs DOM vs BeautifulSoup vs lxml vs other Which is the fastest approach to parse a webpage?

阅读更多关于 Xpath vs DOM vs BeautifulSoup vs lxml vs other Which is the fastest approach to parse a webpage?

问题 I know how to parse a page using Python. My question is which is the fastest method of all parsing techniques, how fast is it from others? The parsing techniques I know are Xpath, DOM, BeautifulSoup, and using the find method of Python. 回答1: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ 回答2: lxml was written on C. And if you use x86 it is best chose. If we speak about techniques there is no big difference between Xpath and DOM - it's very quickly methods. But if you

lxml parser eats all memory

阅读更多关于 lxml parser eats all memory

问题 I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb. i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem. The problem in this line of code: HTML = lxml.html.fromstring(htmltext) Maybe someone know what it can be, or hoe to fix

python lxml append element after another element

阅读更多关于 python lxml append element after another element

问题 I have the following HTML markup <div id="contents"> <div id="content_nav"> something goes here </div> <p> some contents </p> </div> To fix some CSS issue, I want to append a div tag <div style="clear:both"></div> after the content_nav div like this <div id="contents"> <div id="content_nav"> something goes here </div> <div style="clear:both"></div> <p> some contents </p> </div> I am doing it this way: import lxml.etree tree = lxml.etree.fromString(inputString, parser=lxml.etree.HTMLParser())

Python网络爬虫——HTML解析之BeautifulSoup

阅读更多关于 Python网络爬虫——HTML解析之BeautifulSoup

BeautifulSoup是一个用于从HTML和XML文件中提取数据的Python库。BeautifulSoup提供一些简单的函数来处理导航、搜索、修改分析树等功能。BeautifulSoup模块中的查找提取功能非常强大，而且非常便捷，可以节省程序员数小时或数天的时间。 BeautifulSoup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码，用户不需要考虑编码方式，除非文档没指定一个编码方式，那么BeautifulSoup就不能自动识别编码方式。 BeautifulSoup的安装在https://www.crummy.com/software/BeautifulSoup/bs4/download/下载BeatifulSoup的源码，进cmd中，进入BeatifulSoup4-4.8.0的存储路径，在输入Python steup.py install命令即可。 BeatifulSoup的使用将BeatifulSoup安装完成后，我们就可以来使用它了。（1）导入bs4库，然后创建一个模拟HTML代码的字符串,代码如下： from bs4 import BeautifulSoup #导入BeatifulSoup 库 #创建模拟HTML代码的字符串 html_doc = """ <html><head><title>The Dotmouse's story<

python lxml - modify attributes

阅读更多关于 python lxml - modify attributes

问题 from lxml import objectify, etree root = etree.fromstring('''<?xml version="1.0" encoding="ISO-8859-1" ?> <scenario> <init> <send channel="channel-Gy"> <command name="CER"> <avp name="Origin-Host" value="router1dev"></avp> <avp name="Origin-Realm" value="realm.dev"></avp> <avp name="Host-IP-Address" value="0x00010a248921"></avp> <avp name="Vendor-Id" value="11"></avp> <avp name="Product-Name" value="HP Ro Interface"></avp> <avp name="Origin-State-Id" value="1094807040"></avp> <avp name=

lxml: add namespace to input file

阅读更多关于 lxml: add namespace to input file

I am parsing an xml file generated by an external program . I would then like to add custom annotations to this file, using my own namespace. My input looks as below: <sbml xmlns="http://www.sbml.org/sbml/level2/version4" xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner" level="2" version="4"> <model metaid="untitled" id="untitled"> <annotation>...</annotation> <listOfUnitDefinitions>...</listOfUnitDefinitions> <listOfCompartments>...</listOfCompartments> <listOfSpecies> <species metaid="s1" id="s1" name="GenA" compartment="default" initialAmount="0"> <annotation> <celldesigner

Python web scraping involving HTML tags with attributes

阅读更多关于 Python web scraping involving HTML tags with attributes

问题 I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </body> </html> I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag

《Python网络爬虫权威指南》读书笔记1（第1章：初见爬虫）

阅读更多关于《Python网络爬虫权威指南》读书笔记1（第1章：初见爬虫）

前言这本书的所有代码示例都在GitHub网站上（ https://github.com/REMitchell/python-scraping )，可以查看和下载。如果想要更全面地学习Python，Bill Lubanovic写的《Python语言及其应用》是一本非常好的教材。（笔者还没有看，笔者选用的入门教材是《Python编程：从入门到实践》）补充材料（代码示例、练习等）可以从 https://github.com/REMitchell/python-scraping 下载。第1章初见网络爬虫 1.1 网络连接 from urllib.request import urlopen html = urlopen('http://pythonscraping.com/pages/page1.html') print(html.read()) 把上面这段代码保存为scrapetest.py，然后再终端运行如下指令： python scrapetest.py 这会输出 http://pythonscraping.com/pages/page1.html 这个网页的全部HTML代码。更准确地说，这会输出在域名为http://pythonscraping.com 的服务器上<网络应用根地址>/pages 文件夹里的HTML文件pages.html 的源代码。 b'<html>\n