lxml

WebScraping with BeautifulSoup or LXML.HTML

谁都会走 提交于 2019-11-27 17:18:41
问题 I have seen some webcasts and need help in trying to do this: I have been using lxml.html. Yahoo recently changed the web structure. target page; http://finance.yahoo.com/quote/IBM/options?date=1469750400&straddle=true In Chrome using inspector: I see the data in //*[@id="main-0-Quote-Proxy"]/section/section/div[2]/section/section/table then some more code How Do get this data out into a list. I want to change to other stock from "LLY" to "Msft"? How do I switch between dates....And get all

lxml installation error ubuntu 14.04 (internal compiler error)

╄→尐↘猪︶ㄣ 提交于 2019-11-27 16:59:28
I am having problems with installing lxml . I have tried the solutions of the relative questions in this site and other sites but could not fix the problem. Need some suggestions/solution on this. I am providing the full log after executing pip install lxml , Downloading/unpacking lxml Downloading lxml-3.3.5.tar.gz (3.5MB): 3.5MB downloaded Running setup.py (path:/tmp/pip_build_root/lxml/setup.py) egg_info for package lxml /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'bugtrack_url' warnings.warn(msg) Building lxml version 3.3.5. Building without Cython.

Xpath vs DOM vs BeautifulSoup vs lxml vs other Which is the fastest approach to parse a webpage?

本秂侑毒 提交于 2019-11-27 16:44:04
问题 I know how to parse a page using Python. My question is which is the fastest method of all parsing techniques, how fast is it from others? The parsing techniques I know are Xpath, DOM, BeautifulSoup, and using the find method of Python. 回答1: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ 回答2: lxml was written on C. And if you use x86 it is best chose. If we speak about techniques there is no big difference between Xpath and DOM - it's very quickly methods. But if you

lxml parser eats all memory

纵饮孤独 提交于 2019-11-27 16:10:46
问题 I'm writing some spider in python and use lxml library for parsing html and gevent library for async. I found that after sometime of work lxml parser starts eats memory up to 8GB(all server memory). But i have only 100 async threads each of them parse document max to 300kb. i'v tested and get that problem starts in lxml.html.fromstring, but i can't reproduce this problem. The problem in this line of code: HTML = lxml.html.fromstring(htmltext) Maybe someone know what it can be, or hoe to fix

python lxml append element after another element

谁都会走 提交于 2019-11-27 15:58:37
问题 I have the following HTML markup <div id="contents"> <div id="content_nav"> something goes here </div> <p> some contents </p> </div> To fix some CSS issue, I want to append a div tag <div style="clear:both"></div> after the content_nav div like this <div id="contents"> <div id="content_nav"> something goes here </div> <div style="clear:both"></div> <p> some contents </p> </div> I am doing it this way: import lxml.etree tree = lxml.etree.fromString(inputString, parser=lxml.etree.HTMLParser())

Python网络爬虫——HTML解析之BeautifulSoup

巧了我就是萌 提交于 2019-11-27 15:46:14
BeautifulSoup是一个用于从HTML和XML文件中提取数据的Python库。BeautifulSoup提供一些简单的函数来处理导航、搜索、修改分析树等功能。BeautifulSoup模块中的查找提取功能非常强大,而且非常便捷,可以节省程序员数小时或数天的时间。 BeautifulSoup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码,用户不需要考虑编码方式,除非文档没指定一个编码方式,那么BeautifulSoup就不能自动识别编码方式。 BeautifulSoup的安装 在https://www.crummy.com/software/BeautifulSoup/bs4/download/下载BeatifulSoup的源码,进cmd中,进入BeatifulSoup4-4.8.0的存储路径,在输入Python steup.py install命令即可。 BeatifulSoup的使用 将BeatifulSoup安装完成后,我们就可以来使用它了。 (1)导入bs4库,然后创建一个模拟HTML代码的字符串,代码如下: from bs4 import BeautifulSoup #导入BeatifulSoup 库 #创建模拟HTML代码的字符串 html_doc = """ <html><head><title>The Dotmouse's story<

python lxml - modify attributes

可紊 提交于 2019-11-27 15:43:10
问题 from lxml import objectify, etree root = etree.fromstring('''<?xml version="1.0" encoding="ISO-8859-1" ?> <scenario> <init> <send channel="channel-Gy"> <command name="CER"> <avp name="Origin-Host" value="router1dev"></avp> <avp name="Origin-Realm" value="realm.dev"></avp> <avp name="Host-IP-Address" value="0x00010a248921"></avp> <avp name="Vendor-Id" value="11"></avp> <avp name="Product-Name" value="HP Ro Interface"></avp> <avp name="Origin-State-Id" value="1094807040"></avp> <avp name=

lxml: add namespace to input file

微笑、不失礼 提交于 2019-11-27 15:30:58
I am parsing an xml file generated by an external program . I would then like to add custom annotations to this file, using my own namespace. My input looks as below: <sbml xmlns="http://www.sbml.org/sbml/level2/version4" xmlns:celldesigner="http://www.sbml.org/2001/ns/celldesigner" level="2" version="4"> <model metaid="untitled" id="untitled"> <annotation>...</annotation> <listOfUnitDefinitions>...</listOfUnitDefinitions> <listOfCompartments>...</listOfCompartments> <listOfSpecies> <species metaid="s1" id="s1" name="GenA" compartment="default" initialAmount="0"> <annotation> <celldesigner

Python web scraping involving HTML tags with attributes

為{幸葍}努か 提交于 2019-11-27 15:13:10
问题 I'm trying to make a web scraper that will parse a web-page of publications and extract the authors. The skeletal structure of the web-page is the following: <html> <body> <div id="container"> <div id="contents"> <table> <tbody> <tr> <td class="author">####I want whatever is located here ###</td> </tr> </tbody> </table> </div> </div> </body> </html> I've been trying to use BeautifulSoup and lxml thus far to accomplish this task, but I'm not sure how to handle the two div tags and td tag

《Python网络爬虫权威指南》读书笔记1(第1章:初见爬虫)

限于喜欢 提交于 2019-11-27 15:04:22
前言 这本书的所有代码示例都在GitHub网站上( https://github.com/REMitchell/python-scraping ),可以查看和下载。 如果想要更全面地学习Python,Bill Lubanovic写的《Python语言及其应用》是一本非常好的教材。(笔者还没有看,笔者选用的入门教材是《Python编程:从入门到实践》) 补充材料(代码示例、练习等)可以从 https://github.com/REMitchell/python-scraping 下载。 第1章 初见网络爬虫 1.1 网络连接 from urllib.request import urlopen html = urlopen('http://pythonscraping.com/pages/page1.html') print(html.read()) 把上面这段代码保存为scrapetest.py,然后再终端运行如下指令: python scrapetest.py 这会输出 http://pythonscraping.com/pages/page1.html 这个网页的全部HTML代码。更准确地说,这会输出在域名为http://pythonscraping.com 的服务器上<网络应用根地址>/pages 文件夹里的HTML文件pages.html 的源代码。 b'<html>\n