lxml

How to get data off from a web page in selenium webdriver

非 Y 不嫁゛ 提交于 2019-11-28 12:58:49
问题 I want to fetch company name, email, phone number from this Link and put these contents in an excel file. I want to do the same for the all pages of the website. I have got the logic to fetch the the links in the browser and switch in between them. I'm unable to fetch the data from the website. Can anybody provide me an enhancement to the code i have written. Below is the code i have written: from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from

Get all text from an XML document?

孤人 提交于 2019-11-28 12:45:29
How can I get all the text content of an XML document, as a single string - like this Ruby/hpricot example but using Python. I'd like to replace XML tags with a single whitespace. EDIT: This is an answer posted when I thought one-space indentation is normal, and as the comments mention it's not a good answer. Check out the others for some better solutions. This is left here solely for archival reasons, do not follow it! You asked for lxml: reslist = list(root.iter()) result = ' '.join([element.text for element in reslist]) Or: result = '' for element in root.iter(): result += element.text + '

解决 Python3.5 用 pip 安装lxml错误

不打扰是莪最后的温柔 提交于 2019-11-28 12:40:57
Python3.4 用 pip 安装lxml时出现 “Unable to find vcvarsall.bat ”。一般解决方法是安装Visual Studio 2015。但是为了安装这个库去安装几个G的VS明显不划算。 先去这里下载对应版本的MSVC编译包: Python Version You will need 3.5 and later Visual C++ Build Tools 2015 or Visual Studio 2015 3.3 and 3.4 Windows SDK for Windows 7 and .NET 4.0 (Alternatively, Visual Studio 2010 if you have access to it) 2.6 to 3.2 Microsoft Visual C++ Compiler for Python 2.7 然后去 http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml 搜索lxml找到对应python版本的whl文件。 去下载目录执行 pip install whl文件名。 来源: oschina 链接: https://my.oschina.net/u/130566/blog/714630

Parse several XML declarations in a single file by means of lxml.etree.iterparse

岁酱吖の 提交于 2019-11-28 12:39:05
I need to parse a file that contains various XML files, i.e., <xml></xml> <xml></xml> .. and so forth. While using etree.iterparse, I get the following (correct) error: lxml.etree.XMLSyntaxError: XML declaration allowed only at the start of the document Now, I can preprocess the input file and produce for each contained XML file a separate file. This might be the easiest solution. But I wonder if a proper solution for this 'problem' exists. Thanks! The sample data you've provided suggests one problem, while the question and the exception you've provided suggests another. Do you have multiple

Remove all html in python?

删除回忆录丶 提交于 2019-11-28 11:49:10
Is there a way to remove/escape html tags using lxml.html and not beautifulsoup which has some xss issues? I tried using cleaner, but i want to remove all html. Try the .text_content() method on an element, probably best after using lxml.html.clean to get rid of unwanted content (script tags etc...). For example: from lxml import html from lxml.html.clean import clean_html tree = html.parse('http://www.example.com') tree = clean_html(tree) text = tree.getroot().text_content() I believe that, this code can help you: from lxml.html.clean import Cleaner html_text = "<html><head><title>Hello<

Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

左心房为你撑大大i 提交于 2019-11-28 11:37:51
问题 I have written a small python script to parse XML data based on Liza Daly's blog in Python. However, my code does not parse all the nodes. So for example when a person has had multiple addresses then it takes only the first available address. The XML tree would look like this: - lgs - entities - entity - id - name - addressess - address - address1 - address - address1 - entity - id (...) and this would be the python script: import os import time from datetime import datetime import lxml.etree

python爬虫采集网站数据

半世苍凉 提交于 2019-11-28 11:32:27
1.准备工作: 1.1安装requests: cmd >> pip install requests 1.2 安装lxml: cmd >> pip install lxml 1.3安装wheel: cmd >> pip install wheel 1.4 安装xlwt: cmd >> pip install xlwt 2. 编写代码 2.1使用requests.get获取页面 编译结果 2.2 使用lxml将数据改成xpath结构 2.3 精确获取数据 2.4 使用for in循环输出数据 注意:print(tr.xpath(" . //td/text()"))中 如果没有加 . 只会循环相同的内容,上图就是没有加点 正确做法 2.5 只获取需要的数据 3.使用xlwt创建excel表,存储数据 3.1 创建excel表 运行结果 3.2 将数据添加到excel表中 3.3 批量添加数据(让j累加) 3.4 多页数据添加 最终代码 import requests from lxml import etree import xlwt #设置浏览器的请求头,告诉服务器我们是从浏览器来的,作用是阻止被网站反爬 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36

Beautiful Soup and Table Scraping - lxml vs html parser

我们两清 提交于 2019-11-28 11:29:51
I'm trying to extract the HTML code of a table from a webpage using BeautifulSoup. <table class="facts_label" id="facts_table">...</table> I would like to know why the code bellow works with the "html.parser" and prints back none if I change "html.parser" for "lxml" . #! /usr/bin/python from bs4 import BeautifulSoup from urllib import urlopen webpage = urlopen('http://www.thewebpage.com') soup=BeautifulSoup(webpage, "html.parser") table = soup.find('table', {'class' : 'facts_label'}) print table There is a special paragraph in BeautifulSoup documentation called Differences between parsers , it

Use lxml to parse text file with bad header in Python

♀尐吖头ヾ 提交于 2019-11-28 11:28:28
I would like to parse text files (stored locally) with lxml's etree. But all of my files (thousands) have headers, such as: -----BEGIN PRIVACY-ENHANCED MESSAGE----- Proc-Type: 2001,MIC-CLEAR Originator-Name: webmaster@www.sec.gov Originator-Key-Asymmetric: MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB MIC-Info: RSA-MD5,RSA, AHxm/u6lqdt8X6gebNqy9afC2kLXg+GVIOlG/Vrrw/dTCPGwM15+hT6AZMfDSvFZ YVPEaPjyiqB4rV/GS2lj6A== <SEC-DOCUMENT>0001193125-07-200376.txt : 20070913 <SEC-HEADER>0001193125-07-200376.hdr.sgml : 20070913

Is it possible for lxml to work in a case-insensitive manner?

泪湿孤枕 提交于 2019-11-28 11:16:24
I'm trying to scrape META keywords and description tags from arbitrary websites. I obviusly have no control over said website, so have to take what I'm given. They have a variety of casings for the tag and attributes, which means I need to work case-insensitively. I can't believe that the lxml authors are as stubborn as to insist on full forced standards-compliance when it excludes much of the use of their library. I'd like to be able to say doc.cssselect('meta[name=description]') (or some XPath equivalent) but this will not catch <meta name="Description" Content="..."> tags due othe captial D