html-parsing | 易学教程

How to find all text inside <p> elements in an HTML page using BeautifulSoup

阅读更多关于 How to find all text inside elements in an HTML page using BeautifulSoup

问题 I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python. For example, <p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p> should return: Many hundreds of cultivars exist. P.S. Some files contain Unicode characters (Hindi) which need to be extracted. Any ideas how to do that? 回答1: Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the

Loop through <div> elements using PHP

阅读更多关于 Loop through elements using PHP

问题 I have a block of html in a string that is basically a list of divs... Each div has html inside that I want to parse seperately. I am having trouble figuring out exactly how to loop over the initial divs. Can anyone help? An example of the html: <div></div> <div></div> <div></div> <div></div> In this example I would expect the final code to loop round 4 times and provide me with the contents of each div

Parsing HTML: lxml error in Python

阅读更多关于 Parsing HTML: lxml error in Python

问题 I am writing a simple script to fetch the big grey table from here. The code I have is the following: import urllib2 from lxml import etree html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read() root = etree.XML(html) But I am getting an error on the last statement. Traceback (most recent call last): File "D:\Workspace\afi100\afi100.py", line 13, in <module> root = etree.XML(html) File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577) File

Parsing HTML: lxml error in Python

阅读更多关于 Parsing HTML: lxml error in Python

Delphi: Some tip to parse this html table?

阅读更多关于 Delphi: Some tip to parse this html table?

问题 some time I'm trying to get data from this html table, I tried components paid and free. I tried to do some coding and also got no results. I have a class that throw directly html tables for ClientDataSet, but with this table it does not work. Anyone have any tips on how to get the data in this html table? Or a way to convert it to txt / xls / csv or xml? Follows the code for the table: WebBrowser1.Navigate('http://site2.aesa.pb.gov.br/aesa/monitoramentoPluviometria.do?metodo

Delphi: Some tip to parse this html table?

阅读更多关于 Delphi: Some tip to parse this html table?

Using documentFragment to parse HTML without sending HTTP requests

阅读更多关于 Using documentFragment to parse HTML without sending HTTP requests

问题 I'd like to parse a string and make DOM tree out of it. I decided to use documentFragment API and I did this so far: var htmlString ="Some really really complicated html string that only can be parsed by a real browser!"; var fragment = document.createDocumentFragment('div'); var tempDiv = document.createElement('div'); fragment.appendChild(tempDiv); tempDiv.innerHTML = htmlString; console.log(tempDiv); But the problem is that this script causes my browser (Chrome specifically) to send actual

Nokogiri vs Hpricot?

阅读更多关于 Nokogiri vs Hpricot?

问题 Which one would you choose? My important attributes are (not in order): Support and future enhancements. Community and general knowledge base (on the Internet). Comprehensive (I.E., proven to parse a wide range of *.*ml pages). Performance. Memory footprint (runtime, not the code-base). 回答1: Pick Nokogiri, for all points and especially point one: Hpricot is no longer maintained. Meta answer: See ruby-toolbox to get an idea of the popularity of different tools in a given area. 回答2: Only pick

How to get HTML from a beautiful soup object

阅读更多关于 How to get HTML from a beautiful soup object

问题 I have the following bs4 object listing: >>> listing <div class="listingHeader"> <h2> .... >>> type(listing) <class 'bs4.element.Tag'> I want to extract the raw html as a string. I've tried: >>> a = listing.contents >>> type(a) <type 'list'> So this does not work. How can I do this? 回答1: Just get the string representation: html_content = str(listing) This is a non-prettified version. If you want a prettified one, use prettify() method: html_content = listing.prettify() 来源： https:/

Error 410 (“resource no longer available”) while getting html code of an url in Python

阅读更多关于 Error 410 (“resource no longer available”) while getting html code of an url in Python

问题 I am trying to get the html of the following link: http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html To do so, I proceeded as follows: import requests try: from BeautifulSoup import BeautifulSoup except ImportError: from bs4 import BeautifulSoup url='http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html' html=requests.get(url) And the html code I get ( print(html.text) ) is the following: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head>