html-parsing

How to find all text inside <p> elements in an HTML page using BeautifulSoup

感情迁移 提交于 2020-01-01 19:38:06
问题 I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python. For example, <p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p> should return: Many hundreds of cultivars exist. P.S. Some files contain Unicode characters (Hindi) which need to be extracted. Any ideas how to do that? 回答1: Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the

Loop through <div> elements using PHP

烈酒焚心 提交于 2020-01-01 19:27:09
问题 I have a block of html in a string that is basically a list of divs... Each div has html inside that I want to parse seperately. I am having trouble figuring out exactly how to loop over the initial divs. Can anyone help? An example of the html: <div><!-- stuff in here --></div> <div><!-- stuff in here --></div> <div><!-- stuff in here --></div> <div><!-- stuff in here --></div> In this example I would expect the final code to loop round 4 times and provide me with the contents of each div

Parsing HTML: lxml error in Python

拜拜、爱过 提交于 2020-01-01 16:39:33
问题 I am writing a simple script to fetch the big grey table from here. The code I have is the following: import urllib2 from lxml import etree html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read() root = etree.XML(html) But I am getting an error on the last statement. Traceback (most recent call last): File "D:\Workspace\afi100\afi100.py", line 13, in <module> root = etree.XML(html) File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577) File

Parsing HTML: lxml error in Python

青春壹個敷衍的年華 提交于 2020-01-01 16:39:31
问题 I am writing a simple script to fetch the big grey table from here. The code I have is the following: import urllib2 from lxml import etree html = urllib2.urlopen("http://www.afi.com/100years/movies10.aspx").read() root = etree.XML(html) But I am getting an error on the last statement. Traceback (most recent call last): File "D:\Workspace\afi100\afi100.py", line 13, in <module> root = etree.XML(html) File "lxml.etree.pyx", line 2720, in lxml.etree.XML (src/lxml/lxml.etree.c:52577) File

Delphi: Some tip to parse this html table?

时间秒杀一切 提交于 2020-01-01 15:35:13
问题 some time I'm trying to get data from this html table, I tried components paid and free. I tried to do some coding and also got no results. I have a class that throw directly html tables for ClientDataSet, but with this table it does not work. Anyone have any tips on how to get the data in this html table? Or a way to convert it to txt / xls / csv or xml? Follows the code for the table: WebBrowser1.Navigate('http://site2.aesa.pb.gov.br/aesa/monitoramentoPluviometria.do?metodo

Delphi: Some tip to parse this html table?

百般思念 提交于 2020-01-01 15:32:11
问题 some time I'm trying to get data from this html table, I tried components paid and free. I tried to do some coding and also got no results. I have a class that throw directly html tables for ClientDataSet, but with this table it does not work. Anyone have any tips on how to get the data in this html table? Or a way to convert it to txt / xls / csv or xml? Follows the code for the table: WebBrowser1.Navigate('http://site2.aesa.pb.gov.br/aesa/monitoramentoPluviometria.do?metodo

Using documentFragment to parse HTML without sending HTTP requests

点点圈 提交于 2020-01-01 15:03:33
问题 I'd like to parse a string and make DOM tree out of it. I decided to use documentFragment API and I did this so far: var htmlString ="Some really really complicated html string that only can be parsed by a real browser!"; var fragment = document.createDocumentFragment('div'); var tempDiv = document.createElement('div'); fragment.appendChild(tempDiv); tempDiv.innerHTML = htmlString; console.log(tempDiv); But the problem is that this script causes my browser (Chrome specifically) to send actual

Nokogiri vs Hpricot?

断了今生、忘了曾经 提交于 2019-12-31 08:58:05
问题 Which one would you choose? My important attributes are (not in order): Support and future enhancements. Community and general knowledge base (on the Internet). Comprehensive (I.E., proven to parse a wide range of *.*ml pages). Performance. Memory footprint (runtime, not the code-base). 回答1: Pick Nokogiri, for all points and especially point one: Hpricot is no longer maintained. Meta answer: See ruby-toolbox to get an idea of the popularity of different tools in a given area. 回答2: Only pick

How to get HTML from a beautiful soup object

∥☆過路亽.° 提交于 2019-12-31 08:26:46
问题 I have the following bs4 object listing: >>> listing <div class="listingHeader"> <h2> .... >>> type(listing) <class 'bs4.element.Tag'> I want to extract the raw html as a string. I've tried: >>> a = listing.contents >>> type(a) <type 'list'> So this does not work. How can I do this? 回答1: Just get the string representation: html_content = str(listing) This is a non-prettified version. If you want a prettified one, use prettify() method: html_content = listing.prettify() 来源: https:/

Error 410 (“resource no longer available”) while getting html code of an url in Python

江枫思渺然 提交于 2019-12-31 07:24:29
问题 I am trying to get the html of the following link: http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html To do so, I proceeded as follows: import requests try: from BeautifulSoup import BeautifulSoup except ImportError: from bs4 import BeautifulSoup url='http://www8.austlii.edu.au/cgi-bin/viewdoc/au/cases/cth/FCA/2006/3.html' html=requests.get(url) And the html code I get ( print(html.text) ) is the following: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head>