html-content-extraction

php : parse html : extract script tags from body and inject before </body>?

给你一囗甜甜゛ 提交于 2019-12-05 10:34:13
问题 I don't care what the library is, but I need a way to extract <.script.> elements from the <.body.> of a page (as string). I then want to insert the extracted <.script.>s just before <./body.>. Ideally, I'd like to extract the <.script.>s into 2 types; 1) External (those that have the src attribute) 2) Embedded (those with code between <.script.><./script.>) So far I've tried with phpDOM, Simple HTML DOM and Ganon. I've had no luck with any of them (I can find links and remove/print them -

Beautifulsoup get value in table

北城以北 提交于 2019-12-04 19:17:24
I am trying to scrape http://www.co.jefferson.co.us/ats/displaygeneral.do?sch=000104 and get the "owner Name(s)" What I have works but is really ugly and not the best I am sure, so I am looking for a better way. Here is what I have: soup = BeautifulSoup(url_opener.open(url)) x = soup('table', text = re.compile("Owner Name")) print 'And the owner is', x[0].parent.parent.parent.tr.nextSibling.nextSibling.next.next.next The relevant HTML is <td valign="top"> <table border="1" cellpadding="1" cellspacing="0" align="right"> <tbody><tr class="tableheaders"> <td>Owner Name(s)</td> </tr> <tr> <td

PHP - how to get main HTML content like Reader Mode in Firefox

给你一囗甜甜゛ 提交于 2019-12-04 14:40:55
问题 in android Firefox app and safari iPad we can read only main content by "Reader Mode". read more... How to recognize only main content in HTML with PHP? I need to detect main news like Firefox or safari by php for example I get news from bbcsite.com/news/123 by this code: <?php $html = file_get_contents('http://bbcsite.com/news/123'); ?> then show only main news without ads and ... like Firefox and safari. I find fivefilters.org . this site can get content!!! thank you 回答1: A new PHP library

How do you parse a poorly formatted HTML file?

*爱你&永不变心* 提交于 2019-12-04 08:04:42
I have to parse a series of web pages in order to import data into an application. Each type of web page provides the same kind of data. The problem is that the HTML of each page is different, so the location of the data varies. Another problem is that the HTML code is poorly formatted, making it impossible to use a XML-like parser. So far, the best strategy I can think of, is to define a template for each kind of page, like: Template A: <html> ... <tr><td>Table column that is missing a td <td> Another table column</td></tr> <tr><td>$data_item_1$</td> ... </html> Template B: <html> ... <ul><li

php : parse html : extract script tags from body and inject before </body>?

倖福魔咒の 提交于 2019-12-03 23:09:57
I don't care what the library is, but I need a way to extract <.script.> elements from the <.body.> of a page (as string). I then want to insert the extracted <.script.>s just before <./body.>. Ideally, I'd like to extract the <.script.>s into 2 types; 1) External (those that have the src attribute) 2) Embedded (those with code between <.script.><./script.>) So far I've tried with phpDOM, Simple HTML DOM and Ganon. I've had no luck with any of them (I can find links and remove/print them - but fail with scripts every time!). Alternative to https://stackoverflow.com/questions/23414887/php

Python HTML scraping

元气小坏坏 提交于 2019-12-03 21:14:07
It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example: <a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e"> I want to get the href value. Any ideas on how to do this? Maybe regex? Could you post some example code? I'm guessing html scraping libs, such as BeautifulSoup, are a bit of overkill just for this... Huge thanks! Regex is usally a bad idea, try using BeautifulSoup Quick example: html = #get html soup = BeautifulSoup(html) links = soup.findAll('a', attrs={'class': 'myclass'}) for link in links: #process link

What algorithms could I use to identify content on a web page

江枫思渺然 提交于 2019-12-03 12:40:32
问题 I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such. 回答1: This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm 回答2: First, if you need to parse a web page, I would use HTMLAgilityPack to transform

PHP - how to get main HTML content like Reader Mode in Firefox

妖精的绣舞 提交于 2019-12-03 09:10:39
in android Firefox app and safari iPad we can read only main content by "Reader Mode". read more... How to recognize only main content in HTML with PHP? I need to detect main news like Firefox or safari by php for example I get news from bbcsite.com/news/123 by this code: <?php $html = file_get_contents('http://bbcsite.com/news/123'); ?> then show only main news without ads and ... like Firefox and safari. I find fivefilters.org . this site can get content!!! thank you A new PHP library named PHP Goose seems to do a very good job at this too. It's pretty easy to use and is Composer friendly.

python method to extract content (excluding navigation) from an HTML page

好久不见. 提交于 2019-12-03 05:23:39
问题 Of course an HTML page can be parsed using any number of python parsers, but I'm surprised that there don't seem to be any public parsing scripts to extract meaningful content (excluding sidebars, navigation, etc.) from a given HTML doc. I'm guessing it's something like collecting DIV and P elements and then checking them for a minimum amount of text content, but I'm sure a solid implementation would include plenty of things that I haven't thought of. 回答1: Try the Beautiful Soup library for

What is the state of the art in HTML content extraction?

半腔热情 提交于 2019-12-03 00:30:02
问题 There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice? Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for. Postscript the first :