html-content-extraction

What is the state of the art in HTML content extraction?

阅读更多关于 What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages , and some signs of interest here, e.g., one , two , and three , but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice? Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for. Postscript the first : To be precise, the kind of survey I'm after would be a paper (published, unpublished, whatever)

RegEx for extracting HTML Image properties

阅读更多关于 RegEx for extracting HTML Image properties

I need a RegEx pattern for extracting all the properties of an image tag. As we all know, there are lots of malformed HTML out there, so the pattern has to cover those possibilities. I was looking at this solution https://stackoverflow.com/questions/138313/how-to-extract-img-src-title-and-alt-from-html-using-php but it didn't quite get it all: I come up something like: (alt|title|src|height|width)\s*=\s*["'][\W\w]+?["'] Is there any possibilities I'll be missing or a more efficient simple pattern? EDIT: Sorry, I will be more specific, I'm doing this using .NET so it's on the server side. I've

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

阅读更多关于 Extracting pure content / text from HTML Pages by excluding navigation and chrome content

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc. Despite the above approach I am still getting quite some junk in my final text. This

How do I save a web page, programatically?

阅读更多关于 How do I save a web page, programatically?

I would like to save a web page programmatically. I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing. The intended usage is a personal bookmarks application, in which link content is cached in case the original copy is taken down. Take a look at wget , specifically the -p flag −p −−page−requisites This option causes Wget to download all the ﬁles that are necessary to properly display a givenHTML page. Thisincludes such things as inlined images, sounds,

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

阅读更多关于 Extracting pure content / text from HTML Pages by excluding navigation and chrome content

问题 I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation

Using Beautiful Soup Python module to replace tags with plain text

阅读更多关于 Using Beautiful Soup Python module to replace tags with plain text

I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it. I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I am starting off with a basic strategy of: if there are more than x-chars in a node then it is content). Let's take the html code below as an example: <div id="abc"> some long text goes <a href="/"> here </a> and hopefully it will get picked up by the parser as content <

Possible to parse a HTML document and build a DOM tree(java)

阅读更多关于 Possible to parse a HTML document and build a DOM tree(java)

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API. For example: DomRoot = parse("myhtml.html"); for (tags : DomRoot) { } Note: this is a HTML document not XHtml. You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML. This is bold, bold italic, italic, normal text gets correctly rewritten as: This is bold, bold italic, italic, normal text.

Possible to parse a HTML document and build a DOM tree(java)

阅读更多关于 Possible to parse a HTML document and build a DOM tree(java)

问题 Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API. For example: DomRoot = parse("myhtml.html"); for (tags : DomRoot) { } Note: this is a HTML document not XHtml. 回答1: You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML. This is bold, bold italic, italic, </i

Using Beautiful Soup Python module to replace tags with plain text

阅读更多关于 Using Beautiful Soup Python module to replace tags with plain text

问题 I am using Beautiful Soup to extract 'content' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that's how I got started with it. I was able to successfully get most of the content but I am running into some challenges with tags that are part of the content. (I am starting off with a basic strategy of: if there are more than x-chars in a node then it is content). Let's take the html code below as an example: <div id="abc"> some

Create Great Parser - Extract Relevant Text From HTML/Blogs

阅读更多关于 Create Great Parser - Extract Relevant Text From HTML/Blogs

I'm trying to create a generalized HTML parser that works well on Blog Posts. I want to point my parser at the specific entrie's URL and get back clean text of the post itself. My basic approach (from python) has been to use a combination of BeautifulSoup / Urllib2, which is okay, but it assumes you know the proper tags for the blog entry. Does anyone have any better ideas? Here are some thoughts maybe someone could expand upon, that I don't have enough knowledge/know-how yet to implement. The unix program 'lynx' seems to parse blog posts especially well - what parser do they use, or how could