html-parsing | 易学教程

HtmlAgility - Save parsing to a string

阅读更多关于 HtmlAgility - Save parsing to a string

问题 Just tried using the HtmlAgility Pack for the first time and have a problem. First I load in from a string variable. string NewsText = dr["Message"].ToString(); HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); htmlDoc.LoadHtml(NewsText); //doing my stuff... Then I want to save my changes in the string NewsText. How do I do that? htmlDoc.toString() didnt work. Thanks! 回答1: You're looking for htmlDoc.DocumentNode.OuterHtml . 来源： https://stackoverflow.com/questions

Fast and effective way to parse broken HTML?

阅读更多关于 Fast and effective way to parse broken HTML?

问题 I'm working on large projects which require fast HTML parsing, including recovery for broken HTML pages. Currently lxml is my choice, I know it provides an interface for libxml2's recovery mode, too, but I'm not really happy with the results. For some specific HTML pages I found that BeautifulSoup works out really better results (example: http://fortune.com/2015/11/10/vw-scandal-volkswagen-gift-cards/, this one has a broken <header> tag which lxml/libxml2 couldn't correct). However, the

Best way to parse an HTML table into a CSV

阅读更多关于 Best way to parse an HTML table into a CSV

问题 I've got to grab some product data off an existing website to put into a database. The data is all in HTML table format, the model numbers are unique, but each product can have any number of different attributes (so the tables I need to parse all have different columns and headings). <table> <tr> <td>Model No.</td> <td>Weight</td> <td>Colour</td> <td>Etc..</td> </tr> <tr> <td>8572</td> <td>12 Kg</td> <td>Red</td> <td>Blah..</td> </tr> <tr> <td>7463</td> <td>7 Kg</td> <td>Blue</td> <td>Blah..<

Getting non-contiguous text with lxml / ElementTree

阅读更多关于 Getting non-contiguous text with lxml / ElementTree

问题 Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree: <div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div> If I already have the div element as mydiv, then mydiv.text returns just "text1". Using itertext() seems problematic or cumbersome at best since it walks the entire tree under the div. Is there any simple/elegant way to extract a non-first text chunk from an element? 回答1: Well, lxml.etree provides full XPath support, which

how to extract main text from html using Tika

阅读更多关于 how to extract main text from html using Tika

问题 I just want to know that how i can extract main text and plain text from html using Tika? maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to show it? thanks very much in advance 回答1: Here is a sample: public String[] tika_autoParser() { String[] result = new String[3]; try { InputStream input = new FileInputStream(new File("/Users/nazanin/Books/Web crawler.pdf")); ContentHandler textHandler = new BodyContentHandler(); Metadata metadata =

How to get a table from an html page using JAVA

阅读更多关于 How to get a table from an html page using JAVA

问题 I am working on a project where I am trying to fetch financial statements from the internet and use them in a JAVA application to automatically create ratios, and charts. The site I am using uses a login and password to get to the tables. The Tag is TBODY, but there are 2 other TBODY's in the html. How can I use java to print my table to a txt file where I can then use in my application? What would the best way to go about this, and what should I read up on? 回答1: If this were my project, I'd

Bulletproofing SimpleXMLElement

阅读更多关于 Bulletproofing SimpleXMLElement

问题 Everyone knows that we should always use DOM techniques instead of regexes to extract content from HTML, but I get the feeling that I can never trust the SimpleXML extension or similar ones. I'm coding a OpenID implementation right now, and I tried using SimpleXML to do the HTML discovery - but my very first test (with alixaxel.myopenid.com) yielded a lot of errors: Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 27: parser error : Opening and ending tag

Need python lxml syntax help for parsing html

阅读更多关于 Need python lxml syntax help for parsing html

问题 I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with: HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail. I need to find the middle table with the search result rows (this one I was able to figure out): self

extracting element and insert a space

阅读更多关于 extracting element and insert a space

问题 im parsing html using BeautifulSoup in python i dont know how to insert a space when extracting text element this is the code: import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>this<b>is</b>example</html>') print soup.text then output is thisisexample but i want to insert a space to this like yes is example how do i insert a space? 回答1: Use getText instead: import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>this<b>is</b>example</html>') print soup.getText(separator=u' ')

Extract data from website via PHP

阅读更多关于 Extract data from website via PHP

问题 I am trying to create a simple alert app for some friends. Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two: http://www.sparkfun.com/commerce/product_info.php?products_id=5 http://www.sparkfun.com/commerce/product_info.php?products_id=9279 I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and