html-parsing

HtmlAgility - Save parsing to a string

懵懂的女人 提交于 2019-12-21 06:49:02
问题 Just tried using the HtmlAgility Pack for the first time and have a problem. First I load in from a string variable. string NewsText = dr["Message"].ToString(); HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument(); htmlDoc.LoadHtml(NewsText); //doing my stuff... Then I want to save my changes in the string NewsText. How do I do that? htmlDoc.toString() didnt work. Thanks! 回答1: You're looking for htmlDoc.DocumentNode.OuterHtml . 来源: https://stackoverflow.com/questions

Fast and effective way to parse broken HTML?

空扰寡人 提交于 2019-12-21 06:08:08
问题 I'm working on large projects which require fast HTML parsing, including recovery for broken HTML pages. Currently lxml is my choice, I know it provides an interface for libxml2's recovery mode, too, but I'm not really happy with the results. For some specific HTML pages I found that BeautifulSoup works out really better results (example: http://fortune.com/2015/11/10/vw-scandal-volkswagen-gift-cards/, this one has a broken <header> tag which lxml/libxml2 couldn't correct). However, the

Best way to parse an HTML table into a CSV

情到浓时终转凉″ 提交于 2019-12-21 05:50:08
问题 I've got to grab some product data off an existing website to put into a database. The data is all in HTML table format, the model numbers are unique, but each product can have any number of different attributes (so the tables I need to parse all have different columns and headings). <table> <tr> <td>Model No.</td> <td>Weight</td> <td>Colour</td> <td>Etc..</td> </tr> <tr> <td>8572</td> <td>12 Kg</td> <td>Red</td> <td>Blah..</td> </tr> <tr> <td>7463</td> <td>7 Kg</td> <td>Blue</td> <td>Blah..<

Getting non-contiguous text with lxml / ElementTree

邮差的信 提交于 2019-12-21 05:36:22
问题 Suppose I have this sort of HTML from which I need to select "text2" using lxml / ElementTree: <div>text1<span>childtext1</span>text2<span>childtext2</span>text3</div> If I already have the div element as mydiv, then mydiv.text returns just "text1". Using itertext() seems problematic or cumbersome at best since it walks the entire tree under the div. Is there any simple/elegant way to extract a non-first text chunk from an element? 回答1: Well, lxml.etree provides full XPath support, which

how to extract main text from html using Tika

假装没事ソ 提交于 2019-12-21 05:11:26
问题 I just want to know that how i can extract main text and plain text from html using Tika? maybe one possible solution is to use BoilerPipeContentHandler but do you have some sample/demo codes to show it? thanks very much in advance 回答1: Here is a sample: public String[] tika_autoParser() { String[] result = new String[3]; try { InputStream input = new FileInputStream(new File("/Users/nazanin/Books/Web crawler.pdf")); ContentHandler textHandler = new BodyContentHandler(); Metadata metadata =

How to get a table from an html page using JAVA

烂漫一生 提交于 2019-12-21 04:42:13
问题 I am working on a project where I am trying to fetch financial statements from the internet and use them in a JAVA application to automatically create ratios, and charts. The site I am using uses a login and password to get to the tables. The Tag is TBODY, but there are 2 other TBODY's in the html. How can I use java to print my table to a txt file where I can then use in my application? What would the best way to go about this, and what should I read up on? 回答1: If this were my project, I'd

Bulletproofing SimpleXMLElement

徘徊边缘 提交于 2019-12-21 01:59:31
问题 Everyone knows that we should always use DOM techniques instead of regexes to extract content from HTML, but I get the feeling that I can never trust the SimpleXML extension or similar ones. I'm coding a OpenID implementation right now, and I tried using SimpleXML to do the HTML discovery - but my very first test (with alixaxel.myopenid.com) yielded a lot of errors: Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 27: parser error : Opening and ending tag

Need python lxml syntax help for parsing html

允我心安 提交于 2019-12-20 11:54:07
问题 I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with: HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail. I need to find the middle table with the search result rows (this one I was able to figure out): self

extracting element and insert a space

為{幸葍}努か 提交于 2019-12-20 10:31:36
问题 im parsing html using BeautifulSoup in python i dont know how to insert a space when extracting text element this is the code: import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>this<b>is</b>example</html>') print soup.text then output is thisisexample but i want to insert a space to this like yes is example how do i insert a space? 回答1: Use getText instead: import BeautifulSoup soup=BeautifulSoup.BeautifulSoup('<html>this<b>is</b>example</html>') print soup.getText(separator=u' ')

Extract data from website via PHP

非 Y 不嫁゛ 提交于 2019-12-20 08:50:53
问题 I am trying to create a simple alert app for some friends. Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two: http://www.sparkfun.com/commerce/product_info.php?products_id=5 http://www.sparkfun.com/commerce/product_info.php?products_id=9279 I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and