html-parsing | 易学教程

Pin down exact content location in html for web scraping urllib2 Beautiful Soup

阅读更多关于 Pin down exact content location in html for web scraping urllib2 Beautiful Soup

问题 I'm new to web scraping, have little exposure to html file systems and wanted to know if there is a better more efficient way to search for a required content on the html version of a web page. Currently, I want to scrape reviews for a product here: http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem For this, I have the following code: url = http://www.walmart.com/ip/29701960? wmlspartner=wlpa

BeautifulSoup Tag Removal

阅读更多关于 BeautifulSoup Tag Removal

问题 I have am looking to parse a HTML table with Python/BeautifulSoup... This is my first attempt at coding anything in Python, so its probably not the most efficient. I grabbed a function another post here (works great for the most part), but I am running into a couple of problems. The code I am running is here: def strip_tags(html, invalid_tags): bs2 = BeautifulSoup(str(html)) for tag in bs2.findAll(True): if tag.name in invalid_tags: s = "" for c in tag.contents: if not isinstance(c,

Help with Java Swing HTML parsing

阅读更多关于 Help with Java Swing HTML parsing

问题 I am parsing a collection of HTML documents with the Java Swing HTML parsing libraries and I am trying to isolate the text between <title> tags so that I can use them to identify the documents but I am having a hard time accomplishing that since the handleStartTag method doesn't have access to the text inside of the tags 回答1: You can use XPath to pull out data from HTML: String html = //... //read the HTML into a DOM StreamSource source = new StreamSource(new StringReader(html)); DOMResult

Scraping a website with clickable content in Python

阅读更多关于 Scraping a website with clickable content in Python

问题 I would like to scrap the content a the following website: http://financials.morningstar.com/ratios/r.html?t=AMD In there under Key Ratios I would like to click on "Growth" button and then scrap the data in Python. How can I do that? 回答1: You can solve it with requests + BeautifulSoup . There is an asynchronous GET request sent to the http://financials.morningstar.com/financials/getKeyStatPart.html endpoint which you need to simulate. The Growth table is located inside the div with id="tab

Parsing table content in php/regex and getting result by td

阅读更多关于 Parsing table content in php/regex and getting result by td

问题 I have a table like this which I spent a full day trying to get the data from: <table class="table table-condensed"> <tr> <td>Monthely rent</td> <td><strong>Fr. 1'950. </strong></td> </tr> <tr> <td>Rooms(s)</td> <td><strong>3</strong></td> </tr> <tr> <td>Surface</td> <td><strong>93m2</strong></td> </tr> <tr> <td>Date of Contract</td> <td><strong>01.04.17</strong></td> </tr> </table> As you can see the data is well organized, and I am trying to get this result: monthly rent => Fr. 1'950. Rooms

beautifulsoup 4 + python: string returns 'None'

阅读更多关于 beautifulsoup 4 + python: string returns 'None'

问题 I'm trying to parse some html with BeautifulSoup4 and Python 2.7.6, but the string is returning "None". The HTML i'm trying to parse is: <div class="booker-booking"> 2 rooms · USD 0  </div> The snippet from python I have is: data = soup.find('div', class_='booker-booking').string I've also tried the following two: data = soup.find('div', class_='booker-booking').text data = soup.find('div', class_='booker-booking').contents[0] Which both return: u'\n\t\t2\xa0rooms \n\t

Parsing a Table from the following website

阅读更多关于 Parsing a Table from the following website

问题 I want to collect the past weather details of a particular city in India for each day in the year 2016.The following website has this data : "https://www.timeanddate.com/weather/india/kanpur/historic?month=1&year=2016" This link has the data for month January 2016. There is a nice table out there I want to extract this table I have tried enough and I could extract another table which is this one. BUT I DO NOT WANT THIS ONE. It is not serving my purpose I want the other big table with data

How to programmatically load a HTML document in order to add to the document's <head>?

阅读更多关于 How to programmatically load a HTML document in order to add to the document's ?

问题 We are supplied with HTML 'wrapper' files from the client, which we need to insert out content into, and then render the HTML. Before we render the HTML with our content inserted, I need to add a few tags to the <head> section of the client's wrapper, such as references to our script files, css and some meta tags. So what I'm doing is string html = File.ReadAllText(wrapperLocation, Encoding.GetEncoding("iso-8859-1")); and now I have the complete HTML. I then search for a pre-defined content

Get specific data from a webpage

阅读更多关于 Get specific data from a webpage

问题 I have a page, and for that page I need to get the value from a other different page. I just want to retrieve the 6 numbers into the "Números Sorteados" box. So far I only succeeded in get the whole web page with this: WebRequest request = WebRequest.Create("http://www1.caixa.gov.br/loterias/loterias/ultimos_resultados.asp"); WebResponse response = request.GetResponse(); Stream data = response.GetResponseStream(); string html = String.Empty; using (StreamReader sr = new StreamReader(data)) {

XML parser vs regex

阅读更多关于 XML parser vs regex

问题 What should I use? I am going to fetch links, images, text, etc and use it for using it building seo statistics and analysis of the page. What do you recommend to be used? XML Parser or regex I have been using regex and never have had any problems with it however, I have been hearing from people that it can not do some things and blah blah blah...but to be honest I don't know why but I am afraid to use XML parser and prefer regex (and it works and serves the purpose pretty well) So, if