Raw HTML vs. DOM scraping in python using mechanize and beautiful soup

问题

I am attempting to write a program that, as an example, will scrape the top price off of this web page:

http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults

First, I am easily able to retrieve the HTML by doing the following:

from urllib import urlopen 
from BeautifulSoup import BeautifulSoup
import mechanize

webpage = 'http://www.kayak.com/#/flights/JFK-PAR/2012-06-01/2012-07-01/1adults'
br = mechanize.Browser()
data = br.open(webpage).get_data()

soup = BeautifulSoup(data)
print soup

However, the raw HTML does not contain the price. The browser does...it's thing (clarification here might help me also)...and retrieves the price from elsewhere while it constructs the DOM tree.

I was led to believe that mechanize would act just like my browser and return the DOM tree, which I am also led to believe is what I see when I look at, for example, Chrome's Developer Tools view of the page (if I'm incorrect about this, how do I go about getting whatever that price information is stored in?) Is there something that I need to tell mechanize to do in order to see the DOM tree?

Once I can get the DOM tree into python, everything else I need to do should be a snap. Thanks!

回答1:

Mechanize and Beautiful soup are un-beatable tools web-scrapping in python.

But you need to understand what is meant for what:

Mechanize : It mimics the browser functionality on a webpage.

BeautifulSoup : HTML parser, works well even when HTML is not well-formed.

Your problem seems to be javascript. The price is getting populated via an ajax call using javascript. Mechanize, however, does not do javascript, so any content that results from javascript will remain invisible to mechanize.

Take a look at this : http://github.com/davisp/python-spidermonkey/tree/master

This does a wrapper on mechanize and Beautiful soup with js execution.

回答2:

Answering my own question because in the years since asking this I have learned a lot. Today I would use Selenium Webdriver to do this job. Selenium is exactly the tool I was looking for back in 2012 for this type of web scraping project.

https://www.seleniumhq.org/download/

http://chromedriver.chromium.org/

来源：https://stackoverflow.com/questions/9552773/raw-html-vs-dom-scraping-in-python-using-mechanize-and-beautiful-soup

标签

python

dom

screen-scraping

mechanize