How to scrape websites such as Hype Machine?

问题

I'm curious about website scraping (i.e. how it's done etc..), specifically that I'd like to write a script to perform the task for the site Hype Machine. I'm actually a Software Engineering Undergraduate (4th year) however we don't really cover any web programming so my understanding of Javascript/RESTFul API/All things Web are pretty limited as we're mainly focused around theory and client side applications. Any help or directions greatly appreciated.

回答1:

The first thing to look for is whether the site already offers some sort of structured data, or if you need to parse through the HTML yourself. Looks like there is an RSS feed of latest songs. If that's what you're looking for, it would be good to start there.

You can use a scripting language to download the feed and parse it. I use python, but you could pick a different scripting language if you like. Here's some docs on how you might download a url in python and parse XML in python.

Another thing to be conscious of when you write a program that downloads a site or RSS feed is how often your scraping script runs. If you have it run constantly so that you'll get the new data the second it becomes available, you'll put a lot of load on the site, and there's a good chance they'll block you. Try not to run your script more often than you need to.

回答2:

You may want to check the following books:

"Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL" http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593271204

"HTTP Programming Recipes for C# Bots" http://www.amazon.com/HTTP-Programming-Recipes-C-Bots/dp/0977320677

"HTTP Programming Recipes for Java Bots" http://www.amazon.com/HTTP-Programming-Recipes-Java-Bots/dp/0977320669

回答3:

I believe that the most important thing you must analyze is which kind of information do you want to extract. If you want to extract entire websites like google does probably your best option is to analyze tools like nutch from Apache.org or flaptor solution http://ww.hounder.org If you need to extract particular areas on unstructured data documents - websites, docs, pdf - probably you can extend nutch plugins to fit particular needs. nutch.apache.org

On the other hand if you need to extract particular text or clipping areas of a website where you set rules using DOM of the page probably what you need to check is more related to tools like mozenda.com. with those tools you will be able to set up extraction rules in order to scrap particular information on a website. You must take into consideration that any change on a webpage will give you an error on your robot.

Finally, If you are planning to develop a website using information sources you could purchase information from companies such as spinn3r.com were they sell particular niches of information ready to be consume. You will be able to save lots of money on infrastructure. hope it helps!. sebastian.

回答4:

Python has the feedparser module, located at feedparser.org that actually handles RSS in its various flavours and ATOM in its various flavours. No reason to reinvent the wheel.

来源：https://stackoverflow.com/questions/3380230/how-to-scrape-websites-such-as-hype-machine

标签

web-scraping

screen-scraping