retrieve information from a url

I want to make a program that will retrieve some information a url. For example i give the url below, from librarything

How can i retrieve all the words below the "TAGS" tab, like

Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?

I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?

EDIT: You gave me excellent help, but I want to ask something else. For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?

You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:

E.g.

Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");

for (Element tag : tags) {
    System.out.println(tag.text());
}

which prints

Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer

Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.

I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.

Example here

I imagine there's something similar in java and other languages. The concept would be similar:

Load page data.
Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
Do what you want with the data :)

It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.

来源：https://stackoverflow.com/questions/7822420/retrieve-information-from-a-url

标签

java

wrapper

data-mining

information-retrieval