retrieve information from a url

ⅰ亾dé卋堺 提交于 2019-12-06 12:52:13

You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:

E.g.

Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");

for (Element tag : tags) {
    System.out.println(tag.text());
}

which prints

Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer

Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.

I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.

Example here

I imagine there's something similar in java and other languages. The concept would be similar:

  1. Load page data.
  2. Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
  3. Do what you want with the data :)

It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!