Screen scraping in clojure

问题

I googled, but I can't find a satisfactory answer. This SO question is related but kinda old as well as the exact opposite of what I am looking for: a way to do screen-scraping using XPath, not CSS selectors.

I've used enlive for some basic screen-scraping but sometimes one needs the power of XPath selectors. So here it is:

Is there any equivalent to Nokogiri or lxml for clojure (java)? What is the state of the "pure java Nokogiri"? Any way to use the library from clojure? Any better alternatives than this hack?

回答1:

There are a couple of possibilities here.

Several of these require semi-well formed XML to work. If you don't have it, I would pair clj-tagsoup with hiccup to produce the XML (parse with clj-tag-soup, which produces a form that hiccup and write out as XML) and work with that.

First, just use the native JDK capabilities. Assuming the document is well formed enough, try using clj-xpath which provides a wrapper around the native JDK parsing.

If that doesn't suffice, consider taking a more Clojure data structure based route. A simpler path could just use the output of TagSoup and a combination of maps, filters, and nths.

If you need something more advanced, consider using zippers to provide structure around the data, making it easier to manipulate. Use clojure.xml/parse and clojure.zip/xml-zip to produce the zipper, and go from there. An example can be found at http://techbehindtech.com/2010/06/25/parsing-xml-in-clojure/.

Using the native structures is my preferred route for anything complicated, as you can bring the full power of the language to bear.

If you provide a sample of why you need XPath, I can provide some sample code.

来源：https://stackoverflow.com/questions/13693615/screen-scraping-in-clojure

标签

ruby

clojure

screen-scraping

nokogiri