Screen scraping in clojure

给你一囗甜甜゛ 提交于 2019-12-05 19:38:54

There are a couple of possibilities here.

Several of these require semi-well formed XML to work. If you don't have it, I would pair clj-tagsoup with hiccup to produce the XML (parse with clj-tag-soup, which produces a form that hiccup and write out as XML) and work with that.

First, just use the native JDK capabilities. Assuming the document is well formed enough, try using clj-xpath which provides a wrapper around the native JDK parsing.

If that doesn't suffice, consider taking a more Clojure data structure based route. A simpler path could just use the output of TagSoup and a combination of maps, filters, and nths.

If you need something more advanced, consider using zippers to provide structure around the data, making it easier to manipulate. Use clojure.xml/parse and clojure.zip/xml-zip to produce the zipper, and go from there. An example can be found at http://techbehindtech.com/2010/06/25/parsing-xml-in-clojure/.

Using the native structures is my preferred route for anything complicated, as you can bring the full power of the language to bear.

If you provide a sample of why you need XPath, I can provide some sample code.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!