How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions
Language: Clojure
Library: Enlive (a selector-based (à la CSS) templating and transformation system for Clojure)
Selector expression:
(def test-select
(html/select (html/html-resource (java.io.StringReader. test-html)) [:a]))
Now we can do the following at the REPL (I've added line breaks in test-select
):
user> test-select
({:tag :a, :attrs {:href "http://foo.com/"}, :content ["foo"]}
{:tag :a, :attrs {:href "http://bar.com/"}, :content ["bar"]}
{:tag :a, :attrs {:href "http://baz.com/"}, :content ["baz"]})
user> (map #(get-in % [:attrs :href]) test-select)
("http://foo.com/" "http://bar.com/" "http://baz.com/")
You'll need the following to try it out:
Preamble:
(require '[net.cgrand.enlive-html :as html])
Test HTML:
(def test-html
(apply str (concat [""]
(for [link ["foo" "bar" "baz"]]
(str "" link ""))
[""])))