How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions
Language: Racket
Library: (planet ashinn/html-parser:1) and (planet clements/sxml2:1)
(require net/url
(planet ashinn/html-parser:1)
(planet clements/sxml2:1))
(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->sxml))
(define links ((sxpath "//a/@href/text()") doc))
Above example using packages from the new package system: html-parsing and sxml
(require net/url
html-parsing
sxml)
(define the-url (string->url "http://stackoverflow.com/"))
(define doc (call/input-url the-url get-pure-port html->xexp))
(define links ((sxpath "//a/@href/text()") doc))
Note: Install the required packages with 'raco' from a command line, with:
raco pkg install html-parsing
and:
raco pkg install sxml