How do you parse HTML with a variety of languages and parsing libraries?
When answering:
Individual comments will be linked to in answers to questions
Language: Coldfusion 9.0.1+
Library: jSoup
function parseURL(required string url){
var res = [];
var javaLoader = createObject("javaloader.JavaLoader").init([expandPath("./jsoup-1.7.3.jar")]);
var jSoupClass = javaLoader.create("org.jsoup.Jsoup");
//var dom = jSoupClass.parse(html); // if you already have some html to parse.
var dom = jSoupClass.connect( arguments.url ).get();
var links = dom.select("a");
for(var a=1;a LT arrayLen(links);a++){
var s={};s.href= links[a].attr('href'); s.text= links[a].text();
if(s.href contains "http://" || s.href contains "https://") arrayAppend(res,s);
}
return res;
}
//writeoutput(writedump(parseURL(url)));
Returns an array of structures, each struct contains an HREF and TEXT objects.