F#.Data HTML Parser Extracting Strings From Nodes

我的梦境 提交于 2021-01-01 04:43:15

问题


I am trying to use FSharp.Data's HTML Parser to extract a string List of links from href attributes.

I can get the links printed out to console, however, i'm struggling to get them into a list.

Working snippet of a code which prints the wanted links:

let results = HtmlDocument.Load(myUrl)
let links = 
    results.Descendants("td")
    |> Seq.filter (fun x -> x.HasClass("pagenav"))
    |> Seq.map (fun x -> x.Elements("a"))
    |> Seq.iter (fun x -> x |> Seq.iter (fun y -> y.AttributeValue("href") |> printf "%A"))

How do i store those strings into variable links instead of printing them out?

Cheers,


回答1:


On the very last line, you end up with a sequence of sequences - for each td.pagenav you have a bunch of <a>, each of which has a href. That's why you have to have two nested Seq.iters - first you iterate over the outer sequence, and on each iteration you iterate over the inner sequence.

To flatten a sequence of sequences, use Seq.collect. Further, to convert a sequence to a list, use Seq.toList or List.ofSeq (they're equivalent):

let a = [ [1;2;3];  [4;5;6]  ]
let b = a |> Seq.collect id |> Seq.toList
> val b : int list = [1; 2; 3; 4; 5; 6]

Applying this to your code:

let links = 
    results.Descendants("td")
    |> Seq.filter (fun x -> x.HasClass("pagenav"))
    |> Seq.map (fun x -> x.Elements("a"))
    |> Seq.collect (fun x -> x |> Seq.map (fun y -> y.AttributeValue("href")))
    |> Seq.toList

Or you could make it a bit cleaner by applying Seq.collect at the point where you first encounter a nested sequence:

let links = 
    results.Descendants("td")
    |> Seq.filter (fun x -> x.HasClass("pagenav"))
    |> Seq.collect (fun x -> x.Elements("a"))
    |> Seq.map (fun y -> y.AttributeValue("href"))
    |> Seq.toList

That said, I would rather rewrite this as a list comprehension. Looks even cleaner:

let links = [ for td in results.Descendants "td" do
                if td.HasClass "pagenav" then
                  for a in td.Elements "a" ->
                    a.AttributeValue "href"
            ]


来源:https://stackoverflow.com/questions/44294409/f-data-html-parser-extracting-strings-from-nodes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!