Extract links from html table

后端 未结 2 1504
情书的邮戳
情书的邮戳 2020-12-16 21:42

I\'m trying to extract the links from the following webpage http://ipt.humboldt.org.co/ that are of type \"Specimen\". I can get the table from the webpage using the followi

2条回答
  •  一生所求
    2020-12-16 22:40

    It ended up being an intricate XPath expression:

    library(XML)
    sitePage<-htmlParse("http://ipt.humboldt.org.co/")
    hyperlinksYouNeed<-getNodeSet(sitePage,"//table[@id='resourcestable']
                                            //td[5][.='Specimen']
                                            /preceding-sibling
                                            ::td[3]
                                            /a
                                            /@href")
    

    but let me explain the XPath expression bit-by-bit:

    • //table[@id='resourcestable'] -> This way we are getting the main table on the page called 'resourcestable'

    • //td[5][.='Specimen'] -> Now we are filtering only these rows that have Type as Specimen

    • /preceding-sibling -> Now we start looking backwards

    • ::td[3] -> 3 steps to be precise counting backwards from where we are. Be careful preceding-sibling start counting backwards therefore td[1] is the Type column, td[2] is the Organisation column and td[3] is the Name column we want.

    • /a -> now get the included a node

    • /@href -> and finally more precisely the href attribute content

提交回复
热议问题