Scraping a wiki page for the “Periodic table” and all the links

后端 未结 3 1442
迷失自我
迷失自我 2020-12-25 08:55

I wish to scrape the following wiki article: http://en.wikipedia.org/wiki/Periodic_table

So that the output of my R code will be a table with the following columns:

3条回答
  •  一向
    一向 (楼主)
    2020-12-25 09:36

    Tal -- I thought this was going to be easy. I was going to point you to readHTMLTable(), my favorite function in the XML package. Heck, its help page even shows an example of scraping a Wikipedia page!

    But alas, this is not what you want:

    library(XML)
    url = 'http://en.wikipedia.org/wiki/Periodic_table'
    tables = readHTMLTable(html)
    
    # ... look through the list to find the one you want...
    
    table = tables[3]
    table
    $`NULL`
             Group #    1    2    3     4     5     6     7     8     9    10    11    12     13     14     15     16     17     18
    1         Period                                           
    2              1   1H       2He                                    
    3              2  3Li  4Be         5B    6C    7N    8O    9F  10Ne                        
    4              3 11Na 12Mg       13Al  14Si   15P   16S  17Cl  18Ar                        
    5              4  19K 20Ca 21Sc  22Ti   23V  24Cr  25Mn  26Fe  27Co  28Ni  29Cu  30Zn   31Ga   32Ge   33As   34Se   35Br   36Kr
    6              5 37Rb 38Sr  39Y  40Zr  41Nb  42Mo  43Tc  44Ru  45Rh  46Pd  47Ag  48Cd   49In   50Sn   51Sb   52Te    53I   54Xe
    7              6 55Cs 56Ba    *  72Hf  73Ta   74W  75Re  76Os  77Ir  78Pt  79Au  80Hg   81Tl   82Pb   83Bi   84Po   85At   86Rn
    8              7 87Fr 88Ra   ** 104Rf 105Db 106Sg 107Bh 108Hs 109Mt 110Ds 111Rg 112Cn 113Uut 114Uuq 115Uup 116Uuh 117Uus 118Uuo
    9                                                      
    10 * Lanthanoids 57La 58Ce 59Pr  60Nd  61Pm  62Sm  63Eu  64Gd  65Tb  66Dy  67Ho  68Er   69Tm   70Yb   71Lu             
    11  ** Actinoids 89Ac 90Th 91Pa   92U  93Np  94Pu  95Am  96Cm  97Bk  98Cf  99Es 100Fm  101Md  102No  103Lr             
    

    The names are gone and the atomic number runs into the symbol.

    So back to the drawing board...

    My DOM walking-fu is not very strong, so this isn't pretty. It gets every link in a table cell, only keeps those with a "title" attribute (that's where the symbol is), and sticks what you want in a data.frame. It gets every other such link on the page, too, but we're lucky and the elements are the first 118 such links:

    library(XML)
    library(plyr) 
    
    url = 'http://en.wikipedia.org/wiki/Periodic_table'
    
    # don't forget to parse the HTML, doh!
    
    doc = htmlParse(url)
    
    # get every link in a table cell:
    
    links = getNodeSet(doc, '//table/tr/td/a')
    
    # make a data.frame for each node with non-blank text, link, and 'title' attribute:
    
    df = ldply(links, function(x) {
                text = xmlValue(x)
                if (text=='') text=NULL
    
                symbol = xmlGetAttr(x, 'title')
                link = xmlGetAttr(x, 'href')
                if (!is.null(text) & !is.null(symbol) & !is.null(link))
                    data.frame(symbol, text, link)
            } )
    
    # only keep the actual elements -- we're lucky they're first!
    
    df = head(df, 118)
    
    head(df)
         symbol text            link
    1  Hydrogen    H  /wiki/Hydrogen
    2    Helium   He    /wiki/Helium
    3   Lithium   Li   /wiki/Lithium
    4 Beryllium   Be /wiki/Beryllium
    5     Boron    B     /wiki/Boron
    6    Carbon    C    /wiki/Carbon
    

提交回复
热议问题