How can I use R (Rcurl/XML packages ?!) to scrape this webpage?

后端 未结 3 2090
清歌不尽
清歌不尽 2020-12-07 17:00

I have a (somewhat complex) web scraping challenge that I wish to accomplish and would love for some direction (to whatever level you feel like sharing) here goes:

I

相关标签:
3条回答
  • 2020-12-07 17:34

    Tal,

    You could use R and the XML package to do this, but (damn) that is some poorly formed HTML you are trying to parse. In fact, in most cases your would want to be using the readHTMLTable() function, which is covered in this previous thread.

    Given this ugly HTML, however, we will have to use the RCurl package to pull the raw HTML and create some custom functions to parse it. This problem has two components:

    1. Get all of the genome URLS from the base webpage (http://gtrnadb.ucsc.edu/) using the getURLContent() function in the RCurlpackage and some regex magic :-)
    2. Then take that list of URLS and scrape the data you are looking for, and then stick it into a data.frame.

    So, here goes...

    library(RCurl)
    
    ### 1) First task is to get all of the web links we will need ##
    base_url<-"http://gtrnadb.ucsc.edu/"
    base_html<-getURLContent(base_url)[[1]]
    links<-strsplit(base_html,"a href=")[[1]]
    
    get_data_url<-function(s) {
        u_split1<-strsplit(s,"/")[[1]][1]
        u_split2<-strsplit(u_split1,'\\"')[[1]][2]
        ifelse(grep("[[:upper:]]",u_split2)==1 & length(strsplit(u_split2,"#")[[1]])<2,return(u_split2),return(NA))
    }
    
    # Extract only those element that are relevant
    genomes<-unlist(lapply(links,get_data_url))
    genomes<-genomes[which(is.na(genomes)==FALSE)]
    
    ### 2) Now, scrape the genome data from all of those URLS ###
    
    # This requires two complementary functions that are designed specifically
    # for the UCSC website. The first parses the data from a -structs.html page
    # and the second collects that data in to a multi-dimensional list
    parse_genomes<-function(g) {
        g_split1<-strsplit(g,"\n")[[1]]
        g_split1<-g_split1[2:5]
        # Pull all of the data and stick it in a list
        g_split2<-strsplit(g_split1[1],"\t")[[1]]
        ID<-g_split2[1]                             # Sequence ID
        LEN<-strsplit(g_split2[2],": ")[[1]][2]     # Length
        g_split3<-strsplit(g_split1[2],"\t")[[1]]
        TYPE<-strsplit(g_split3[1],": ")[[1]][2]    # Type
        AC<-strsplit(g_split3[2],": ")[[1]][2]      # Anticodon
        SEQ<-strsplit(g_split1[3],": ")[[1]][2]     # ID
        STR<-strsplit(g_split1[4],": ")[[1]][2]     # String
        return(c(ID,LEN,TYPE,AC,SEQ,STR))
    }
    
    # This will be a high dimensional list with all of the data, you can then manipulate as you like
    get_structs<-function(u) {
        struct_url<-paste(base_url,u,"/",u,"-structs.html",sep="")
        raw_data<-getURLContent(struct_url)
        s_split1<-strsplit(raw_data,"<PRE>")[[1]]
        all_data<-s_split1[seq(3,length(s_split1))]
        data_list<-lapply(all_data,parse_genomes)
        for (d in 1:length(data_list)) {data_list[[d]]<-append(data_list[[d]],u)}
        return(data_list)
    }
    
    # Collect data, manipulate, and create data frame (with slight cleaning)
    genomes_list<-lapply(genomes[1:2],get_structs) # Limit to the first two genomes (Bdist & Spurp), a full scrape will take a LONG time
    genomes_rows<-unlist(genomes_list,recursive=FALSE) # The recursive=FALSE saves a lot of work, now we can just do a straigh forward manipulation
    genome_data<-t(sapply(genomes_rows,rbind))
    colnames(genome_data)<-c("ID","LEN","TYPE","AC","SEQ","STR","NAME")
    genome_data<-as.data.frame(genome_data)
    genome_data<-subset(genome_data,ID!="</PRE>")   # Some malformed web pages produce bad rows, but we can remove them
    
    head(genome_data)
    

    The resulting data frame contains seven columns related to each genome entry: ID, length, type, sequence, string, and name. The name column contains the base genome, which was my best guess for data organization. Here it what it looks like:

    head(genome_data)
                                       ID   LEN TYPE                           AC                                                                       SEQ
    1     Scaffold17302.trna1 (1426-1498) 73 bp  Ala     AGC at 34-36 (1459-1461) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTTTCCA
    2   Scaffold20851.trna5 (43038-43110) 73 bp  Ala   AGC at 34-36 (43071-43073) AGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
    3   Scaffold20851.trna8 (45975-46047) 73 bp  Ala   AGC at 34-36 (46008-46010) TGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTTCTCCA
    4     Scaffold17302.trna2 (2514-2586) 73 bp  Ala     AGC at 34-36 (2547-2549) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACAGGGATCGATGCCCGGGTTCTCCA
    5 Scaffold51754.trna5 (253637-253565) 73 bp  Ala AGC at 34-36 (253604-253602) CGGGGGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGGGTCCTCCA
    6     Scaffold17302.trna4 (6027-6099) 73 bp  Ala     AGC at 34-36 (6060-6062) GGGGAGCTAGCTCAGATGGTAGAGCGCTCGCTTAGCATGCGAGAGGtACCGGGATCGATGCCCGAGTTCTCCA
                                                                            STR  NAME
    1 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
    2 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
    3 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
    4 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>.>>>.......<<<.<<<<<<<<. Spurp
    5 .>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<.. Spurp
    6 >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<......>>>>.......<<<<.<<<<<<<. Spurp
    

    I hope this helps, and thanks for the fun little Sunday afternoon R challenge!

    0 讨论(0)
  • 2020-12-07 17:52

    Just tried it using Mozenda (http://www.mozenda.com). After roughly 10 minutes and I had an agent that could scrape the data as you describe. You may be able to get all of this data just using their free trial. Coding is fun, if you have time, but it looks like you may already have a solution coded for you. Nice job Drew.

    0 讨论(0)
  • 2020-12-07 17:59

    Interesting problem and agree that R is cool, but somehow i find R to be a bit cumbersome in this respect. I seem to prefer to get the data in intermediate plain text form first in order to be able to verify that the data is correct in every step... If the data is ready in its final form or for uploading your data somewhere RCurl is very useful.

    Simplest in my opinion would be to (on linux/unix/mac/or in cygwin) just mirror the entire http://gtrnadb.ucsc.edu/ site (using wget) and take the files named /-structs.html, sed or awk the data you would like and format it for reading into R.

    I'm sure there would be lots of other ways also.

    0 讨论(0)
提交回复
热议问题