Scraping Javascript generated data

前端 未结 2 582
你的背包
你的背包 2020-12-11 13:34

I\'m working on a project with the World Bank analyzing their procurement processes.

The WB maintains websites for each of their projects, containing links and data

相关标签:
2条回答
  • 2020-12-11 14:12

    The main page calls a javascript function

    javascript:callTabContent('p','P090644','','en','procurement','procurementId');
    

    The main thing here is the project id P090644. This together with the required language en are passed as parameters to a form at http://www.worldbank.org/p2e/procurement.html.

    This form call can be replicated with a url http://www.worldbank.org/p2e/procurement.html?lang=en&projId=P090644.

    Code to extract relevant project description urls follows:

    projID<-"P090644"
    projDetails<-paste0("http://www.worldbank.org/p2e/procurement.html?lang=en&projId=",projID)
    
    require(XML)
    
    pdData<-htmlParse(projDetails)
    pdDescribtions<-xpathSApply(pdData,'//*/table[@id="contractawards"]//*/@href')
    
    #> pdDescribtions
                                                                    href 
    #"http://search.worldbank.org/wcontractawards/procdetails/OP00005718" 
                                                                    href 
    #"http://search.worldbank.org/wcontractawards/procdetails/OP00005702" 
                                                                    href 
    #"http://search.worldbank.org/wcontractawards/procdetails/OP00005709" 
                                                                    href 
    #"http://search.worldbank.org/wcontractawards/procdetails/OP00005715" 
    

    it should be noted that excel links are provided which maybe of use to you also. They may contain the data you intend to scrap from the description links

    procNotice<-paste0("http://search.worldbank.org/wprocnotices/projectdetails/",projID,".xls")
    conAward<-paste0("http://search.worldbank.org/wcontractawards/projectdetails/",projID,".xls")
    conData<-paste0("http://search.worldbank.org/wcontractdata/projectdetails/",projID,".xls")
    
    require(gdata)
    
    pnData<-read.xls(procNotice)
    caData<-read.xls(conAward)
    cdData<-read.xls(conData)
    

    UPDATE:

    To find what is being posted we can examine what happens when the javascript function is called. Using Firebug or something similar we intercept the request header which starts:

    POST /p2e/procurement.html HTTP/1.1
    Host: www.worldbank.org
    

    and has parameters:

    lang=en
    projId=P090644
    

    Alternatively we can examine the javascript at http://siteresources.worldbank.org/cached/extapps/cver116/p2e/js/script.js and look at the function callTabContent:

    function callTabContent(tabparam, projIdParam, contextPath, langCd, htmlId, anchorTagId) {
        if (tabparam == 'n' || tabparam == 'h') {
            $.ajax( {
                type : "POST",
                url : contextPath + "/p2e/"+htmlId+".html",
                data : "projId=" + projIdParam + "&lang=" + langCd,
                success : function(msg) {
                    if(tabparam=="n"){
                        $("#newsfeed").replaceWith(msg);
                    } else{
                        $("#cycle").replaceWith(msg);
                    }
                    stickNotes();
                }
            });
        } else {
            $.ajax( {
                type : "POST",
                url : contextPath + "/p2e/"+htmlId+".html",
                data : "projId=" + projIdParam + "&lang=" + langCd,
                success : function(msg) {
                    $("#tabContent").replaceWith(msg);
                    $('#map_container').hide();
                    changeAlternateColors();
                    $("#tab_menu a").removeClass("selected");
                    $('#'+anchorTagId).addClass("selected");                
                    stickNotes();
                }
            });
        }
    }
    

    examining the content of the function we can see it is simply posting relevant parameters to a form then updating the webpage.

    0 讨论(0)
  • 2020-12-11 14:27

    I am not sure I have understood every details of your problem. But what I know for sure is that casperJS works great for javascript generated content.

    You can have a look at it here: http://casperjs.org/

    It's written in Javascript and has a bunch of useful functions very well documented on the link I provided.

    I have used it myself lately for a personal project and can be set up easily with a few lines of code.

    Give it a go! Hope, that helps..

    0 讨论(0)
提交回复
热议问题