Scraping large pdf tables which span across multiple pages

后端 未结 7 1969
野的像风
野的像风 2021-02-04 07:14

I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is t

7条回答
  •  甜味超标
    2021-02-04 07:41

    Efforts to construct an Index for this (presumably the variation in formats relates to the different sub-reports. These all seem to be for Catalunya:

    heads <- grep("                                                                .+2012", txt)
    notheads <- grep("                                                                .+Anuari de", txt)
     headtxt <-  unique(trim(txt[1:length(txt) %in% heads & !1:length(txt) %in% notheads]))
    
     [1] "TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012"                            
     [2] "TEMPERATURA MÀXIMA MITJANA MENSUAL ( ºC ) - 2012"                     
     [3] "TEMPERATURA MÍNIMA MITJANA MENSUAL ( ºC ) - 2012"                     
     [4] "TEMPERATURA MÀXIMA ABSOLUTA MENSUAL ( ºC ) - 2012"                    
     [5] "TEMPERATURA MÍNIMA ABSOLUTA MENSUAL ( ºC ) - 2012"                    
     [6] "AMPLITUD TÈRMICA MITJANA MENSUAL ( ºC ) - 2012"                       
     [7] "AMPLITUD TÈRMICA MÀXIMA MENSUAL ( ºC ) - 2012"                        
     [8] "NOMBRE DE DIES DE GLAÇADA ( TN ≤ 0 ºC ) - 2012"                       
     [9] "PRECIPITACIÓ MENSUAL ( mm ) - 2012"                                   
    [10] "PRECIPITACIÓ MENSUAL MÀXIMA EN 24 HORES ( mm ) - 2012"                
    [11] "PRECIPITACIÓ MENSUAL MÀXIMA EN 1 HORA ( mm ) - 2012"                  
    [12] "PRECIPITACIÓ MENSUAL MÀXIMA EN 30 MINUTS ( mm ) - 2012"               
    [13] "PRECIPITACIÓ MENSUAL MÀXIMA EN UN 1 MINUT ( mm ) - 2012"              
    [14] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT ≥ 0,1 mm) - 2012"                 
    [15] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT > 0,2 mm) - 2012"                 
    [16] "VELOCITAT MITJANA DEL VENT MENSUAL ( m/s ) - 2012"                    
    [17] "DIRECCIÓ DOMINANT DEL VENT - 2012"                                    
    [18] "MITJANA MENSUAL DE LA RATXA MÀXIMA DIÀRIA DEL VENT ( m/s ) - 2012"    
    [19] "RATXA MÀXIMA ABSOLUTA DEL VENT MENSUAL ( m/s ) - 2012"                
    [20] "HUMITAT RELATIVA MITJANA MENSUAL ( % ) - 2012"                        
    [21] "MITJANA MENSUAL DE LA HUMITAT RELATIVA MÀXIMA DIÀRIA ( % ) - 2012"    
    [22] "MITJANA MENSUAL DE LA HUMITAT RELATIVA MÍNIMA DIÀRIA ( % ) - 2012"    
    [23] "MITJANA MENSUAL DE LA IRRADIACIÓ SOLAR GLOBAL DIÀRIA ( MJ/m2 ) - 2012"
    [24] "PRESSIÓ ATMOSFÈRICA MITJANA MENSUAL, A NIVELL DE L'EMA ( hPa ) - 2012"
    [25] "PRESSIÓ ATMOSFÈRICA MÀXIMA ABSOLUTA MENSUAL ( hPa ) - 2012"           
    [26] "PRESSIÓ ATMOSFÈRICA MÍNIMA ABSOLUTA MENSUAL ( hPa ) - 2012"           
    [27] "GRUIX MÀXIM MENSUAL DE NEU AL TERRA ( cm ) - 2012"  
    

    The parens and dashes interfere with grepping. So trying to get into a form where those values can be use to identify page header locations by grep(val, txt) succeeds by removing the "\\(.+$" matches with a single exception (which I decided to fix "by hand":

     headtxt[14:15]
    #[14] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT ≥ 0,1 mm) - 2012"                 
    #[15] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT > 0,2 mm) - 2012"  
    
    headtxt <- gsub("\\(.+$", "", headtxt)
    
    pagedivs <- lapply(headtxt, grep, txt)
    # Seemed reasonable that the first 5 (of 10) should be the first section
    pagedivs[[14]] <- pagedivs[[14]][1:5]
    pagedivs[[15]] <- pagedivs[[15]][6:10]
    

    So looking for a marker to end pages it looks like 4 empty lines is reliable

    > length(notheads)
    [1] 113
    > rl.lens <- rle( nchar(txt) )
    > table(rl.lens$lengths[rl.lens$values==0])
    #  1   4 
    #226 113 
    

    Removed all the "Ã" because they were creating non-fixed width columns:

    txt <- gsub("Ã", "", txt)
    write(txt, "txt_noAs.txt)
    

    Interestingly, my text editor now shows "à"'s where the "Ã"'s used to appear. At this point one can loop over the pages within page type starting at pagedivs+4 to the location of 4 empty rows and use read.fwf from the 'utils' package. What remains to support this is a layout definition, which you say you already have a handle on, but which could be also inferred using pkg:gsubfn's strapply or a regex solution.

    Looking for an approach to develop a regex solution:

    > numfields <- gregexpr("[-[:digit:].]+ ", txt)
    > table( sapply( numfields,  length))
    
       1    2    3    5    6    7    8   11   12   13   14   15 
    1201  193    8    1   13   15    2    4 1162  869  308   32 
      16   17   19   20   21   23   24   25   26   27   28   30 
       1    3    1    1    1    7   10  688  481  168   13    1 
    

    So clearly the pages fall into two classes: those where the number of numeric columns is 12-14 and those where they number 23-28. I would have expected this to be a bit different, but I guess the "ANY" columns threw off my expectations.

提交回复
热议问题