I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout
as advised here. The problem is t
Efforts to construct an Index for this (presumably the variation in formats relates to the different sub-reports. These all seem to be for Catalunya:
heads <- grep(" .+2012", txt)
notheads <- grep(" .+Anuari de", txt)
headtxt <- unique(trim(txt[1:length(txt) %in% heads & !1:length(txt) %in% notheads]))
[1] "TEMPERATURA MITJANA MENSUAL ( ºC ) - 2012"
[2] "TEMPERATURA MÀXIMA MITJANA MENSUAL ( ºC ) - 2012"
[3] "TEMPERATURA MÍNIMA MITJANA MENSUAL ( ºC ) - 2012"
[4] "TEMPERATURA MÀXIMA ABSOLUTA MENSUAL ( ºC ) - 2012"
[5] "TEMPERATURA MÍNIMA ABSOLUTA MENSUAL ( ºC ) - 2012"
[6] "AMPLITUD TÈRMICA MITJANA MENSUAL ( ºC ) - 2012"
[7] "AMPLITUD TÈRMICA MÀXIMA MENSUAL ( ºC ) - 2012"
[8] "NOMBRE DE DIES DE GLAÇADA ( TN ≤ 0 ºC ) - 2012"
[9] "PRECIPITACIÓ MENSUAL ( mm ) - 2012"
[10] "PRECIPITACIÓ MENSUAL MÀXIMA EN 24 HORES ( mm ) - 2012"
[11] "PRECIPITACIÓ MENSUAL MÀXIMA EN 1 HORA ( mm ) - 2012"
[12] "PRECIPITACIÓ MENSUAL MÀXIMA EN 30 MINUTS ( mm ) - 2012"
[13] "PRECIPITACIÓ MENSUAL MÀXIMA EN UN 1 MINUT ( mm ) - 2012"
[14] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT ≥ 0,1 mm) - 2012"
[15] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT > 0,2 mm) - 2012"
[16] "VELOCITAT MITJANA DEL VENT MENSUAL ( m/s ) - 2012"
[17] "DIRECCIÓ DOMINANT DEL VENT - 2012"
[18] "MITJANA MENSUAL DE LA RATXA MÀXIMA DIÀRIA DEL VENT ( m/s ) - 2012"
[19] "RATXA MÀXIMA ABSOLUTA DEL VENT MENSUAL ( m/s ) - 2012"
[20] "HUMITAT RELATIVA MITJANA MENSUAL ( % ) - 2012"
[21] "MITJANA MENSUAL DE LA HUMITAT RELATIVA MÀXIMA DIÀRIA ( % ) - 2012"
[22] "MITJANA MENSUAL DE LA HUMITAT RELATIVA MÍNIMA DIÀRIA ( % ) - 2012"
[23] "MITJANA MENSUAL DE LA IRRADIACIÓ SOLAR GLOBAL DIÀRIA ( MJ/m2 ) - 2012"
[24] "PRESSIÓ ATMOSFÈRICA MITJANA MENSUAL, A NIVELL DE L'EMA ( hPa ) - 2012"
[25] "PRESSIÓ ATMOSFÈRICA MÀXIMA ABSOLUTA MENSUAL ( hPa ) - 2012"
[26] "PRESSIÓ ATMOSFÈRICA MÍNIMA ABSOLUTA MENSUAL ( hPa ) - 2012"
[27] "GRUIX MÀXIM MENSUAL DE NEU AL TERRA ( cm ) - 2012"
The parens and dashes interfere with grepping. So trying to get into a form where those values can be use to identify page header locations by grep(val, txt)
succeeds by removing the "\\(.+$"
matches with a single exception (which I decided to fix "by hand":
headtxt[14:15]
#[14] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT ≥ 0,1 mm) - 2012"
#[15] "NOMBRE DE DIES DE PRECIPITACIÓ (PPT > 0,2 mm) - 2012"
headtxt <- gsub("\\(.+$", "", headtxt)
pagedivs <- lapply(headtxt, grep, txt)
# Seemed reasonable that the first 5 (of 10) should be the first section
pagedivs[[14]] <- pagedivs[[14]][1:5]
pagedivs[[15]] <- pagedivs[[15]][6:10]
So looking for a marker to end pages it looks like 4 empty lines is reliable
> length(notheads)
[1] 113
> rl.lens <- rle( nchar(txt) )
> table(rl.lens$lengths[rl.lens$values==0])
# 1 4
#226 113
Removed all the "Ã" because they were creating non-fixed width columns:
txt <- gsub("Ã", "", txt)
write(txt, "txt_noAs.txt)
Interestingly, my text editor now shows "à"'s where the "Ã"'s used to appear. At this point one can loop over the pages within page type starting at pagedivs+4 to the location of 4 empty rows and use read.fwf
from the 'utils' package. What remains to support this is a layout definition, which you say you already have a handle on, but which could be also inferred using pkg:gsubfn's strapply or a regex solution.
Looking for an approach to develop a regex solution:
> numfields <- gregexpr("[-[:digit:].]+ ", txt)
> table( sapply( numfields, length))
1 2 3 5 6 7 8 11 12 13 14 15
1201 193 8 1 13 15 2 4 1162 869 308 32
16 17 19 20 21 23 24 25 26 27 28 30
1 3 1 1 1 7 10 688 481 168 13 1
So clearly the pages fall into two classes: those where the number of numeric columns is 12-14 and those where they number 23-28. I would have expected this to be a bit different, but I guess the "ANY" columns threw off my expectations.