Can anyone please tell me how to read only the first 6 months (7 columns) for each year of the data below, for example by using read.table()?
Ye
Say the data are in file data.txt, you can use the colClasses argument of read.table() to skip columns. Here the data in the first 7 columns are "integer" and we set the remaining 6 columns to "NULL" indicating they should be skipped
> read.table("data.txt", colClasses = c(rep("integer", 7), rep("NULL", 6)),
+ header = TRUE)
Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27 -2 -6 -10 -32
Change "integer" to one of the accepted types as detailed in ?read.table depending on the real type of data.
data.txt looks like this:
$ cat data.txt
"Year" "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
2009 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2010 -41 -27 -25 -31 -31 -39 -25 -15 -30 -27 -21 -25
2011 -21 -27 -2 -6 -10 -32 -13 -12 -27 -30 -38 -29
and was created by using
write.table(dat, file = "data.txt", row.names = FALSE)
where dat is
dat <- structure(list(Year = 2009:2011, Jan = c(-41L, -41L, -21L), Feb = c(-27L,
-27L, -27L), Mar = c(-25L, -25L, -2L), Apr = c(-31L, -31L, -6L
), May = c(-31L, -31L, -10L), Jun = c(-39L, -39L, -32L), Jul = c(-25L,
-25L, -13L), Aug = c(-15L, -15L, -12L), Sep = c(-30L, -30L, -27L
), Oct = c(-27L, -27L, -30L), Nov = c(-21L, -21L, -38L), Dec = c(-25L,
-25L, -29L)), .Names = c("Year", "Jan", "Feb", "Mar", "Apr",
"May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"), class = "data.frame",
row.names = c(NA, -3L))
If the number of columns is not known beforehand, the utility function count.fields will read through the file and count the number of fields in each line.
## returns a vector equal to the number of lines in the file
count.fields("data.txt", sep = "\t")
## returns the maximum to set colClasses
max(count.fields("data.txt", sep = "\t"))
You could also use JDBC to achieve this. Let's create a sample csv file.
write.table(x=mtcars, file="mtcars.csv", sep=",", row.names=F, col.names=T) # create example csv file
Download and save the the CSV JDBC driver from this link: http://sourceforge.net/projects/csvjdbc/files/latest/download
> library(RJDBC)
> path.to.jdbc.driver <- "jdbc//csvjdbc-1.0-18.jar"
> drv <- JDBC("org.relique.jdbc.csv.CsvDriver", path.to.jdbc.driver)
> conn <- dbConnect(drv, sprintf("jdbc:relique:csv:%s", getwd()))
> head(dbGetQuery(conn, "select * from mtcars"), 3)
mpg cyl disp hp drat wt qsec vs am gear carb
1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
2 21 6 160 110 3.9 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1
> head(dbGetQuery(conn, "select mpg, gear from mtcars"), 3)
MPG GEAR
1 21 4
2 21 4
3 22.8 4
To read a specific set of columns from a dataset you, there are several other options:
1) With freadfrom the data.table-package:
You can specify the desired columns with the select parameter from fread from the data.table package. You can specify the columns with a vector of column names or column numbers.
For the example dataset:
library(data.table)
dat <- fread("data.txt", select = c("Year","Jan","Feb","Mar","Apr","May","Jun"))
dat <- fread("data.txt", select = c(1:7))
Alternatively, you can use the drop parameter to indicate which columns should not be read:
dat <- fread("data.txt", drop = c("Jul","Aug","Sep","Oct","Nov","Dec"))
dat <- fread("data.txt", drop = c(8:13))
All result in:
> data
Year Jan Feb Mar Apr May Jun
1 2009 -41 -27 -25 -31 -31 -39
2 2010 -41 -27 -25 -31 -31 -39
3 2011 -21 -27 -2 -6 -10 -32
UPDATE: When you don't want fread to return a data.table, use the data.table = FALSE-parameter, e.g.: fread("data.txt", select = c(1:7), data.table = FALSE)
2) With read.csv.sql from the sqldf-package:
Another alternative is the read.csv.sql function from the sqldf package:
library(sqldf)
dat <- read.csv.sql("data.txt",
sql = "select Year,Jan,Feb,Mar,Apr,May,Jun from file",
sep = "\t")
3) With the read_*-functions from the readr-package:
library(readr)
dat <- read_table("data.txt",
col_types = cols_only(Year = 'i', Jan = 'i', Feb = 'i', Mar = 'i',
Apr = 'i', May = 'i', Jun = 'i'))
dat <- read_table("data.txt",
col_types = list(Jul = col_skip(), Aug = col_skip(), Sep = col_skip(),
Oct = col_skip(), Nov = col_skip(), Dec = col_skip()))
dat <- read_table("data.txt", col_types = 'iiiiiii______')
From the documentation an explanation for the used characters with col_types:
each character represents one column: c = character, i = integer, n = number, d = double, l = logical, D = date, T = date time, t = time, ? = guess, or _/- to skip the column
You do it like this:
df = read.table("file.txt", nrows=1, header=TRUE, sep="\t", stringsAsFactors=FALSE)
colClasses = as.list(apply(df, 2, class))
needCols = c("Year", "Jan", "Feb", "Mar", "Apr", "May", "Jun")
colClasses[!names(colClasses) %in% needCols] = list(NULL)
df = read.table("file.txt", header=TRUE, colClasses=colClasses, sep="\t", stringsAsFactors=FALSE)