Plot a histogram of subset of a data

问题

!The image shows the screen shot of the .txt file of the data. The data consists of 2,075,259 rows and 9 columns

Measurements of electric power consumption in one household with a one-minute sampling rate over a period of almost 4 years. Different electrical quantities and some sub-metering values are available.

Only data from the dates 2007-02-01 and 2007-02-02 is needed. I was trying to plot a histogram of "Global_active_power" in the above mentioned dates.

Note that in this dataset missing values are coded as "?"]

This is the code i was trying to plot the histogram:

{

data <- read.table("household_power_consumption.txt", header=TRUE)
my_data <- data[data$Date %in% as.Date(c('01/02/2007', '02/02/2007'))]

my_data <- gsub(";", " ", my_data)    # replace ";" with " "  
my_data <- gsub("?", "NA", my_data)   # convert "?" to "NA"
my_data <- as.numeric(my_data)      # turn into numbers

hist(my_data["Global_active_power"])

}

After running the code it is showing this error:

Error in hist.default(my_data["Global_active_power"]) : invalid number of 'breaks'

Can you please help me spot the mistake in the code.

Link of the data file : https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2Fhousehold_power_consumption.zip

回答1:

You need to provide the separator (";") explicitly and your types aren't what you think they are, observe:

data <- read.table("household_power_consumption.txt", header=TRUE, sep=';', na.strings='?')
data$Date <- as.Date(data$Date, format='%d/%m/%Y')
bottom.date <- as.Date('01/02/2007', format='%d/%m/%Y')
top.date <- as.Date('02/02/2007', format='%d/%m/%Y')
my_data <- data[data$Date > bottom.date & data$Date < top.date,3] 
hist(my_data)

Gives

as the plot. Hope that helps.

回答2:

Given you have 2m rows (though not too many columns), you're firmly into fread territory;

Here's how I would do what you want:

library(data.table)
data<-fread("household_power_consumption.txt",sep=";", #1
            na.strings=c("?","NA"),colClasses="character" #2
            )[,Date:=as.Date(Date,format="%d/%m/%Y")
              ][Date %in% seq(from=as.Date("2007-02-01"), #3
                              to=as.Date("2007-02-02"),by="day")]

numerics<-setdiff(names(data),c("Date","Time")) #4
data[,(numerics):=lapply(.SD,as.numeric),.SDcols=numerics]

data[,hist(Global_active_power)] #5

A brief explanation of what's going on

1: See the data.table vignettes for great introductions to the package. Here, given the structure of your data, we tell fread up front that ; is what separates fields (which is nonstandard)

2: We can tell fread up front that it can expect ? in some of the columns and should treat them as NA--e.g., here's data[8640] before setting na.strings:

         Date     Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00                   ?                     ?       ?                ?              ?              ?             NA

Once we set na.strings, we sidestep having to replace ? as NA later:

         Date     Time Global_active_power Global_reactive_power Voltage Global_intensity Sub_metering_1 Sub_metering_2 Sub_metering_3
1: 21/12/2006 11:23:00                  NA                    NA      NA               NA             NA             NA

On the other hand, we also have to read those fields as characters, even though they're numeric. This is something I'm hoping fread will be able to handle automatically in the future.

data.table commands can be chained (from left to right); I'm using this to subset the data before it's assigned. It's up to you whether you find that more or less readable, as there's only marginal performance differences.
Since we had to read the numeric fields as strings, we now recast them as numeric; this is the standard data.table syntax for doing so.
Once we've got our data subset as we like and of the right type, we can pass hist as an argument in j and get what we want.

Note that if all you wanted from this data set was the histogram, you could have condensed the code a bit:

ok_dates<-seq(from=as.Date("2007-02-01"),
              to=as.Date("2007-02-02"),by="day")
fread("household_power_consumption.txt",sep=";",
      select=c("Date","Global_active_power"),
      na.strings=c("?","NA"),colClasses="character"
      )[,Date:=as.Date(Date,format="%d/%m/%Y")
        ][Date %in% ok_dates,hist(as.numeric(Global_active_power))]

来源：https://stackoverflow.com/questions/31284591/plot-a-histogram-of-subset-of-a-data

标签

histogram