How to create one dataframe from multiple csv files in a folder

匆匆过客 提交于 2019-12-11 18:53:39

问题


I have a list of CSV files(A1.csv, A2.csv........D10.csv) in a folder which contains data two columns but several rows. Basically, I want to extract the values of last row and 2nd column from all the csv files See the picture to understand better

and create a data frame which will contain file name in 1st column and the extracted values(C) in the second column.

Now, I can do it by creating another list of CSV files and concatenate them later into one data frame.

Is it possible to store each data frame produced by CSV files into a list and then concatenate them (what rbind do in R). I tried this code in R, it works. But I want to learn the more efficient way in R or python.( Python is preferable as I am trying to learn python)

#read through csv files and select the last row 2nd column
m=c(NULL)
aa=c(NULL)
f=list.files(path = getwd(),pattern = '.*csv')
for (g in f){
aa=read.csv(g)
m=tail(aa,1)
q=m[,2]
yy=data.frame(ID=g,Final=q)
write.csv(yy,file = paste("Filename/",g),row.names = F)
}
###concatanate into one file
readFile=list.files(path = getwd(),pattern = "*.csv")
Alldata=lapply(readFile,function(filename){
dummy=read.csv(filename)
return(dummy)
})
FinalFIle=do.call(rbind,Alldata)
write.csv(FinalFIle,file = "FinalFIle.csv",row.names = F)

回答1:


Here is an option in R.

Step 1: Prepare a vector with file names. If there are too many files in the folder, the list.files function could be useful. Here, I just manually created it. I also assume that all the files are stored in the working directory. Otherwise, you will need to construct the file path.

file_vec <- c("A1.csv", "A2.csv", "A3.csv")

Step 2: Read all CSV file based on file_vec. The key is to use the lapply function to apply read.csv of every element in file_vec.

dt_list <- lapply(file_vec, read.csv, stringsAsFactors = FALSE)

Step 3: Prepare a vector showing file names without .csv

name_vec <- sub(".csv", "", file_vec)

Step 4: Create the data frame. x[nrow(x), 2] is a way to access the last value of the second column.

dt_final <- data.frame(File = name_vec,
                       Value = sapply(dt_list, function(x) x[nrow(x), 2]),
                       stringsAsFactors = FALSE)

dt_final is the final output.




回答2:


Here's another option using the tidyverse in R:

library(tidyverse)

# In my example, I'm using a folder with 4 Chicago Crime Datasets
setwd("INSERT/PATH/HERE")

files <- list.files()

tibble(files) %>%
  mutate(file_contents = map(files, ~ read_csv(file.path(.), n_max = 10))) %>% 
  unnest(file_contents) %>%
  group_by(files) %>%
  slice(n()) %>% 
  select(1:2)

Which returns:

# A tibble: 4 x 2
# Groups:   filename [4]
                         filename    X1
                            <chr> <int>
1 Chicago_Crimes_2001_to_2004.csv  4904
2 Chicago_Crimes_2005_to_2007.csv    10
3 Chicago_Crimes_2008_to_2011.csv  5867
4 Chicago_Crimes_2012_to_2017.csv  1891

Note that the n_max = 10 argument isn't needed. I only included this because the files I was working with are pretty large.

For anyone interested, the dataset can be found here.

Also, it's possible that you may want to avoid setting the work directory with setwd(). If this is the case, you can use the additional argument full.names = TRUE in list.files():

path <- "INSERT/PATH/HERE"
files <- list.files(path, full.names = TRUE)

I'd recommend this approach as scripts containing the line setwd() aren't flexible, paths will change from user to user.




回答3:


Python Solution

>>> import pandas as pd
>>> files = ['A1.csv', 'A2.csv', ... , 'D10.csv']
>>> df_final = pd.Dataframe({fname: pd.read_csv(fname).iat[-1, 1] for fname in files})



回答4:


This is an easy case for bash and friends. This one-liner

for i in A*.csv B*.csv C*.csv D*.csv; do awk -F , 'END{ print $NF }' "$i"; done

extracts the bottom right field, no matter how many rows or columns, of any number of files that follow the pattern you have given. If all files were in one in one folder, and they were the only .csv files in that folder, and you wanted to save the outcome in a new file, this would do the job:

for i in *.csv; do awk -F , 'END{ print $NF }' "$i"; done > extract.txt


来源:https://stackoverflow.com/questions/47492690/how-to-create-one-dataframe-from-multiple-csv-files-in-a-folder

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!