dataframe

Efficient way to read 15 M lines csv files in python

泄露秘密 提交于 2021-02-05 18:54:07
问题 For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format. I've already tried different approaches, notably pandas.read_csv with chunksize and dtype specifications, and dask.dataframe. They both take around 90 seconds to treat 1 file, and so I'd like to know if there's a way to efficiently treat these files in the described way. In the following, I show some code of the tests I've done. import pandas as pd import

Efficient way to read 15 M lines csv files in python

别等时光非礼了梦想. 提交于 2021-02-05 18:52:03
问题 For my application, I need to read multiple files with 15 M lines each, store them in a DataFrame, and save the DataFrame in HDFS5 format. I've already tried different approaches, notably pandas.read_csv with chunksize and dtype specifications, and dask.dataframe. They both take around 90 seconds to treat 1 file, and so I'd like to know if there's a way to efficiently treat these files in the described way. In the following, I show some code of the tests I've done. import pandas as pd import

Create date column from datetime in R

为君一笑 提交于 2021-02-05 12:32:33
问题 I am new to R and I am an avid SAS programmer and am just having a difficult time wrapping my head around R. Within a data frame I have a date time column formatted as a POSIXct with the following the column appearing as "2013-01-01 00:53:00" . I would like to create a date column using a function that extracts the date and a column to extract the hour. In an ideal world I would like to be able to extract the date, year, day, month, time and hour all within the data frame to create these

Reassign index of a dataframe

陌路散爱 提交于 2021-02-05 12:30:19
问题 I have the following dataframe: Month 1 -0.075844 2 -0.089111 3 0.042705 4 0.002147 5 -0.010528 6 0.109443 7 0.198334 8 0.209830 9 0.075139 10 -0.062405 11 -0.211774 12 -0.109167 1 -0.075844 2 -0.089111 3 0.042705 4 0.002147 5 -0.010528 6 0.109443 7 0.198334 8 0.209830 9 0.075139 10 -0.062405 11 -0.211774 12 -0.109167 Name: Passengers, dtype: float64 As you can see numbers are listed twice from 1-12 / 1-12, instead, I would like to change the index to 1-24. The problem is that when plotting

How to subset all rows in a dataframe that have a particular value

你说的曾经没有我的故事 提交于 2021-02-05 12:25:22
问题 I have a large dataset that contains in each row different combinations of "NA" "1" and "2". I would like to subset all rows that specifically contain only "2" and "NA". So in the sample below, I'd like to automatically name and subset Row1 and Row4: df <- data.frame(Col1=c(NA,NA,2,NA), Col2=c(NA,NA,1,2), Col3=c(NA,1,NA,NA), Col4=c(2,NA,NA,NA), row.names=c("Row1","Row2","Row3","Row4"), stringsAsFactors = FALSE) 回答1: Try this: target <- 2 #print row names names(which(apply(df, 1, function(x)

Reassign index of a dataframe

空扰寡人 提交于 2021-02-05 12:25:05
问题 I have the following dataframe: Month 1 -0.075844 2 -0.089111 3 0.042705 4 0.002147 5 -0.010528 6 0.109443 7 0.198334 8 0.209830 9 0.075139 10 -0.062405 11 -0.211774 12 -0.109167 1 -0.075844 2 -0.089111 3 0.042705 4 0.002147 5 -0.010528 6 0.109443 7 0.198334 8 0.209830 9 0.075139 10 -0.062405 11 -0.211774 12 -0.109167 Name: Passengers, dtype: float64 As you can see numbers are listed twice from 1-12 / 1-12, instead, I would like to change the index to 1-24. The problem is that when plotting

How to subset all rows in a dataframe that have a particular value

旧时模样 提交于 2021-02-05 12:23:50
问题 I have a large dataset that contains in each row different combinations of "NA" "1" and "2". I would like to subset all rows that specifically contain only "2" and "NA". So in the sample below, I'd like to automatically name and subset Row1 and Row4: df <- data.frame(Col1=c(NA,NA,2,NA), Col2=c(NA,NA,1,2), Col3=c(NA,1,NA,NA), Col4=c(2,NA,NA,NA), row.names=c("Row1","Row2","Row3","Row4"), stringsAsFactors = FALSE) 回答1: Try this: target <- 2 #print row names names(which(apply(df, 1, function(x)

Pandas concat flips all my values in the DataFrame

断了今生、忘了曾经 提交于 2021-02-05 12:23:41
问题 I have a dataframe called 'running_tally' list jan_to jan_from 0 LA True False 1 NY False True I am trying to append new data to it in the form of a single column dataframe called 'new_data' list 0 HOU 1 LA I concat these two dfs based on their 'list' column for further processing, but immediately after I do that all the boolean values unexpectedly flip. running_tally = pd.concat([running_tally,new_data]).groupby('list',as_index=False).first() the above statement will produce: list jan_to jan

Pandas concat flips all my values in the DataFrame

半城伤御伤魂 提交于 2021-02-05 12:22:00
问题 I have a dataframe called 'running_tally' list jan_to jan_from 0 LA True False 1 NY False True I am trying to append new data to it in the form of a single column dataframe called 'new_data' list 0 HOU 1 LA I concat these two dfs based on their 'list' column for further processing, but immediately after I do that all the boolean values unexpectedly flip. running_tally = pd.concat([running_tally,new_data]).groupby('list',as_index=False).first() the above statement will produce: list jan_to jan

How do I create a new column in r that is a binomial variable based on a string variable? [duplicate]

泪湿孤枕 提交于 2021-02-05 12:21:08
问题 This question already has answers here : Vectorized IF statement in R? (6 answers) Convert dataframe column to 1 or 0 for “true”/“false” values and assign to dataframe (5 answers) Closed 2 years ago . I'm currently trying to create a new column in my data frame based on another column using mutate(). I want to make the new column a binomial variable (1 or 0) based on whether the column its based on says "Active" or not. I'm currently trying to do it by saying: violations$outcome = if