pandas

Memory leaks when using pandas_udf and Parquet serialization?

一曲冷凌霜 提交于 2021-02-06 10:15:47
问题 I am currently developing my first whole system using PySpark and I am running into some strange, memory-related issues. In one of the stages, I would like to resemble a Split-Apply-Combine strategy in order to modify a DataFrame. That is, I would like to apply a function to each of the groups defined by a given column and finally combine them all. Problem is, the function I want to apply is a prediction method for a fitted model that "speaks" the Pandas idiom, i.e., it is vectorized and

Memory leaks when using pandas_udf and Parquet serialization?

╄→гoц情女王★ 提交于 2021-02-06 10:15:05
问题 I am currently developing my first whole system using PySpark and I am running into some strange, memory-related issues. In one of the stages, I would like to resemble a Split-Apply-Combine strategy in order to modify a DataFrame. That is, I would like to apply a function to each of the groups defined by a given column and finally combine them all. Problem is, the function I want to apply is a prediction method for a fitted model that "speaks" the Pandas idiom, i.e., it is vectorized and

Pandas Rolling Window - datetime64[ns] are not implemented

北城以北 提交于 2021-02-06 10:14:30
问题 I'm attempting to use Python/Pandas to build some charts. I have data that is sampled every second. Here is a sample: Index, Time, Value 31362, 1975-05-07 07:59:18, 36.151612 31363, 1975-05-07 07:59:19, 36.181368 31364, 1975-05-07 07:59:20, 36.197195 31365, 1975-05-07 07:59:21, 36.151413 31366, 1975-05-07 07:59:22, 36.138009 31367, 1975-05-07 07:59:23, 36.142962 31368, 1975-05-07 07:59:24, 36.122680 I need to create a variety of windows to look at the data. 10, 100, 1000 etc. Unfortunately

Pandas Rolling Window - datetime64[ns] are not implemented

会有一股神秘感。 提交于 2021-02-06 10:13:40
问题 I'm attempting to use Python/Pandas to build some charts. I have data that is sampled every second. Here is a sample: Index, Time, Value 31362, 1975-05-07 07:59:18, 36.151612 31363, 1975-05-07 07:59:19, 36.181368 31364, 1975-05-07 07:59:20, 36.197195 31365, 1975-05-07 07:59:21, 36.151413 31366, 1975-05-07 07:59:22, 36.138009 31367, 1975-05-07 07:59:23, 36.142962 31368, 1975-05-07 07:59:24, 36.122680 I need to create a variety of windows to look at the data. 10, 100, 1000 etc. Unfortunately

Forward fill all except last value in python pandas dataframe

柔情痞子 提交于 2021-02-06 09:20:46
问题 I have a dataframe in pandas with several columns I want to forward fill the values for. At the moment I'm doing: columns = ['a', 'b', 'c'] for column in columns: df[column].fillna(method='ffill', inplace=True) ...but because the series in the columns are different lengths, that leaves long tails of filled values on the ends of some of them. Because the gaps in the some of the series are quite large, I can't use the fillna's limit parameter without also leaving long tails of filled values on

Pandas: Conditionally replace values based on other columns values

随声附和 提交于 2021-02-06 09:19:27
问题 I have a dataframe (df) that looks like this: environment event time 2017-04-28 13:08:22 NaN add_rd 2017-04-28 08:58:40 NaN add_rd 2017-05-03 07:59:35 test add_env 2017-05-03 08:05:14 prod add_env ... Now my goal is for each add_rd in the event column, the associated NaN -value in the environment column should be replaced with a string RD . environment event time 2017-04-28 13:08:22 RD add_rd 2017-04-28 08:58:40 RD add_rd 2017-05-03 07:59:35 test add_env 2017-05-03 08:05:14 prod add_env ...

Pandas as fast data storage for Flask application

女生的网名这么多〃 提交于 2021-02-06 09:06:35
问题 I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask . I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and

Pandas rolling gives NaN

时间秒杀一切 提交于 2021-02-06 08:41:23
问题 I'm looking at the tutorials on window functions, but I don't quite understand why the following code produces NaNs. If I understand correctly, the code creates a rolling window of size 2. Why do the first, fourth, and fifth rows have NaN? At first, I thought it's because adding NaN with another number would produce NaN, but then I'm not sure why the second row wouldn't be NaN. dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}, index=pd.date_range('20130101 09:00:00', periods=5, freq='s')) In

Pandas rolling gives NaN

淺唱寂寞╮ 提交于 2021-02-06 08:41:10
问题 I'm looking at the tutorials on window functions, but I don't quite understand why the following code produces NaNs. If I understand correctly, the code creates a rolling window of size 2. Why do the first, fourth, and fifth rows have NaN? At first, I thought it's because adding NaN with another number would produce NaN, but then I'm not sure why the second row wouldn't be NaN. dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}, index=pd.date_range('20130101 09:00:00', periods=5, freq='s')) In

Pandas: control new column names when merging two dataframes?

一世执手 提交于 2021-02-06 07:55:40
问题 I would like to merge two Pandas dataframes together and control the names of the new column values. I originally created the dataframes from CSV files. The original CSV files looked like this: # presents.csv org,name,items,spend... 12A,Clerkenwell,151,435,... 12B,Liverpool Street,37,212,... ... # trees.csv org,name,items,spend... 12A,Clerkenwell,0,0,... 12B,Liverpool Street,2,92,... ... Now I have two data frames: df_presents = pd.read_csv(StringIO(presents_txt)) df_trees = pd.read_csv