pandas | 易学教程

Memory leaks when using pandas_udf and Parquet serialization?

阅读更多关于 Memory leaks when using pandas_udf and Parquet serialization?

问题 I am currently developing my first whole system using PySpark and I am running into some strange, memory-related issues. In one of the stages, I would like to resemble a Split-Apply-Combine strategy in order to modify a DataFrame. That is, I would like to apply a function to each of the groups defined by a given column and finally combine them all. Problem is, the function I want to apply is a prediction method for a fitted model that "speaks" the Pandas idiom, i.e., it is vectorized and

Memory leaks when using pandas_udf and Parquet serialization?

阅读更多关于 Memory leaks when using pandas_udf and Parquet serialization?

Pandas Rolling Window - datetime64[ns] are not implemented

阅读更多关于 Pandas Rolling Window - datetime64[ns] are not implemented

问题 I'm attempting to use Python/Pandas to build some charts. I have data that is sampled every second. Here is a sample: Index, Time, Value 31362, 1975-05-07 07:59:18, 36.151612 31363, 1975-05-07 07:59:19, 36.181368 31364, 1975-05-07 07:59:20, 36.197195 31365, 1975-05-07 07:59:21, 36.151413 31366, 1975-05-07 07:59:22, 36.138009 31367, 1975-05-07 07:59:23, 36.142962 31368, 1975-05-07 07:59:24, 36.122680 I need to create a variety of windows to look at the data. 10, 100, 1000 etc. Unfortunately

Pandas Rolling Window - datetime64[ns] are not implemented

阅读更多关于 Pandas Rolling Window - datetime64[ns] are not implemented

Forward fill all except last value in python pandas dataframe

阅读更多关于 Forward fill all except last value in python pandas dataframe

问题 I have a dataframe in pandas with several columns I want to forward fill the values for. At the moment I'm doing: columns = ['a', 'b', 'c'] for column in columns: df[column].fillna(method='ffill', inplace=True) ...but because the series in the columns are different lengths, that leaves long tails of filled values on the ends of some of them. Because the gaps in the some of the series are quite large, I can't use the fillna's limit parameter without also leaving long tails of filled values on

Pandas: Conditionally replace values based on other columns values

阅读更多关于 Pandas: Conditionally replace values based on other columns values

问题 I have a dataframe (df) that looks like this: environment event time 2017-04-28 13:08:22 NaN add_rd 2017-04-28 08:58:40 NaN add_rd 2017-05-03 07:59:35 test add_env 2017-05-03 08:05:14 prod add_env ... Now my goal is for each add_rd in the event column, the associated NaN -value in the environment column should be replaced with a string RD . environment event time 2017-04-28 13:08:22 RD add_rd 2017-04-28 08:58:40 RD add_rd 2017-05-03 07:59:35 test add_env 2017-05-03 08:05:14 prod add_env ...

Pandas as fast data storage for Flask application

阅读更多关于 Pandas as fast data storage for Flask application

问题 I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask . I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and

Pandas rolling gives NaN

阅读更多关于 Pandas rolling gives NaN

问题 I'm looking at the tutorials on window functions, but I don't quite understand why the following code produces NaNs. If I understand correctly, the code creates a rolling window of size 2. Why do the first, fourth, and fifth rows have NaN? At first, I thought it's because adding NaN with another number would produce NaN, but then I'm not sure why the second row wouldn't be NaN. dft = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}, index=pd.date_range('20130101 09:00:00', periods=5, freq='s')) In

Pandas rolling gives NaN

阅读更多关于 Pandas rolling gives NaN

Pandas: control new column names when merging two dataframes?

阅读更多关于 Pandas: control new column names when merging two dataframes?

问题 I would like to merge two Pandas dataframes together and control the names of the new column values. I originally created the dataframes from CSV files. The original CSV files looked like this: # presents.csv org,name,items,spend... 12A,Clerkenwell,151,435,... 12B,Liverpool Street,37,212,... ... # trees.csv org,name,items,spend... 12A,Clerkenwell,0,0,... 12B,Liverpool Street,2,92,... ... Now I have two data frames: df_presents = pd.read_csv(StringIO(presents_txt)) df_trees = pd.read_csv