问题
- This must use vectorized methods, nothing iterative
I would like to create a numpy array from pandas dataframe.
My code:
import pandas as pd
_df = pd.DataFrame({'itme': ['book', 'book' , 'car', ' car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})
item color val
book green -22.70
book blue -109.60
car red -57.19
car green -11.20
bike blue -25.60
bike red -33.61
There are about 12k million rows.
I need to create a numpy array like :
item green blue red
book -22.70 -109.60 null
car -11.20 null -57.19
bike null -25.60 -33.16
each row is the item name and each col is color name. The order of the items and colors are not important. But, in numpy array, there are no row and column names, I need to keep the item and color name for each value, so that I know what the value represents in the numpy array.
For example
how to know that -57.19 is for "car" and "red" in numpy array ?
So, I need to create a dictionary to keep the mapping between :
item <--> row index in the numpy array
color <--> col index in the numpy array
I do not want to use iteritems and itertuples because they are not efficient for large dataframe due to How to iterate over rows in a DataFrame in Pandas and How to iterate over rows in a DataFrame in Pandas and Python Pandas iterate over rows and access column names and Does pandas iterrows have performance issues?
I prefer numpy vectorization solution for this.
How to efficiently convert the pandas dataframe to numpy array ? The array will also be transformed to torch.tensor.
thanks
回答1:
- do a quick search for a val by their "item" and "color" with one of the following options:
- Use pandas Boolean indexing
- Convert the dataframe into a numpy.recarry using pandas.DataFrame.to_records, and also use Boolean indexing
.itemis a method for bothpandasandnumpy, so don't use'item'as a column name. It has been changed to'_item'.- As an FYI,
numpyis apandasdependency, and much ofpandasvectorized functionality directly corresponds tonumpy.
import pandas as pd
import numpy as np
# test data
df = pd.DataFrame({'_item': ['book', 'book' , 'car', 'car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})
# Use pandas Boolean index to
selected = df[(df._item == 'book') & (df.color == 'blue')]
# print(selected)
_item color val
book blue -109.6
# Alternatively, create a recarray
v = df.to_records(index=False)
# display(v)
rec.array([('book', 'green', -22.7 ), ('book', 'blue', -109.6 ),
('car', 'red', -57.19), ('car', 'green', -11.2 ),
('bike', 'blue', -25.6 ), ('bike', 'red', -33.61)],
dtype=[('_item', 'O'), ('color', 'O'), ('val', '<f8')])
# search the recarray
selected = v[(v._item == 'book') & (v.color == 'blue')]
# print(selected)
[('book', 'blue', -109.6)]
Update in response to OP edit
- You must first reshape the dataframe using pandas.DataFrame.pivot, and then use the previously mentioned methods.
dfp = df.pivot(index='_item', columns='color', values='val')
# display(dfp)
color blue green red
_item
bike -25.6 NaN -33.61
book -109.6 -22.7 NaN
car NaN -11.2 -57.19
# create a numpy recarray
v = dfp.to_records(index=True)
# display(v)
rec.array([('bike', -25.6, nan, -33.61),
('book', -109.6, -22.7, nan),
('car', nan, -11.2, -57.19)],
dtype=[('_item', 'O'), ('blue', '<f8'), ('green', '<f8'), ('red', '<f8')])
# select data
selected = v.blue[(v._item == 'book')]
# print(selected)
array([-109.6])
来源:https://stackoverflow.com/questions/64839600/how-to-convert-a-pandas-dataframe-into-a-numpy-array-with-the-column-names