问题
I'm reading from a database that had many array type columns, which pd.read_sql gives me a dataframe with columns that are dtype=object
, containing lists.
I'd like an efficient way to find which rows have arrays containing some element:
s = pd.Series(
[[1,2,3], [1,2], [99], None, [88,2]]
)
print s
..
0 [1, 2, 3]
1 [1, 2]
2 [99]
3 None
4 [88, 2]
1-hot-encoded feature tables for an ML application and I'd like to end up with tables like:
contains_1 contains_2, contains_3 contains_88
0 1 ...
1 1
2 0
3 nan
4 0
...
I can unroll a series of arrays like so:
s2 = s.apply(pd.Series).stack()
0 0 1.0
1 2.0
2 3.0
1 0 1.0
1 2.0
2 0 99.0
4 0 88.0
1 2.0
which gets me at the being able to find the elements meeting some test:
>>> print s2[(s2==2)].index.get_level_values(0)
Int64Index([0, 1, 4], dtype='int64')
Woot! This step:
s.apply(pd.Series).stack()
produces a great intermediate data-structure (s2) that's fast to iterate over for each category. However, the apply
step is jaw-droppingly slow (many 10's of seconds for a single column with 500k rows with lists of 10's of items), and I have many columns.
Update: It seems likely that having the data in a series of lists to begin with in quite slow. Performing unroll in the SQL side seems tricky (I have many columns that I want to unroll). Is there a way to pull array data into a better structure?
回答1:
import numpy as np
import pandas as pd
import cytoolz
s0 = s.dropna()
v = s0.values.tolist()
i = s0.index.values
l = [len(x) for x in v]
c = cytoolz.concat(v)
n = np.append(0, np.array(l[:-1])).cumsum().repeat(l)
k = np.arange(len(c)) - n
s1 = pd.Series(c, [i.repeat(l), k])
UPDATE: What worked for me...
def unroll(s):
s = s.dropna()
v = s.values.tolist()
c = pd.Series(x for x in cytoolz.concat(v)) # 16 seconds!
i = s.index
lens = np.array([len(x) for x in v]) #s.apply(len) is slower
n = np.append(0, lens[:-1]).cumsum().repeat(lens)
k = np.arange(sum(lens)) - n
s = pd.Series(c)
s.index = [i.repeat(lens), k]
s = s.dropna()
return s
It should be possible to replace:
s = pd.Series(c)
s.index = [i.repeat(lens), k]
with:
s = pd.Series(c, index=[i.repeat(lens), k])
But this doesn't work. (Says is ok here )
来源:https://stackoverflow.com/questions/44813885/pandas-faster-series-of-lists-unrolling-for-one-hot-encoding