Handling a very big dataframe

半世苍凉 提交于 2019-12-13 20:05:30

问题


Right now I'm having trouble on how to process my data and transform it into a dataframe. Basically what I'm trying to do is to read the data first

data = pd.read_csv(querylog, sep=" ", header=None)

then group it

query_group = data.groupby('Query')
ip_group = data.groupby('IP')

and lastly create a blank dataframe to map their values

df = pd.DataFrame(columns=query_group.groups, index=range(0, len(ip_group.groups)))

index = 0
for name, group in ip_group:
    df.set_value(index, 'IP', name)
    index += 1
df = df.set_index('IP')

for index, row in data.iterrows():
    df.set_value(row['IP'], row['Query'], 1)
    print(index)
df = df.fillna(0)

So my problem is that the ip_group can go up to a size of 6000 and query_group up to 400000 which would result in making a very big blank dataframe that my memory cannot handle. Can anyone help me on how to solve this issue? Any help is appreciated.

Sample dataframe of the data would look like this

data = pd.DataFrame( { "Query" : ["google.com", "youtube.com", "facebook.com"],
     "IP" : ["192.168.0.104", "192.168.0.103","192.168.0.104"] } )

and my expected output would look like this

                google.com youtube.com  facebook.com
IP            
192.168.0.104   1          0             1
192.168.0.103   0          1             0

回答1:


IIUC you can use get_dummies, but without data is problematic find the best solution:

df = pd.get_dummies(data.set_index('IP')['Query'])
print df.groupby(df.index).sum()

Sample:

import pandas as pd

data = pd.DataFrame( { "Query" : ["a", "b", "c", "d", "a" , "b"],
     "IP" : [1,5,4,8,3,4] } )
print data  
   IP Query
0   1     a
1   5     b
2   4     c
3   8     d
4   3     a
5   4     b
#set index from column data
data = data.set_index('IP')

#get dummies from column Query
df = pd.get_dummies(data['Query'])
print df
    a  b  c  d
IP            
1   1  0  0  0
5   0  1  0  0
4   0  0  1  0
8   0  0  0  1
3   1  0  0  0
4   0  1  0  0

#groupby by index and sum columns
print df.groupby(df.index).sum()
    a  b  c  d
IP            
1   1  0  0  0
3   1  0  0  0
4   0  1  1  0
5   0  1  0  0
8   0  0  0  1

Try convert to int8 by astype for savings 3 times less memory:

print pd.get_dummies(data['Query']).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null float64
google.com      3 non-null float64
youtube.com     3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes

print pd.get_dummies(data['Query']).astype(np.int8).info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null int8
google.com      3 non-null int8
youtube.com     3 non-null int8
dtypes: int8(3)
memory usage: 33.0+ bytes

print pd.get_dummies(data['Query'], sparse=True).info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Index: 3 entries, 192.168.0.104 to 192.168.0.104
Data columns (total 3 columns):
facebook.com    3 non-null float64
google.com      3 non-null float64
youtube.com     3 non-null float64


来源:https://stackoverflow.com/questions/35829481/handling-a-very-big-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!