Function should clean data to half the size, instead it enlarges it by an order of magnitude

问题

This has been driving me nuts all week weekend. I am trying merge data for different assets around a common timestamp. Each asset's data is a value in dictionary. The data of interest is stored in lists in one column, so this needs to be separated first. Here is a sample of the unprocessed df:

data['Dogecoin'].head()

    market_cap_by_available_supply  price_btc   price_usd   volume_usd
0   [1387118554000, 3488670]    [1387118554000, 6.58771e-07]    [1387118554000, 0.000558776]    [1387118554000, 0.0]
1   [1387243928000, 1619159]    [1387243928000, 3.18752e-07]    [1387243928000, 0.000218176]    [1387243928000, 0.0]
2   [1387336027000, 2191987]    [1387336027000, 4.10802e-07]    [1387336027000, 0.000267749]    [1387336027000, 0.0]

Then I apply this function to separate market_cap_by_available_supply, saving it's components into a new dataframe for that asset:

data2 = {}
#sorting function 
for coin in data:
    #seperates timestamp and marketcap from their respective list inside each element
    TS = data[coin].market_cap_by_available_supply.map(lambda r: r[0])
    cap = data[coin].market_cap_by_available_supply.map(lambda r: r[1])
    #Creates DataFrame and stores timestamp and marketcap data into dictionairy
    df = DataFrame(columns=['timestamp','cap'])
    df.timestamp = TS
    df.cap = cap
    df.columns = ['timestamp',str(coin)+'_cap']
    data2[coin] = df
    #converts timestamp into datetime 'yyy-mm-dd'
    data2[coin]['timestamp'] = pd.to_datetime(data2[coin]['timestamp'], unit='ms').dt.date

It seemed to work perfectly, producing the correct data, sample:

data2['Namecoin'].head()
timestamp   Namecoin_cap
0   2013-04-28  5969081
1   2013-04-29  7006114
2   2013-04-30  7049003
3   2013-05-01  6366350
4   2013-05-02  5848626

However when I attempted to merge all dataframes, I got a memory error, I've spent hours trying to figure out the root and it seems like the 'sorting function' above is increasing the size of the dataframe from 12Mb to 131Mb! It should do this opposite. Any ideas ?

On a side note here is the data https://www.mediafire.com/?9pcwroe1x35nnwl

I open it with this pickle funtion

with open("CMC_no_tuple_data.pickle", "rb") as myFile:
    data = pickle.load(myFile)

EDIT: Sorry for the typo in the pickle file name. @Goyo to compute the size i simple saved data and data 2 via pickle.dump and looked at their respective sizes @ Padraic Cunningham you used the sort funtion i provided and it produced a smaller file ?? This is not my case and i get a memory error when trying to merge the dataframes

回答1:

When you merge your dataframes, you are doing a join on values that are not unique. When you are joining all these dataframes together, you are getting many matches. As you add more and more currencies you are getting something similar to a Cartesian product rather than a join. In the snippet below, I added code to sort the values and then remove duplicates.

from pandas import Series, DataFrame
import pandas as pd

coins='''
Bitcoin
Ripple
Ethereum
Litecoin
Dogecoin
Dash
Peercoin
MaidSafeCoin
Stellar
Factom
Nxt
BitShares
'''
coins = coins.split('\n')
API = 'https://api.coinmarketcap.com/v1/datapoints/'
data = {}

for coin in coins:
    print(coin)
    try:
        data[coin]=(pd.read_json(API + coin))
    except: pass
data2 = {}
for coin in data:
    TS = data[coin].market_cap_by_available_supply.map(lambda r: r[0])
    TS = pd.to_datetime(TS,unit='ms').dt.date
    cap = data[coin].market_cap_by_available_supply.map(lambda r: r[1])
    df = DataFrame(columns=['timestamp','cap'])
    df.timestamp = TS
    df.cap = cap
    df.columns = ['timestamp',coin+'_cap']
    df.sort_values(by=['timestamp',coin+'_cap'])
    df= df.drop_duplicates(subset='timestamp',keep='last')
    data2[coin] = df

df = data2['Bitcoin']
keys = data2.keys()
keys.remove('Bitcoin')
for coin in keys:
    df = pd.merge(left=df,right=data2[coin],left_on='timestamp', right_on='timestamp', how='left')
    print len(df),len(df.columns)
df.to_csv('caps.csv')

EDIT:I have added a table belowing showing how the size of the table grows as you do your join operation.

This table shows the number of rows after joining 5,10,15,20,25, and 30 currencies.

Rows,Columns
1015 5
1255 10
5095 15
132071 20
4195303 25
16778215 30

This table shows how removing duplicates makes your joins only match a single row.

Rows,Columns
1000 5
1000 10
1000 15
1000 20
1000 25
1000 30

来源：https://stackoverflow.com/questions/35255712/function-should-clean-data-to-half-the-size-instead-it-enlarges-it-by-an-order

标签

python

memory

pandas

dataframe