Replacing Category Data (pandas)

问题

I have some large files with several category columns. Category is kind of a generous word too because these are basically descriptions/partial sentences.

Here are the unique values per category:

Category 1 = 15 
Category 2 = 94
Category 3 = 294
Category 4 = 401

Location 1 = 30
Location 2 = 60

Then there are even users with recurring data (first name, last name, IDs etc).

I was thinking of the following solutions to make the file size smaller:

1) Create a file which matches each category with an unique integer

2) Create a map (is there a way to do this from reading another file? Like I would create a .csv and load it as another dataframe and then match it? Or do I literally have to type it out initially?)

3) Basically do a join (VLOOKUP) and then del the old column with the long object names

pd.merge(df1, categories, on = 'Category1', how = 'left') 
del df1['Category1']

What do people normally do in this case? These files are pretty huge. 60 columns and most of the data are long, repeating categories and timestamps. Literally no numerical data at all. It's fine for me, but sharing the files is almost impossible due to shared drive space allocations for more than a few months.

回答1:

To benefit from Categorical dtype when saving to csv you might want to follow this process:

Extract your Category definitions into separate dataframes / files
Convert your Categorical data to int codes
Save converted DataFrame to csv along with definitions dataframes

When you need to use them again:

Restore dataframes from csv files
Map dataframe with int codes to category definitions
Convert mapped columns to Categorical

To illustrate the process:

Make a sample dataframe:

df = pd.DataFrame(index=pd.np.arange(0,100000))
df.index.name = 'index'
df['categories'] = 'Category'
df['locations'] = 'Location'
n1 = pd.np.tile(pd.np.arange(1,5), df.shape[0]/4)
n2 = pd.np.tile(pd.np.arange(1,3), df.shape[0]/2)
df['categories'] = df['categories'] + pd.Series(n1).astype(str)
df['locations'] = df['locations'] + pd.Series(n2).astype(str)
print df.info()

    <class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null object
locations     100000 non-null object
dtypes: object(2)
memory usage: 2.3+ MB
None

Note the size: 2.3+ MB - this would be roughly the size of your csv file. Now convert these data to Categorical:

df['categories'] = df['categories'].astype('category')
df['locations'] = df['locations'].astype('category')
print df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null category
locations     100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None

Note the drop in memory usage down to 976.6 KB But if you would save it to csv now:

df.to_csv('test1.csv')

...you would see this inside the file:

index,categories,locations
0,Category1,Location1
1,Category2,Location2
2,Category3,Location1
3,Category4,Location2

Which means 'Categorical' has been converted to strings for saving in csv. So let's get rid of the labels in Categorical data after we save the definitions:

categories_details = pd.DataFrame(df.categories.drop_duplicates(), columns=['categories'])
print categories_details

      categories
index           
0      Category1
1      Category2
2      Category3
3      Category4

locations_details = pd.DataFrame(df.locations.drop_duplicates(), columns=['locations'])
print locations_details

       index           
0      Location1
1      Location2

Now covert Categorical to int dtype:

for col in df.select_dtypes(include=['category']).columns:
    df[col] = df[col].cat.codes
print df.head()

       categories  locations
index                       
0               0          0
1               1          1
2               2          0
3               3          1
4               0          0

print df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null int8
locations     100000 non-null int8
dtypes: int8(2)
memory usage: 976.6 KB
None

Save converted data to csv and note that the file now has only numbers without labels. The file size will also reflect this change.

df.to_csv('test2.csv')

index,categories,locations
0,0,0
1,1,1
2,2,0
3,3,1

Save the definitions as well:

categories_details.to_csv('categories_details.csv')
locations_details.to_csv('locations_details.csv')

When you need to restore the files, load them from csv files:

df2 = pd.read_csv('test2.csv', index_col='index')
print df2.head()

       categories  locations
index                       
0               0          0
1               1          1
2               2          0
3               3          1
4               0          0

print df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null int64
locations     100000 non-null int64
dtypes: int64(2)
memory usage: 2.3 MB
None

categories_details2 = pd.read_csv('categories_details.csv', index_col='index')
print categories_details2.head()

      categories
index           
0      Category1
1      Category2
2      Category3
3      Category4

print categories_details2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
categories    4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes
None

locations_details2 = pd.read_csv('locations_details.csv', index_col='index')
print locations_details2.head()

       locations
index           
0      Location1
1      Location2

print locations_details2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
locations    2 non-null object
dtypes: object(1)
memory usage: 32.0+ bytes
None

Now use map to replace int coded data with categories descriptions and convert them to Categorical:

df2['categories'] = df2.categories.map(categories_details2.to_dict()['categories']).astype('category')
df2['locations'] = df2.locations.map(locations_details2.to_dict()['locations']).astype('category')
print df2.head()

      categories  locations
index                      
0      Category1  Location1
1      Category2  Location2
2      Category3  Location1
3      Category4  Location2
4      Category1  Location1

print df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories    100000 non-null category
locations     100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None

Note the memory usage back to what it was when we first converted data to Categorical. It should not be hard to automate this process if you need to repeat it many time.

回答2:

Pandas has a Categorical data type that does just that. It basically maps the categories to integers behind the scenes.

Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array.

Documentation is here.

回答3:

Here's a way to save a dataframe with Categorical columns in a single .csv:

Example:
------      -------
Fatcol      Thincol: unique strings once, then numbers
------      -------
"Alberta"   "Alberta"
"BC"        "BC"
"BC"        2   -- string 2
"Alberta"   1   -- string 1
"BC"        2
...

The "Thincol" on the right can be saved as is in a .csv file,
and expanded to the "Fatcol" on the left after reading it in;
this can halve the size of big .csv s with repeated strings.

Functions
---------
fatcol( col: Thincol ) -> Fatcol, list[ unique str ]
thincol( col: Fatcol ) -> Thincol, dict( unique str -> int ), list[ unique str ]

Here "Fatcol" and "Thincol" are type names for iterators, e.g. lists:
    Fatcol: list of strings
    Thincol: list of strings or ints or NaN s
If a `col` is a `pandas.Series`, its `.values` are used.

This cut a 700M .csv to 248M -- but write_csv runs at ~ 1 MB/sec on my imac.

来源：https://stackoverflow.com/questions/30173092/replacing-category-data-pandas

标签

pandas

csv

compression