问题
I have some large files with several category columns. Category is kind of a generous word too because these are basically descriptions/partial sentences.
Here are the unique values per category:
Category 1 = 15
Category 2 = 94
Category 3 = 294
Category 4 = 401
Location 1 = 30
Location 2 = 60
Then there are even users with recurring data (first name, last name, IDs etc).
I was thinking of the following solutions to make the file size smaller:
1) Create a file which matches each category with an unique integer
2) Create a map (is there a way to do this from reading another file? Like I would create a .csv and load it as another dataframe and then match it? Or do I literally have to type it out initially?)
OR
3) Basically do a join (VLOOKUP) and then del the old column with the long object names
pd.merge(df1, categories, on = 'Category1', how = 'left')
del df1['Category1']
What do people normally do in this case? These files are pretty huge. 60 columns and most of the data are long, repeating categories and timestamps. Literally no numerical data at all. It's fine for me, but sharing the files is almost impossible due to shared drive space allocations for more than a few months.
回答1:
To benefit from Categorical
dtype when saving to csv you might want to follow this process:
- Extract your Category definitions into separate dataframes / files
- Convert your Categorical data to int codes
- Save converted DataFrame to csv along with definitions dataframes
When you need to use them again:
- Restore dataframes from csv files
- Map dataframe with int codes to category definitions
- Convert mapped columns to Categorical
To illustrate the process:
Make a sample dataframe:
df = pd.DataFrame(index=pd.np.arange(0,100000))
df.index.name = 'index'
df['categories'] = 'Category'
df['locations'] = 'Location'
n1 = pd.np.tile(pd.np.arange(1,5), df.shape[0]/4)
n2 = pd.np.tile(pd.np.arange(1,3), df.shape[0]/2)
df['categories'] = df['categories'] + pd.Series(n1).astype(str)
df['locations'] = df['locations'] + pd.Series(n2).astype(str)
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null object
locations 100000 non-null object
dtypes: object(2)
memory usage: 2.3+ MB
None
Note the size: 2.3+ MB
- this would be roughly the size of your csv file.
Now convert these data to Categorical
:
df['categories'] = df['categories'].astype('category')
df['locations'] = df['locations'].astype('category')
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null category
locations 100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None
Note the drop in memory usage down to 976.6 KB
But if you would save it to csv now:
df.to_csv('test1.csv')
...you would see this inside the file:
index,categories,locations
0,Category1,Location1
1,Category2,Location2
2,Category3,Location1
3,Category4,Location2
Which means 'Categorical' has been converted to strings for saving in csv.
So let's get rid of the labels in Categorical
data after we save the definitions:
categories_details = pd.DataFrame(df.categories.drop_duplicates(), columns=['categories'])
print categories_details
categories
index
0 Category1
1 Category2
2 Category3
3 Category4
locations_details = pd.DataFrame(df.locations.drop_duplicates(), columns=['locations'])
print locations_details
index
0 Location1
1 Location2
Now covert Categorical
to int
dtype:
for col in df.select_dtypes(include=['category']).columns:
df[col] = df[col].cat.codes
print df.head()
categories locations
index
0 0 0
1 1 1
2 2 0
3 3 1
4 0 0
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null int8
locations 100000 non-null int8
dtypes: int8(2)
memory usage: 976.6 KB
None
Save converted data to csv
and note that the file now has only numbers without labels.
The file size will also reflect this change.
df.to_csv('test2.csv')
index,categories,locations
0,0,0
1,1,1
2,2,0
3,3,1
Save the definitions as well:
categories_details.to_csv('categories_details.csv')
locations_details.to_csv('locations_details.csv')
When you need to restore the files, load them from csv
files:
df2 = pd.read_csv('test2.csv', index_col='index')
print df2.head()
categories locations
index
0 0 0
1 1 1
2 2 0
3 3 1
4 0 0
print df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null int64
locations 100000 non-null int64
dtypes: int64(2)
memory usage: 2.3 MB
None
categories_details2 = pd.read_csv('categories_details.csv', index_col='index')
print categories_details2.head()
categories
index
0 Category1
1 Category2
2 Category3
3 Category4
print categories_details2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 0 to 3
Data columns (total 1 columns):
categories 4 non-null object
dtypes: object(1)
memory usage: 64.0+ bytes
None
locations_details2 = pd.read_csv('locations_details.csv', index_col='index')
print locations_details2.head()
locations
index
0 Location1
1 Location2
print locations_details2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 1 columns):
locations 2 non-null object
dtypes: object(1)
memory usage: 32.0+ bytes
None
Now use map
to replace int
coded data with categories descriptions and convert them to Categorical
:
df2['categories'] = df2.categories.map(categories_details2.to_dict()['categories']).astype('category')
df2['locations'] = df2.locations.map(locations_details2.to_dict()['locations']).astype('category')
print df2.head()
categories locations
index
0 Category1 Location1
1 Category2 Location2
2 Category3 Location1
3 Category4 Location2
4 Category1 Location1
print df2.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 100000 entries, 0 to 99999
Data columns (total 2 columns):
categories 100000 non-null category
locations 100000 non-null category
dtypes: category(2)
memory usage: 976.6 KB
None
Note the memory usage back to what it was when we first converted data to Categorical
.
It should not be hard to automate this process if you need to repeat it many time.
回答2:
Pandas has a Categorical
data type that does just that. It basically maps the categories to integers behind the scenes.
Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array.
Documentation is here.
回答3:
Here's a way to save a dataframe with Categorical columns in a single .csv:
Example:
------ -------
Fatcol Thincol: unique strings once, then numbers
------ -------
"Alberta" "Alberta"
"BC" "BC"
"BC" 2 -- string 2
"Alberta" 1 -- string 1
"BC" 2
...
The "Thincol" on the right can be saved as is in a .csv file,
and expanded to the "Fatcol" on the left after reading it in;
this can halve the size of big .csv s with repeated strings.
Functions
---------
fatcol( col: Thincol ) -> Fatcol, list[ unique str ]
thincol( col: Fatcol ) -> Thincol, dict( unique str -> int ), list[ unique str ]
Here "Fatcol" and "Thincol" are type names for iterators, e.g. lists:
Fatcol: list of strings
Thincol: list of strings or ints or NaN s
If a `col` is a `pandas.Series`, its `.values` are used.
This cut a 700M .csv to 248M -- but write_csv
runs at ~ 1 MB/sec on my imac.
来源:https://stackoverflow.com/questions/30173092/replacing-category-data-pandas