问题
I have a Dataframe in which each row contains a sentence followed by a list of part-of-speech tags, created with spaCy:
df.head()
question POS_tags
0 A title for my ... [DT, NN, IN,...]
1 If one of the ... [IN, CD, IN,...]
When I write the DataFrame to a csv file (encoding='utf-8') and re-open it, it looks like the data format has changed with the POS tags now appearing between quotes ' ' like this:
df.head()
question POS_tags
0 A title for my ... ['DT', 'NN', 'IN',...]
1 If one of the ... ['IN', 'CD', 'IN',...]
When I now try to use the POS tags for some operations, it turns out they are no longer lists but have become strings that even include the quotation marks. They still look like lists but are not. This is clear when doing:
q = df['POS_tags']
q = list(q)
print(q)
Which results in:
["['DT', 'NN', 'IN']"]
What is going on here?
I either want the column 'POS_tags' to contain lists, even after saving to csv and re-opening. Or I want to do an operation on the column 'POS_tags' to have the same lists again that SpaCy originally created. Any advice how to do this?
回答1:
To preserve the exact structure of the DataFrame, an easy solution is to serialize the DF in pickle format with pd.to_pickle
, instead of using csv
, which will always throw away all information about data types, and will require manual reconstruction after re-import. One drawback of pickle is that it's not human-readable.
# Save to pickle
df.to_pickle('pickle-file.pkl')
# Save with compression
df.to_pickle('pickle-file.pkl.gz', compression='gzip')
# Load pickle from disk
df = pd.read_pickle('pickle-file.pkl') # or...
df = pd.read_pickle('pickle-file.pkl.gz', compression='gzip')
Fixing lists after importing from CSV
If you've already imported from CSV, this should convert the POS_tags
column from strings to python lists:
from ast import literal_eval
df['POS_tags'] = df['POS_tags'].apply(literal_eval)
来源:https://stackoverflow.com/questions/49580996/why-do-my-lists-become-strings-after-saving-to-csv-and-re-opening-python