问题
I am importing study data into a Pandas data frame using read_csv
.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64
.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
回答1:
As indicated in this question/answer by Lev Landau, there could be a simple solution to use converters
option for a certain column in read_csv
function.
converters={'column_name': lambda x: str(x)}
You can refer to more options of read_csv
funtion in pandas.io.parsers.read_csv documentation.
Lets say I have csv file projects.csv
like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is triming leading zeros:
import csv
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
me@ubuntu:~$ python test_dataframe.py
project_name project_id
0 Some Project 245
1 Another Project 478
me@ubuntu:~$
Solution code example:
import csv
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': lambda x: str(x)})
print dataframe
Required result:
me@ubuntu:~$ python test_dataframe.py
project_name project_id
0 Some Project 000245
1 Another Project 000478
me@ubuntu:~$
回答2:
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv()
:
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
回答3:
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
回答4:
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt()
and create the DataFrame
from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10
coming in November.
回答5:
You can use converters
to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel
.
Python3.6 or higher required.
回答6:
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
来源:https://stackoverflow.com/questions/13250046/how-to-keep-leading-zeros-in-a-column-when-reading-csv-with-pandas