Python/pandas n00b. I have code that is processing event data stored in csv files. Data from df[\"CONTACT PHONE NUMBER\"]
is outputting the phone number as `5555551
I think phone numbers should be stored as a string.
When reading the csv you can ensure this column is read as a string:
pd.read_csv(filename, dtype={"CONTACT PHONE NUMBER": str})
You can use the string methods, naively adding:
In [11]: s = pd.Series(['5554443333', '1114445555', np.nan, '123']) # df["CONTACT PHONE NUMBER"]
# phone_nos = '(' + s.str[:3] + ')' + s.str[3:7] + '-' + s.str[7:11]
Edit: as Noah answers in a related question, you can do this more directly/efficiently using str.replace:
In [12]: phone_nos = s.str.replace('^(\d{3})(\d{3})(\d{4})$', r'(\1)\2-\3')
In [13]: phone_nos
Out[13]:
0 (555)4443-333
1 (111)4445-555
2 NaN
3 123
dtype: object
But there is a problem here as you have a malformed number, not precisely 10 digits, so you could NaN those:
In [14]: s.str.contains('^\d{10}$') # note: NaN is truthy
Out[14]:
0 True
1 True
2 NaN
3 False
dtype: object
In [15]: phone_nos.where(s.str.contains('^\d{10}$'))
Out[15]:
0 (555)4443-333
1 (111)4445-555
2 NaN
3 NaN
dtype: object
Now, you might like to inspect the bad formats you have (maybe you have to change your output to encompass them, e.g. if they included a country code):
In [16]: s[~s.str.contains('^\d{10}$').astype(bool)]
Out[16]:
3 123
dtype: object