I have the following data in a csv file:
from StringIO import StringIO
import pandas as pd
the_data = \"\"\"
ABC,2016-6-9 0:00,95,{\'//PurpleCar\': [115L],
this should do the trick
s = pd.read_csv(StringIO(the_data), sep='|', header=None, squeeze=True)
left = s.str.split(',').str[:3].apply(pd.Series)
left.columns = ['Company', 'Date', 'Volume']
right = s.str.split(',').str[3:].str.join(',') \
.str.replace(r'[\[\]\{\}\']', '') \
.str.replace(r'(:\s+\d+)L', r'\1') \
.str.split(',', expand=True)
right.columns = ['Car{}'.format(i) for i in range(1, 5)]
pd.concat([left, right], axis=1)
Edit: The file seems to be actually an escaped CSV so we don't need a custom parsing for this part.
As @Blckknght points out in the comment, the file is not a valid CSV. I'll make some assumptions in my answer. They are
First, some imports
import ast
import pandas as pd
We'll just split the rows by commas as we don't need to deal with any sort of CSV escaping (assumptions #1 and #2).
rows = (line.split(",", 3) for line in the_data.splitlines() if line.strip() != "")
fixed_columns = pd.DataFrame.from_records(rows, columns=["Company", "Date", "Value", "Cars_str"])
fixed_columns = pd.read_csv(..., names=["Company", "Date", "Value", "Cars_str"])
The first three columns are fixed and we leave them as they are. The last column we can parse with ast.literal_eval
because it's a dict
(assumption #3). This is IMO more readable and more flexible if the format changes than regex. Also you'll detect the format change earlier.
cars = fixed_columns["Cars_str"].apply(ast.literal_eval)
del fixed_columns["Cars_str"]
And this part answers rather your other question.
We prepare functions to process the keys and values of the dict so they fail if our assumptions about content of the dict fail.
def get_single_item(list_that_always_has_single_item):
v, = list_that_always_has_single_item
return v
def extract_car_name(car_str):
assert car_str.startswith("//"), car_str
return car_str[2:]
We apply the functions and construct pd.Series
which allow us to...
dynamic_columns = cars.apply(
lambda x: pd.Series({
extract_car_name(k): get_single_item(v)
for k, v in x.items()
}))
...add the columns to the dataframe
result = pd.concat([fixed_columns, dynamic_columns], axis=1)
result
Finally, we get the table:
Company Date Value BlackCar BlueCar NPO-GreenCar PinkCar \
0 ABC 2016-6-9 0:00 95 NaN 16.0 NaN NaN
1 ABC 2016-6-10 0:00 0 NaN 90.0 NaN NaN
2 ABC 2016-6-11 0:00 0 NaN 31.0 NaN NaN
3 ABC 2016-6-12 0:00 0 NaN 8888.0 NaN NaN
4 ABC 2016-6-13 0:00 0 NaN 4.0 NaN NaN
5 DEF 2016-6-16 0:00 0 15.0 NaN 0.0 4.0
6 DEF 2016-6-17 0:00 0 15.0 NaN 0.0 4.0
7 DEF 2016-6-18 0:00 0 15.0 NaN 0.0 4.0
8 DEF 2016-6-19 0:00 0 15.0 NaN 0.0 4.0
9 DEF 2016-6-20 0:00 0 15.0 NaN 0.0 4.0
PurpleCar WhiteCar-XYZ YellowCar
0 115.0 0.0 403.0
1 219.0 0.0 381.0
2 817.0 0.0 21.0
3 80.0 0.0 2011.0
4 32.0 0.0 15.0
5 32.0 NaN NaN
6 32.0 NaN NaN
7 32.0 NaN NaN
8 32.0 NaN NaN
9 32.0 NaN NaN
I think it's better to conver the strings into two columns:
from io import StringIO
import pandas as pd
df = pd.read_csv(StringIO(the_data), sep=',', header=None)
df.columns = ['Company','Date','Volume','Car1','Car2','Car3','Car4']
cars = ["Car1", "Car2", "Car3", "Car4"]
pattern = r"//(?P<color>.+?)':.*?(?P<value>\d+)"
df2 = pd.concat([df[col].str
.extract(pattern)
.assign(value=lambda self: pd.to_numeric(self["value"]))
for col in cars],
axis=1, keys=cars)
the result:
Car1 Car2 Car3 Car4
color value color value color value color value
0 PurpleCar 115 YellowCar 403 BlueCar 16 WhiteCar-XYZ 0
1 PurpleCar 219 YellowCar 381 BlueCar 90 WhiteCar-XYZ 0
2 PurpleCar 817 YellowCar 21 BlueCar 31 WhiteCar-XYZ 0
3 PurpleCar 80 YellowCar 2011 BlueCar 8888 WhiteCar-XYZ 0
4 PurpleCar 32 YellowCar 15 BlueCar 4 WhiteCar-XYZ 0
5 PurpleCar 32 BlackCar 15 PinkCar 4 NPO-GreenCar 0
6 PurpleCar 32 BlackCar 15 PinkCar 4 NPO-GreenCar 0
7 PurpleCar 32 BlackCar 15 PinkCar 4 NPO-GreenCar 0
8 PurpleCar 32 BlackCar 15 PinkCar 4 NPO-GreenCar 0
9 PurpleCar 32 BlackCar 15 PinkCar 4 NPO-GreenCar 0