I currently have a massive set of datasets. I have a set for each year in the 2000\'s. I take a combination of three years and run a code on that to clean. The problem is that d
One way to achieve this would be like as follows:
import pandas as pd
# generating random DF
num_rows = 100
locs = list('abcdefghijklmno')
df = pd.DataFrame(
{'id': np.random.randint(1, 100, num_rows),
'location': np.random.choice(locs, num_rows),
'year': np.random.randint(2005, 2007, num_rows)})
df.sort_values('id', inplace=True)
print('**** sorted DF (first 10 rows) ****')
print(df.head(10))
# chopping DF into chunks ...
chunk_size = 5
chunks = [i for i in df.id.unique()[::chunk_size]]
chunk_margins = [(chunks[i-1],chunks[i]) for i in range(1, len(chunks))]
df_chunks = [df.ix[(df.id >= x[0]) & (df.id < x[1])] for x in chunk_margins]
print('**** first chunk ****')
print(df_chunks[0])
Output:
**** sorted DF (first 10 rows) ****
id location year
31 2 c 2005
85 2 e 2006
89 2 l 2006
70 2 i 2005
60 4 n 2005
68 7 g 2005
22 7 e 2006
73 10 i 2005
23 10 j 2006
47 16 n 2005
**** first chunk ****
id location year
31 2 c 2005
85 2 e 2006
89 2 l 2006
70 2 i 2005
60 4 n 2005
68 7 g 2005
22 7 e 2006
73 10 i 2005
23 10 j 2006
47 16 n 2005
6 16 k 2006
82 16 g 2005