问题
I am trying to assign
unique
values in pandas
df
to specific individuals.
For the df
below, [Area]
and [Place]
will together make up unique
values that are various jobs. These values will be assigned to individuals with the overall aim to use the least amount of individuals possible.
The trick is these values are constantly starting and finishing and go for different lengths of time. The most unique
values assigned to an individual any one time is 3. [On]
displays how many current unique values for [Place] and [Area] are occurring.
So this provides a concrete guide on how many individuals I need. e.g. 3 unique
values one = 1 person, 6 unique values on = 2 persons
I can't do a groupby
statement where I assign
the first 3 unique values
to individual 1
and the next 3 unique
values to individual 2
etc.
What I envisage is, when unique
values are greater than 3 I want to group values in [Area]
first and then combine the leftovers. So look to assign
same values in [Area]
to an individual (up to 3). Then, if there are _leftover_
values (<3), they should be combined to make a group of 3, where possible.
The way I envisage this working is: see into the future by an hour
. For each new row
of values the script
should see how many values will be [On]
(this provides an indication of how many total individuals are required). Where unique
values are >3, they should be assigned
by grouping
the same value in [Area]
. If there are leftover values they should be combined anyhow to make up to a group of 3.
For the df
below, the number of unique
values occurring for [Place]
and [Area]
varies between 1-6. So we should never have more than 2 individuals assigned
. When unique
values are >3 it should be assigned by [Area]
first. The leftover values should be combined with other individuals that have less than 3 unique
values.
Apologies for the large df. It's the only way i can replicate the problem!
import pandas as pd
import numpy as np
from collections import Counter
d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','8:33:00','8:47:00','8:48:00','9:03:00','9:15:00','9:18:00','9:33:00','9:45:00','9:48:00','10:03:00','10:15:00','10:15:00','10:15:00','10:18:00','10:32:00','10:33:00','10:39:00','10:43:00','10:48:00','10:50:00','11:03:00','11:03:00','11:07:00','11:25:00','11:27:00','11:42:00','11:48:00','11:51:00','11:57:00','12:00:00','12:08:00','12:15:00','12:17:00','12:25:00','12:30:00','12:35:00','12:39:00','12:47:00','12:52:00','12:55:00','13:00:00','13:03:00','13:07:00','13:12:00','13:15:00','13:22:00','13:27:00','13:27:00'],
'Area' : ['A','A','A','A','A','A','A','A','A','A','A','A','A','A','A','B','A','B','A','A','A','A','B','A','A','B','B','A','B','C','A','B','C','C','A','B','C','C','B','A','C','B','C','C','A','C','B','C','C','A','C'],
'Place' : ['House 1','House 2','House 3','House 1','House 3','House 2','House 1','House 3','House 2','House 1','House 3','House 2','House 1','House 3','House 4','House 1','House 2','House 1','House 1','House 4','House 3','House 2','House 1','House 1','House 4','House 1','House 1','House 4','House 1','House 1','House 4','House 1','House 2','House 1','House 4','House 1','House 1','House 2','House 1','House 4','House 1','House 1','House 3','House 2','House 4','House 1','House 2','House 4','House 1','House 4','House 2'],
'On' : ['1','2','3','3','3','3','3','3','3','3','3','3','3','3','4','5','5','5','5','5','5','4','3','3','3','2','2','2','2','3','3','3','4','4','4','4','4','4','4','4','4','4','4','4','4','4','5','6','6','6','6'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 2','Person 3','Person 1','Person 3','Person 1','Person 2','Person 1','Person 1','Person 3','Person 1','Person 2','Person 3','Person 3','Person 2','Person 3','Person 4','Person 2','Person 3','Person 4','Person 4','Person 2','Person 3','Person 4','Person 4','Person 3','Person 2','Person 4','Person 3','Person 4','Person 4','Person 2','Person 4','Person 3','Person 5','Person 4','Person 2','Person 4'],
})
df = pd.DataFrame(data=d)
def getAssignedPeople(df, areasPerPerson):
areas = df['Area'].values
places = df['Place'].values
times = pd.to_datetime(df['Time']).values
maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
assignmentCount = Counter()
assignedPeople = []
assignedPlaces = {}
heldPeople = {}
heldAreas = {}
holdAvailable = True
person = 0
# search for repeated areas. Mark them if the next repeat occurs within an hour
ixrep = np.argmax(np.triu(areas.reshape(-1, 1)==areas, k=1), axis=1)
holds = np.zeros(areas.size, dtype=bool)
holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')
for area,place,hold in zip(areas, places, holds):
if (area, place) in assignedPlaces:
# this unique (area, place) has already been assigned to someone
assignedPeople.append(assignedPlaces[(area, place)])
continue
if assignmentCount[person] >= areasPerPerson:
# the current person is already assigned to enough areas, move on to the next
a = heldPeople.pop(person, None)
heldAreas.pop(a, None)
person += 1
if area in heldAreas:
# assign to the person held in this area
p = heldAreas.pop(area)
heldPeople.pop(p)
else:
# get the first non-held person. If we need to hold in this area,
# also make sure the person has at least 2 free assignment slots,
# though if it's the last person assign to them anyway
p = person
while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
p += 1
assignmentCount.update([p])
assignedPlaces[(area, place)] = p
assignedPeople.append(p)
if hold:
if p==maxPerson:
# mark that there are no more people available to perform holds
holdAvailable = False
# this area recurrs in an hour, mark that the person should be held here
heldPeople[p] = area
heldAreas[area] = p
return assignedPeople
def allocatePeople(df, areasPerPerson=3):
assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
df = df.copy()
df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
return df
print(allocatePeople(df))
Output:
Time Area Place On Person
0 8:03:00 A House 1 1 Person 1
1 8:17:00 A House 2 2 Person 1
2 8:20:00 A House 3 3 Person 1
3 8:33:00 A House 1 3 Person 1
4 8:47:00 A House 3 3 Person 1
5 8:48:00 A House 2 3 Person 1
6 9:03:00 A House 1 3 Person 1
7 9:15:00 A House 3 3 Person 1
8 9:18:00 A House 2 3 Person 1
9 9:33:00 A House 1 3 Person 1
10 9:45:00 A House 3 3 Person 1
11 9:48:00 A House 2 3 Person 1
12 10:03:00 A House 1 3 Person 1
13 10:15:00 A House 3 3 Person 1
14 10:15:00 A House 4 4 Person 2
15 10:15:00 B House 1 5 Person 2
16 10:18:00 A House 2 5 Person 1
17 10:32:00 B House 1 5 Person 2
18 10:33:00 A House 1 5 Person 1
19 10:39:00 A House 4 5 Person 2
20 10:43:00 A House 3 5 Person 1
21 10:48:00 A House 2 4 Person 1
22 10:50:00 B House 1 3 Person 2
23 11:03:00 A House 1 3 Person 1
24 11:03:00 A House 4 3 Person 2
25 11:07:00 B House 1 2 Person 2
26 11:25:00 B House 1 2 Person 2
27 11:27:00 A House 4 2 Person 2
28 11:42:00 B House 1 2 Person 2
29 11:48:00 C House 1 3 Person 2
30 11:51:00 A House 4 3 Person 2
31 11:57:00 B House 1 3 Person 2
32 12:00:00 C House 2 4 Person 3
33 12:08:00 C House 1 4 Person 2
34 12:15:00 A House 4 4 Person 2
35 12:17:00 B House 1 4 Person 2
36 12:25:00 C House 1 4 Person 2
37 12:30:00 C House 2 4 Person 3
38 12:35:00 B House 1 4 Person 2
39 12:39:00 A House 4 4 Person 2
40 12:47:00 C House 1 4 Person 2
41 12:52:00 B House 1 4 Person 2
42 12:55:00 C House 3 4 Person 3
43 13:00:00 C House 2 4 Person 3
44 13:03:00 A House 4 4 Person 2
45 13:07:00 C House 1 4 Person 2
46 13:12:00 B House 2 5 Person 3
47 13:15:00 C House 4 6 Person 4
48 13:22:00 C House 1 6 Person 2
49 13:27:00 A House 4 6 Person 2
50 13:27:00 C House 2 6 Person 3
Intended Output and Comments on why I think it should be assigned:
回答1:
There's a live version of this answer online that you can try for yourself.
The problem
The bug you're seeing is due to (yet another) interesting edge case of your problem. During the 6th
job, the code assigns person 2
to (A, House 4)
. It then then sees that area A
repeats within an hour, so it holds person 2
in that area. This makes person 2
unavailable for the next job, which is in area B
.
However, there's no reason to hold person 2
in area A
for the sake of a job that occurs in (A, House 1)
, since the unique combination of area and place (A, House 1)
has already been assigned to person 1
.
The solution
The problem can be fixed by considering only unique combinations of area and place when deciding when to hold a person in an area. Only a couple of lines of code have to change.
First, we construct a list of areas that correspond to the unique (area, place) pairs:
unqareas = df[['Area', 'Place']].drop_duplicates()['Area'].values
Then we just substitute unqareas
for areas
in the first line of the code that identifies holds:
ixrep = np.argmax(np.triu(unqareas.reshape(-1, 1)==unqareas, k=1), axis=1)
Complete listing/test
import pandas as pd
import numpy as np
from collections import Counter
d = ({
'Time' : ['8:03:00','8:07:00','8:10:00','8:23:00','8:27:00','8:30:00','8:37:00','8:40:00','8:48:00'],
'Place' : ['House 1','House 2','House 3','House 1','House 2','House 3','House 4','House 1','House 1'],
'Area' : ['A','A','A','A','A','A','A','B','A'],
'Person' : ['Person 1','Person 1','Person 1','Person 1','Person 1','Person 1','Person 2','Person 3','Person 1'],
'On' : ['1','2','3','3','3','3','4','5','5']
})
df = pd.DataFrame(data=d)
def getAssignedPeople(df, areasPerPerson):
areas = df['Area'].values
unqareas = df[['Area', 'Place']].drop_duplicates()['Area'].values
places = df['Place'].values
times = pd.to_datetime(df['Time']).values
maxPerson = np.ceil(areas.size / float(areasPerPerson)) - 1
assignmentCount = Counter()
assignedPeople = []
assignedPlaces = {}
heldPeople = {}
heldAreas = {}
holdAvailable = True
person = 0
# search for repeated areas. Mark them if the next repeat occurs within an hour
ixrep = np.argmax(np.triu(unqareas.reshape(-1, 1)==unqareas, k=1), axis=1)
holds = np.zeros(areas.size, dtype=bool)
holds[ixrep.nonzero()] = (times[ixrep[ixrep.nonzero()]] - times[ixrep.nonzero()]) < np.timedelta64(1, 'h')
for area,place,hold in zip(areas, places, holds):
if (area, place) in assignedPlaces:
# this unique (area, place) has already been assigned to someone
assignedPeople.append(assignedPlaces[(area, place)])
continue
if assignmentCount[person] >= areasPerPerson:
# the current person is already assigned to enough areas, move on to the next
a = heldPeople.pop(person, None)
heldAreas.pop(a, None)
person += 1
if area in heldAreas:
# assign to the person held in this area
p = heldAreas.pop(area)
heldPeople.pop(p)
else:
# get the first non-held person. If we need to hold in this area,
# also make sure the person has at least 2 free assignment slots,
# though if it's the last person assign to them anyway
p = person
while p in heldPeople or (hold and holdAvailable and (areasPerPerson - assignmentCount[p] < 2)) and not p==maxPerson:
p += 1
assignmentCount.update([p])
assignedPlaces[(area, place)] = p
assignedPeople.append(p)
if hold:
if p==maxPerson:
# mark that there are no more people available to perform holds
holdAvailable = False
# this area recurrs in an hour, mark that the person should be held here
heldPeople[p] = area
heldAreas[area] = p
return assignedPeople
def allocatePeople(df, areasPerPerson=3):
assignedPeople = getAssignedPeople(df, areasPerPerson=areasPerPerson)
df = df.copy()
df.loc[:,'Person'] = df['Person'].unique()[assignedPeople]
return df
print(allocatePeople(df))
Output:
Time Place Area Person On
0 8:03:00 House 1 A Person 1 1
1 8:07:00 House 2 A Person 1 2
2 8:10:00 House 3 A Person 1 3
3 8:23:00 House 1 A Person 1 3
4 8:27:00 House 2 A Person 1 3
5 8:30:00 House 3 A Person 1 3
6 8:37:00 House 4 A Person 2 4
7 8:40:00 House 1 B Person 2 5
8 8:48:00 House 1 A Person 1 5
来源:https://stackoverflow.com/questions/54101445/re-assign-unique-values-pandas-dataframe