Summarising features with multiple values in Python for Machine Learning model

问题

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:

PregnancyID MotherID    gestationalAgeInWeeks  abdomCirc
0           0           14                     150
0           0           21                     200
1           1           20                     294
1           1           25                     315
1           1           30                     350
2           2           8                      170
2           2           9                      180
2           2           18                     NaN

As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).

I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:

abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks

So my final dataset would look like this:

PregnancyID     MotherID    abdomCirc1st  abdomCirc2nd   abdomCirc3rd
0               0           NaN           200            NaN
1               1           NaN           315            350
2               2           180           NaN            NaN

The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.

But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.

What I want to do is the following:

Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.

I think I have to do something along the lines of:

df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')

But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.

回答1:

You can try this. a bit of a complicated query but it seems to work:

(df.groupby(['MotherID', 'PregnancyID'])
    .apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
                        .groupby('tm')['abdomCirc']
                        .apply(max))
    .unstack()
)

produces


     tm                    1      2     3
MotherID    PregnancyID         
0           0              NaN    200.0 NaN
1           1              NaN    294.0 350.0
2           2              180.0  NaN   NaN

Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)

For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).

Then we unstack() that moves tm to the column names

You may want to rename this columns later, but I did not bother

Solution 2

Come to think of it you can simplify the above a bit:

(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
    .drop(columns = 'gestationalAgeInWeeks')
    .groupby(['MotherID', 'PregnancyID','tm'])
    .agg('max')
    .unstack()
    )

similar idea, same output.

回答2:

There is a magic command called query. This should do your work for now:

abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()

abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()

abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()

If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)

Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

来源：https://stackoverflow.com/questions/65106011/summarising-features-with-multiple-values-in-python-for-machine-learning-model

标签

python

pandas