I am trying to make a frequency table based on a dataframe with pandas and Python. In fact it's exactly the same as a previous question of mine which used R.
Let's say that I have a dataframe in pandas that looks like this (in fact the dataframe is much larger, but for illustrative purposes I limited the rows):
node    |   precedingWord ------------------------- A-bom       de A-bom       die A-bom       de A-bom       een A-bom       n A-bom       de acroniem    het acroniem    t acroniem    het acroniem    n acroniem    een act         de act         het act         die act         dat act         t act         n I'd like to use these values to make a count of the precedingWords per node, but with subcategories. For instance: one column to add values to that is titled neuter, another non-neuter and a last one rest. neuter would contain all values for which precedingWord is one of these values: t,het, dat. non-neuter would contain de and die, and rest would contain everything that doesn't belong into neuter or non-neuter. (It would be nice if this could be dynamic, in other words that rest uses some sort of reversed variable that is used for neuter and non-neuter. Or which simply subtracts the values in neuter and non-neuter from the length of rows with that node.)
Example output (in a new dataframe, let's say freqDf, would look like this:
node    |   neuter   | nonNeuter   | rest ----------------------------------------- A-bom       0          4             2 acroniem    3          0             2 act         3          2             1 I found an answer to a similar question but the use case isn't exactly the same. It seems to me that in that question all variables are independent. However, in my case it is obvious that I have multiple rows with the same node, which should all be brought down to a single one frequency - as show in the expected output above.
I thought something like this (untested):
def specificFreq(d):       for uniqueWord in d['node']         return pd.Series({'node': uniqueWord ,             'neuter': sum(d['node' == uniqueWord] & d['precedingWord'] == 't|het|dat'),             'nonNeuter':  sum(d['node' == uniqueWord] & d['precedingWord'] == 'de|die'),             'rest': len(uniqueWord) - neuter - nonNeuter}) # Length of rows with the specific word, distracted by neuter and nonneuter values above  df.groupby('node').apply(specificFreq) But I highly doubt this the correct way of doing something like this.