问题
I have a dataframe looks like this after I groupby them.
gb = new_user_log.groupby(['msno', 'date', 'num_25', 'num_50',
'num_75', 'num_985', 'num_100',
'num_unq', 'total_secs', 'days']).days.count()
Part of the dataframe shows here:
msno date num_25 num_50 num_75 num_985 num_100 num_unq total_secs days
++orpnUqSevh2M5A97pRRiONA58g5m9DwaNrhD44HY0= 2016-08-14 78 13 3 3 77 84 18987.862 2 1
+0krxpTkQT7hciN95OEp7i2lyKvbXft887VNQGF6xN4= 2016-12-22 27 23 5 2 11 65 5946.577 35 1
98 1
+1eAmfPiXsMG0J+U/j7YTeXO+u0/ZhZqmoXfXKOY57I= 2016-12-17 8 2 2 2 126 131 32629.128 46 1
+1zyLKrstPYDOnoyooqqb0+VmmovONd9N3QZynT7rq0= 2016-07-14 18 2 1 3 46 66 10253.571 196 1
2016-09-17 11 3 1 6 81 45 12970.416 261 1
2016-11-17 5 1 3 2 8 19 2614.571 322 1
+28j7X4BShC9g2C4No4xAmzXuhLlmMbFDYWZcbblCEs= 2015-11-21 1 2 1 2 88 94 23580.548 45 1
2016-03-28 12 6 5 4 36 61 11596.887 173 1
+2GtgQNpMuIG0Vy1ycAbfla0zRni6NKtRLlyyuHHB8I= 2015-07-18 15 1 2 2 73 75 19179.451 102 1
2016-07-29 7 2 1 1 8 17 2515.486 479 1
2016-11-09 45 6 3 1 71 93 19865.317 582 1
So the first little problem I hope to solve is give each of 'msno'(userID) a number, not as a string type...
All these features num_25 num_50 num_75 num_985 num_100 num_unq total_secs are counting events happened in each day.
The last feature days is calculated by days = date - transaction_date means the number of days a user own the product. And the firsttransaction_date for each user means the first time to buy this product. It comes from the dataframe belows,
msno payment_method_id \
0 QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ= 39
20 GBy8qSz16X5iYWD+3CMxv/Hm6OPSrXBYtmbnlRtknW0= 33
44 T0FF6lumjKcqEO0O+tUH2ytc+Kb9EkeaLzcVUiTr1aE= 40
72 9iW/UpqRoviya9CQh64RoYLPjiws8+biwscN+g5qYm4= 37
86 LUCi7i5FeNuZz4DB0zKu8J80rgr2uCO8YHCSW9PZfAk= 38
98 qCagaTL3UbMn0zvdMDDUoonbei70q1eASKrXa1cZGCs= 38
120 nsx5IGkCueevv0vFHB4jkG0HdRl6m6ltB8U9Guo5nS0= 40
123 nsx5IGkCueevv0vFHB4jkG0HdRl6m6ltB8U9Guo5nS0= 40
transaction_date membership_expire_date
0 2016-10-31 2016-12-21
20 2016-03-31 2016-04-30
44 2015-03-28 2015-04-27
72 2015-05-04 2015-06-03
86 2016-03-13 2016-08-24
98 2015-03-20 2015-04-19
120 2016-04-07 2016-05-06
123 2017-01-01 2017-02-06
You can see that in days there are only a few values for each user. So is it possible to fill all other days of the event with 0 to expand the dataframe to a very sparse one? The rule I have to follow is to find the maximum value in days column (In this sample is 582) then fill all 582 days with 0s for every user under those events columns when no events happened.
From the title I said fill NaN because some of the users have multiple transcation_date and membership_expire_date records(See the last two rows). For example, I want to compute the difference between the second transcation_date - the first membership_expire_date(similar to how days come) of the user then fill in NaN in this range time and also maybe the third starts - the second ends(It can be a lot of range time for some users even about 30). So two types of values should be filled, 0 and NaN.
I don't know if I made these questions clearly and indeed that was a lot.. If anyone could help me out or even give me some hints, you must be my lifesaver!
来源:https://stackoverflow.com/questions/46988238/how-to-fill-0-and-nan-for-a-dataframe-after-groupby-it-in-python