group-by

ElasticSearch count multiple fields grouped by

 ̄綄美尐妖づ 提交于 2021-01-01 06:31:00
问题 I have documents like {"domain":"US", "zipcode":"11111", "eventType":"click", "id":"1", "time":100} {"domain":"US", "zipcode":"22222", "eventType":"sell", "id":"2", "time":200} {"domain":"US", "zipcode":"22222", "eventType":"click", "id":"3","time":150} {"domain":"US", "zipcode":"11111", "eventType":"sell", "id":"4","time":350} {"domain":"US", "zipcode":"33333", "eventType":"sell", "id":"5","time":225} {"domain":"EU", "zipcode":"44444", "eventType":"click", "id":"5","time":120} I want to

How to processes the extremely large dataset into chunks in Python (Pandas), while considering the full dataset for application of function?

跟風遠走 提交于 2021-01-01 06:27:44
问题 I have read numerous threads on similar topics on the forum. However, what I am asking here, I believe, it is not a duplicate question. I am reading a very large dataset (22 gb) of CSV format, having 350 million rows. I am trying to read the dataset in chunks, based on the solution provided by that link. My current code is as following. import pandas as pd def Group_ID_Company(chunk_of_dataset): return chunk_of_dataset.groupby(['id', 'company'])[['purchasequantity', 'purchaseamount']].sum()

Use cumcount on pandas dataframe with a conditional increment

主宰稳场 提交于 2020-12-31 14:19:24
问题 Consider the dataframe df = pd.DataFrame( [ ['A', 1], ['A', 1], ['B', 1], ['B', 0], ['A', 0], ['A', 1], ['B', 1] ], columns = ['key', 'cond']) I want to find a cumulative (running) count (starting at 1) for each key , where we only increment if the previous value in the group had cond == 1 . When appended to the above dataframe this would give df_result = pd.DataFrame( [ ['A', 1, 1], ['A', 1, 2], ['B', 1, 1], ['B', 0, 2], ['A', 0, 3], ['A', 1, 3], ['B', 1, 2] ], columns = ['key', 'cond'])

Use cumcount on pandas dataframe with a conditional increment

风格不统一 提交于 2020-12-31 14:18:25
问题 Consider the dataframe df = pd.DataFrame( [ ['A', 1], ['A', 1], ['B', 1], ['B', 0], ['A', 0], ['A', 1], ['B', 1] ], columns = ['key', 'cond']) I want to find a cumulative (running) count (starting at 1) for each key , where we only increment if the previous value in the group had cond == 1 . When appended to the above dataframe this would give df_result = pd.DataFrame( [ ['A', 1, 1], ['A', 1, 2], ['B', 1, 1], ['B', 0, 2], ['A', 0, 3], ['A', 1, 3], ['B', 1, 2] ], columns = ['key', 'cond'])

Use cumcount on pandas dataframe with a conditional increment

别来无恙 提交于 2020-12-31 14:17:10
问题 Consider the dataframe df = pd.DataFrame( [ ['A', 1], ['A', 1], ['B', 1], ['B', 0], ['A', 0], ['A', 1], ['B', 1] ], columns = ['key', 'cond']) I want to find a cumulative (running) count (starting at 1) for each key , where we only increment if the previous value in the group had cond == 1 . When appended to the above dataframe this would give df_result = pd.DataFrame( [ ['A', 1, 1], ['A', 1, 2], ['B', 1, 1], ['B', 0, 2], ['A', 0, 3], ['A', 1, 3], ['B', 1, 2] ], columns = ['key', 'cond'])

group data and filter groups by two columns (dplyr)

☆樱花仙子☆ 提交于 2020-12-30 03:42:30
问题 I have a question regarding using dplyr to filter a dataset. I want to group data by its RestaurantID and then filter() all groups where the wage >= 5 in Year==1992 . For example: I have: RestaurantID Year Wage 1 92 6 1 93 4 2 92 3 2 93 4 3 92 5 3 93 5 Dataset I want (where it keeps all groups that had a wage value in 1992 that was >= 5) RestaurantID Year Wage 1 92 6 1 93 4 3 92 5 3 93 5 I tried: data %>% group_by("RestaurantID") %>% filter(any(Wage>= '5', Year =='92')) But this gives me all

group data and filter groups by two columns (dplyr)

此生再无相见时 提交于 2020-12-30 03:41:46
问题 I have a question regarding using dplyr to filter a dataset. I want to group data by its RestaurantID and then filter() all groups where the wage >= 5 in Year==1992 . For example: I have: RestaurantID Year Wage 1 92 6 1 93 4 2 92 3 2 93 4 3 92 5 3 93 5 Dataset I want (where it keeps all groups that had a wage value in 1992 that was >= 5) RestaurantID Year Wage 1 92 6 1 93 4 3 92 5 3 93 5 I tried: data %>% group_by("RestaurantID") %>% filter(any(Wage>= '5', Year =='92')) But this gives me all

spark: How does salting work in dealing with skewed data

拜拜、爱过 提交于 2020-12-29 07:52:25
问题 I have a skewed data in a table which is then compared with other table that is small. I understood that salting works in case of joins- that is a random number is appended to keys in big table with skew data from a range of random data and the rows in small table with no skew data are duplicated with the same range of random numbers. Hence the the matching happens because there will be a hit in one among the duplicate values for particular slated key of skewed able I also read that salting

spark: How does salting work in dealing with skewed data

大城市里の小女人 提交于 2020-12-29 07:52:24
问题 I have a skewed data in a table which is then compared with other table that is small. I understood that salting works in case of joins- that is a random number is appended to keys in big table with skew data from a range of random data and the rows in small table with no skew data are duplicated with the same range of random numbers. Hence the the matching happens because there will be a hit in one among the duplicate values for particular slated key of skewed able I also read that salting

SQL Max(date) without group by

旧时模样 提交于 2020-12-26 07:54:56
问题 I have the following table... MemberID ServDate 001 12-12-2015 001 12-13-2015 001 12-15-2015 002 11-30-2015 002 12-04-2015 And I want to make it look like this... MemberID ServDate LastServDate 001 12-12-2015 12-15-2015 001 12-13-2015 12-15-2015 001 12-15-2015 12-15-2015 002 11-30-2015 12-04-2015 002 12-04-2015 12-04-2015 Is there a way I can do this without having to use a GROUP BY or nested query? (I'm dealing with a very large database and the GROUP BY slows things down considerably) 回答1: