aggregate

R summarize unique values across columns based on values from one column

馋奶兔 提交于 2020-01-30 05:30:28
问题 I want to know the total number of unique values for each column based on the values of var_1. For example: Test <- data.frame(var_1 = c("a","a","a", "b", "b", "c", "c", "c", "c", "c"), var_2 = c("bl","bf","bl", "bl","bf","bl","bl","bf","bc", "bg" ), var_3 = c("cf","cf","eg", "cf","cf","eg","cf","dr","eg","fg")) The results I am looking for would be based on the values in var_1 and should be: var_1 var_2 var_3 a 2 2 b 2 1 c 3 4 However, after trying various methods (including apply and table)

MongoDB 聚合管道(Aggregation Pipeline)

半腔热情 提交于 2020-01-28 04:29:10
转自 : https://www.cnblogs.com/shanyou/p/3494854.html 管道概念 POSIX多线程的使用方式中, 有一种很重要的方式-----流水线(亦称为“管道”)方式,“数据元素”流串行地被一组线程按顺序执行。它的使用架构可参考下图: 以面向对象的思想去理解,整个流水线,可以理解为一个数据传输的管道;该管道中的每一个工作线程,可以理解为一个整个流水线的一个工作阶段stage,这些工作线程之间的合作是一环扣一环的。靠输入口越近的工作线程,是时序较早的工作阶段stage,它的工作成果会影响下一个工作线程阶段(stage)的工作结果,即下个阶段依赖于上一个阶段的输出,上一个阶段的输出成为本阶段的输入。这也是pipeline的一个共有特点! 为了回应用户对简单数据访问的需求,MongoDB2.2版本引入新的功能 聚合框架 (Aggregation Framework) ,它是数据聚合的一个新框架,其概念类似于数据处理的管道。 每个文档通过一个由多个节点组成的管道,每个节点有自己特殊的功能(分组、过滤等),文档经过管道处理后,最后输出相应的结果。管道基本的功能有两个: 一是对文档进行“过滤”,也就是筛选出符合条件的文档; 二是对文档进行“变换”,也就是改变文档的输出形式。 其他的一些功能还包括按照某个指定的字段分组和排序等

R: Aggregate character strings [duplicate]

心已入冬 提交于 2020-01-25 18:43:46
问题 This question already has answers here : How to sum a variable by group (13 answers) Closed last month . I have a data frame ModelDF having columns with numeric as well as character values like: Quantity Type Mode Company 1 Shoe hello Nike 1 Shoe hello Nike 2 Jeans hello Levis 3 Shoe hello Nike 1 Jeans hello Levis 1 Shoe hello Adidas 2 Jeans hello Spykar 1 Shoe ahola Nike 1 Jeans ahola Levis I have to aggregate it in this form Quantity Type Mode Company 5 Shoe hello Nike 3 jeans hello Levis 1

Remove duplicates from MongoDB 4.2 data base

若如初见. 提交于 2020-01-25 12:06:23
问题 I am trying to remove duplicates from MongoDB but all solutions find fail. My JSON structure: { "_id" : ObjectId("5d94ad15667591cf569e6aa4"), "a" : "aaa", "b" : "bbb", "c" : "ccc", "d" : "ddd", "key" : "057cea2fc37aabd4a59462d3fd28c93b" } Key value is md5(a+b+c+d). I already have a database with over 1 billion records and I want to remove all the duplicates according to key and after use unique index so if the key is already in data base the record wont insert again. I already tried db.data

Remove duplicates from MongoDB 4.2 data base

我与影子孤独终老i 提交于 2020-01-25 12:03:46
问题 I am trying to remove duplicates from MongoDB but all solutions find fail. My JSON structure: { "_id" : ObjectId("5d94ad15667591cf569e6aa4"), "a" : "aaa", "b" : "bbb", "c" : "ccc", "d" : "ddd", "key" : "057cea2fc37aabd4a59462d3fd28c93b" } Key value is md5(a+b+c+d). I already have a database with over 1 billion records and I want to remove all the duplicates according to key and after use unique index so if the key is already in data base the record wont insert again. I already tried db.data

PowerQuery COUNTIF Previous Dates

冷暖自知 提交于 2020-01-25 11:27:25
问题 I'm a little rusty on PowerQuery. I need to count "previous" entries in the same table. For example, let's say we have a table of car sales. For the purposes of PowerQuery, this table will be named tblCarSales I need to add two aggregate columns. The first aggregate column is the count of previous sales. The Excel formula would be =COUNTIF([Sale Date],"<"&[@[Sale Date]]) The second aggregate column is the count of previous sales by make . The Excel formula would be =COUNTIFS([Sale Date],"<"&[

pandas aggregate dataframe returns only one column

南楼画角 提交于 2020-01-24 13:54:28
问题 Hy there. I have a pandas DataFrame (df) like this: foo id1 bar id2 0 8.0 1 NULL 1 1 5.0 1 NULL 1 2 3.0 1 NULL 1 3 4.0 1 1 2 4 7.0 1 3 2 5 9.0 1 4 3 6 5.0 1 2 3 7 7.0 1 3 1 ... I want to group by id1 and id2 and try to get the mean of foo and bar. My code: res = df.groupby(["id1","id2"])["foo","bar"].mean() What I get is almost what I expect: foo id1 id2 1 1 5.750000 2 7.000000 2 1 3.500000 2 1.500000 3 1 6.000000 2 5.333333 The values in column "foo" are exactly the average values (means)

pandas aggregate dataframe returns only one column

北战南征 提交于 2020-01-24 13:54:26
问题 Hy there. I have a pandas DataFrame (df) like this: foo id1 bar id2 0 8.0 1 NULL 1 1 5.0 1 NULL 1 2 3.0 1 NULL 1 3 4.0 1 1 2 4 7.0 1 3 2 5 9.0 1 4 3 6 5.0 1 2 3 7 7.0 1 3 1 ... I want to group by id1 and id2 and try to get the mean of foo and bar. My code: res = df.groupby(["id1","id2"])["foo","bar"].mean() What I get is almost what I expect: foo id1 id2 1 1 5.750000 2 7.000000 2 1 3.500000 2 1.500000 3 1 6.000000 2 5.333333 The values in column "foo" are exactly the average values (means)

django: calculate percentage based on object count

巧了我就是萌 提交于 2020-01-23 12:53:47
问题 I have the following models: class Question(models.Model): question = models.CharField(max_length=100) class Option(models.Model): question = models.ForeignKey(Question) value = models.CharField(max_length=200) class Answer(models.Model): option = models.ForeignKey(Option) Each Question has Options defined by the User. For Example: Question - What is the best fruit? Options - Apple, Orange, Grapes. Now other user's can Answer the question with their responses restricted to Options . I have

[spark]RewriteDistinctAggregates

女生的网名这么多〃 提交于 2020-01-22 05:08:55
如果 Aggregate 操作中同时包含 Distinct 与非 Distinct 操作,优化器可以将该操作改写成两个不包含 Distinct 的 Aggregate 假设 schema 如下 create table animal ( gkey varchar ( 128 ) , cat varchar ( 128 ) , dog varchar ( 128 ) , price double ) ; animal 表中的数据如下 gkey cat dog price a ca1 cb1 10 a ca1 cb2 5 b ca1 cb1 13 测试语句如下 SELECT gkey , SUM ( price ) , COUNT ( DISTINCT cat ) , COUNT ( DISTINCT dog ) FROM animal GROUP BY gkey 该测试语句拥有3个 aggregate ,其中两个包含 distinct ,优化策略如下 首先将 animal 表格的每行扩展成 3 行,并添加新的一列 grid ,类型为整形,记新的表为 animal2 gkey cat dog price grid $gkey null null $price 0 $gkey $cat null null 1 $gkey null $dog null 2 表 animal2 数据如下