aggregate | 易学教程

Crosstab with a large or undefined number of categories

阅读更多关于 Crosstab with a large or undefined number of categories

问题 My real problem has to do with recording which of a very large number of anti-virus products agree that a given sample is a member of a given anti-virus family. The database has millions of samples, with tens of anti-virus products voting on each sample. I want to ask a query like "For the malware containing the name 'XYZ' which sample had the most votes, and which vendors voted for it?" and get results like: "BadBadVirus" V1 V2 V3 V4 V5 V6 V7 Sample 1 - 4 votes 1 0 1 0 0 1 1 Sample 2 - 5

How to output duplicated rows

阅读更多关于 How to output duplicated rows

问题 I have the following data: x1 x2 x3 x4 34 14 45 53 2 8 18 17 34 14 45 20 19 78 21 48 2 8 18 5 In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format: x1 x2 x3 x4 34 14 45 53 34 14 45 20 2 8 18 17 2 8 18 5 Please, ask me questions if something unclear. ADDITIONAL QUESTION: in the output x1 x2 x3 x4 34 14 45 53 34 14 45 20 2 8 18 17 2 8 18 5 find the sum of values in last column

R count NA by group

阅读更多关于 R count NA by group

问题 Could someone please explain why I get different answers using the aggregate function to count missing values by group? Also, is there a better way to count missing values by group using a native R function? DF <- data.frame(YEAR=c(2000,2000,2000,2001,2001,2001,2001,2002,2002,2002), X=c(1,NA,3,NA,NA,NA,7,8,9,10)) DF aggregate(X ~ YEAR, data=DF, function(x) { sum(is.na(x)) }) with(DF, aggregate(X, list(YEAR), function(x) { sum(is.na(x)) })) aggregate(X ~ YEAR, data=DF, function(x) { sum(! is

Sql Server : How to use an aggregate function like MAX in a WHERE clause

阅读更多关于 Sql Server : How to use an aggregate function like MAX in a WHERE clause

问题 I want get the maximum value for this record. Please help me: SELECT rest.field1 FROM mastertable AS m INNER JOIN ( SELECT t1.field1 field1, t2.field2 FROM table1 AS T1 INNER JOIN table2 AS t2 ON t2.field = t1.field WHERE t1.field3=MAX(t1.field3) -- ^^^^^^^^^^^^^^ Help me here. ) AS rest ON rest.field1 = m.field 回答1: You could use a sub query... WHERE t1.field3 = (SELECT MAX(st1.field3) FROM table1 AS st1) But I would actually move this out of the where clause and into the join statement, as

Aggregate by week in R

阅读更多关于 Aggregate by week in R

问题 In R I frequently aggregate daily data (in a zoo ) by month, using something like this: result <- aggregate(x, as.yearmon, "mean", na.rm=TRUE) Is there a way that I can do this by week? 回答1: The easiest thing to do is to use the apply.weekly function from xts . > apply.weekly(zoo(1:10, as.Date("2010-01-01") + 1:10), mean) 2010-01-03 2010-01-10 2010-01-11 3 42 10 来源： https://stackoverflow.com/questions/4309248/aggregate-by-week-in-r

Explain the aggregate functionality in Spark

阅读更多关于 Explain the aggregate functionality in Spark

问题 I am looking for some better explanation of the aggregate functionality that is available via spark in python. The example I have is as follows (using pyspark from Spark 1.2.0 version) sc.parallelize([1,2,3,4]).aggregate( (0, 0), (lambda acc, value: (acc[0] + value, acc[1] + 1)), (lambda acc1, acc2: (acc1[0] + acc2[0], acc1[1] + acc2[1]))) Output: (10, 4) I get the expected result (10,4) which is sum of 1+2+3+4 and 4 elements. If I change the initial value passed to the aggregate function to

R use ddply or aggregate

阅读更多关于 R use ddply or aggregate

问题 I have a data frame with 3 columns: custId, saleDate, DelivDateTime. > head(events22) custId saleDate DelivDate 1 280356593 2012-11-14 14:04:59 11/14/12 17:29 2 280367076 2012-11-14 17:04:44 11/14/12 20:48 3 280380097 2012-11-14 17:38:34 11/14/12 20:45 4 280380095 2012-11-14 20:45:44 11/14/12 23:59 5 280380095 2012-11-14 20:31:39 11/14/12 23:49 6 280380095 2012-11-14 19:58:32 11/15/12 00:10 Here's the dput: > dput(events22) structure(list(custId = c(280356593L, 280367076L, 280380097L,

Compute mean and standard deviation by group for multiple variables in a data.frame

阅读更多关于 Compute mean and standard deviation by group for multiple variables in a data.frame

问题 Edit -- This question was originally titled << Long to wide data reshaping in R >> I'm just learning R and trying to find ways to apply it to help out others in my life. As a test case, I'm working on reshaping some data, and I'm having trouble following the examples I've found online. What I'm starting with looks like this: ID Obs 1 Obs 2 Obs 3 1 43 48 37 1 27 29 22 1 36 32 40 2 33 38 36 2 29 32 27 2 32 31 35 2 25 28 24 3 45 47 42 3 38 40 36 And what I want to end up with will look like this

Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

阅读更多关于 Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

问题 I want to share this particular Apache Spark with Python solution because documentation for it is quite poor. I wanted to calculate the average value of K/V pairs (stored in a Pairwise RDD), by KEY. Here is what the sample data looks like: >>> rdd1.take(10) # Show a small sample. [(u'2013-10-09', 7.60117302052786), (u'2013-10-10', 9.322709163346612), (u'2013-10-10', 28.264462809917358), (u'2013-10-07', 9.664429530201343), (u'2013-10-07', 12.461538461538463), (u'2013-10-09', 20.76923076923077)

Linq to Objects - return pairs of numbers from list of numbers

阅读更多关于 Linq to Objects - return pairs of numbers from list of numbers

问题 var nums = new[]{ 1, 2, 3, 4, 5, 6, 7}; var pairs = /* some linq magic here*/ ; => pairs = { {1, 2}, {3, 4}, {5, 6}, {7, 0} } The elements of pairs should be either two-element lists, or instances of some anonymous class with two fields, something like new {First = 1, Second = 2} . 回答1: None of the default linq methods can do this lazily and with a single scan. Zipping the sequence with itself does 2 scans and grouping is not entirely lazy. Your best bet is to implement it directly: public