aggregation

Aggregation with Group By date in Spark SQL

心已入冬 提交于 2019-11-30 15:40:41
I have an RDD containing a timestamp named time of type long: root |-- id: string (nullable = true) |-- value1: string (nullable = true) |-- value2: string (nullable = true) |-- time: long (nullable = true) |-- type: string (nullable = true) I am trying to group by value1, value2 and time as YYYY-MM-DD. I tried to group by cast(time as Date) but then I got the following error: Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at

Convert strings to floats at aggregation time?

江枫思渺然 提交于 2019-11-30 14:41:37
Is there any way to convert strings to floats when specifying a histogram aggregation? Because I have documents with fields that are floats but are not parsed by elasticsearch as such, and when I attempt to do a sum using a string field It throws the next error. ClassCastException[org.elasticsearch.index.fielddata.plain.PagedBytesIndexFieldData cannot be cast to org.elasticsearch.index.fielddata.IndexNumericFieldData]}]" I know I could change the mapping, but for the usage case that I have, it would be more handy if I could specify something like "script : _value.tofloat()" when writing the

Add RawContact so it aggregates to an existing contact

耗尽温柔 提交于 2019-11-30 12:04:07
I am trying to add a new RawContact to an existing Contact so my custom data field shows up inside the original Contact. I tried Adding a StructuredName Data row to my new RawContact with a DisplayName that matches the DisplayName of the original RawContact. I thought matching DisplayNames would be enough to aggregate both RawContacts but the contacts app seems to display both RawContacts as different Contacts. Here is my code public static void addContact(Context context, Account account, String number, String displayname) { Log.e(Global.TAG, "adding contact: " + number + " / " + displayname)

Pandas - possible to aggregate two columns using two different aggregations?

时间秒杀一切 提交于 2019-11-30 10:56:47
I'm loading a csv file, which has the following columns: date, textA, textB, numberA, numberB I want to group by the columns: date, textA and textB - but want to apply "sum" to numberA, but "min" to numberB. data = pd.read_table("file.csv", sep=",", thousands=',') grouped = data.groupby(["date", "textA", "textB"], as_index=False) ...but I cannot see how to then apply two different aggregate functions, to two different columns? I.e. sum(numberA), min(numberB) The agg method can accept a dict, in which case the keys indicate the column to which the function is applied: grouped.agg({'numberA':

Pandas: aggregate when column contains numpy arrays

蹲街弑〆低调 提交于 2019-11-30 09:15:21
I'm using a pandas DataFrame in which one column contains numpy arrays. When trying to sum that column via aggregation I get an error stating 'Must produce aggregated value'. e.g. import pandas as pd import numpy as np DF = pd.DataFrame([[1,np.array([10,20,30])], [1,np.array([40,50,60])], [2,np.array([20,30,40])],], columns=['category','arraydata']) This works the way I would expect it to: DF.groupby('category').agg(sum) output: arraydata category 1 [50 70 90] 2 [20 30 40] However, since my real data frame has multiple numeric columns, arraydata is not chosen as the default column to aggregate

Sorting after aggregation in Elasticsearch

为君一笑 提交于 2019-11-30 05:18:55
I have docs with this structure: { FIELD1:string, FIELD2: [ {SUBFIELD:number}, {SUBFIELD:number}...] } I want to sort on the result of the sum of numbers in FIELD2.SUBFIELDs: GET myindex/_search { "size":0, "aggs": { "a1": { "terms": { "field": "FIELD1", "size":0 }, "aggs":{ "a2":{ "sum":{ "field":"FIELD2.SUBFIELD" } } } } } } If I do this I obtain buckets not sorted, but I want buckets sorted by "a2" value. How I can do this? Thank you! You almost had it. You just need to add an order property to your a1 terms aggregations, like this: GET myindex/_search { "size":0, "aggs": { "a1": { "terms":

Can't aggregate arrays

蹲街弑〆低调 提交于 2019-11-30 05:05:12
问题 I can create an array of arrays: select array[array[1, 2], array[3, 4]]; array --------------- {{1,2},{3,4}} But I can't aggregated arrays: select array_agg(array[c1, c2]) from ( values (1, 2), (3, 4) ) s(c1, c2); ERROR: could not find array type for data type integer[] What am I missing? 回答1: I use: CREATE AGGREGATE array_agg_mult(anyarray) ( SFUNC = array_cat, STYPE = anyarray, INITCOND = '{}' ); and queries like: SELECT array_agg_mult( ARRAY[[x,x]] ) FROM generate_series(1,10) x; Note that

Django Aggreagtion: Sum return value only?

廉价感情. 提交于 2019-11-30 04:54:29
I have a list of values paid and want to display the total paid. I have used Aggregation and Sum to calculate the values together. The problem is,I just want the total value printed out, but aggreagtion prints out: {'amount__sum': 480.0} (480.0 being the total value added. In my View, I have: from django.db.models import Sum total_paid = Payment.objects.all.aggregate(Sum('amount')) And to show the value on the page, I have a mako template with the following: <p><strong>Total Paid:</strong> ${total_paid}</p> How would I get it to show 480.0 instead of {'amount__sum': 480.0} ? I don't believe

Fast melted data.table operations

微笑、不失礼 提交于 2019-11-30 03:32:41
问题 I am looking for patterns for manipulating data.table objects whose structure resembles that of dataframes created with melt from the reshape2 package. I am dealing with data tables with millions of rows. Performance is critical. The generalized form of the question is whether there is a way to perform grouping based on a subset of values in a column and have the result of the grouping operation create one or more new columns. A specific form of the question could be how to use data.table to

Git diff on topic branch, excluding merge commits that happened in the meantime?

邮差的信 提交于 2019-11-29 17:27:03
问题 Let's say I have the following situation: B---D---F---G topic / / --A---C---E master For code review purposes, I would like to pull out a diff from commit A to commit G, but not including commits E and C which happened on the master branch, and also not including commit F which is a merge commit. In other words, I would like to generate a diff that contains changes from F to G and aggregate those changes with changes from A to D. In other-other words, I want the review diff to contain only my