How to compute summary statistic on Cassandra table with Spark DataFrame?

强颜欢笑 提交于 2020-01-22 03:58:12

问题


I'm trying to get the min, max mean of some Cassandra/SPARK data but I need to do it with JAVA.

import org.apache.spark.sql.DataFrame;
import static org.apache.spark.sql.functions.*;

DataFrame df = sqlContext.read()
        .format("org.apache.spark.sql.cassandra")
        .option("table",  "someTable")
        .option("keyspace", "someKeyspace")
        .load();

df.groupBy(col("keyColumn"))
        .agg(min("valueColumn"), max("valueColumn"), avg("valueColumn"))
        .show();

EDITED to show working version: Make sure to put " around the someTable and someKeyspace


回答1:


Just import your data as a DataFrame and apply required aggregations:

import org.apache.spark.sql.DataFrame;
import static org.apache.spark.sql.functions.*;

DataFrame df = sqlContext.read()
        .format("org.apache.spark.sql.cassandra")
        .option("table", someTable)
        .option("keyspace", someKeyspace)
        .load();

df.groupBy(col("keyColumn"))
        .agg(min("valueColumn"), max("valueColumn"), avg("valueColumn"))
        .show();

where someTable and someKeyspace store table name and keyspace respectively.




回答2:


I suggest checking out https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector-demos

Which contains demos in both Scala and the equivalent Java.

You can also check out: http://spark.apache.org/documentation.html

Which has tons of examples that you can flip between Scala, Java, and Python versions.

I'm almost 100% certain that between those to links, you'll find exactly what you're looking for.

If there's anything you're having trouble with after that, feel free to update your question with a more specific error/problem.




回答3:


In general,

compile scala file: $ scalac Main.scala

create your java source file from Main.class file: $ javap Main

More info is available at following url: http://alvinalexander.com/scala/scala-class-to-decompiled-java-source-code-classes



来源:https://stackoverflow.com/questions/35273798/how-to-compute-summary-statistic-on-cassandra-table-with-spark-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!