Comparison between different methods of executing SQL queries on Cassandra Column Families using spark

允我心安 提交于 2019-12-10 15:59:04

问题


As part of my project, I have to create a SQL query interface for a very large Cassandra Dataset, hence I have been looking at different methods for executing SQL queries on cassandra column families using Spark and I have come up with 3 different methods

  1. using Spark SQLContext with a statically defined schema

    // statically defined in the application
    public static class TableTuple implements Serializable {
        private int id;
        private String line;
    
        TableTuple (int i, String l) {
            id = i;
            line = l;
        }
    
        // getters and setters
        ...
    }
    

    and I consume the definition as:

    SparkConf conf = new SparkConf(true)
            .set("spark.cassandra.connection.host", CASSANDRA_HOST)
            .setJars(jars);
    
    SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
    SQLContext sqlContext = new SQLContext(sc);
    
    JavaRDD<CassandraRow> rowrdd = javaFunctions(sc).cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
    JavaRDD<TableTuple> rdd = rowrdd.map(row -> new TableTuple(row.getInt(0), row.getString(1)));
    
    DataFrame dataFrame = sqlContext.createDataFrame(rdd, TableTuple.class);
    dataFrame.registerTempTable("lines");
    
    DataFrame resultsFrame = sqlContext.sql("Select line from lines where id=1");
    
    System.out.println(Arrays.asList(resultsFrame.collect()));
    
  2. using Spark SQLContext with a dynamically defined schema

    SparkConf conf = new SparkConf(true)
            .set("spark.cassandra.connection.host", CASSANDRA_HOST)
            .setJars(jars);
    
    SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
    SQLContext sqlContext = new SQLContext(sc);
    
    JavaRDD<CassandraRow> cassandraRdd = javaFunctions(sc).cassandraTable(CASSANDRA_KEYSPACE, CASSANDRA_COLUMN_FAMILY);
    JavaRDD<Row> rdd = cassandraRdd.map(row -> RowFactory.create(row.getInt(0), row.getString(1)));
    
    List<StructField> fields = new ArrayList<>();
    fields.add(DataTypes.createStructField("id", DataTypes.IntegerType, true));
    fields.add(DataTypes.createStructField("line", DataTypes.StringType, true));
    StructType schema = DataTypes.createStructType(fields);
    
    DataFrame dataFrame = sqlContext.createDataFrame(rdd, schema);
    dataFrame.registerTempTable("lines");
    
    DataFrame resultDataFrame = sqlContext.sql("select line from lines where id = 1");
    
    System.out.println(Arrays.asList(resultDataFrame.collect()));
    
  3. using CassandraSQLContext from the spark-cassandra-connector

    SparkConf conf = new SparkConf(true)
            .set("spark.cassandra.connection.host", CASSANDRA_HOST)
            .setJars(jars);
    
    SparkContext sc = new SparkContext(HOST, APP_NAME, conf);
    
    CassandraSQLContext sqlContext = new CassandraSQLContext(sc);
    DataFrame resultsFrame = sqlContext.sql("Select line from " + CASSANDRA_KEYSPACE + "." + CASSANDRA_COLUMN_FAMILY + " where id = 1");
    
    System.out.println(Arrays.asList(resultsFrame.collect()));
    

I would like to know the advantages/disadvantages of one method over another. Also, for the CassandraSQLContext method, are queries limited to CQL, or is it fully compatible with Spark SQL. I would also like an analysis pertaining to my specific use case, I have a cassandra column family with ~17.6 million tuples having 62 columns. For querying such a large database, which method is most adequate ?

来源:https://stackoverflow.com/questions/30978125/comparison-between-different-methods-of-executing-sql-queries-on-cassandra-colum

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!