Spark Datastax Java API Select statements

匿名 (未验证) 提交于 2019-12-03 01:18:02

问题:

I'm using a tutorial here in this Github to run spark on cassandra using a java maven project: https://github.com/datastax/spark-cassandra-connector.

I've figured how to use direct CQL statements, as I have previously asked a question about that here: Querying Data in Cassandra via Spark in a Java Maven Project

However, now I'm trying to use the datastax java API in fear that my original code in my original question will not work for Datastax version of Spark and Cassandra. For some weird reason, it won't let me use .where even though it is outlined in the documentation that I can use that exact statement. Here is my code:

import org.apache.commons.lang3.StringUtils; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function;  import java.io.Serializable;  import static com.datastax.spark.connector.CassandraJavaUtil.*;   public class App implements Serializable {      // firstly, we define a bean class     public static class Person implements Serializable {         private Integer id;         private String fname;         private String lname;         private String role;          // Remember to declare no-args constructor         public Person() { }          public Integer getId() { return id; }         public void setId(Integer id) { this.id = id; }          public String getfname() { return fname; }         public void setfname(String fname) { this.fname = fname; }          public String getlname() { return lname; }         public void setlname(String lname) { this.lname = lname; }          public String getrole() { return role; }         public void setrole(String role) { this.role = role; }          // other methods, constructors, etc.     }      private transient SparkConf conf;     private App(SparkConf conf) {         this.conf = conf;     }       private void run() {         JavaSparkContext sc = new JavaSparkContext(conf);         createSchema(sc);           sc.stop();     }      private void createSchema(JavaSparkContext sc) {          JavaRDD rdd = javaFunctions(sc).cassandraTable("tester", "empbyrole", Person.class)                 .where("role=?", "IT Engineer").map(new Function() {                     @Override                     public String call(Person person) throws Exception {                         return person.toString();                     }                 });         System.out.println("Data as Person beans: \n" + StringUtils.join("\n", rdd.toArray()));        }        public static void main( String[] args )     {         if (args.length != 2) {             System.err.println("Syntax: com.datastax.spark.demo.JavaDemo ");             System.exit(1);         }          SparkConf conf = new SparkConf();         conf.setAppName("Java API demo");         conf.setMaster(args[0]);         conf.set("spark.cassandra.connection.host", args[1]);          App app = new App(conf);         app.run();     } } 

here is the error:

14/09/23 13:46:53 ERROR executor.Executor: Exception in task ID 0 java.io.IOException: Exception during preparation of SELECT "role", "id", "fname", "lname" FROM "tester"."empbyrole" WHERE token("role") > -5709068081826432029 AND token("role")  -5709068081826432029 AND token("role")  -5709068081826432029 AND token("role") 

I know that my error is specifically at this section:

JavaRDD rdd = javaFunctions(sc).cassandraTable("tester", "empbyrole", Person.class)                 .where("role=?", "IT Engineer").map(new Function() {                     @Override                     public String call(Person person) throws Exception {                         return person.toString();                     }                 }); 

When I remove the .where(), it works. But it says specifically on github that you should be able to execute .where and .map functions respectively. Does anyone have any type of reasoning for this? or solution? Thanks.

edit i get the error to go away when i use this statement instead:

JavaRDD rdd = javaFunctions(sc).cassandraTable("tester", "empbyrole", Person.class)                 .where("id=?", "1").map(new Function() {                     @Override                     public String call(Person person) throws Exception {                         return person.toString();                     }                 }); 

I have no idea why this option works but not the rest of my variations. Here are the statements i ran in my cql so that you know what my keyspace looks like:

    session.execute("DROP KEYSPACE IF EXISTS tester");     session.execute("CREATE KEYSPACE tester WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 3}");     session.execute("CREATE TABLE tester.emp (id INT PRIMARY KEY, fname TEXT, lname TEXT, role TEXT)");     session.execute("CREATE TABLE tester.empByRole (id INT, fname TEXT, lname TEXT, role TEXT, PRIMARY KEY (role,id))");     session.execute("CREATE TABLE tester.dept (id INT PRIMARY KEY, dname TEXT)");             session.execute(               "INSERT INTO tester.emp (id, fname, lname, role) " +               "VALUES (" +                   "0001," +                   "'Angel'," +                   "'Pay'," +                   "'IT Engineer'" +                   ");");     session.execute(               "INSERT INTO tester.emp (id, fname, lname, role) " +               "VALUES (" +                   "0002," +                   "'John'," +                   "'Doe'," +                   "'IT Engineer'" +                   ");");     session.execute(               "INSERT INTO tester.emp (id, fname, lname, role) " +               "VALUES (" +                   "0003," +                   "'Jane'," +                   "'Doe'," +                   "'IT Analyst'" +                   ");");     session.execute(           "INSERT INTO tester.empByRole (id, fname, lname, role) " +           "VALUES (" +               "0001," +               "'Angel'," +               "'Pay'," +               "'IT Engineer'" +               ");");     session.execute(               "INSERT INTO tester.empByRole (id, fname, lname, role) " +               "VALUES (" +                   "0002," +                   "'John'," +                   "'Doe'," +                   "'IT Engineer'" +                   ");");     session.execute(               "INSERT INTO tester.empByRole (id, fname, lname, role) " +               "VALUES (" +                   "0003," +                   "'Jane'," +                   "'Doe'," +                   "'IT Analyst'" +                   ");");         session.execute(               "INSERT INTO tester.dept (id, dname) " +               "VALUES (" +                   "1553," +                   "'Commerce'" +                   ");"); 

回答1:

The where method adds ALLOW FILTERING to your query under the covers. This is not a magic bullet, as it still doesn't support arbitrary fields as query predicates. In general, the field must either be indexed or a clustering column. If this isn't practical for your data model, you can simply use the filter method on the RDD. The downside is that the filter takes place in Spark and not in Cassandra.

So the id field works because it's supported in a CQL WHERE clause, whereas I'm assuming role is just a regular field. Please note that I am NOT suggesting that you index your field or change it to a clustering column, as I don't know your data model.



回答2:

There is a limitation in the Spark Cassandra Connector that the where method will not work on partitioning keys. In your table empByRole, role is a partitioning key, hence the error. It should work correctly on clustering columns or indexed columns (secondary indexes).

This is being tracked as issue 37 in the GitHub project and work has been ongoing.

On the Java API doc page, the examples shown used .where("name=?", "Anna"). I assume that name is not a partitioning key, but the example could be more clear about that.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!