apache-spark | 易学教程

ValueError: as_list() is not defined on an unknown TensorShape

阅读更多关于 ValueError: as_list() is not defined on an unknown TensorShape

问题 i work on thhe example based in this web and here is i got after this jobs_train, jobs_test = jobs_df.randomSplit([0.6, 0.4]) >>> zuckerberg_train, zuckerberg_test = zuckerberg_df.randomSplit([0.6, 0.4]) >>> train_df = jobs_train.unionAll(zuckerberg_train) >>> test_df = jobs_test.unionAll(zuckerberg_test) >>> from pyspark.ml.classification import LogisticRegression >>> from pyspark.ml import Pipeline >>> from sparkdl import DeepImageFeaturizer >>> featurizer = DeepImageFeaturizer(inputCol=

How can I get spark on emr-5.2.1 to write to dynamodb?

阅读更多关于 How can I get spark on emr-5.2.1 to write to dynamodb?

问题 According to this article here, when I create an aws emr cluster that will use spark to pipe data to dynamodb, I need to preface with the line: spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar This line appears in numerous references, including from the amazon devs themselves. However, when I run create-cluster with an added --jars flag, I get this error: Exception in thread "main" java.io.FileNotFoundException: File file:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar does not

How can I get external table jdbc url in SnappyData

阅读更多关于 How can I get external table jdbc url in SnappyData

问题 Previously I created an external table in SnappyData like this: create external table EXT_DIM_CITY using jdbc options(url 'jdbc:mysql://***:5002/***?user=***&password=***', driver 'com.mysql.jdbc.Driver', dbtable 'dim_city'); but now I forget the mysql jdbc url that EXT_DIM_CITY referred to. How can I get the jdbc url from SnappyData? 回答1: With the latest SnappyData release 1.0.2.1, all table properties can be seen with extended describe: describe extended EXT_DIM_CITY The properties will be

How can I get external table jdbc url in SnappyData

阅读更多关于 How can I get external table jdbc url in SnappyData

add parent column name as prefix to avoid ambiguity

阅读更多关于 add parent column name as prefix to avoid ambiguity

问题 Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it. Added another column with json data. scala> val df = Seq( (77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""), (78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""), (178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""), (179,

add parent column name as prefix to avoid ambiguity

阅读更多关于 add parent column name as prefix to avoid ambiguity

Count of all element less than the value in a row

阅读更多关于 Count of all element less than the value in a row

问题 Given a dataframe value ----- 0.3 0.2 0.7 0.5 is there a way to build a column that contains, for each row, the count of the element in that row that are less or equal than the row value? Specifically, value count_less_equal ------------------------- 0.3 2 0.2 1 0.7 4 0.5 3 I could groupBy the value column but I don't know how to filter all values in the row that are less that that value. I was thinking, maybe it's possible to duplicate the first column, then create a filter so that for each

Airflow ModuleNotFoundError: No module named 'pyspark'

阅读更多关于 Airflow ModuleNotFoundError: No module named 'pyspark'

问题 I installed Airflow on my machine which works well and I have a local spark also (which is operational too). I want to use airflow to orchestrate two sparks tasks: task_spark_datatransform >> task_spark_model_reco . The two pyspark modules associated to these two tasks are tested and work well under spark. I also create a very simple Airflow Dag using bashOperator * to run each spark task. For example, for the task task_spark_datatransform I have: task_spark_datatransform = BashOperator (task

What is the difference between partitioning and bucketing in Spark?

阅读更多关于 What is the difference between partitioning and bucketing in Spark?

问题 I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to bucket/repartition it by "SaleId". In Spark, what is the difference between partitioning the data by column and bucketing the data by column? for example: partition: df2 = df2.repartition(10, "SaleId") bucket: df2.write.format('parquet').bucketBy(10,

Spark Streaming - java.lang.NoSuchMethodError Error

阅读更多关于 Spark Streaming - java.lang.NoSuchMethodError Error

问题 I am trying to access the streaming tweets from Spark Streaming. This is the software configuration. Ubuntu 14.04.2 LTS scala -version Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL spark-submit --version Spark version 1.6.0 Following is the code. object PrintTweets { def main(args: Array[String]) { // Configure Twitter credentials using twitter.txt setupTwitter() // Set up a Spark streaming context named "PrintTweets" that runs locally using // all CPU cores and one