apache-spark

ValueError: as_list() is not defined on an unknown TensorShape

纵然是瞬间 提交于 2021-01-29 05:20:20
问题 i work on thhe example based in this web and here is i got after this jobs_train, jobs_test = jobs_df.randomSplit([0.6, 0.4]) >>> zuckerberg_train, zuckerberg_test = zuckerberg_df.randomSplit([0.6, 0.4]) >>> train_df = jobs_train.unionAll(zuckerberg_train) >>> test_df = jobs_test.unionAll(zuckerberg_test) >>> from pyspark.ml.classification import LogisticRegression >>> from pyspark.ml import Pipeline >>> from sparkdl import DeepImageFeaturizer >>> featurizer = DeepImageFeaturizer(inputCol=

How can I get spark on emr-5.2.1 to write to dynamodb?

浪子不回头ぞ 提交于 2021-01-29 03:16:29
问题 According to this article here, when I create an aws emr cluster that will use spark to pipe data to dynamodb, I need to preface with the line: spark-shell --jars /usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar This line appears in numerous references, including from the amazon devs themselves. However, when I run create-cluster with an added --jars flag, I get this error: Exception in thread "main" java.io.FileNotFoundException: File file:/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar does not

How can I get external table jdbc url in SnappyData

爱⌒轻易说出口 提交于 2021-01-29 01:37:37
问题 Previously I created an external table in SnappyData like this: create external table EXT_DIM_CITY using jdbc options(url 'jdbc:mysql://***:5002/***?user=***&password=***', driver 'com.mysql.jdbc.Driver', dbtable 'dim_city'); but now I forget the mysql jdbc url that EXT_DIM_CITY referred to. How can I get the jdbc url from SnappyData? 回答1: With the latest SnappyData release 1.0.2.1, all table properties can be seen with extended describe: describe extended EXT_DIM_CITY The properties will be

How can I get external table jdbc url in SnappyData

╄→尐↘猪︶ㄣ 提交于 2021-01-29 01:34:38
问题 Previously I created an external table in SnappyData like this: create external table EXT_DIM_CITY using jdbc options(url 'jdbc:mysql://***:5002/***?user=***&password=***', driver 'com.mysql.jdbc.Driver', dbtable 'dim_city'); but now I forget the mysql jdbc url that EXT_DIM_CITY referred to. How can I get the jdbc url from SnappyData? 回答1: With the latest SnappyData release 1.0.2.1, all table properties can be seen with extended describe: describe extended EXT_DIM_CITY The properties will be

add parent column name as prefix to avoid ambiguity

≯℡__Kan透↙ 提交于 2021-01-28 21:59:16
问题 Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it. Added another column with json data. scala> val df = Seq( (77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""), (78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""), (178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""), (179,

add parent column name as prefix to avoid ambiguity

余生颓废 提交于 2021-01-28 21:43:12
问题 Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it. Added another column with json data. scala> val df = Seq( (77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""), (78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""), (178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""), (179,

Count of all element less than the value in a row

余生颓废 提交于 2021-01-28 21:14:33
问题 Given a dataframe value ----- 0.3 0.2 0.7 0.5 is there a way to build a column that contains, for each row, the count of the element in that row that are less or equal than the row value? Specifically, value count_less_equal ------------------------- 0.3 2 0.2 1 0.7 4 0.5 3 I could groupBy the value column but I don't know how to filter all values in the row that are less that that value. I was thinking, maybe it's possible to duplicate the first column, then create a filter so that for each

Airflow ModuleNotFoundError: No module named 'pyspark'

╄→гoц情女王★ 提交于 2021-01-28 21:12:13
问题 I installed Airflow on my machine which works well and I have a local spark also (which is operational too). I want to use airflow to orchestrate two sparks tasks: task_spark_datatransform >> task_spark_model_reco . The two pyspark modules associated to these two tasks are tested and work well under spark. I also create a very simple Airflow Dag using bashOperator * to run each spark task. For example, for the task task_spark_datatransform I have: task_spark_datatransform = BashOperator (task

What is the difference between partitioning and bucketing in Spark?

半腔热情 提交于 2021-01-28 20:14:16
问题 I try to optimize a join query between two spark dataframes, let's call them df1, df2 (join on common column "SaleId"). df1 is very small (5M) so I broadcast it among the nodes of the spark cluster. df2 is very large (200M rows) so I tried to bucket/repartition it by "SaleId". In Spark, what is the difference between partitioning the data by column and bucketing the data by column? for example: partition: df2 = df2.repartition(10, "SaleId") bucket: df2.write.format('parquet').bucketBy(10,

Spark Streaming - java.lang.NoSuchMethodError Error

我的梦境 提交于 2021-01-28 20:00:30
问题 I am trying to access the streaming tweets from Spark Streaming. This is the software configuration. Ubuntu 14.04.2 LTS scala -version Scala code runner version 2.11.7 -- Copyright 2002-2013, LAMP/EPFL spark-submit --version Spark version 1.6.0 Following is the code. object PrintTweets { def main(args: Array[String]) { // Configure Twitter credentials using twitter.txt setupTwitter() // Set up a Spark streaming context named "PrintTweets" that runs locally using // all CPU cores and one