apache-spark-sql

Spark & Scala: saveAsTextFile() exception

二次信任 提交于 2021-02-07 03:31:41
问题 I'm new to Spark & Scala and I got exception after calling saveAsTextFile(). Hope someone can help... Here is my input.txt: Hello World, I'm a programmer Hello World, I'm a programmer This is the info after running "spark-shell" on CMD: C:\Users\Nhan Tran>spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://DLap:4040 Spark context available as 'sc' (master = local[

Spark & Scala: saveAsTextFile() exception

孤人 提交于 2021-02-07 03:27:42
问题 I'm new to Spark & Scala and I got exception after calling saveAsTextFile(). Hope someone can help... Here is my input.txt: Hello World, I'm a programmer Hello World, I'm a programmer This is the info after running "spark-shell" on CMD: C:\Users\Nhan Tran>spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://DLap:4040 Spark context available as 'sc' (master = local[

Spark dataframe add new column with random data

淺唱寂寞╮ 提交于 2021-02-06 16:07:05
问题 I want to add a new column to the dataframe with values consist of either 0 or 1. I used 'randint' function from, from random import randint df1 = df.withColumn('isVal',randint(0,1)) But I get the following error, /spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column how to use a custom function or randint function for generate random value for the column? 回答1: You are using python builtin

Spark dataframe add new column with random data

被刻印的时光 ゝ 提交于 2021-02-06 16:01:54
问题 I want to add a new column to the dataframe with values consist of either 0 or 1. I used 'randint' function from, from random import randint df1 = df.withColumn('isVal',randint(0,1)) But I get the following error, /spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column how to use a custom function or randint function for generate random value for the column? 回答1: You are using python builtin

Limiting maximum size of dataframe partition

梦想的初衷 提交于 2021-02-06 15:47:46
问题 When I write out a dataframe to, say, csv, a .csv file is created for each partition. Suppose I want to limit the max size of each file to, say, 1 MB. I could do the write multiple times and increase the argument to repartition each time. Is there a way I can calculate ahead of time what argument to use for repartition to ensure the max size of each file is less than some specified size. I imagine there might be pathological cases where all the data ends up on one partition. So make the

Limiting maximum size of dataframe partition

喜夏-厌秋 提交于 2021-02-06 15:47:41
问题 When I write out a dataframe to, say, csv, a .csv file is created for each partition. Suppose I want to limit the max size of each file to, say, 1 MB. I could do the write multiple times and increase the argument to repartition each time. Is there a way I can calculate ahead of time what argument to use for repartition to ensure the max size of each file is less than some specified size. I imagine there might be pathological cases where all the data ends up on one partition. So make the

convert dataframe to libsvm format

心不动则不痛 提交于 2021-02-06 11:11:59
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

*爱你&永不变心* 提交于 2021-02-06 11:10:41
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

不羁的心 提交于 2021-02-06 11:07:38
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

依然范特西╮ 提交于 2021-02-06 11:05:59
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and