apache-spark-sql | 易学教程

Spark & Scala: saveAsTextFile() exception

阅读更多关于 Spark & Scala: saveAsTextFile() exception

问题 I'm new to Spark & Scala and I got exception after calling saveAsTextFile(). Hope someone can help... Here is my input.txt: Hello World, I'm a programmer Hello World, I'm a programmer This is the info after running "spark-shell" on CMD: C:\Users\Nhan Tran>spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://DLap:4040 Spark context available as 'sc' (master = local[

Spark & Scala: saveAsTextFile() exception

阅读更多关于 Spark & Scala: saveAsTextFile() exception

Spark dataframe add new column with random data

阅读更多关于 Spark dataframe add new column with random data

问题 I want to add a new column to the dataframe with values consist of either 0 or 1. I used 'randint' function from, from random import randint df1 = df.withColumn('isVal',randint(0,1)) But I get the following error, /spark/python/pyspark/sql/dataframe.py", line 1313, in withColumn assert isinstance(col, Column), "col should be Column" AssertionError: col should be Column how to use a custom function or randint function for generate random value for the column? 回答1: You are using python builtin

Spark dataframe add new column with random data

阅读更多关于 Spark dataframe add new column with random data

Limiting maximum size of dataframe partition

阅读更多关于 Limiting maximum size of dataframe partition

问题 When I write out a dataframe to, say, csv, a .csv file is created for each partition. Suppose I want to limit the max size of each file to, say, 1 MB. I could do the write multiple times and increase the argument to repartition each time. Is there a way I can calculate ahead of time what argument to use for repartition to ensure the max size of each file is less than some specified size. I imagine there might be pathological cases where all the data ends up on one partition. So make the

Limiting maximum size of dataframe partition

阅读更多关于 Limiting maximum size of dataframe partition

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format