spark-dataframe | 易学教程

When to use Spark DataFrame/Dataset API and when to use plain RDD?

阅读更多关于 When to use Spark DataFrame/Dataset API and when to use plain RDD?

问题 Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms. However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions.

When to use Spark DataFrame/Dataset API and when to use plain RDD?

阅读更多关于 When to use Spark DataFrame/Dataset API and when to use plain RDD?

How to write into PostgreSQL hstore using Spark Dataset

阅读更多关于 How to write into PostgreSQL hstore using Spark Dataset

问题 I'm trying to write a Spark Dataset into an existent postgresql table (can't change the table metadata like column types). One of the columns of this table is of type HStore and it's causing trouble. I see the following exception when I launch the write (here the original map is empty which when escaped gives an empty string): Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO part_d3da09549b713bbdcd95eb6095f929c8 (.., "my_hstore_column", ..) VALUES (..,'',..) was aborted.

Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

阅读更多关于 Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]

问题 I am relatively new to Spark and Scala. I am starting with the following dataframe (single column made out of a dense Vector of Doubles): scala> val scaledDataOnly_pruned = scaledDataOnly.select("features") scaledDataOnly_pruned: org.apache.spark.sql.DataFrame = [features: vector] scala> scaledDataOnly_pruned.show(5) +--------------------+ | features| +--------------------+ |[-0.0948337274182...| |[-0.0948337274182...| |[-0.0948337274182...| |[-0.0948337274182...| |[-0.0948337274182...| +----

Spark job restarted after showing all jobs completed and then fails (TimeoutException: Futures timed out after [300 seconds])

阅读更多关于 Spark job restarted after showing all jobs completed and then fails (TimeoutException: Futures timed out after [300 seconds])

问题 I'm running a spark job. It shows that all of the jobs were completed: however after couple of minutes the entire job restarts, this time it will show all jobs and tasks were completed too, but after couple of minutes it will fail. I found this exception in the logs: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] So this happens when I'm trying to join 2 pretty big tables: one of 3B rows, and the second is 200M rows, when I run show(100) on the resulting

Spark cosine distance between rows using Dataframe

阅读更多关于 Spark cosine distance between rows using Dataframe

问题 I have to compute a cosine distance between each rows but I have no idea how to do it using Spark API Dataframes elegantly. The idea is to compute similarities for each rows(items) and take top 10 similarities by comparing their similarities between rows. --> This is need for Item-Item Recommender System. All that I've read about it is referred to computing similarity over columns Apache Spark Python Cosine Similarity over DataFrames May someone say is it possible to compute a cosine distance

run spark as java web application

阅读更多关于 run spark as java web application

问题 I have used Spark ML and was able to get reasonable accuracy in prediction for my business problem The data is not huge and I was able to transform the input ( basically a csv file ) using stanford NLP and run Naive Bayes for prediction in my local machine. I want to run this prediction service like a simple java main program or along with a simple MVC web application Currently I run my prediction using the spark-submit command ? Instead , can I create spark context and data frames from my

Scala: Spark SQL to_date(unix_timestamp) returning NULL

阅读更多关于 Scala: Spark SQL to_date(unix_timestamp) returning NULL

问题 Spark Version: spark-2.0.1-bin-hadoop2.7 Scala: 2.11.8 I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter date_format includes string of column names that need to be converted to yyyy-mm-dd format. In the following code, I first loaded the csv of Date column as StringType via the schema , and then I check if the date_format is not empty, that is there are columns that need to

How can I write a parquet file using Spark (pyspark)?

阅读更多关于 How can I write a parquet file using Spark (pyspark)?

问题 I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write' from pyspark import SparkContext sc = SparkContext("local", "Protob Conversion to Parquet ") # spark is an existing SparkSession df = sc.textFile("/temp/proto_temp.csv") # Displays the content

Spark-SQL : How to read a TSV or CSV file into dataframe and apply a custom schema?

阅读更多关于 Spark-SQL : How to read a TSV or CSV file into dataframe and apply a custom schema?

问题 I'm using Spark 2.0 while working with tab-separated value (TSV) and comma-separated value (CSV) files. I want to load the data into Spark-SQL dataframes, where I would like to control the schema completely when the files are read. I don't want Spark to guess the schema from the data in the file. How would I load TSV or CSV files into Spark SQL Dataframes and apply a schema to them? 回答1: Below is a complete Spark 2.0 example of loading a tab-separated value (TSV) file and applying a schema. I