spark-dataframe | 易学教程

How to convert DataFrame to RDD in Scala?

阅读更多关于 How to convert DataFrame to RDD in Scala?

问题 Can someone please share how one can convert a dataframe to an RDD ? 回答1: Simply: val rows: RDD[Row] = df.rdd 回答2: Use df.map(row => ...) to convert the dataframe to a RDD if you want to map a row to a different RDD element. For example df.map(row => (row(1), row(2))) gives you a paired RDD where the first column of the df is the key and the second column of the df is the value. 回答3: I was just looking for my answer and found this post. Jean's answer to absolutely correct,adding on that "df

Compute size of Spark dataframe - SizeEstimator gives unexpected results

阅读更多关于 Compute size of Spark dataframe - SizeEstimator gives unexpected results

问题 I am trying to find a reliable way to compute the size (in bytes) of a Spark dataframe programmatically. The reason is that I would like to have a method to compute an "optimal" number of partitions ("optimal" could mean different things here: it could mean having an optimal partition size, or resulting in an optimal file size when writing to Parquet tables - but both can be assumed to be some linear function of the dataframe size). In other words, I would like to call coalesce(n) or

Spark dataframes convert nested JSON to seperate columns

阅读更多关于 Spark dataframes convert nested JSON to seperate columns

问题 I've a stream of JSONs with following structure that gets converted to dataframe { "a": 3936, "b": 123, "c": "34", "attributes": { "d": "146", "e": "12", "f": "23" } } The dataframe show functions results in following output sqlContext.read.json(jsonRDD).show +----+-----------+---+---+ | a| attributes| b| c| +----+-----------+---+---+ |3936|[146,12,23]|123| 34| +----+-----------+---+---+ How can I split attributes column (nested JSON structure) into attributes.d, attributes.e and attributes.f

How to view Random Forest statistics in Spark (scala)

阅读更多关于 How to view Random Forest statistics in Spark (scala)

问题 I have a RandomForestClassifierModel in Spark. Using .toDebugString() outputs the following Tree 0 (weight 1.0): If (feature 0 in {1.0,2.0,3.0}) If (feature 3 in {2.0,3.0}) If (feature 8 <= 55.3) . . Else (feature 0 not in {1.0,2.0,3.0}) . . Tree 1 (weight 1.0): . . ...etc I'd like to view the actual data as it goes through the model, something like Tree 0 (weight 1.0): If (feature 0 in {1.0,2.0,3.0}) 60% If (feature 3 in {2.0,3.0}) 57% If (feature 8 <= 55.3) 22% . . Else (feature 0 not in {1

Spark Dataframe of WrappedArray to Dataframe[Vector]

阅读更多关于 Spark Dataframe of WrappedArray to Dataframe[Vector]

问题 I have a spark Dataframe df with the following schema: root |-- features: array (nullable = true) | |-- element: double (containsNull = false) I would like to create a new Dataframe where each row will be a Vector of Double s and expecting to get the following schema: root |-- features: vector (nullable = true) So far I have the following piece of code (influenced by this post: Converting Spark Dataframe(with WrappedArray) to RDD[labelPoint] in scala) but I fear something is wrong with it

Spark Exception Complex types not supported while loading parquet

阅读更多关于 Spark Exception Complex types not supported while loading parquet

问题 I am trying to load Parquet File in Spark as dataframe- val df= spark.read.parquet(path) I am getting - org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 12, 10.250.2.32): java.lang.UnsupportedOperationException: Complex types not supported. While going through the code, I realized there is a check in spark VectorizedParquetRecordReader.java (initializeInternal)- Type t = requestedSchema

Spark Exception Complex types not supported while loading parquet

阅读更多关于 Spark Exception Complex types not supported while loading parquet

Pyspark - ValueError: could not convert string to float / invalid literal for float()

阅读更多关于 Pyspark - ValueError: could not convert string to float / invalid literal for float()

问题 I am trying to use data from a spark dataframe as the input for my k-means model. However I keep getting errors. (Check section after code) My spark dataframe and looks like this (and has around 1M rows): ID col1 col2 Latitude Longitude 13 ... ... 22.2 13.5 62 ... ... 21.4 13.8 24 ... ... 21.8 14.1 71 ... ... 28.9 18.0 ... ... ... .... .... Here is my code: from pyspark.ml.clustering import KMeans from pyspark.ml.linalg import Vectors df = spark.read.csv("file.csv") spark_rdd = df.rdd.map

How to pass multiple statements into Spark SQL HiveContext

阅读更多关于 How to pass multiple statements into Spark SQL HiveContext

问题 For example I have few Hive HQL statements which I want to pass into Spark SQL: set parquet.compression=SNAPPY; create table MY_TABLE stored as parquet as select * from ANOTHER_TABLE; select * from MY_TABLE limit 5; Following doesn't work: hiveContext.sql("set parquet.compression=SNAPPY; create table MY_TABLE stored as parquet as select * from ANOTHER_TABLE; select * from MY_TABLE limit 5;") How to pass the statements into Spark SQL? 回答1: Thank you to @SamsonScharfrichter for the answer. This

How to pass multiple statements into Spark SQL HiveContext

阅读更多关于 How to pass multiple statements into Spark SQL HiveContext