spark-dataframe

How to select an exact number of random rows from DataFrame

狂风中的少年 提交于 2019-12-25 09:18:07
问题 How can I select an exact number of random rows from a DataFrame efficiently? The data contains an index column that can be used. If I have to use maximum size, what is more efficient, count() or max() on the index column? 回答1: A possible approach is to calculate the number of rows using .count() , then use sample() from python 's random library to generate a random sequence of arbitrary length from this range. Lastly use the resulting list of numbers vals to subset your index column. import

DataFrame Object is not showing any data

坚强是说给别人听的谎言 提交于 2019-12-25 09:17:05
问题 I was trying to create a dataframe object on a hdfs file using spark csv lib as shown in this tutorial. But when i tried to get the count of DataFrame object , it is showing as 0 Here is my file look like, employee.csv: empid,empname 1000,Tom 2000,Jerry I loaded the above file using, val empDf = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").load("hdfs:///user/.../employee.csv"); When i queried like, empDf object.printSchema() is giving

Spark Dataframe is saved to MongoDB in wrong format

China☆狼群 提交于 2019-12-25 09:13:14
问题 I am using Spark-MongoDB and I am trying to save a DataFrame into MongoDB : val event = """{"Dev":[{"a":3},{"b":3}],"hr":[{"a":6}]}""" val events = sc.parallelize(event :: Nil) val df = sqlc.read.json(events) val saveConfig = MongodbConfigBuilder(Map(Host -> List("localhost:27017"), Database -> "test", Collection -> "test", SamplingRatio -> 1.0, WriteConcern -> "normal", SplitSize -> 8, SplitKey -> "_id")) df.saveToMongodb(saveConfig.build) I'm expecting the data to be saved as the input

SparkR 2.0 Classification: how to get performance matrices?

僤鯓⒐⒋嵵緔 提交于 2019-12-25 09:02:06
问题 How to get performance matrices in sparkR classification, e.g., F1 score, Precision, Recall, Confusion Matrix # Load training data df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") training <- df testing <- df # Fit a random forest classification model with spark.randomForest model <- spark.randomForest(training, label ~ features, "classification", numTrees = 10) # Model summary summary(model) # Prediction predictions <- predict(model, testing) head(predictions) #

perform RDD operations on DataFrames

断了今生、忘了曾经 提交于 2019-12-25 04:23:09
问题 I have a dataset of 10 fields. I need to perform RDD operations on these DataFrame. Is it possible to perform RDD operations like map , flatMap , etc.. here is my sample code: df.select("COUNTY","VEHICLES").show(); this is my dataframe and i need to convert this dataframe to RDD and operate some RDD operations on this new RDD. Here is code how i am converted dataframe to RDD RDD<Row> java = df.select("COUNTY","VEHICLES").rdd(); after converting to RDD, i am not able to see the RDD results, i

how to compute diff for one col in spark dataframe?

廉价感情. 提交于 2019-12-25 03:54:36
问题 +-------------------+ | Dev_time| +-------------------+ |2015-09-18 05:00:20| |2015-09-18 05:00:21| |2015-09-18 05:00:22| |2015-09-18 05:00:23| |2015-09-18 05:00:24| |2015-09-18 05:00:25| |2015-09-18 05:00:26| |2015-09-18 05:00:27| |2015-09-18 05:00:37| |2015-09-18 05:00:37| |2015-09-18 05:00:37| |2015-09-18 05:00:38| |2015-09-18 05:00:39| +-------------------+ For spark's dataframe, I want to compute the diff of the datetime ,just like in numpy.diff(array) 回答1: Generally speaking there is no

Homemade DataFrame aggregation/dropDuplicates Spark

时光总嘲笑我的痴心妄想 提交于 2019-12-25 01:46:19
问题 I want to perform a transformation on my DataFrame df so that I only have each key once and only once in the final DataFrame. For machine learning purposes, I don't want to have a bias in my dataset. This should never occur, but the data I get from my data source contains this "weirdness". So if I have lines with the same keys, I want to be able to chose either a combination of the two (like mean value) or a string concatenation (labels for example) or a random values set. Say my DataFrame df

Spark error: Exception in thread “main” java.lang.UnsupportedOperationException

 ̄綄美尐妖づ 提交于 2019-12-24 20:17:07
问题 I am writing a Scala/spark program which would find the max salary of the employee. The employee data is available in a CSV file, and the salary column has a comma separator for thousands and also it has a $ prefixed to it e.g. $74,628.00. To handle this comma and dollar sign, I have written a parser function in scala which would split each line on "," and then map each column to individual variables to be assigned to a case class. My parser program looks like below. In this to eliminate the

Extracting value from data frame thorws error because of the . in the column name in spark

依然范特西╮ 提交于 2019-12-24 20:11:41
问题 This is my Existing data frame +------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+--------------------------+-----------+--------------------+-----------+--------------------------------------------------------------------------------------------+-----------------------+-----------

Would a forced Spark DataFrame materialization work as a checkpoint?

橙三吉。 提交于 2019-12-24 19:40:30
问题 I have a large and complex DataFrame with nested structures in Spark 2.1.0 (pySpark) and I want to add an ID column to it. The way I did it was to add a column like this: df= df.selectExpr('*','row_number() OVER (PARTITION BY File ORDER BY NULL) AS ID') So it goes e.g. from this: File A B a.txt valA1 [valB11,valB12] a.txt valA2 [valB21,valB22] to this: File A B ID a.txt valA1 [valB11,valB12] 1 a.txt valA2 [valB21,valB22] 2 After I add this column, I don't immediately trigger a materialization