spark-dataframe | 易学教程

How to select an exact number of random rows from DataFrame

阅读更多关于 How to select an exact number of random rows from DataFrame

问题 How can I select an exact number of random rows from a DataFrame efficiently? The data contains an index column that can be used. If I have to use maximum size, what is more efficient, count() or max() on the index column? 回答1: A possible approach is to calculate the number of rows using .count() , then use sample() from python 's random library to generate a random sequence of arbitrary length from this range. Lastly use the resulting list of numbers vals to subset your index column. import

DataFrame Object is not showing any data

阅读更多关于 DataFrame Object is not showing any data

问题 I was trying to create a dataframe object on a hdfs file using spark csv lib as shown in this tutorial. But when i tried to get the count of DataFrame object , it is showing as 0 Here is my file look like, employee.csv: empid,empname 1000,Tom 2000,Jerry I loaded the above file using, val empDf = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").load("hdfs:///user/.../employee.csv"); When i queried like, empDf object.printSchema() is giving

Spark Dataframe is saved to MongoDB in wrong format

阅读更多关于 Spark Dataframe is saved to MongoDB in wrong format

问题 I am using Spark-MongoDB and I am trying to save a DataFrame into MongoDB : val event = """{"Dev":[{"a":3},{"b":3}],"hr":[{"a":6}]}""" val events = sc.parallelize(event :: Nil) val df = sqlc.read.json(events) val saveConfig = MongodbConfigBuilder(Map(Host -> List("localhost:27017"), Database -> "test", Collection -> "test", SamplingRatio -> 1.0, WriteConcern -> "normal", SplitSize -> 8, SplitKey -> "_id")) df.saveToMongodb(saveConfig.build) I'm expecting the data to be saved as the input

SparkR 2.0 Classification: how to get performance matrices?

阅读更多关于 SparkR 2.0 Classification: how to get performance matrices?

问题 How to get performance matrices in sparkR classification, e.g., F1 score, Precision, Recall, Confusion Matrix # Load training data df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") training <- df testing <- df # Fit a random forest classification model with spark.randomForest model <- spark.randomForest(training, label ~ features, "classification", numTrees = 10) # Model summary summary(model) # Prediction predictions <- predict(model, testing) head(predictions) #

perform RDD operations on DataFrames

阅读更多关于 perform RDD operations on DataFrames

问题 I have a dataset of 10 fields. I need to perform RDD operations on these DataFrame. Is it possible to perform RDD operations like map , flatMap , etc.. here is my sample code: df.select("COUNTY","VEHICLES").show(); this is my dataframe and i need to convert this dataframe to RDD and operate some RDD operations on this new RDD. Here is code how i am converted dataframe to RDD RDD<Row> java = df.select("COUNTY","VEHICLES").rdd(); after converting to RDD, i am not able to see the RDD results, i

how to compute diff for one col in spark dataframe?

阅读更多关于 how to compute diff for one col in spark dataframe?

问题 +-------------------+ | Dev_time| +-------------------+ |2015-09-18 05:00:20| |2015-09-18 05:00:21| |2015-09-18 05:00:22| |2015-09-18 05:00:23| |2015-09-18 05:00:24| |2015-09-18 05:00:25| |2015-09-18 05:00:26| |2015-09-18 05:00:27| |2015-09-18 05:00:37| |2015-09-18 05:00:37| |2015-09-18 05:00:37| |2015-09-18 05:00:38| |2015-09-18 05:00:39| +-------------------+ For spark's dataframe, I want to compute the diff of the datetime ,just like in numpy.diff(array) 回答1: Generally speaking there is no

Homemade DataFrame aggregation/dropDuplicates Spark

阅读更多关于 Homemade DataFrame aggregation/dropDuplicates Spark

问题 I want to perform a transformation on my DataFrame df so that I only have each key once and only once in the final DataFrame. For machine learning purposes, I don't want to have a bias in my dataset. This should never occur, but the data I get from my data source contains this "weirdness". So if I have lines with the same keys, I want to be able to chose either a combination of the two (like mean value) or a string concatenation (labels for example) or a random values set. Say my DataFrame df

Spark error: Exception in thread “main” java.lang.UnsupportedOperationException

阅读更多关于 Spark error: Exception in thread “main” java.lang.UnsupportedOperationException

问题 I am writing a Scala/spark program which would find the max salary of the employee. The employee data is available in a CSV file, and the salary column has a comma separator for thousands and also it has a $ prefixed to it e.g. $74,628.00. To handle this comma and dollar sign, I have written a parser function in scala which would split each line on "," and then map each column to individual variables to be assigned to a case class. My parser program looks like below. In this to eliminate the

Extracting value from data frame thorws error because of the . in the column name in spark

阅读更多关于 Extracting value from data frame thorws error because of the . in the column name in spark

问题 This is my Existing data frame +------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+--------------------------+-----------+--------------------+-----------+--------------------------------------------------------------------------------------------+-----------------------+-----------

Would a forced Spark DataFrame materialization work as a checkpoint?

阅读更多关于 Would a forced Spark DataFrame materialization work as a checkpoint?

问题 I have a large and complex DataFrame with nested structures in Spark 2.1.0 (pySpark) and I want to add an ID column to it. The way I did it was to add a column like this: df= df.selectExpr('*','row_number() OVER (PARTITION BY File ORDER BY NULL) AS ID') So it goes e.g. from this: File A B a.txt valA1 [valB11,valB12] a.txt valA2 [valB21,valB22] to this: File A B ID a.txt valA1 [valB11,valB12] 1 a.txt valA2 [valB21,valB22] 2 After I add this column, I don't immediately trigger a materialization