spark-dataframe | 易学教程

PySpark How to read CSV into Dataframe, and manipulate it

阅读更多关于 PySpark How to read CSV into Dataframe, and manipulate it

问题 I'm quite new to pyspark and am trying to use it to process a large dataset which is saved as a csv file. I'd like to read CSV file into spark dataframe, drop some columns, and add new columns. How should I do that? I am having trouble getting this data into a dataframe. This is a stripped down version of what I have so far: def make_dataframe(data_portion, schema, sql): fields = data_portion.split(",") return sql.createDateFrame([(fields[0], fields[1])], schema=schema) if __name__ == "__main

Take n rows from a spark dataframe and pass to toPandas()

阅读更多关于 Take n rows from a spark dataframe and pass to toPandas()

问题 I have this code: l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).toPandas() Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call toPandas() to return a pandas dataframe. How do I do it? I can't call take(n) because that doesn't return a dataframe and thus I can't pass it to toPandas() . So to put it another way, how can I take the top n rows from a dataframe and

Flatten Nested Spark Dataframe

阅读更多关于 Flatten Nested Spark Dataframe

问题 Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType, etc). Say I have a schema like: StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...))) Looking to adapt this into a flat table

Inferring Spark DataType from string literals

阅读更多关于 Inferring Spark DataType from string literals

问题 I am trying to write a Scala function that can infer Spark DataTypes based on a provided input string: /** * Example: * ======== * toSparkType("string") => StringType * toSparkType("boolean") => BooleanType * toSparkType("date") => DateType * etc. */ def toSparkType(inputType : String) : DataType = { var dt : DataType = null if(matchesStringRegex(inputType)) { dt = StringType } else if(matchesBooleanRegex(inputType)) { dt = BooleanType } else if(matchesDateRegex(inputType)) { dt = DateType }

Why does df.limit keep changing in Pyspark?

阅读更多关于 Why does df.limit keep changing in Pyspark?

I'm creating a data sample from some dataframe df with rdd = df.limit(10000).rdd This operation takes quite some time (why actually? can it not short-cut after 10000 rows?), so I assume I have a new RDD now. However, when I now work on rdd , it is different rows every time I access it. As if it resamples over again. Caching the RDD helps a bit, but surely that's not save? What is the reason behind it? Update: Here is a reproduction on Spark 1.5.2 from operator import add from pyspark.sql import Row rdd=sc.parallelize([Row(i=i) for i in range(1000000)],100) rdd1=rdd.toDF().limit(1000).rdd for _

Exploding nested Struct in Spark dataframe

阅读更多关于 Exploding nested Struct in Spark dataframe

I'm working through the Databricks example . The schema for the dataframe looks like: > parquetDF.printSchema root |-- department: struct (nullable = true) | |-- id: string (nullable = true) | |-- name: string (nullable = true) |-- employees: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- firstName: string (nullable = true) | | |-- lastName: string (nullable = true) | | |-- email: string (nullable = true) | | |-- salary: integer (nullable = true) In the example, they show how to explode the employees column into 4 additional columns: val explodeDF = parquetDF

Apply a function to a single column of a csv in Spark

阅读更多关于 Apply a function to a single column of a csv in Spark

Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this? My code SparkContext().addPyFile("myfile.py") spark = SparkSession\ .builder\ .appName("myApp")\ .getOrCreate() from myfile import myFunction df = spark.read.csv(sys.argv[1], header=True, mode="DROPMALFORMED",) a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF() I would like to be able to just call the function on the column name instead of mapping each row to line

Spark Streaming: How can I add more partitions to my DStream?

阅读更多关于 Spark Streaming: How can I add more partitions to my DStream?

I have a spark-streaming app which looks like this: val message = KafkaUtils.createStream(...).map(_._2) message.foreachRDD( rdd => { if (!rdd.isEmpty){ val kafkaDF = sqlContext.read.json(rdd) kafkaDF.foreachPartition( i =>{ createConnection() i.foreach( row =>{ connection.sendToTable() } ) closeConnection() } ) And, I run it on a yarn cluster using spark-submit --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 5.... When I try to log kafkaDF.rdd.partitions.size , the result turns out be '1' or '5' mostly. I am confused, is it possible to control

How To Push a Spark Dataframe to Elastic Search (Pyspark)

阅读更多关于 How To Push a Spark Dataframe to Elastic Search (Pyspark)

Beginner ES Question here What is the workflow or steps for pushing a Spark Dataframe to Elastic Search? From research, I believe I need to use the spark.newAPIHadoopFile() method. However, digging through the Elastic Search Documentation , and other Stack Q/A's I am still a little confused on what format the arguments need to be in and why NOTE that I am using pyspark, this is a new table to ES (no index already exists), and the df is 5 columns (2 string types, 2 long types, and 1 list of ints) with ~3.5M rows. This worked for me - I had my data in df . df = df.drop('_id') df.write.format(

Spark DataSet filter performance

阅读更多关于 Spark DataSet filter performance

I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different. The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class. val df = spark.read.csv(csvFile).as[FireIncident] A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host) df.where($"UnitID" === "B02").count() 2) Use temp table and sql query (~ same as option 1) df.createOrReplaceTempView("FireIncidentsSF") spark