spark-dataframe

PySpark How to read CSV into Dataframe, and manipulate it

北战南征 提交于 2019-12-03 14:33:36
问题 I'm quite new to pyspark and am trying to use it to process a large dataset which is saved as a csv file. I'd like to read CSV file into spark dataframe, drop some columns, and add new columns. How should I do that? I am having trouble getting this data into a dataframe. This is a stripped down version of what I have so far: def make_dataframe(data_portion, schema, sql): fields = data_portion.split(",") return sql.createDateFrame([(fields[0], fields[1])], schema=schema) if __name__ == "__main

Take n rows from a spark dataframe and pass to toPandas()

蹲街弑〆低调 提交于 2019-12-03 14:26:02
问题 I have this code: l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).toPandas() Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call toPandas() to return a pandas dataframe. How do I do it? I can't call take(n) because that doesn't return a dataframe and thus I can't pass it to toPandas() . So to put it another way, how can I take the top n rows from a dataframe and

Flatten Nested Spark Dataframe

北战南征 提交于 2019-12-03 13:04:15
问题 Is there a way to flatten an arbitrarily nested Spark Dataframe? Most of the work I'm seeing is written for specific schema, and I'd like to be able to generically flatten a Dataframe with different nested types (e.g. StructType, ArrayType, MapType, etc). Say I have a schema like: StructType(List(StructField(field1,...), StructField(field2,...), ArrayType(StructType(List(StructField(nested_field1,...), StructField(nested_field2,...)),nested_array,...))) Looking to adapt this into a flat table

Inferring Spark DataType from string literals

时光怂恿深爱的人放手 提交于 2019-12-03 12:57:43
问题 I am trying to write a Scala function that can infer Spark DataTypes based on a provided input string: /** * Example: * ======== * toSparkType("string") => StringType * toSparkType("boolean") => BooleanType * toSparkType("date") => DateType * etc. */ def toSparkType(inputType : String) : DataType = { var dt : DataType = null if(matchesStringRegex(inputType)) { dt = StringType } else if(matchesBooleanRegex(inputType)) { dt = BooleanType } else if(matchesDateRegex(inputType)) { dt = DateType }

Why does df.limit keep changing in Pyspark?

∥☆過路亽.° 提交于 2019-12-03 12:17:27
I'm creating a data sample from some dataframe df with rdd = df.limit(10000).rdd This operation takes quite some time (why actually? can it not short-cut after 10000 rows?), so I assume I have a new RDD now. However, when I now work on rdd , it is different rows every time I access it. As if it resamples over again. Caching the RDD helps a bit, but surely that's not save? What is the reason behind it? Update: Here is a reproduction on Spark 1.5.2 from operator import add from pyspark.sql import Row rdd=sc.parallelize([Row(i=i) for i in range(1000000)],100) rdd1=rdd.toDF().limit(1000).rdd for _

Exploding nested Struct in Spark dataframe

末鹿安然 提交于 2019-12-03 10:47:56
I'm working through the Databricks example . The schema for the dataframe looks like: > parquetDF.printSchema root |-- department: struct (nullable = true) | |-- id: string (nullable = true) | |-- name: string (nullable = true) |-- employees: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- firstName: string (nullable = true) | | |-- lastName: string (nullable = true) | | |-- email: string (nullable = true) | | |-- salary: integer (nullable = true) In the example, they show how to explode the employees column into 4 additional columns: val explodeDF = parquetDF

Apply a function to a single column of a csv in Spark

二次信任 提交于 2019-12-03 10:39:00
Using Spark I'm reading a csv and want to apply a function to a column on the csv. I have some code that works but it's very hacky. What is the proper way to do this? My code SparkContext().addPyFile("myfile.py") spark = SparkSession\ .builder\ .appName("myApp")\ .getOrCreate() from myfile import myFunction df = spark.read.csv(sys.argv[1], header=True, mode="DROPMALFORMED",) a = df.rdd.map(lambda line: Row(id=line[0], user_id=line[1], message_id=line[2], message=myFunction(line[3]))).toDF() I would like to be able to just call the function on the column name instead of mapping each row to line

Spark Streaming: How can I add more partitions to my DStream?

情到浓时终转凉″ 提交于 2019-12-03 08:58:06
I have a spark-streaming app which looks like this: val message = KafkaUtils.createStream(...).map(_._2) message.foreachRDD( rdd => { if (!rdd.isEmpty){ val kafkaDF = sqlContext.read.json(rdd) kafkaDF.foreachPartition( i =>{ createConnection() i.foreach( row =>{ connection.sendToTable() } ) closeConnection() } ) And, I run it on a yarn cluster using spark-submit --master yarn-cluster --num-executors 3 --driver-memory 2g --executor-memory 2g --executor-cores 5.... When I try to log kafkaDF.rdd.partitions.size , the result turns out be '1' or '5' mostly. I am confused, is it possible to control

How To Push a Spark Dataframe to Elastic Search (Pyspark)

我怕爱的太早我们不能终老 提交于 2019-12-03 08:42:21
Beginner ES Question here What is the workflow or steps for pushing a Spark Dataframe to Elastic Search? From research, I believe I need to use the spark.newAPIHadoopFile() method. However, digging through the Elastic Search Documentation , and other Stack Q/A's I am still a little confused on what format the arguments need to be in and why NOTE that I am using pyspark, this is a new table to ES (no index already exists), and the df is 5 columns (2 string types, 2 long types, and 1 list of ints) with ~3.5M rows. This worked for me - I had my data in df . df = df.drop('_id') df.write.format(

Spark DataSet filter performance

我是研究僧i 提交于 2019-12-03 08:22:14
I have been experimenting different ways to filter a typed data set. It turns out the performance can be quite different. The data set was created based on a 1.6 GB rows of data with 33 columns and 4226047 rows. DataSet is created by loading csv data and mapped to a case class. val df = spark.read.csv(csvFile).as[FireIncident] A filter on UnitId = 'B02' should return 47980 rows. I tested three ways as below: 1) Use typed column (~ 500 ms on local host) df.where($"UnitID" === "B02").count() 2) Use temp table and sql query (~ same as option 1) df.createOrReplaceTempView("FireIncidentsSF") spark