spark-dataframe | 易学教程

'RDD' object has no attribute '_jdf' pyspark RDD

阅读更多关于 'RDD' object has no attribute '_jdf' pyspark RDD

问题 I'm new in pyspark. I would like to perform some machine Learning on a text file. from pyspark import Row from pyspark.context import SparkContext from pyspark.sql.session import SparkSession from pyspark import SparkConf sc = SparkContext spark = SparkSession.builder.appName("ML").getOrCreate() train_data = spark.read.text("20ng-train-all-terms.txt") td= train_data.rdd #transformer df to rdd tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))

Spark Scala: retrieve the schema and store it

阅读更多关于 Spark Scala: retrieve the schema and store it

问题 Is it possible to retrieve the schema of an RDD and store it in a variable? Because I want to create a new data frame from another RDD using the same schema. For example, below is what I am hoping to have: val schema = oldDF.getSchema() val newDF = sqlContext.createDataFrame(rowRDD, schema) Assuming I already have rowRDD in the format of RDD[org.apache.spark.sql.Row] , is this something possible? 回答1: Just use schema attribute val oldDF = sqlContext.createDataFrame(sc.parallelize(Seq(("a", 1)

Spark: equivelant of zipwithindex in dataframe

阅读更多关于 Spark: equivelant of zipwithindex in dataframe

问题 Assuming I am having the following dataframe: dummy_data = [('a',1),('b',25),('c',3),('d',8),('e',1)] df = sc.parallelize(dummy_data).toDF(['letter','number']) And i want to create the following dataframe: [('a',0),('b',2),('c',1),('d',3),('e',0)] What I do is to convert it to rdd and use zipWithIndex function and after join the results: convertDF = (df.select('number') .distinct() .rdd .zipWithIndex() .map(lambda x:(x[0].number,x[1])) .toDF(['old','new'])) finalDF = (df .join(convertDF,df

Spark Sql: TypeError(“StructType can not accept object in type %s” % type(obj))

阅读更多关于 Spark Sql: TypeError(“StructType can not accept object in type %s” % type(obj))

问题 I am currently pulling data from SQL Server using PyODBC and trying to insert into a table in Hive in a Near Real Time (NRT) manner. I got a single row from source and converted into List[Strings] and creating schema programatically but while creating a DataFrame, Spark is throwing StructType error. >>> cnxn = pyodbc.connect(con_string) >>> aj = cnxn.cursor() >>> >>> aj.execute("select * from tjob") <pyodbc.Cursor object at 0x257b2d0> >>> row = aj.fetchone() >>> row (1127, u'', u'8196660', u'

How to retrieve Metrics like Output Size and Records Written from Spark UI?

阅读更多关于 How to retrieve Metrics like Output Size and Records Written from Spark UI?

问题 How do I collect these metrics on a console (Spark Shell or Spark submit job) right after the task or job is done. We are using Spark to load data from Mysql to Cassandra and it is quite huge (ex: ~200 GB and 600M rows). When the task the done, we want to verify how many rows exactly did spark process? We can get the number from Spark UI, but how can we retrieve that number ("Output Records Written") from spark shell or in spark-submit job. Sample Command to load from Mysql to Cassandra. val

Bulk data migration through Spark SQL

阅读更多关于 Bulk data migration through Spark SQL

问题 I'm currently trying to bulk migrate the contents of a very large MySQL table into a parquet file via Spark SQL. But when doing so I quickly run out of memory, even when setting the driver's memory limit higher (I'm using spark in local mode). Example code: Dataset<Row> ds = spark.read() .format("jdbc") .option("url", url) .option("driver", "com.mysql.jdbc.Driver") .option("dbtable", "bigdatatable") .option("user", "root") .option("password", "foobar") .load(); ds.write().mode(SaveMode.Append

Issue with VectorUDT when using Spark ML

阅读更多关于 Issue with VectorUDT when using Spark ML

问题 I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD. Inside the UDAF, I have to specify a data type for the input, buffer, and output schemas: def inputSchema = new StructType().add("features", new VectorUDT()) def bufferSchema: StructType = StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)

StackOverflowError when operating with a large number of columns in Spark

阅读更多关于 StackOverflowError when operating with a large number of columns in Spark

问题 I have a wide dataframe (130000 rows x 8700 columns) and when I try to sum all columns I´m getting the following error: Exception in thread "main" java.lang.StackOverflowError at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray

Spark toDF() / createDataFrame() type inference doesn't work as expected

阅读更多关于 Spark toDF() / createDataFrame() type inference doesn't work as expected

问题 df = sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str']) df.printSchema() produces root |-- should_be_int: string (nullable = true) |-- should_be_str: string (nullable = true) Notice should_be_int has string datatype, according to documentation: https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by

spark: access rdd inside another rdd

阅读更多关于 spark: access rdd inside another rdd

问题 I have a lookup rdd of size 6000, lookup_rdd: RDD[String] a1 a2 a3 a4 a5 ..... and another rdd, data_rdd: RDD[(String, Iterable[(String, Int)])]: (id,(item,count)) which has unique ids, (id1,List((a1,2), (a3,4))) (id2,List((a2,1), (a4,2), (a1,1))) (id3,List((a5,1))) FOREACH element in lookup_rdd I want to check whether each id has that element or not, if it is there I put the count and if it's not I put 0, and store in a file. What is the efficient way to achieve this. Is hashing possible? eg