spark-dataframe

'RDD' object has no attribute '_jdf' pyspark RDD

我与影子孤独终老i 提交于 2019-12-10 10:45:08
问题 I'm new in pyspark. I would like to perform some machine Learning on a text file. from pyspark import Row from pyspark.context import SparkContext from pyspark.sql.session import SparkSession from pyspark import SparkConf sc = SparkContext spark = SparkSession.builder.appName("ML").getOrCreate() train_data = spark.read.text("20ng-train-all-terms.txt") td= train_data.rdd #transformer df to rdd tr_data= td.map(lambda line: line.split()).map(lambda words: Row(label=words[0],words=words[1:]))

Spark Scala: retrieve the schema and store it

守給你的承諾、 提交于 2019-12-10 09:27:18
问题 Is it possible to retrieve the schema of an RDD and store it in a variable? Because I want to create a new data frame from another RDD using the same schema. For example, below is what I am hoping to have: val schema = oldDF.getSchema() val newDF = sqlContext.createDataFrame(rowRDD, schema) Assuming I already have rowRDD in the format of RDD[org.apache.spark.sql.Row] , is this something possible? 回答1: Just use schema attribute val oldDF = sqlContext.createDataFrame(sc.parallelize(Seq(("a", 1)

Spark: equivelant of zipwithindex in dataframe

為{幸葍}努か 提交于 2019-12-10 05:08:22
问题 Assuming I am having the following dataframe: dummy_data = [('a',1),('b',25),('c',3),('d',8),('e',1)] df = sc.parallelize(dummy_data).toDF(['letter','number']) And i want to create the following dataframe: [('a',0),('b',2),('c',1),('d',3),('e',0)] What I do is to convert it to rdd and use zipWithIndex function and after join the results: convertDF = (df.select('number') .distinct() .rdd .zipWithIndex() .map(lambda x:(x[0].number,x[1])) .toDF(['old','new'])) finalDF = (df .join(convertDF,df

Spark Sql: TypeError(“StructType can not accept object in type %s” % type(obj))

折月煮酒 提交于 2019-12-10 03:03:54
问题 I am currently pulling data from SQL Server using PyODBC and trying to insert into a table in Hive in a Near Real Time (NRT) manner. I got a single row from source and converted into List[Strings] and creating schema programatically but while creating a DataFrame, Spark is throwing StructType error. >>> cnxn = pyodbc.connect(con_string) >>> aj = cnxn.cursor() >>> >>> aj.execute("select * from tjob") <pyodbc.Cursor object at 0x257b2d0> >>> row = aj.fetchone() >>> row (1127, u'', u'8196660', u'

How to retrieve Metrics like Output Size and Records Written from Spark UI?

老子叫甜甜 提交于 2019-12-10 02:13:45
问题 How do I collect these metrics on a console (Spark Shell or Spark submit job) right after the task or job is done. We are using Spark to load data from Mysql to Cassandra and it is quite huge (ex: ~200 GB and 600M rows). When the task the done, we want to verify how many rows exactly did spark process? We can get the number from Spark UI, but how can we retrieve that number ("Output Records Written") from spark shell or in spark-submit job. Sample Command to load from Mysql to Cassandra. val

Bulk data migration through Spark SQL

一世执手 提交于 2019-12-09 17:18:06
问题 I'm currently trying to bulk migrate the contents of a very large MySQL table into a parquet file via Spark SQL. But when doing so I quickly run out of memory, even when setting the driver's memory limit higher (I'm using spark in local mode). Example code: Dataset<Row> ds = spark.read() .format("jdbc") .option("url", url) .option("driver", "com.mysql.jdbc.Driver") .option("dbtable", "bigdatatable") .option("user", "root") .option("password", "foobar") .load(); ds.write().mode(SaveMode.Append

Issue with VectorUDT when using Spark ML

£可爱£侵袭症+ 提交于 2019-12-09 17:02:56
问题 I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD. Inside the UDAF, I have to specify a data type for the input, buffer, and output schemas: def inputSchema = new StructType().add("features", new VectorUDT()) def bufferSchema: StructType = StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)

StackOverflowError when operating with a large number of columns in Spark

柔情痞子 提交于 2019-12-09 00:01:37
问题 I have a wide dataframe (130000 rows x 8700 columns) and when I try to sum all columns I´m getting the following error: Exception in thread "main" java.lang.StackOverflowError at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:59) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray

Spark toDF() / createDataFrame() type inference doesn't work as expected

社会主义新天地 提交于 2019-12-08 14:00:06
问题 df = sc.parallelize([('1','a'),('2','b'),('3','c')]).toDF(['should_be_int','should_be_str']) df.printSchema() produces root |-- should_be_int: string (nullable = true) |-- should_be_str: string (nullable = true) Notice should_be_int has string datatype, according to documentation: https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by

spark: access rdd inside another rdd

耗尽温柔 提交于 2019-12-08 12:50:39
问题 I have a lookup rdd of size 6000, lookup_rdd: RDD[String] a1 a2 a3 a4 a5 ..... and another rdd, data_rdd: RDD[(String, Iterable[(String, Int)])]: (id,(item,count)) which has unique ids, (id1,List((a1,2), (a3,4))) (id2,List((a2,1), (a4,2), (a1,1))) (id3,List((a5,1))) FOREACH element in lookup_rdd I want to check whether each id has that element or not, if it is there I put the count and if it's not I put 0, and store in a file. What is the efficient way to achieve this. Is hashing possible? eg