spark-dataframe

Rename columns with special characters in python or Pyspark dataframe

孤人 提交于 2019-12-24 02:17:13
问题 I have a data frame in python/pyspark. The columns have special characters like dot(.) spaces brackets(()) and parenthesis {}. in their names. Now I want to rename the column names in such a way that if there are dot and spaces replace them with underscore and if there are () and {} then remove them from the column names. I have done this df1 = df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns)) with this I was able to replace the dot and spaces with underscores with Unable to do the

Pyspark - Ranking columns keeping ties

耗尽温柔 提交于 2019-12-24 00:45:19
问题 I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns) +--------+----------+-----+----+ | Entity| id| colA|colB| +-------------------+-----+----+ | a|8589934652| 21| 50| | b| 112| 9| 23| | c|8589934629| 9| 23| | d|8589934702| 8| 21| | e| 20| 2| 21| | f|8589934657| 2| 5| | g|8589934601| 1| 5| | h

Error while reading very large files with spark csv package

佐手、 提交于 2019-12-23 18:36:42
问题 We are trying to read a 3 gb file which has multiple new line character in one its column using spark-csv and univocity 1.5.0 parser, but the file is getting split in the multiple column in some row on the basis of newline character. This scenario is occurring in case of large file. We are using spark 1.6.1 and scala 2.10 Following code i'm using for reading the file : sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .option("mode",

select specific columns in Spark DataFrames from Array of Struct

倾然丶 夕夏残阳落幕 提交于 2019-12-23 17:15:32
问题 I have a Spark DataFrame df with the following Schema: root |-- k: integer (nullable = false) |-- v: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- a: integer (nullable = false) | | |-- b: double (nullable = false) | | |-- c: string (nullable = true) Is it possible to just select a, c in v from df without doing a map ? In particular, df is loaded from a Parquet file and I don't want the values for c to even be loaded/read. 回答1: It depends on exactly what you

Handle database connection inside spark streaming

ⅰ亾dé卋堺 提交于 2019-12-23 16:24:20
问题 I am not sure if I understand correctly how spark handle database connection and how to reliable using large number of database update operation insides spark without potential screw up the spark job. This is a code snippet I have been using (for easy illustration): val driver = new MongoDriver val hostList: List[String] = conf.getString("mongo.hosts").split(",").toList val connection = driver.connection(hostList) val mongodb = connection(conf.getString("mongo.db")) val dailyInventoryCol =

Should the DataFrame function groupBy be avoided?

风流意气都作罢 提交于 2019-12-23 16:17:05
问题 This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different? I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this

Spark on Hive SQL query error NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT

爷,独闯天下 提交于 2019-12-23 10:16:08
问题 I get error while submitting Spark 1.6.0 SQL application against Hive 2.1.0: Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:512) at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:252) at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:239) at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:443) at org.apache.spark.sql

Getting exception : java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) while using data frames

本小妞迷上赌 提交于 2019-12-23 09:16:50
问题 I am receiving "java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)" error while using dataframes in scala app and running it using spark. However if I work using only RDD's and not dataframes, no such error comes up with same pom and settings. Also while going through other posts with same error, it is mentioned that scala version has to be 2.10 as spark is not compatible with 2.11 scala, and i am using 2.10 scala version with 2.0.0 spark. Below

spark - scala: not a member of org.apache.spark.sql.Row

筅森魡賤 提交于 2019-12-23 08:25:33
问题 I am trying to convert a data frame to RDD, then perform some operations below to return tuples: df.rdd.map { t=> (t._2 + "_" + t._3 , t) }.take(5) Then I got the error below. Anyone have any ideas? Thanks! <console>:37: error: value _2 is not a member of org.apache.spark.sql.Row (t._2 + "_" + t._3 , t) ^ 回答1: When you convert a DataFrame to RDD, you get an RDD[Row] , so when you use map , your function receives a Row as parameter. Therefore, you must use the Row methods to access its members

Spark UDF error - Schema for type Any is not supported

百般思念 提交于 2019-12-23 07:49:27
问题 I’m trying to create a udf that will replace negative values in a column with 0. My dataframe is – called df, and contains one column called avg_x. This is my code for creating a udf val noNegative = udf {(avg_acc_x: Double) => if(avg_acc_x < 0) 0 else "avg_acc_x"} I get this error java.lang.UnsupportedOperationException: Schema for type Any is not supported df.printSchema returns |-- avg_acc_x: double (nullable = false) so I don’t understand why this error is occurring? 回答1: It's because of