spark-dataframe | 易学教程

Rename columns with special characters in python or Pyspark dataframe

阅读更多关于 Rename columns with special characters in python or Pyspark dataframe

问题 I have a data frame in python/pyspark. The columns have special characters like dot(.) spaces brackets(()) and parenthesis {}. in their names. Now I want to rename the column names in such a way that if there are dot and spaces replace them with underscore and if there are () and {} then remove them from the column names. I have done this df1 = df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns)) with this I was able to replace the dot and spaces with underscores with Unable to do the

Pyspark - Ranking columns keeping ties

阅读更多关于 Pyspark - Ranking columns keeping ties

问题 I'm looking for a way to rank columns of a dataframe preserving ties. Specifically for this example, I have a pyspark dataframe as follows where I want to generate ranks for colA & colB (though I want to support being able to rank N number of columns) +--------+----------+-----+----+ | Entity| id| colA|colB| +-------------------+-----+----+ | a|8589934652| 21| 50| | b| 112| 9| 23| | c|8589934629| 9| 23| | d|8589934702| 8| 21| | e| 20| 2| 21| | f|8589934657| 2| 5| | g|8589934601| 1| 5| | h

Error while reading very large files with spark csv package

阅读更多关于 Error while reading very large files with spark csv package

问题 We are trying to read a 3 gb file which has multiple new line character in one its column using spark-csv and univocity 1.5.0 parser, but the file is getting split in the multiple column in some row on the basis of newline character. This scenario is occurring in case of large file. We are using spark 1.6.1 and scala 2.10 Following code i'm using for reading the file : sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .option("mode",

select specific columns in Spark DataFrames from Array of Struct

阅读更多关于 select specific columns in Spark DataFrames from Array of Struct

问题 I have a Spark DataFrame df with the following Schema: root |-- k: integer (nullable = false) |-- v: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- a: integer (nullable = false) | | |-- b: double (nullable = false) | | |-- c: string (nullable = true) Is it possible to just select a, c in v from df without doing a map ? In particular, df is loaded from a Parquet file and I don't want the values for c to even be loaded/read. 回答1: It depends on exactly what you

Handle database connection inside spark streaming

阅读更多关于 Handle database connection inside spark streaming

问题 I am not sure if I understand correctly how spark handle database connection and how to reliable using large number of database update operation insides spark without potential screw up the spark job. This is a code snippet I have been using (for easy illustration): val driver = new MongoDriver val hostList: List[String] = conf.getString("mongo.hosts").split(",").toList val connection = driver.connection(hostList) val mongodb = connection(conf.getString("mongo.db")) val dailyInventoryCol =

Should the DataFrame function groupBy be avoided?

阅读更多关于 Should the DataFrame function groupBy be avoided?

问题 This link and others tell me that the Spark groupByKey is not to be used if there is a large number of keys, since Spark shuffles all the keys around. Does the same apply to the groupBy function as well? Or is this something different? I'm asking this because I want to do what this question tries to do, but I have a very large number of keys. It should be possible to do this without shuffling all the data around by reducing on each node locally, but I can't find the PySpark way to do this

Spark on Hive SQL query error NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT

阅读更多关于 Spark on Hive SQL query error NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT

问题 I get error while submitting Spark 1.6.0 SQL application against Hive 2.1.0: Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:512) at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:252) at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:239) at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:443) at org.apache.spark.sql

Getting exception : java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) while using data frames

阅读更多关于 Getting exception : java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;) while using data frames

问题 I am receiving "java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)" error while using dataframes in scala app and running it using spark. However if I work using only RDD's and not dataframes, no such error comes up with same pom and settings. Also while going through other posts with same error, it is mentioned that scala version has to be 2.10 as spark is not compatible with 2.11 scala, and i am using 2.10 scala version with 2.0.0 spark. Below

spark - scala: not a member of org.apache.spark.sql.Row

阅读更多关于 spark - scala: not a member of org.apache.spark.sql.Row

问题 I am trying to convert a data frame to RDD, then perform some operations below to return tuples: df.rdd.map { t=> (t._2 + "_" + t._3 , t) }.take(5) Then I got the error below. Anyone have any ideas? Thanks! <console>:37: error: value _2 is not a member of org.apache.spark.sql.Row (t._2 + "_" + t._3 , t) ^ 回答1: When you convert a DataFrame to RDD, you get an RDD[Row] , so when you use map , your function receives a Row as parameter. Therefore, you must use the Row methods to access its members

Spark UDF error - Schema for type Any is not supported

阅读更多关于 Spark UDF error - Schema for type Any is not supported

问题 I’m trying to create a udf that will replace negative values in a column with 0. My dataframe is – called df, and contains one column called avg_x. This is my code for creating a udf val noNegative = udf {(avg_acc_x: Double) => if(avg_acc_x < 0) 0 else "avg_acc_x"} I get this error java.lang.UnsupportedOperationException: Schema for type Any is not supported df.printSchema returns |-- avg_acc_x: double (nullable = false) so I don’t understand why this error is occurring? 回答1: It's because of