spark-dataframe

Extracting tag attributes from xml using sparkxml

不羁岁月 提交于 2019-12-02 08:27:57
I am loading a xml file using com.databricks.spark.xml and i want to read a tag attribute using the sql context . XML : <Receipt> <Sale> <DepartmentID>PR</DepartmentID> <Tax TaxExempt="false" TaxRate="10.25"/> </Sale> </Receipt> Loaded the file by, val df = sqlContext.read.format("com.databricks.spark.xml").option("rowTag","Receipt").load("/home/user/sale.xml"); df.registerTempTable("SPtable"); Printing the Schema: root |-- Sale: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- DepartmentID: long (nullable = true) | | |-- Tax: string (nullable = true) Now i want to

How to efficiently read multiple small parquet files with Spark? is there a CombineParquetInputFormat?

别等时光非礼了梦想. 提交于 2019-12-02 08:19:20
Spark generated multiple small parquet Files. How can one handle efficiently small number of parquet files both on producer and consumer Spark jobs. The most straightforward approach IMHO is to use repartition/coalesce (prefer coalesce unless data is skewed and you want to create same-sized outputs) before writing parquet files so that you will not create small files to begin with. df .map(<some transformation>) .filter(<some filter>) ///... .coalesce(<number of partitions>) .write .parquet(<path>) Number of partitions could be calculated on count of total rows in dataframe divided by some

Spark Window Functions requires HiveContext?

删除回忆录丶 提交于 2019-12-02 07:41:34
I trying one example of window function on spark from this blog http://xinhstechblog.blogspot.in/2016/04/spark-window-functions-for-dataframes.html . Getting following error while running the program.My questions ,do we need hivecontext to execute the window functions in spark? Exception in thread "main" org.apache.spark.sql.AnalysisException: Could not resolve window function 'avg'. Note that, using window functions currently requires a HiveContext; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) at org.apache.spark.sql.catalyst.analysis

How to enable Cartesian join in Spark 2.0? [duplicate]

给你一囗甜甜゛ 提交于 2019-12-02 07:22:46
This question already has an answer here: spark.sql.crossJoin.enabled for Spark 2.x 3 answers I have to cross join 2 dataframe in Spark 2.0 I am encountering below error: User class threw exception: org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively expensive and are disabled by default. To explicitly enable them, please set spark.sql.crossJoin.enabled = true; Please help me where to set this configuration, I am coding in eclipse. As the error message clearly states you need to set spark.sql.crossJoin.enabled = true to your spark configuration You can set the same

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

 ̄綄美尐妖づ 提交于 2019-12-02 06:54:22
问题 I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments; val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2")) The solution for this query already exists in pyspark version --provided in the following link PySpark DataFrame - Join on multiple columns dynamically I would like to code the same code using spark-scala 回答1: In scala you do it in similar

How to Remove header and footer from Dataframe?

∥☆過路亽.° 提交于 2019-12-02 06:49:28
I am reading a text (not CSV) file that has header, content and footer using spark.read.format("text").option("delimiter","|")...load(file) I can access the header with df.first() . Is there something close to df.last() or df.reverse().first() ? Sample data: col1|col2|col3 100|hello|asdf 300|hi|abc 200|bye|xyz 800|ciao|qwerty This is the footer line Processing logic: #load text file txt = sc.textFile("path_to_above_sample_data_text_file.txt") #remove header header = txt.first() txt = txt.filter(lambda line: line != header) #remove footer txt = txt.map(lambda line: line.split("|"))\ .filter

Transforming a column and update the DataFrame

被刻印的时光 ゝ 提交于 2019-12-02 04:46:18
So, what I'm doing below is I drop a column A from a DataFrame because I want to apply a transformation (here I just json.loads a JSON string) and replace the old column with the transformed one. After the transformation I just join the two resulting data frames. df = df_data.drop('A').join( df_data[['ID', 'A']].rdd\ .map(lambda x: (x.ID, json.loads(x.A)) if x.A is not None else (x.ID, None))\ .toDF()\ .withColumnRenamed('_1', 'ID')\ .withColumnRenamed('_2', 'A'), ['ID'] ) The thing I dislike about this is of course the overhead I'm faced because I had to do the withColumnRenamed operations.

Apache Spark Dataframe - Load data from nth line of a CSV file

丶灬走出姿态 提交于 2019-12-02 04:04:51
I would like to process a huge order CSV file (5GB), with some metadata rows at the start of file. Header columns are represented in row 4 (starting with "h,") followed by another metadata row, describing optionality. Data rows start with "d," m,Version,v1.0 m,Type,xx m,<OtherMetaData>,<...> h,Col1,Col2,Col3,Col4,Col5,.............,Col100 m,Mandatory,Optional,Optional,...........,Mandatory d,Val1,Val2,Val3,Val4,Val5,.............,Val100 Is it possible to skip a specified number of rows when loading the file and use 'inferSchema' option for DataSet? Dataset<Row> df = spark.read() .format("csv")

Spark Exception Complex types not supported while loading parquet

假装没事ソ 提交于 2019-12-02 03:46:23
I am trying to load Parquet File in Spark as dataframe- val df= spark.read.parquet(path) I am getting - org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 12, 10.250.2.32): java.lang.UnsupportedOperationException: Complex types not supported. While going through the code, I realized there is a check in spark VectorizedParquetRecordReader.java (initializeInternal)- Type t = requestedSchema.getFields().get(i); if (!t.isPrimitive() || t.isRepetition(Type.Repetition.REPEATED)) { throw new

Pyspark - ValueError: could not convert string to float / invalid literal for float()

风格不统一 提交于 2019-12-02 03:05:00
I am trying to use data from a spark dataframe as the input for my k-means model. However I keep getting errors. (Check section after code) My spark dataframe and looks like this (and has around 1M rows): ID col1 col2 Latitude Longitude 13 ... ... 22.2 13.5 62 ... ... 21.4 13.8 24 ... ... 21.8 14.1 71 ... ... 28.9 18.0 ... ... ... .... .... Here is my code: from pyspark.ml.clustering import KMeans from pyspark.ml.linalg import Vectors df = spark.read.csv("file.csv") spark_rdd = df.rdd.map(lambda row: (row["ID"], Vectors.dense(row["Latitude"],row["Longitude"]))) feature_df = spark_rdd.toDF(["ID