spark-dataframe

Spark CSV 2.1 File Names

假装没事ソ 提交于 2021-02-07 08:32:28
问题 i'm trying to save DataFrame into CSV using the new spark 2.1 csv option df.select(myColumns: _*).write .mode(SaveMode.Overwrite) .option("header", "true") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .csv(absolutePath) everything works fine and i don't mind haivng the part-000XX prefix but now seems like some UUID was added as a suffix i.e part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz Anyone knows how i can remove this file ext and stay only with

Spark CSV 2.1 File Names

大城市里の小女人 提交于 2021-02-07 08:32:26
问题 i'm trying to save DataFrame into CSV using the new spark 2.1 csv option df.select(myColumns: _*).write .mode(SaveMode.Overwrite) .option("header", "true") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .csv(absolutePath) everything works fine and i don't mind haivng the part-000XX prefix but now seems like some UUID was added as a suffix i.e part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz Anyone knows how i can remove this file ext and stay only with

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

假装没事ソ 提交于 2021-02-07 08:17:51
问题 The python Pandas library contains the following function : DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False) The indicator field combined with Panda's value_counts() function can be used to quickly determine how well a join performed. Example: In [48]: df1 = pd.DataFrame({'col1': [0, 1], 'col_left':['a', 'b']}) In [49]: df2 = pd.DataFrame({'col1': [1, 2, 2],'col_right':

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

浪子不回头ぞ 提交于 2021-02-07 08:16:41
问题 The python Pandas library contains the following function : DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False) The indicator field combined with Panda's value_counts() function can be used to quickly determine how well a join performed. Example: In [48]: df1 = pd.DataFrame({'col1': [0, 1], 'col_left':['a', 'b']}) In [49]: df2 = pd.DataFrame({'col1': [1, 2, 2],'col_right':

Apply custom function to cells of selected columns of a data frame in PySpark

和自甴很熟 提交于 2021-02-07 03:32:39
问题 Let's say I have a data frame which looks like this: +---+-----------+-----------+ | id| address1| address2| +---+-----------+-----------+ | 1|address 1.1|address 1.2| | 2|address 2.1|address 2.2| +---+-----------+-----------+ I would like to apply a custom function directly to the strings in the address1 and address2 columns, for example: def example(string1, string2): name_1 = string1.lower().split(' ') name_2 = string2.lower().split(' ') intersection_count = len(set(name_1) & set(name_2))

convert dataframe to libsvm format

心不动则不痛 提交于 2021-02-06 11:11:59
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

*爱你&永不变心* 提交于 2021-02-06 11:10:41
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

不羁的心 提交于 2021-02-06 11:07:38
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

依然范特西╮ 提交于 2021-02-06 11:05:59
问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

Load XML string from Column in PySpark

点点圈 提交于 2021-02-05 07:20:25
问题 I have a JSON file in which one of the columns is an XML string. I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file. How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values? The following doesn't work: tr = spark.read.json( "my-file-path") trans_xml = sqlContext.read.format('com.databricks.spark.xml')