spark-dataframe | 易学教程

Spark CSV 2.1 File Names

阅读更多关于 Spark CSV 2.1 File Names

问题 i'm trying to save DataFrame into CSV using the new spark 2.1 csv option df.select(myColumns: _*).write .mode(SaveMode.Overwrite) .option("header", "true") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .csv(absolutePath) everything works fine and i don't mind haivng the part-000XX prefix but now seems like some UUID was added as a suffix i.e part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz Anyone knows how i can remove this file ext and stay only with

Spark CSV 2.1 File Names

阅读更多关于 Spark CSV 2.1 File Names

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

阅读更多关于 Does Spark Dataframe have an equivalent option of Panda's merge indicator?

问题 The python Pandas library contains the following function : DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False) The indicator field combined with Panda's value_counts() function can be used to quickly determine how well a join performed. Example: In [48]: df1 = pd.DataFrame({'col1': [0, 1], 'col_left':['a', 'b']}) In [49]: df2 = pd.DataFrame({'col1': [1, 2, 2],'col_right':

Does Spark Dataframe have an equivalent option of Panda's merge indicator?

阅读更多关于 Does Spark Dataframe have an equivalent option of Panda's merge indicator?

Apply custom function to cells of selected columns of a data frame in PySpark

阅读更多关于 Apply custom function to cells of selected columns of a data frame in PySpark

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

问题 I have a dataframe resulting from a sql query df1 = sqlContext.sql("select * from table_test") I need to convert this dataframe to libsvm format so that it can be provided as an input for pyspark.ml.classification.LogisticRegression I tried to do the following. However, this resulted in the following error as I'm using spark 1.5.2 df1.write.format("libsvm").save("data/foo") Failed to load class for data source: libsvm I wanted to use MLUtils.loadLibSVMFile instead. I'm behind a firewall and

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

convert dataframe to libsvm format

阅读更多关于 convert dataframe to libsvm format

Load XML string from Column in PySpark

阅读更多关于 Load XML string from Column in PySpark

问题 I have a JSON file in which one of the columns is an XML string. I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file. How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values? The following doesn't work: tr = spark.read.json( "my-file-path") trans_xml = sqlContext.read.format('com.databricks.spark.xml')