spark-csv

Spark CSV 2.1 File Names

假装没事ソ 提交于 2021-02-07 08:32:28
问题 i'm trying to save DataFrame into CSV using the new spark 2.1 csv option df.select(myColumns: _*).write .mode(SaveMode.Overwrite) .option("header", "true") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .csv(absolutePath) everything works fine and i don't mind haivng the part-000XX prefix but now seems like some UUID was added as a suffix i.e part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz Anyone knows how i can remove this file ext and stay only with

Spark CSV 2.1 File Names

大城市里の小女人 提交于 2021-02-07 08:32:26
问题 i'm trying to save DataFrame into CSV using the new spark 2.1 csv option df.select(myColumns: _*).write .mode(SaveMode.Overwrite) .option("header", "true") .option("codec", "org.apache.hadoop.io.compress.GzipCodec") .csv(absolutePath) everything works fine and i don't mind haivng the part-000XX prefix but now seems like some UUID was added as a suffix i.e part-00032-10309cf5-a373-4233-8b28-9e10ed279d2b.csv.gz ==> part-00032.csv.gz Anyone knows how i can remove this file ext and stay only with

Spark 2.1 cannot write Vector field on CSV

此生再无相见时 提交于 2020-06-27 21:55:37
问题 I was migrating my code from Spark 2.0 to 2.1 when I stumbled into a problem related to Dataframe saving. Here's the code import org.apache.spark.sql.types._ import org.apache.spark.ml.linalg.VectorUDT val df = spark.createDataFrame(Seq(Tuple1(1))).toDF("values") val toSave = new org.apache.spark.ml.feature.VectorAssembler().setInputCols(Array("values")).transform(df) toSave.write.csv(path) This code succeeds when using Spark 2.0.0 Using Spark 2.1.0.cloudera1, I get the following error : java

Spark 2.1 cannot write Vector field on CSV

牧云@^-^@ 提交于 2020-06-27 21:53:34
问题 I was migrating my code from Spark 2.0 to 2.1 when I stumbled into a problem related to Dataframe saving. Here's the code import org.apache.spark.sql.types._ import org.apache.spark.ml.linalg.VectorUDT val df = spark.createDataFrame(Seq(Tuple1(1))).toDF("values") val toSave = new org.apache.spark.ml.feature.VectorAssembler().setInputCols(Array("values")).transform(df) toSave.write.csv(path) This code succeeds when using Spark 2.0.0 Using Spark 2.1.0.cloudera1, I get the following error : java

inferSchema in spark csv package

那年仲夏 提交于 2020-03-06 09:25:10
问题 i am trying to read a csv file as a spark df by enabling inferSchema, but then am unable to get the fv_df.columns. below is the error message >>> fv_df = spark.read.option("header", "true").option("delimiter", "\t").csv('/home/h212957/FacilityView/datapoints_FV.csv', inferSchema=True) >>> fv_df.columns Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/h212957/spark/python/pyspark/sql/dataframe.py", line 687, in columns return [f.name for f in self.schema

inferSchema in spark-csv package

爱⌒轻易说出口 提交于 2020-01-01 09:22:55
问题 When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column? I have the following csv file Name,Department,years_of_experience,DOB Sam,Software,5,1990-10-10 Alex,Data Analytics,3,1992-10-10 I've read the CSV using the below code val df = sqlContext. read. format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true"). load(sampleAdDataS3Location) df.schema All the columns are read as string. I

inferSchema in spark-csv package

☆樱花仙子☆ 提交于 2020-01-01 09:22:47
问题 When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column? I have the following csv file Name,Department,years_of_experience,DOB Sam,Software,5,1990-10-10 Alex,Data Analytics,3,1992-10-10 I've read the CSV using the below code val df = sqlContext. read. format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true"). load(sampleAdDataS3Location) df.schema All the columns are read as string. I

Getting NullPointerException using spark-csv with DataFrames

十年热恋 提交于 2020-01-01 05:24:26
问题 Running through the spark-csv README there's sample Java code like this import org.apache.spark.sql.SQLContext; import org.apache.spark.sql.types.*; SQLContext sqlContext = new SQLContext(sc); StructType customSchema = new StructType( new StructField("year", IntegerType, true), new StructField("make", StringType, true), new StructField("model", StringType, true), new StructField("comment", StringType, true), new StructField("blank", StringType, true)); DataFrame df = sqlContext.read() .format

how to change header of a data frame with another data frame header?

ぐ巨炮叔叔 提交于 2019-12-25 11:35:39
问题 I have a data set which looks like this LineItem.organizationId|^|LineItem.lineItemId|^|StatementTypeCode|^|LineItemName|^|LocalLanguageLabel|^|FinancialConceptLocal|^|FinancialConceptGlobal|^|IsDimensional|^|InstrumentId|^|LineItemSequence|^|PhysicalMeasureId|^|FinancialConceptCodeGlobalSecondary|^|IsRangeAllowed|^|IsSegmentedByOrigin|^|SegmentGroupDescription|^|SegmentChildDescription|^|SegmentChildLocalLanguageLabel|^|LocalLanguageLabel.languageId|^|LineItemName.languageId|^

DataFrame Object is not showing any data

坚强是说给别人听的谎言 提交于 2019-12-25 09:17:05
问题 I was trying to create a dataframe object on a hdfs file using spark csv lib as shown in this tutorial. But when i tried to get the count of DataFrame object , it is showing as 0 Here is my file look like, employee.csv: empid,empname 1000,Tom 2000,Jerry I loaded the above file using, val empDf = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("delimiter",",").load("hdfs:///user/.../employee.csv"); When i queried like, empDf object.printSchema() is giving