spark-csv

Error while reading very large files with spark csv package

佐手、 提交于 2019-12-23 18:36:42
问题 We are trying to read a 3 gb file which has multiple new line character in one its column using spark-csv and univocity 1.5.0 parser, but the file is getting split in the multiple column in some row on the basis of newline character. This scenario is occurring in case of large file. We are using spark 1.6.1 and scala 2.10 Following code i'm using for reading the file : sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .option("mode",

Spark DataFrame handing empty String in OneHotEncoder

社会主义新天地 提交于 2019-12-19 17:45:31
问题 I am importing a CSV file (using spark-csv) into a DataFrame which has empty String values. When applied the OneHotEncoder , the application crashes with error requirement failed: Cannot have an empty string for name. . Is there a way I can get around this? I could reproduce the error in the example provided on Spark ml page: val df = sqlContext.createDataFrame(Seq( (0, "a"), (1, "b"), (2, "c"), (3, ""), //<- original example has "a" here (4, "a"), (5, "c") )).toDF("id", "category") val

Scala: Spark SQL to_date(unix_timestamp) returning NULL

别说谁变了你拦得住时间么 提交于 2019-12-18 17:02:46
问题 Spark Version: spark-2.0.1-bin-hadoop2.7 Scala: 2.11.8 I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter date_format includes string of column names that need to be converted to yyyy-mm-dd format. In the following code, I first loaded the csv of Date column as StringType via the schema , and then I check if the date_format is not empty, that is there are columns that need to

How to force inferSchema for CSV to consider integers as dates (with “dateFormat” option)?

冷暖自知 提交于 2019-12-18 05:08:28
问题 I use Spark 2.2.0 I am reading a csv file as follows: val dataFrame = spark.read.option("inferSchema", "true") .option("header", true) .option("dateFormat", "yyyyMMdd") .csv(pathToCSVFile) There is one date column in this file, and all records has a value equal to 20171001 for this particular column. The issue is that spark is inferring that that the type of this column is integer rather than date . When I remove the "inferSchema" option, the type of that column is string . There is no null

Custom schema in spark-csv throwing error in spark 1.4.1

只愿长相守 提交于 2019-12-13 01:27:30
问题 I trying to process CSV file using spark -csv package in spark-shell in spark 1.4.1. scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala> import org.apache.spark.sql.hive.orc._ import org.apache.spark.sql.hive.orc._ scala> import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}; import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType} scala> val hiveContext = new org.apache.spark.sql

Programmatically generate the schema AND the data for a dataframe in Apache Spark

穿精又带淫゛_ 提交于 2019-12-11 00:19:38
问题 I would like to dynamically generate a dataframe containing a header record for a report, so creating a dataframe from the value of the string below: val headerDescs : String = "Name,Age,Location" val headerSchema = StructType(headerDescs.split(",").map(fieldName => StructField(fieldName, StringType, true))) However now I want to do the same for the data (which is in effect the same data i.e. the metadata). I create an RDD : val headerRDD = sc.parallelize(headerDescs.split(",")) I then

Parse Micro/Nano Seconds timestamp in spark-csv Dataframe reader : Inconsistent results

此生再无相见时 提交于 2019-12-10 23:09:31
问题 I'm trying to read a csv file which has timestamps till nano seconds. sample content of file TestTimestamp.csv- spark- 2.4.0, scala - 2.11.11 /** * TestTimestamp.csv - * 101,2019-SEP-23 11.42.35.456789123 AM * */ Tried to read it using timestampFormat = "yyyy-MMM-dd hh.mm.ss.SSSSSSSSS aaa" val dataSchema = StructType(Array(StructField("ID", DoubleType, true), StructField("Created_TS", TimestampType, true))) val data = spark.read.format("csv") .option("header", "false") .option("inferSchema",

add header and column to dataframe spark

狂风中的少年 提交于 2019-12-10 18:22:10
问题 hi guys i've got a dataframe on which i want to add a header and a first column manually here is the dataframe import org.apache.spark.sql.SparkSession val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate() val df = spark.read.option("header",true).option("inferSchema",true).csv("C:\\gg.csv").cache() the content of the dataframe 12,13,14 11,10,5 3,2,45 the expected output is define,col1,col2,col3 c1,12,13,14 c2,11,10,5 c3,3,2,45 Any help would be appreciated.

About how to create a custom org.apache.spark.sql.types.StructType schema object starting from a json file programmatically

会有一股神秘感。 提交于 2019-12-07 16:57:49
问题 i have to create a custom org.apache.spark.sql.types.StructType schema object with the info from a json file, the json file can be anything, so i have parametriced it within a property file. This is how it looks the property file: //ruta al esquema del fichero output (por defecto se infiere el esquema del Parquet destino). Si existe, el esquema será en formato JSON, aplicable a DataFrame (ver StructType.fromJson) schema.parquet=/Users/XXXX/Desktop/generated_schema.json writing.mode=overwrite

About how to create a custom org.apache.spark.sql.types.StructType schema object starting from a json file programmatically

戏子无情 提交于 2019-12-05 18:38:09
i have to create a custom org.apache.spark.sql.types.StructType schema object with the info from a json file, the json file can be anything, so i have parametriced it within a property file. This is how it looks the property file: //ruta al esquema del fichero output (por defecto se infiere el esquema del Parquet destino). Si existe, el esquema será en formato JSON, aplicable a DataFrame (ver StructType.fromJson) schema.parquet=/Users/XXXX/Desktop/generated_schema.json writing.mode=overwrite separator=; header=false The file generated_schema.json looks like: {"type" : "struct","fields" : [ {