spark-csv

inferSchema in spark-csv package

隐身守侯 提交于 2019-12-04 05:57:24
When CSV is read as dataframe in spark, all the columns are read as string. Is there any way to get the actual type of column? I have the following csv file Name,Department,years_of_experience,DOB Sam,Software,5,1990-10-10 Alex,Data Analytics,3,1992-10-10 I've read the CSV using the below code val df = sqlContext. read. format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema", "true"). load(sampleAdDataS3Location) df.schema All the columns are read as string. I expect the column years_of_experience to be read as int and DOB to be read as date Please note that I've set

Adding custom Delimiter adds double quotes in the final spark data frame CSV outpu

放肆的年华 提交于 2019-12-02 16:56:23
问题 I have a data frame where i am replacing default delimiter , with |^| . it is working fine and i am getting the expected result also except where , is found in the records . For example i have one such records like below 4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense),net|^||^||^|IIII|^|False|^||^||^||^||^|False|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|True|^||^|3014960|^||^|I|!| So there is , in the 4th field . Now i am doing like this

Scala: Spark SQL to_date(unix_timestamp) returning NULL

独自空忆成欢 提交于 2019-11-30 15:19:58
Spark Version: spark-2.0.1-bin-hadoop2.7 Scala: 2.11.8 I am loading a raw csv into a DataFrame. In csv, although the column is support to be in date format, they are written as 20161025 instead of 2016-10-25. The parameter date_format includes string of column names that need to be converted to yyyy-mm-dd format. In the following code, I first loaded the csv of Date column as StringType via the schema , and then I check if the date_format is not empty, that is there are columns that need to be converted to Date from String , then cast each column using unix_timestamp and to_date . However, in

How to parse a csv that uses ^A (i.e. \\001) as the delimiter with spark-csv?

匆匆过客 提交于 2019-11-30 09:17:54
Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter. I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an

How to show full column content in a Spark Dataframe?

旧巷老猫 提交于 2019-11-30 06:09:44
问题 I am using spark-csv to load data into a DataFrame. I want to do a simple query and display the content: val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("my.csv") df.registerTempTable("tasks") results = sqlContext.sql("select col from tasks"); results.show() The col seems truncated: scala> results.show(); +--------------------+ | col| +--------------------+ |2015-11-16 07:15:...| |2015-11-16 07:15:...| |2015-11-16 07:15:...| |2015-11-16 07:15:...|

How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

蹲街弑〆低调 提交于 2019-11-29 14:09:07
问题 Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter. I know that the spark-csv package that I'm using has a delimiter option,

How to force inferSchema for CSV to consider integers as dates (with “dateFormat” option)?

陌路散爱 提交于 2019-11-29 07:36:16
I use Spark 2.2.0 I am reading a csv file as follows: val dataFrame = spark.read.option("inferSchema", "true") .option("header", true) .option("dateFormat", "yyyyMMdd") .csv(pathToCSVFile) There is one date column in this file, and all records has a value equal to 20171001 for this particular column. The issue is that spark is inferring that that the type of this column is integer rather than date . When I remove the "inferSchema" option, the type of that column is string . There is no null values, nor any wrongly formatted line in this file. What is the reason/solution for this issue? If my

How to show full column content in a Spark Dataframe?

℡╲_俬逩灬. 提交于 2019-11-28 15:20:05
I am using spark-csv to load data into a DataFrame. I want to do a simple query and display the content: val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("my.csv") df.registerTempTable("tasks") results = sqlContext.sql("select col from tasks"); results.show() The col seems truncated: scala> results.show(); +--------------------+ | col| +--------------------+ |2015-11-16 07:15:...| |2015-11-16 07:15:...| |2015-11-16 07:15:...| |2015-11-16 07:15:...| |2015-11-16 07:15:...| |2015-11-16 07:15:...| |2015-11-16 07:15:...| |2015-11-16 07:15:...| |2015-11-16 07

How to estimate dataframe real size in pyspark?

南楼画角 提交于 2019-11-28 07:00:47
How to determine a dataframe size? Right now I estimate the real size of a dataframe as follows: headers_size = key for key in df.first().asDict() rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum() total_size = headers_size + rows_size It is too slow and I'm looking for a better way. nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/ from pyspark.serializers import PickleSerializer, AutoBatchedSerializer def _to_java_object_rdd(rdd): """ Return a JavaRDD of Object by unpickling It will convert each Python object

Can I read a CSV represented as a string into Apache Spark using spark-csv

本秂侑毒 提交于 2019-11-27 20:56:09
I know how to read a csv file into spark using spark-csv ( https://github.com/databricks/spark-csv ), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible? Update : Starting from Spark 2.2.x there is finally a proper way to do it using Dataset. import org.apache.spark.sql.{Dataset, SparkSession} val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate() import spark.implicits._ val csvData: Dataset[String] = spark.sparkContext.parallelize( """ |id, date, timedump |1, "2014/01/01 23:00