spark-csv

Provide schema while reading csv file as a dataframe

≡放荡痞女 提交于 2019-11-27 06:58:59
I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below. val pagecount = sqlContext.read.format("csv") .option("delimiter"," ").option("quote","") .option("schema","project: string ,article: string ,requests: integer ,bytes_served: long") .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000") But when I check the schema of the data frame I created, it seems to have taken its own

How to estimate dataframe real size in pyspark?

删除回忆录丶 提交于 2019-11-27 01:10:28
问题 How to determine a dataframe size? Right now I estimate the real size of a dataframe as follows: headers_size = key for key in df.first().asDict() rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum() total_size = headers_size + rows_size It is too slow and I'm looking for a better way. 回答1: nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/ from pyspark.serializers import PickleSerializer, AutoBatchedSerializer

Can I read a CSV represented as a string into Apache Spark using spark-csv

牧云@^-^@ 提交于 2019-11-26 20:21:17
问题 I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible? 回答1: Update : Starting from Spark 2.2.x there is finally a proper way to do it using Dataset. import org.apache.spark.sql.{Dataset, SparkSession} val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate() import spark.implicits._ val

Provide schema while reading csv file as a dataframe

孤人 提交于 2019-11-26 08:07:25
问题 I am trying to read a csv file into a dataframe. I know what the schema of my dataframe should be since I know my csv file. Also I am using spark csv package to read the file. I trying to specify the schema like below. val pagecount = sqlContext.read.format(\"csv\") .option(\"delimiter\",\" \").option(\"quote\",\"\") .option(\"schema\",\"project: string ,article: string ,requests: integer ,bytes_served: long\") .load(\"dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample

Write single CSV file using spark-csv

落爺英雄遲暮 提交于 2019-11-25 23:46:30
问题 I am using https://github.com/databricks/spark-csv , I am trying to write a single CSV, but not able to, it is making a folder. Need a Scala function which will take parameter like path and file name and write that CSV file. 回答1: It is creating a folder with multiple files, because each partition is saved individually. If you need a single output file (still in a folder) you can repartition (preferred if upstream data is large, but requires a shuffle): df .repartition(1) .write.format("com