How to convert a csv file to parquet

问题

I'm new to BigData.I need to convert a csv/txt file to Parquet format. I searched a lot but couldn't find any direct way to do so. Is there any way to achieve that?

回答1:

Here is a sample piece of code which does it both ways.

回答2:

You can use Apache Drill, as described in Convert a CSV File to Apache Parquet With Drill.

In brief:

Start Apache Drill:

$ cd /opt/drill/bin
$ sqlline -u jdbc:drill:zk=local

Create the Parquet file:

-- Set default table format to parquet
ALTER SESSION SET `store.format`='parquet';

-- Create a parquet table containing all data from the CSV table
CREATE TABLE dfs.tmp.`/stats/airport_data/` AS
SELECT
CAST(SUBSTR(columns[0],1,4) AS INT)  `YEAR`,
CAST(SUBSTR(columns[0],5,2) AS INT) `MONTH`,
columns[1] as `AIRLINE`,
columns[2] as `IATA_CODE`,
columns[3] as `AIRLINE_2`,
columns[4] as `IATA_CODE_2`,
columns[5] as `GEO_SUMMARY`,
columns[6] as `GEO_REGION`,
columns[7] as `ACTIVITY_CODE`,
columns[8] as `PRICE_CODE`,
columns[9] as `TERMINAL`,
columns[10] as `BOARDING_AREA`,
CAST(columns[11] AS DOUBLE) as `PASSENGER_COUNT`
FROM dfs.`/opendata/Passenger/SFO_Passenger_Data/*.csv`;

Try selecting data from the new Parquet file:

-- Select data from parquet table
SELECT *
FROM dfs.tmp.`/stats/airport_data/*`

You can change the dfs.tmp location by going to http://localhost:8047/storage/dfs (source: CSV and Parquet).

回答3:

I already posted an answer on how to do this using Apache Drill. However, if you are familiar with Python, you can now do this using Pandas and PyArrow!

Install dependencies

Using pip:

pip install pandas pyarrow

or using conda:

conda install pandas pyarrow -c conda-forge

Convert CSV to Parquet in chunks

# csv_to_parquet.py

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

csv_file = '/path/to/my.tsv'
parquet_file = '/path/to/my.parquet'
chunksize = 100_000

csv_stream = pd.read_csv(csv_file, sep='\t', chunksize=chunksize, low_memory=False)

for i, chunk in enumerate(csv_stream):
    print("Chunk", i)
    if i == 0:
        # Guess the schema of the CSV file from the first chunk
        parquet_schema = pa.Table.from_pandas(df=chunk).schema
        # Open a Parquet file for writing
        parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
    # Write CSV chunk to the parquet file
    table = pa.Table.from_pandas(chunk, schema=parquet_schema)
    parquet_writer.write_table(table)

parquet_writer.close()

I haven't benchmarked this code against the Apache Drill version, but in my experience it's plenty fast, converting tens of thousands of rows per second (this depends on the CSV file of course!).

回答4:

The following code is an example using spark2.0. Reading is much faster than inferSchema option. Spark 2.0 convert into parquet file in much more efficient than spark1.6.

import org.apache.spark.sql.types._
var df = StructType(Array(StructField("timestamp", StringType, true),StructField("site", StringType, true),StructField("requests", LongType, true) ))
df = spark.read
          .schema(df)
          .option("header", "true")
          .option("delimiter", "\t")
          .csv("/user/hduser/wikipedia/pageviews-by-second-tsv")
df.write.parquet("/user/hduser/wikipedia/pageviews-by-second-parquet")

回答5:

1) You can create an external hive table

create  external table emp(name string,job_title string,department string,salary_per_year int)
row format delimited
fields terminated by ','
location '.. hdfs location of csv file '

2) Another hive table that will store parquet file

create  external table emp_par(name string,job_title string,department string,salary_per_year int)
row format delimited
stored as PARQUET
location 'hdfs location were you want the save parquet file'

Insert the table one data into table two :

insert overwrite table emp_par select * from emp

回答6:

Read csv files as Dataframe in Apache Spark with spark-csv package. after loading data to Dataframe save dataframe to parquetfile.

val df = sqlContext.read
      .format("com.databricks.spark.csv")
      .option("header", "true")
      .option("inferSchema", "true")
      .option("mode", "DROPMALFORMED")
      .load("/home/myuser/data/log/*.csv")
df.saveAsParquetFile("/home/myuser/data.parquet")

回答7:

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
import sys

sc = SparkContext(appName="CSV2Parquet")
sqlContext = SQLContext(sc)

schema = StructType([
    StructField("col1", StringType(), True),
    StructField("col2", StringType(), True),
    StructField("col3", StringType(), True),
    StructField("col4", StringType(), True),
    StructField("col5", StringType(), True)])
rdd = sc.textFile('/input.csv').map(lambda line: line.split(","))
df = sqlContext.createDataFrame(rdd, schema)
df.write.parquet('/output.parquet')

来源：https://stackoverflow.com/questions/26124417/how-to-convert-a-csv-file-to-parquet

标签

java

bigdata

parquet