Saving Spark dataFrames as parquet files - no errors, but data is not being saved

问题

I want to save a dataframe as a parquet file in Python, but I am only able to save the schema, not the data itself.

I have reduced my problem down to a very simple Python test case, which I copied below from IPYNB.

Any advice on what might be going on?

In [2]:

import math
import string
import datetime
import numpy as np
import matplotlib.pyplot
from pyspark.sql import *
import pylab
import random
import time

In [3]:

sqlContext = SQLContext(sc)
#create a simple 1 column dataframe a single row of data
df = sqlContext.createDataFrame(sc.parallelize(xrange(1)).flatMap(lambda x[Row(col1="Test row")]))
df.show()
df.count()

Out[3]:
col1    
Test row

1L

In [4]:
# Persist the dataframe as a parquet file
df.saveAsParquetFile("test.parquet")

In [5]: 
ls

TrapezoidRule.ipynb         metastore_db/
WeatherPrecipitation.ipynb  derby.log                  test.parquet/

In [6]: 
ls -l test.parquet
total 4
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users   0 Oct  4 14:13 _SUCCESS
-rw-r--r-- 1 s26e-5a5fbda111ac17-5edfd8a0d95d users 188 Oct  4 14:13 _common_metadata

In [7]: 
# The directory listing shows that the test parquet was created, but there are no data files.
# load the parquet file into another df and show that no data was saved or loaded... only the schema
newDF = sqlContext.parquetFile("test.parquet")
newDF.show()
newDF.count()

Out[7]: 
col1

0L

来源：https://stackoverflow.com/questions/32938349/saving-spark-dataframes-as-parquet-files-no-errors-but-data-is-not-being-save

标签

dataframe

apache-spark-sql

parquet

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!