I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:
I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.
Here's how I did it:
Source: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html
Linux image: https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2
Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:
sudo yum list | grep python3
I installed:
python36.x86_64
python36-devel.x86_64
python36-libs.x86_64
python36-pip.noarch
python36-setuptools.noarch
python36-tools.x86_64
mkdir parquet
cd parquet
pip install -t . fastparquet
pip install -t . (any other dependencies)
copy my python file in this folder
zip and upload into Lambda
Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.
Source: Write parquet from AWS Kinesis firehose to AWS S3