Read Parquet file stored in S3 with AWS Lambda (Python 3)

99封情书 提交于 2019-12-03 15:02:45

I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.

Here's how I did it:

1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda

Source: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

Linux image: https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2

Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:

sudo yum list | grep python3

I installed:

python36.x86_64
python36-devel.x86_64
python36-libs.x86_64
python36-pip.noarch
python36-setuptools.noarch
python36-tools.x86_64

2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command:

mkdir parquet
cd parquet
pip install -t . fastparquet 
pip install -t . (any other dependencies)
copy my python file in this folder
zip and upload into Lambda

Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.

Source: Write parquet from AWS Kinesis firehose to AWS S3

This was an environment issue (Lambda in VPC not getting access to the bucket). Pyarrow is now working.
Hopefully the question itself will give a good-enough overview on how to make all that work.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!