Read Parquet file stored in S3 with AWS Lambda (Python 3)

前端 未结 4 947
星月不相逢
星月不相逢 2021-01-02 03:23

I am trying to load, process and write Parquet files in S3 with AWS Lambda. My testing / deployment process is:

  • https://github.com/lambci/docker-lambda as a
4条回答
  •  野趣味
    野趣味 (楼主)
    2021-01-02 04:15

    I was able to accomplish writing parquet files into S3 using fastparquet. It's a little tricky but my breakthrough came when I realized that to put together all the dependencies, I had to use the same exact Linux that Lambda is using.

    Here's how I did it:

    1. Spin up a EC2 instance using the Amazon Linux image that is used with Lambda

    Source: https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

    Linux image: https://console.aws.amazon.com/ec2/v2/home#Images:visibility=public-images;search=amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2

    Note: you might need to install many packages and change python version to 3.6 as this Linux is not meant for development. Here's how I looked for packages:

    sudo yum list | grep python3
    

    I installed:

    python36.x86_64
    python36-devel.x86_64
    python36-libs.x86_64
    python36-pip.noarch
    python36-setuptools.noarch
    python36-tools.x86_64
    

    2. Used the instructions from here to built a zip file with all of the dependencies that my script would use with dumping them all in a folder and the zipping them with this command:

    mkdir parquet
    cd parquet
    pip install -t . fastparquet 
    pip install -t . (any other dependencies)
    copy my python file in this folder
    zip and upload into Lambda
    

    Note: there are some constraints I had to work around: Lambda doesn't let you upload zip larger 50M and unzipped > 260M. If anyone knows a better way to get dependencies into Lambda, please do share.

    Source: Write parquet from AWS Kinesis firehose to AWS S3

提交回复
热议问题