Avoid creation of _$folder$ keys in S3 with hadoop (EMR)

吃可爱长大的小学妹 提交于 2019-11-30 22:35:24

EMR doesn't seem to provide a way to avoid this.

Because S3 uses a key-value pair storage system, the Hadoop file system implements directory support in S3 by creating empty files with the "_$folder$" suffix.

You can safely delete any empty files with the <directoryname>_$folder$ suffix that appear in your S3 buckets. These empty files are created by the Hadoop framework at runtime, but Hadoop is designed to process data even if these empty files are removed.

https://aws.amazon.com/premiumsupport/knowledge-center/emr-s3-empty-files/

It's in the Hadoop source code, so it could be fixed, but apparently it's not fixed in EMR.

If you are feeling clever, you could create an S3 event notification that matches the _$folder$ suffix, and have it fire off a Lambda function to delete the objects after they're created.

There's no way in S3 to actually create an empty folder. S3 is an object store so everything is an object in there. When Hadoop uses it as a filesystem, it requires to organize those objects so that it appears as a file system tree, so it creates some special objects to mark an object as a directory. You just store data files, but you can choose to organize those data files into paths, which creates a concept similar to folders for traversing.

If you just don't create a folder, but place files in the path you want - that should work for you. You don't have to create a folder before writing files to it in S3.

Also this may help: https://qubole.zendesk.com/hc/en-us/articles/213496246-How-To-Remove-Dir-marker-folders-in-S3-NativeFS-

Use below script in EMR bootstrap action to solve this issue. Patch provided by AWS

#!/bin/bash

# NOTE: This script replaces the s3-dist-cp RPM on EMR versions 4.6.0+ with s3-dist-cp-2.2.0.
# This is intended to remove the _$folder$ markers when creating the destination prefixes in S3.

set -ex

RPM=bootstrap-actions/s3-dist-cp-2.2.0/s3-dist-cp-2.2.0-1.amzn1.noarch.rpm

LOCAL_DIR=/var/aws/emr/packages/bigtop/s3-dist-cp/noarch

# Get the region from metadata
REGION=$(curl http://169.254.169.254/latest/meta-data/placement/availability-zone/ 2>/dev/null | head -c -1)

# Choose correct bucket for region
if [ $REGION = "us-east-1" ]
then
    BUCKET=awssupportdatasvcs.com
else
    BUCKET=$REGION.awssupportdatasvcs.com
fi

# Download new RPM
sudo rm $LOCAL_DIR/s3-dist-cp*.rpm
aws s3 cp s3://$BUCKET/$RPM /tmp/
sudo cp /tmp/s3-dist-cp-2.2.0-1.amzn1.noarch.rpm $LOCAL_DIR/

echo Rebuilding Repo
sudo yum install -y createrepo
sudo createrepo --update -o /var/aws/emr/packages/bigtop /var/aws/emr/packages/bigtop
sudo yum clean all
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!