How to convert json files stored in s3 to csv using glue?

為{幸葍}努か 提交于 2020-01-06 04:31:39

问题


I have some json files stored in s3, and I need to convert them, at the folder folder they are, to csv format.

Currently I'm using glue to map them to athena, but, as I said, now I need to map them to csv.

Is it possible to use a Glue JOB to do that?

I trying to understand if a glue job can crawl into my s3 folder directories, converting all json files it finds to csv (as new files).

If not possible, is there any aws service that could help me do that?

EDIT1:

Here's the current code i'm trying to run

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext()
glueContext = GlueContext(sc)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://agco-sa-dfs-dv/dealer-data"]}, format = "json")
outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://agco-sa-dfs-dv/dealer-data"}, format = "csv")

The job runs with no error, but nothing seems to happen on s3 folder. I'm supposing the code will get the json files from /dealer-data and convert to the same folder, as csv. I'm probably wrong.

EDIT2:

Ok, I almost made it work the way i needed.

The thing is, the create dynamic frame is only working for folders with files, not folders with subfolders with files.

import sys
import logging
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2"]}, format = "json")

outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2/bla.csv"}, format = "csv")

The above works, but only for that directory (../2) Is there a way to read all files given a folder and subfolders?


回答1:


You should set the recurse option to True for S3 connection:

inputGDF = glueContext.create_dynamic_frame_from_options(
    connection_type = "s3", 
    connection_options = {
        "paths": ["s3://agco-sa-dfs-dv/dealer-data/installations/3555/2019/2"],
        "recurse" : True
    }, 
    format = "json
)


来源:https://stackoverflow.com/questions/56244413/how-to-convert-json-files-stored-in-s3-to-csv-using-glue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!