Write the results of the Google Api to a data lake with Databricks

问题

I am getting back user usage data from the Google Admin Report User Usage Api via the Python SDK on Databricks. The data size is around 100 000 records per day which I do a night via a batch process. The api returns a max page size of 1000 so I call it 1000 roughly to get the data I need for the day. This is working fine.

My ultimate aim is to store the data in its raw format in a data lake (Azure Gen2, but irrelevant to this question). Later on, I will transform the data using Databricks into an aggregated reporting model and put PowerBI on top of it to track Google App usage over time.

As a C# programmer, I am new to Python and Spark: my current approach is to request the first page of 1000 records from the api and then write it to the datalake directly as a JSON file, then get the next pageset and write that too. The folder structure would be something like "\raw\googleuser\YYYY\MM\DD\data1.json".

I would like to keep data in it's rawest form possible in the raw zone and not apply too many transformations. The 2nd process can extract the fields I need, tag it with metadata and write it back as Parquet ready for consumption by function. This is why I am thinking of writing it as JSON.

This means that the 2nd process needs to read the JSON into a dataframe where I can transform it and write it as parquet (this part is also straight forward).

Because I am using the Google Api I am not working with Json - it returns dict objects (with complex nesting). I can extract it as a Json string using json.dump() but I cannot figure out how to write a STRING directly to my datalake. Once I get it into a dataframe I can easily write it in any format, however it seems like a performance overhead to convert it from Json into a dataframe and then essentially back to Json just to write it.

Here are the things I have tried and the results:

Build up a list of pyspark.sql.Rows and at the end of all the paging (100k of rows) - use spark.createDataFrame(rows) to turn it into a dataframe. Once it is a dataframe then I can save it as a Json file. This works, but seems inefficient.
Use json.dump(request) to get a string of 1000 records in Json. I am able to write it to the Databricks File System using this code:

with open("/dbfs/tmp/googleuserusagejsonoutput-{0}.json" .format(keyDateFilter), 'w') as f: f.write(json.dumps(response))

However, I then have to move it to my Azure data lake with:

dbutils.fs.cp("/tmp/test_dbfs1.txt", datalake_path + dbfs_path + "xyz.json")

Then I get the next 1000 records and keep doing this. I cannot seem to use the open() method directory to the data lake store (Azure abfss driver) or this would be a decent solution. It seems fragile and strange to dump it locally first and then move it.
Same as option 1, but do dump the dataframe to datalake every 1000 records and overwrite it (so that memory does not increase more than 1000 records at a time)
Ignore the rule of dumping raw Json. Massage the data into the simplest format I want and get rid of all extra data I don't need. This would result in a much smaller footprint and then Option 1 or 3 above would be followed. (This is the second question - the principle of saving all data from the Api in it's raw format so that as requirements change over time I always have the historical data in the data lake and can just change the transformation routines to extract different metrics out of it. Hence I am reluctant to drop any data at this stage.

Any advice appreciated please...

回答1:

Mount the lake to your databricks environment so you can just save it to the lake as if it was a normal folder:

with open('/dbfs/mnt/mydatalake/googleuserusagejsonoutput-{0}.json', 'wb') as f:
            json.dump(data, codecs.getwriter('utf-8')(f), sort_keys = True, indent = 4, ensure_ascii=False)
            f.close()

You only need to mount the lake once:

https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake-gen2.html#mount-the-azure-data-lake-storage-gen2-filesystem-with-dbfs

That being said,

Storing big data in json format is not optimal; for each and every value (cell) you are storing the key (column name), so your data will be much larger than it needs to be. Also, you should probably have a de-duplication function to ensure both, (1) there are not gaps in the data, and (2) you aren't storing the same data in multiple files. Databricks delta takes care of that.

https://docs.databricks.com/delta/delta-intro.html

来源：https://stackoverflow.com/questions/55628005/write-the-results-of-the-google-api-to-a-data-lake-with-databricks

标签

python

apache-spark

azure-data-lake

databricks

google-api-python-client