问题
I am trying to import some data from a public repo in GitHub so that to use it from my Databricks notebooks.
So far I tried to connect my Databricks account with my GitHub as described here, without results though since it seems that GitHub support comes with some non-community licensing. I get the following message when I try to set the GitHub token which is required for the GitHub integration:
The same question has been asked before on the official Databricks forum.
What is the best way to import and store a GitHub repo on databricks community edition?
回答1:
I managed to solve this using shell
commands from the notebook itself. To retrieve the repository for the 1st time I did git clone
via HTTPS:
%sh git clone https://github.com/SomeDataRepo/TheData.git --depth 1 --branch=master /dbfs/FileStore/TheData/
Why not SSH? Well SSH requires to setup the SSH keys which was not necessary in my case.
Finally, every time that I need a fresh version of the data I execute a git pull
before executing my program:
%sh git -C /dbfs/FileStore/TheData/ pull
回答2:
assuming you have python installed on your desktop, install the databricks cli, clone the git repo to your local, then use the workspace cli to import the entire repo as a directory.
https://docs.databricks.com/dev-tools/cli/workspace-cli.html
来源:https://stackoverflow.com/questions/61078444/import-a-github-repo-into-databricks-community-edition