Generate database schema diagram for Databricks

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-28 11:11:09

问题


I'm creating a Databricks application and the database schema is getting to be non-trivial. Is there a way I can generate a schema diagram for a Databricks database (something similar to the schema diagrams that can be generated from mysql)?


回答1:


There are 2 variants possible:

  1. using Spark SQL with show databases, show tables in <database>, describe table ...
  2. using spark.catalog.listDatabases, spark.catalog.listTables, spark.catagog.listColumns.

2nd variant isn't very performant when you have a lot of tables in the database/namespace, although it's slightly easier to use programmatically. But in both cases, the implementation is just 3 nested loops iterating over list of databases, then list of tables inside database, and then list of columns inside table. This data could be used to generate a diagram using your favorite diagramming tool.

Here is the code for generating the source for PlantUML:

# This script generates PlantUML diagram for tables visible to Spark.
# The diagram is stored in the db_schema.puml file, so just run
# 'java -jar plantuml.jar db_schema.puml' to get PNG file

from pyspark.sql import SparkSession
from pyspark.sql.utils import AnalysisException

# Variables

# list of databases/namespaces to analyze.  Could be empty, then all existing 
# databases/namespaces will be processed
databases = ["a", "airbnb"] # put databases/namespace to handle
# change this if you want to include temporary tables as well
include_temp = False

# implementation
spark = SparkSession.builder.appName("Database Schema Generator").getOrCreate()

# if databases aren't specified, then fetch list from the Spark
if len(databases) == 0:
    databases = [db["namespace"] for db in spark.sql("show databases").collect()]

with open(f"db_schema.puml", "w") as f:
    f.write("\n".join(
        ["@startuml", "skinparam packageStyle rectangle", "hide circle",
         "hide empty methods", "", ""]))

    for database_name in databases[:3]:
        f.write(f'package "{database_name}" {{\n')
        tables = spark.sql(f"show tables in `{database_name}`")
        for tbl in tables.collect():
            table_name = tbl["tableName"]
            db = tbl["database"]
            if include_temp or not tbl["isTemporary"]:
                lines = []
                try:
                    lines.append(f'class {table_name} {{')
                    cols = spark.sql(f"describe table `{db}`.`{table_name}`")
                    for cl in cols.collect():
                        col_name = cl["col_name"]
                        data_type = cl["data_type"]
                        lines.append(f'{{field}} {col_name} : {data_type}')

                    lines.append('}\n')
                    f.write("\n".join(lines))
                except AnalysisException as ex:
                    print(f"Error when trying to describe {tbl.database}.{table_name}: {ex}")

        f.write("}\n\n")

    f.write("@enduml\n")

that then could be transformed into the picture:



来源:https://stackoverflow.com/questions/63783824/generate-database-schema-diagram-for-databricks

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!