可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
What ways do we have available to connect to a Google Cloud SQL (MySQL) instance from the newly introduced Google Cloud Composer? The intention is to get data from a Cloud SQL instance into BigQuery (perhaps with an intermediary step through Cloud Storage).
Can the Cloud SQL proxy be exposed in some way on pods part the Kubernetes cluster hosting Composer?
If not can the Cloud SQL Proxy be brought in by using the Kubernetes Service Broker? -> https://cloud.google.com/kubernetes-engine/docs/concepts/add-on/service-broker
Should Airflow be used to schedule and call GCP API commands like 1) export mysql table to cloud storage 2) read mysql export into bigquery?
Perhaps there are other methods that I am missing to get this done
回答1:
We had the same problem but with a Postgres instance. This is what we did, and got it to work:
create a sqlproxy deployment in the Kubernetes cluster where airflow runs. This was a copy of the existing airflow-sqlproxy used by the default airflow_db connection with the following changes to the deployment file:
- replace all instances of airflow-sqlproxy with the new proxy name
- edit under 'spec: template: spec: containers: command: -instances', replace the existing instance name with the new instance we want to connect to
create a kubernetes service, again as a copy of the existing airflow-sqlproxy-service with the following changes:
- replace all instances of airflow-sqlproxy with the new proxy name
- under 'spec: ports', change to the appropriate port (we used 5432 for a Postgres instance)
in the airflow UI, add a connection of type Postgres with host set to the newly created service name.
回答2:
You can follow these instructions to launch a new Cloud SQL proxy instance in the cluster.
re #3: That sounds like a good plan. There isn't a Cloud SQL to BigQuery operator to my knowledge, so you'd have to do it in two phases like you described.
回答3:
Here, in Hoffa's answer to a similar question, you can find a reference on how Wepay keeps it synchronized every 15 minutes using an Airflow operator.
From said answer:
Take a look at how WePay does this:
The MySQL to GCS operator executes a SELECT query against a MySQL table. The SELECT pulls all data greater than (or equal to) the last high watermark. The high watermark is either the primary key of the table (if the table is append-only), or a modification timestamp column (if the table receives updates). Again, the SELECT statement also goes back a bit in time (or rows) to catch potentially dropped rows from the last query (due to the issues mentioned above).
With Airflow they manage to keep BigQuery synchronized to their MySQL database every 15 minutes.