问题
I want to store data from SQL to Pandas dataframe and do some data transformations and then load to another table suing airflow
Issue that I am facing is that connection string to tables are accessbale only through Airflow. So I need to use airflow as medium to read and write data.
How can this be done ?
MY code
Task1 = PostgresOperator(
task_id='Task1',
postgres_conn_id='REDSHIFT_CONN',
sql="SELECT * FROM Western.trip limit 5 ",
params={'limit': '50'},
dag=dag
The output of task needs to be stored to dataframe (df) and after tranfromations and load back into another table.
How can this be done?
回答1:
I doubt there's an in-built operator for this. You can easily write a custom operator
- Extend
PostgresOperator
or justBaseOperator
/ any other operator of your choice. All custom code goes into the overridden execute() method - Then use PostgresHook to obtain a
Pandas
DataFrame
by invoking get_pandas_df() function - Perform whatever transformations you have to do in your
pandas
df
- Finally use insert_rows() function to insert data into table
UPDATE-1
As requested, I'm hereby adding the code for operator
from typing import Dict, Any, List, Tuple
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.utils.decorators import apply_defaults
from pandas import DataFrame
class MyCustomOperator(PostgresOperator):
@apply_defaults
def __init__(self, destination_table: str, *args, **kwargs):
super().__init__(*args, **kwargs)
self.destination_table: str = destination_table
def execute(self, context: Dict[str, Any]):
# create PostgresHook
self.hook: PostgresHook = PostgresHook(postgres_conn_id=self.postgres_conn_id,
schema=self.database)
# read data from Postgres-SQL query into pandas DataFrame
df: DataFrame = self.hook.get_pandas_df(sql=self.sql, parameters=self.parameters)
# perform transformations on df here
df['column_to_be_doubled'] = df['column_to_be_doubled'].multiply(2)
..
# convert pandas DataFrame into list of tuples
rows: List[Tuple[Any, ...]] = list(df.itertuples(index=False, name=None))
# insert list of tuples in destination Postgres table
self.hook.insert_rows(table=self.destination_table, rows=rows)
Note: The snippet is for reference only; it has NOT been tested
References
- Pandas convert DataFrame into Array of tuples
Further modifications / improvements
- The
destination_table
param can be read from Variable - If the destination table doesn't necessarily reside in same
Postgres
schema, then we can take another param likedestination_postgres_conn_id
in__init__
and use that to create adestination_hook
on which we can invokeinsert_rows
method
来源:https://stackoverflow.com/questions/61555430/how-to-do-store-sql-output-to-pandas-dataframe-using-airflow