How to do store sql output to pandas dataframe using Airflow?

烂漫一生 提交于 2020-05-14 18:42:05

问题


I want to store data from SQL to Pandas dataframe and do some data transformations and then load to another table suing airflow

Issue that I am facing is that connection string to tables are accessbale only through Airflow. So I need to use airflow as medium to read and write data.

How can this be done ?

MY code

Task1 = PostgresOperator(
    task_id='Task1',
    postgres_conn_id='REDSHIFT_CONN',
    sql="SELECT * FROM Western.trip limit 5 ",
    params={'limit': '50'},
    dag=dag

The output of task needs to be stored to dataframe (df) and after tranfromations and load back into another table.

How can this be done?


回答1:


I doubt there's an in-built operator for this. You can easily write a custom operator

  • Extend PostgresOperator or just BaseOperator / any other operator of your choice. All custom code goes into the overridden execute() method
  • Then use PostgresHook to obtain a Pandas DataFrame by invoking get_pandas_df() function
  • Perform whatever transformations you have to do in your pandas df
  • Finally use insert_rows() function to insert data into table

UPDATE-1

As requested, I'm hereby adding the code for operator

from typing import Dict, Any, List, Tuple

from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.postgres_operator import PostgresOperator
from airflow.utils.decorators import apply_defaults
from pandas import DataFrame


class MyCustomOperator(PostgresOperator):

    @apply_defaults
    def __init__(self, destination_table: str, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.destination_table: str = destination_table

    def execute(self, context: Dict[str, Any]):
        # create PostgresHook
        self.hook: PostgresHook = PostgresHook(postgres_conn_id=self.postgres_conn_id,
                                               schema=self.database)
        # read data from Postgres-SQL query into pandas DataFrame
        df: DataFrame = self.hook.get_pandas_df(sql=self.sql, parameters=self.parameters)
        # perform transformations on df here
        df['column_to_be_doubled'] = df['column_to_be_doubled'].multiply(2)
        ..
        # convert pandas DataFrame into list of tuples
        rows: List[Tuple[Any, ...]] = list(df.itertuples(index=False, name=None))
        # insert list of tuples in destination Postgres table
        self.hook.insert_rows(table=self.destination_table, rows=rows)

Note: The snippet is for reference only; it has NOT been tested

References

  • Pandas convert DataFrame into Array of tuples

Further modifications / improvements

  • The destination_table param can be read from Variable
  • If the destination table doesn't necessarily reside in same Postgres schema, then we can take another param like destination_postgres_conn_id in __init__ and use that to create a destination_hook on which we can invoke insert_rows method


来源:https://stackoverflow.com/questions/61555430/how-to-do-store-sql-output-to-pandas-dataframe-using-airflow

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!