Best ETL Packages In Python

孤者浪人 提交于 2019-12-24 08:00:03

问题


I have 2 use cases:

  • Extract, Transform and Load from Oracle / PostgreSQL / Redshift / S3 / CSV to my own Redshift cluster
  • Schedule the job do it runs daily/weekly (INSERT + TABLE or INSERT + NONE options preferable).

I am currently using:

  1. SQLAlchemy for extracts (works well generally).
  2. PETL for transforms and loads (works well on smaller data sets, but for ~50m+ rows it is slow and the connection to the database(s) time out).
  3. An internal tool for the scheduling component (which stores the transform in XML and then the loads from the XML and seems rather long and complicated).

I have been looking through this link but would welcome additional suggestions. Exporting to Spark or similar is also welcome if there is an "easier" process where I can just do everything through Python (I'm only using Redshift because it seems like the best option).


回答1:


How about

  • Python
  • Pandas

This is what we use for our ETL processing.




回答2:


I'm using Pandas to access my ETL files, try doing something like this:

  • Create a class with all your queries there.
  • Create another class that processes the actual Datawarehouse that includes Pandas and Matplotlib for the graph.


来源:https://stackoverflow.com/questions/46039850/best-etl-packages-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!