What are the open source tools and techniques to build a complete data warehouse platform? [closed]

为君一笑 提交于 2019-12-04 07:39:45

问题


I'm looking for these open source tools possibly free or with free trial version to set up complete data warehouse stack.

I know about few like Pentaho open source Mondrian server, but couldn't get any google result to setup complete platform. I'm not sure whether these components are compatible with each other? Could someone please list them along with their position in the chain?


回答1:


The Open Source Data Warehousing does a great job at identifying OSS components that could be used to build a Data Warehouse stack: Infrastructure (servers, OS, databases), Integration Management (ETL, EAI, etc), Information Management (DW/Mart/ODS, OLap Servers, etc), Information Delivery (Portal, Dashboard, Analytics/OLAP Client, etc). Here is a summary:

Open Source BI/DW Projects

BI and Analytics

  • BEE - http://bee.insightstrategy.cz/en/index.html
  • BIRT - http://www.eclipse.org/birt
  • JasperSoft – http://www.jaspersoft.com
  • MarvelIT - http://www.marvelit.com/dash.html
  • OpenI – http://openi.sourceforge.net
  • OpenReports – http://oreports.com
  • Orange - http://www.ailab.si/orange
  • Palo – http://www.palo.net
  • Pentaho - http://www.pentaho.com
  • R - http://www.r-project.org
  • SpagoBI – http://spagobi.eng.it
  • Weka - http://www.cs.waikato.ac.nz/~ml/index.html
  • VitalSigns - http://vitalsigns.sourceforge.net/

Databases

  • http://greenplum.org (bizgres)
  • http://www.ingres.com
  • http://www.mysql.com
  • http://www.postgresql.org
  • http://www.enterprisedb.com

Integration

  • Apatar - http://www.apatar.com
  • CloverETL - http://cloveretl.berlios.de/
  • JitterBit - http://www.jitterbit.com/
  • KETL - http://www.ketl.org
  • Octopus - http://www.enhydra.org/tech/octopus/index.html
  • OSDQ - http://sourceforge.net/projects/dataquality
  • Pentaho - http://www.pentaho.com
  • Red Hat – http://www.redhat.com
  • Saga.M31 Galaxy - http://galaxy.sagadc.com
  • Talend - http://www.talend.com
  • SnapLogic – http://www.snaplogic.com

I recommend browsing the presentation. Good stuff.




回答2:


A datawarehouse stack (or suite) usually consists of three layers. These are usually referenced as ETL (loading), Database & Reporting (interface). In addition, there are somewhat more advanced tools for performance and expert needs. These consist of Cubes and Statistical Analysis Tools.

As far as interoperability goes, the ETL tools and the reporting tools need to support whatever database you are using. However, since there are only two big open source databases, there is usually no problem mixing different solutions.

As for specifics -

1 - ETL

Data loading can be achieved by open-source tools such as Pentaho's Data Integration or Talend (an eclipse extension). I would suggest googling "open source etl" to tailor the solution for your specific needs.

2 - DB

You'll need a relational database (RDBMS). The two most prominent open-source players are PostgreSQL (used by Stack Overflow) and MySQL. While MySQL has a larger user base, Postgres is gaining more an more popularity ever since implementing several crucial features that were missing in earlier versions.

3 - Reporting

Pentaho offer reporting platform. So is BIRT (another eclipse extension). Again, Google is your friend for specific comparisons. Note that when if you choose Pentaho for both the ETL and Reporting tools you are likely to enjoy a better integration. You've also mentioned Mondrian, which is a tool to generate MDX queries over an RDBMS. MDX is the standard language for querying cubes.

At this point of time, assuming you are starting from scratch, I would recommend setting up the first two layers of the data warehouse - ETL & DB. You can later add any number of reporting tools above.




回答3:


This is another similar question 20 Billion Rows/Month - Hbase / Hive / Greenplum / What?

The most relevant part:

I cannot stress this enough: Get something that plays nicely with off-the-shelf reporting tools.

.

Hive or HBase put you in the business of building a custom front-end, which you really don't want unless you're happy to spend the next 5 years writing custom report formatters in Python.




回答4:


Expanding on what Pascal wrote:

OLAP server: Mondrian

AJAX pivot tables: Saiku

OLAP schema designer: Pentaho Schema Workbench

OLAP aggregate designer: Pentaho Aggregation Designer

ETL: Pentaho Kettle

Report designer: Pentaho Report Designer

Data Quality: DataCleaner

Columnar Data Warehouse: MonetDB

Data Mining: RapidMiner




回答5:


Data Quality and Profiling - http://sourceforge.net/projects/dataquality/

it also has Hive connection and data workbench for creating real life data.



来源:https://stackoverflow.com/questions/3308238/what-are-the-open-source-tools-and-techniques-to-build-a-complete-data-warehouse

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!