What are the open source tools and techniques to build a complete data warehouse platform? [closed]

问题

I'm looking for these open source tools possibly free or with free trial version to set up complete data warehouse stack.

I know about few like Pentaho open source Mondrian server, but couldn't get any google result to setup complete platform. I'm not sure whether these components are compatible with each other? Could someone please list them along with their position in the chain?

回答1:

The Open Source Data Warehousing does a great job at identifying OSS components that could be used to build a Data Warehouse stack: Infrastructure (servers, OS, databases), Integration Management (ETL, EAI, etc), Information Management (DW/Mart/ODS, OLap Servers, etc), Information Delivery (Portal, Dashboard, Analytics/OLAP Client, etc). Here is a summary:

Open Source BI/DW Projects

BI and Analytics

BEE - http://bee.insightstrategy.cz/en/index.html

BIRT - http://www.eclipse.org/birt

JasperSoft – http://www.jaspersoft.com

MarvelIT - http://www.marvelit.com/dash.html

OpenI – http://openi.sourceforge.net

OpenReports – http://oreports.com

Orange - http://www.ailab.si/orange

Palo – http://www.palo.net

Pentaho - http://www.pentaho.com

R - http://www.r-project.org

SpagoBI – http://spagobi.eng.it

Weka - http://www.cs.waikato.ac.nz/~ml/index.html

VitalSigns - http://vitalsigns.sourceforge.net/

Databases

http://greenplum.org (bizgres)

http://www.ingres.com

http://www.mysql.com

http://www.postgresql.org

http://www.enterprisedb.com

Integration

Apatar - http://www.apatar.com

CloverETL - http://cloveretl.berlios.de/

JitterBit - http://www.jitterbit.com/

KETL - http://www.ketl.org

Octopus - http://www.enhydra.org/tech/octopus/index.html

OSDQ - http://sourceforge.net/projects/dataquality

Pentaho - http://www.pentaho.com

Red Hat – http://www.redhat.com

Saga.M31 Galaxy - http://galaxy.sagadc.com

Talend - http://www.talend.com

SnapLogic – http://www.snaplogic.com

I recommend browsing the presentation. Good stuff.

回答2:

A datawarehouse stack (or suite) usually consists of three layers. These are usually referenced as ETL (loading), Database & Reporting (interface). In addition, there are somewhat more advanced tools for performance and expert needs. These consist of Cubes and Statistical Analysis Tools.

As far as interoperability goes, the ETL tools and the reporting tools need to support whatever database you are using. However, since there are only two big open source databases, there is usually no problem mixing different solutions.

As for specifics -

1 - ETL

Data loading can be achieved by open-source tools such as Pentaho's Data Integration or Talend (an eclipse extension). I would suggest googling "open source etl" to tailor the solution for your specific needs.

2 - DB

You'll need a relational database (RDBMS). The two most prominent open-source players are PostgreSQL (used by Stack Overflow) and MySQL. While MySQL has a larger user base, Postgres is gaining more an more popularity ever since implementing several crucial features that were missing in earlier versions.

3 - Reporting

Pentaho offer reporting platform. So is BIRT (another eclipse extension). Again, Google is your friend for specific comparisons. Note that when if you choose Pentaho for both the ETL and Reporting tools you are likely to enjoy a better integration. You've also mentioned Mondrian, which is a tool to generate MDX queries over an RDBMS. MDX is the standard language for querying cubes.

At this point of time, assuming you are starting from scratch, I would recommend setting up the first two layers of the data warehouse - ETL & DB. You can later add any number of reporting tools above.

回答3:

This is another similar question 20 Billion Rows/Month - Hbase / Hive / Greenplum / What?

The most relevant part:

I cannot stress this enough: Get something that plays nicely with off-the-shelf reporting tools.

Hive or HBase put you in the business of building a custom front-end, which you really don't want unless you're happy to spend the next 5 years writing custom report formatters in Python.

回答4:

Expanding on what Pascal wrote:

OLAP server: Mondrian

AJAX pivot tables: Saiku

OLAP schema designer: Pentaho Schema Workbench

OLAP aggregate designer: Pentaho Aggregation Designer

ETL: Pentaho Kettle

Report designer: Pentaho Report Designer

Data Quality: DataCleaner

Columnar Data Warehouse: MonetDB

Data Mining: RapidMiner

回答5:

Data Quality and Profiling - http://sourceforge.net/projects/dataquality/

it also has Hive connection and data workbench for creating real life data.

来源：https://stackoverflow.com/questions/3308238/what-are-the-open-source-tools-and-techniques-to-build-a-complete-data-warehouse

标签

open-source

data-warehouse