Difference between Pig and Hive? Why have both? [closed]

后端未结

关注

 19  2754

[愿得一人] 2020-12-02 03:27

19条回答

無奈伤痛 (楼主)

2020-12-02 03:56

I found this the most helpful (though, it's a year old) - http://yahoohadoop.tumblr.com/post/98256601751/pig-and-hive-at-yahoo

It specifically talks about Pig vs Hive and when and where they are employed at Yahoo. I found this very insightful. Some interesting notes:

On incremental changes/updates to data sets:

Instead, joining against the new incremental data and using the results together with the results from the previous full join is the correct approach. This will take only a few minutes. Standard database operations can be implemented in this incremental way in Pig Latin, making Pig a good tool for this use case.

On using other tools via streaming:

Pig integration with streaming also makes it easy for researchers to take a Perl or Python script they have already debugged on a small data set and run it against a huge data set.

On using Hive for data warehousing:

In both cases, the relational model and SQL are the best fit. Indeed, data warehousing has been one of the core use cases for SQL through much of its history. It has the right constructs to support the types of queries and tools that analysts want to use. And it is already in use by both the tools and users in the field.

The Hadoop subproject Hive provides a SQL interface and relational model for Hadoop. The Hive team has begun work to integrate with BI tools via interfaces such as ODBC.

0 讨论(0)

查看其它19个回答
发布评论:

提交评论
- 加载中...

热议问题