Reading a large graph from Titan (on HBase) into Spark

依然范特西╮ 提交于 2019-12-03 17:02:37

About a year ago, I encountered the same challenge as you describe -- we had a very large Titan instance, but we could not run any OLAP processes on it.

I have researched the subject pretty deeply, but any solution I found (SparkGraphComputer, TitanHBaseInputFormat) was either very slow (matters of days or weeks in our scale) or just buggy and missed data. The main reason for the slowness was that all of them used HBase main API, which turned out as the speed bottleneck.

So I implemented Mizo - it is a Spark RDD for Titan on HBase, that bypasses HBase main API, and parses HBase internal data files (called HFiles).

I have tested it on a pretty large scale -- a Titan graph with hundreds of billions of elements, weighing about 25TB.

Because it does not rely on the Scan API that HBase exposes, it is much faster. For example, counting edges in the graph I mentioned takes about 10 hours.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!