Is it possible to store graphs hbase? if so how do you model the database to support a graph structure?

痞子三分冷 提交于 2019-12-03 15:40:21

You can store an adjacency list in HBase/Accumulo in a column oriented fashion. I'm more familiar with Accumulo (HBase terminology might be slightly different) so you might use a schema similar to:

SrcNode(RowKey) EdgeType(CF):DestNode(CFQ) Edge/Node Properties(Value)

Where CF=ColumnFamily and CFQ=ColumnFamilyQualifier

You might also store node/vertex properties as separate rows using something like:

Node(RowKey) PropertyType(CF):PropertyValue(CFQ) PropertyValue(Value)

The PropertyValue could be either in the CFQ or the Value

From a graph processing perspective as mentioned by @Arnon Rotem-Gal-Oz you could look at Apache Giraph which is an implementation of Google Pregel. Pregel is the method Google use for large graph processing.

Using HBase/Accumulo as input to giraph has been submitted recently (7 Mar 2012) as a new feature request to Giraph: HBase/Accumulo Input and Output formats (GIRAPH-153)

You can store the graph in HBase as adjacency list so for example, each raw would have columns for general properties (name, pagerank etc.) and a list of keys of adjacent nodes (if it a directed graph than just the nodes you can get to from this node or an additional column with the direction of each)

Take a look at apache Giraph (you can also read a little more about it here) while this isn't about HBase it is about handling graphs in Hadoop. Also you may want to look at Hadoop 0.23 (and up) as the YARN engine (aka map/reduce2) is more open to non-map/reduce algorithms

big data nerd

I would not use HBase in the way "Binary Nerd" recommended it as HBase does not perform very well when handling multiple column families.

Best performance is achieved with a single column family (a second one should only be used if you very often only access the content of one column family and the data stored in the other column family is very large)

There are graph databases build on top of HBase you could try and/or study.

Apache S2Graph provides REST API for storing, querying the graph data represented by edge and vertices. There you can find a presentation, where the construction of row/column keys is explained. Analysis of operations' performance that influenced or is influenced by the design are also given.

Titan can use other storage backends besides HBase, and has integration with analytics frameworks. It is also designed with big data sets in mind.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!