b-tree

How to build B-tree index using Apache Spark?

元气小坏坏 提交于 2019-12-04 19:22:30
Now I have a set of numbers, such as 1,4,10,23,... , and I would like to build a b-tree index for them using Apache Spark . The format is per line per record (separated by '/n'). And I have also no idea of the output file's format, I just want to find a recommend one The regular way of building b-tree index are shown in https://en.wikipedia.org/wiki/B-tree , but I now would like a distributed parallel version in Apache Spark . In addition, the Wiki of B-tree introduced a way to build a B-tree to represent a large existing collection of data.(see https://en.wikipedia.org/wiki/B-tree ) It seems

B-Tree vs Hash Table

一曲冷凌霜 提交于 2019-12-04 07:26:04
问题 In MySQL, an index type is a b-tree, and access an element in a b-tree is in logarithmic amortized time O(log(n)) . On the other hand, accessing an element in a hash table is in O(1) . Why is a hash table not used instead of a b-tree in order to access data inside a database? 回答1: You can only access elements by their primary key in a hashtable. This is faster than with a tree algorithm ( O(1) instead of log(n) ), but you cannot select ranges ( everything in between x and y ). Tree algorithms

MySQL Hash Indexes for Optimization

和自甴很熟 提交于 2019-12-04 05:12:04
So maybe this is noob, but I'm messing with a couple tables. I have TABLE A roughly 45,000 records I have TABLE B roughly 1.5 million records I have a query: update schema1.tablea a inner join ( SELECT DISTINCT ID, Lookup, IDpart1, IDpart2 FROM schema1.tableb WHERE IDpart1 is not NULL AND Lookup is not NULL ORDER BY ID,Lookup ) b Using(ID,Lookup) set a.Elg_IDpart1 = b.IDpart1, a.Elg_IDpart2 = b.IDpart2 where a.ID is NOT NULL AND a.Elg_IDpart1 is NULL So I am forcing the index on ID, Lookup. Each table does have a index on those columns as well but because of the sub-query I forced it. It is

MySQL 索引原理以及慢查询优化

大憨熊 提交于 2019-12-03 23:55:47
本文以MySQL数据库为研究对象,讨论与数据库索引相关的一些话题。特别需要说明的是,MySQL支持诸多存储引擎,而各种存储引擎对索引的支持也各不相同,因此MySQL数据库支持多种索引类型,如BTree索引,哈希索引,全文索引等等。为了避免混乱,本文将只关注于BTree索引,因为这是平常使用MySQL时主要打交道的索引,至于哈希索引和全文索引本文暂不讨论。 文章主要内容分为四个部分: 第一部分主要从数据结构以及计算机主存、磁盘存取的层面讨论MySQL数据库索引。 第二部分主要讨论MySQL中不同引擎(主要讲解MyISAM和InnoDB)的B-Tree索引对比,包括聚集索引、非聚集索引等话题。 第三部分根据上面的理论基础,讨论MySQL中高性能使用索引的策略。 第四部分根据实际案例分析慢查询以及如何进行优化。 第五部分是标注的参考的文章。 1、索引的数据结构以及算法基础 1.1、索引的本质 MySQL官方对索引的定义为:索引(Index)是帮助MySQL高效获取数据的数据结构。提取句子主干,就可以得到索引的本质:索引是一种数据结构。 数据库查询是数据库的主要功能之一,最基本的查询算法是顺序查找(linear search)时间复杂度为O(n),显然在数据量很大时效率很低。优化的查找算法如二分查找(binary search)、二叉树查找(binary tree search)等

探寻数据库索引的底层原理

穿精又带淫゛_ 提交于 2019-12-03 20:16:26
我们都有到图书馆借书的经历,偌大的图书馆,为什么能在短的时间内找到想要找的书?如果这些书是杂乱无章的堆放,或者没有任何标识的放在书架,那么还能这么快的找到吗?这个场景就很接近我们软件开发中使用数据库的场景,图书馆的书就类似我们数据表中的数据,楼层索引牌、书架分类标识、索书号就类似我们查找数据的索引。那我们常用的数据库的索引底层的一个数据结构是什么样的呢?要了解数据库索引的底层原理,我们就得先了解一种叫树的数据结构,而树中很经典的一种数据结构就是二叉树!所以下面我们就从二叉树到平衡二叉树,再到B-树,最后到B+树来一步一步了解数据库索引底层的原理! 一.二叉树(Binary Search Trees) 二叉树是每个结点最多有两个子树的树结构。通常子树被称作“左子树”(left subtree)和“右子树”(right subtree)。二叉树常被用于实现二叉查找树和二叉堆。二叉树有如下特性: 1、每个结点都包含一个元素以及n个子树,这里0≤n≤2。 2、左子树和右子树是有顺序的,次序不能任意颠倒。左子树的值要小于父结点,右子树的值要大于父结点。   光看概念有点枯燥,假设我们现在有这样一组数[35 27 48 12 29 38 55],顺序的插入到一个数的结构中,步骤如下 好了,这就是一棵二叉树啦!我们能看到,经通过一系列的插入操作之后,原本无序的一组数已经变成一个有序的结构了

Why does CouchDB use an append-only B+ tree and not a HAMT

老子叫甜甜 提交于 2019-12-03 14:46:39
I'm reading up on datastructures, especially immutable ones like the append-only B+ tree used in CouchDB and the Hash array mapped trie used in Clojure and some other functional programming languages. The main reason datastructures that work well in memory might not work well on disk appears to be time spent on disk seeks due to fragmentation, as with a normal binary tree. However, HAMT is also very shallow, so doesn't require any more seeks than a B tree. Another suggested reason is that deletions from a array mapped trie are more expensive tha from a B tree. This is based on the assumption

Berkeleydb - B-Tree versus Hash Table

你说的曾经没有我的故事 提交于 2019-12-03 13:45:18
I am trying to understand what should drive the choice of the access method while using a BerkeleyDB : B-Tree versus HashTable. A Hashtable provides O(1) lookup but inserts are expensive (using Linear/Extensible hashing we get amortized O(1) for insert). But B-Trees provide log N (base B) lookup and insert times. A B-Tree can also support range queries and allow access in sorted order. Apart from these considerations what else should be factored in? If I don't need to support range queries can I just use a Hashtable access method? When your data sets get very large, B-trees are still better

C/C++: How to store data in a file in B tree

心已入冬 提交于 2019-12-03 13:05:44
问题 It appears to me that one way of storing data in a B-tree as a file can be done efficiently with C using binary file with a sequence (array) of structs, with each struct representing a node. One can thus connect the individual nodes with approach that will be similar to creating linked lists using arrays. But then the problem that props up would be deletion of a node, as erasing only a few bytes in the middle in a huge file is not possible. One way of deleting could be to keep track of 'empty

How btree is stored on disc?

大兔子大兔子 提交于 2019-12-03 07:14:16
问题 I know how to implement btree in memory, but not clear about how to store btree in disc. I think there are two major difference: Conversion between memory pointer and disc address, see this post. How to split page when insert new k/v item? It is very easy to implement in memory. Thanks 回答1: it all depends on DBMS you use. If you wanna know how it is implemented in MS SQL Server, things to read about are: Pages (I guess they are in almost all modern DBMS's) - in SQL Server they are 8Kb.

C/C++: How to store data in a file in B tree

六月ゝ 毕业季﹏ 提交于 2019-12-03 04:06:42
It appears to me that one way of storing data in a B-tree as a file can be done efficiently with C using binary file with a sequence (array) of structs, with each struct representing a node. One can thus connect the individual nodes with approach that will be similar to creating linked lists using arrays. But then the problem that props up would be deletion of a node, as erasing only a few bytes in the middle in a huge file is not possible. One way of deleting could be to keep track of 'empty' nodes until a threshold cutoff is reached and then make another file that will discard the empty