Storing the result of Minhash

徘徊边缘 提交于 2019-12-11 11:54:19

问题


The result is a fixed number of arrays, let's say lists (all of the same length) in python.

One could see it as a matrix too, so in c I would use an array, where every cell would point to another array. How to do it in Python?

A list where every item is a list or something else?

I thought of a dictionary, but the keys are trivial, 1, 2, ..., M, so I am not sure if that is the pythonic way to go here.

I am not interested in the implementation, I am interested in which approach I should follow, in which choice I should made!


回答1:


Whatever container you choose, it should contain hash-itemID pairs, and should be indexed or sorted by the hash. Unsorted arrays will not be remotely efficient.

Assuming you're using a decent sized hash and your various hash algorithms are well-implemented, you should be able to just as effectively store all minhashes in a single container, since the chance of collision between a minhash from one algorithm and a minhash from another is negligible, and if any such collision occurs it won't substantially alter the similarity measure.

Using a single container as opposed to multiple reduces the memory overhead for indexing, though it also slightly increases the amount of processing required. As memory is usually the limiting factor for minhash, a single container may be preferable.




回答2:


You can store anything you want in a python list: ints, strings, more lists lists, dicts, objects, functions - you name it.

anything_goes_in_here = [1, 'one', lambda one: one / 1, {1: 'one'}, [1, 1]]

So storing a list of lists is pretty straight forward:

>>> list_1 = [1, 2, 3, 4]
>>> list_2 = [5, 6, 7, 8]
>>> list_3 = [9, 10, 11, 12]
>>> list_4 = [13, 14, 15, 16]
>>> main_list = [list_1, list_2, list_3, list_4]
>>> for list in main_list:
...     for num in list:
...             print num
...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

If you are looking to store a list of lists where the index is meaningful (meaning the index gives you some information about the data stored there), then this is basically reimplementing a hashmap (dictionary), and while you say it's trivial - using a dictionary sounds like it fits the problem well here.



来源:https://stackoverflow.com/questions/37062078/storing-the-result-of-minhash

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!