Java - Custom Hash Map/Table Some Points

问题

In some previous posts I have asked some questions about coding of Custom Hash Map/Table in java. Now as I can't solve it and may be I forgot to properly mentioning what I really want, I am summarizing all of them to make it clear and precise.

What I am going to do:

I am trying to code for our server in which I have to find users access type by URL.

Now, I have 1110 millions of URLs (approx).

So, what we did,

1) Divided the database on 10 parts each of 110 millions of Urls. 2) Building a HashMap using parallel array whose key are URL's one part (represented as LONG) and values are URL's other part (represented as INT) - key can have multiple values.

3) Then search the HashMap for some other URLs (millions of URLs saved in one day) per day at the beginning when system starts.

What you have Tried:

1) I have tried many NoSQL databases, however we found not so good for our purpose.

2) I have build our custom hashmap(using two parallel arrays) for that purpose.

So, what the issue is:

When the system starts we have to load our hashtable of each database and perform search for million of url:

Now, issue is,

1) Though the HashTable performance is quite nice, code takes more time while loading HashTable (we are using File Channel & memory-mapped buffer to load it which takes 20 seconds to load HashTable - 220 millions entry - as load factor is 0.5, we found it most faster)

So, we are spending time: (HashTable Load + HashTable Search) * No. of DB = (5 + 20) * 10 = 250 seconds. Which is quite expensive for us and most of the time (200 out of 250 sec) is going for loading hashtables.

Have you think any-other way:

One way can be:

Without worrying about loading and storing, and leave caching to the operating system by using a memory-mapped buffer. But, as I have to search for millions of keys, it gives worser performance than above.

As we found HashTable performance is nice but loading time is high, we thought to cut it off in another way like:

1) Create an array of Linked Lists of the size Integer_MAX (my own custom linked list).

2) Insert values (int's) to the Linked Lists whose number is key number (we reduce the key size to INT).

3) So, we have to store only the linked lists to the disks.

Now, issue is, it is taking lots of time to create such amount of Linked Lists and creating such large amount of Linked Lists has no meaning if data is not well distributed.

So, What is your requirements:

Simply my requirements:

1) Key with multiple values insertion and searching. Looking for nice searching performance. 2) Fast way to load (specially) into memory.

(keys are 64 bit INT and Values are 32 bit INT, one key can have at most 2-3 values. We can make our key 32 bit also but will give more collisions, but acceptable for us, if we can make it better).

Can anyone help me, how to solve this or any comment how to solve this issue ?

Thanks.

NB:

1) As per previous suggestions of Stack Overflow, Pre-read data for disk caching is not possible because when system starts our application will start working and on next day when system starts.

2) We have not found NoSQL db's are scaling well as our requirements are simple (means just insert hashtable key value and load and search (retrieve values)).

3) As our application is a part of small project and to be applied on a small campus, I don't think anybody will buy me a SSD disk for that. That is my limitation.

4) We use Guava/ Trove also but they are not able to store such large amount of data in 16 GB also (we are using 32 GB ubuntu server.)

回答1:

If you need quick access to 1110 million data items then hashing is the way to go. But dont reinvent the wheel, use something like:

memcacheDB: http://memcachedb.org
MongoDB: http://www.mongodb.org
Cassandra: http://cassandra.apache.org

回答2:

It seems to me (if I understand your problem correctly) that you are trying to approach the problem in a convoluted manner.
I mean the data you are trying to pre-load are huge to begin with (let's say 220 Million * 64 ~ 14GB). And you are trying to memory-map etc for this.
I think this is a typical problem that is solved by distributing the load in different machines. I.e. instead of trying to locate the linked list index you should be trying to figure out the index of the appropriate machine that a specific part of the map has been loaded and get the value from that machine from there (each machine has loaded part of this database map and you get the data from the appropriate part of the map i.e. machine each time).
Maybe I am way off here but I also suspect you are using a 32bit machine.
So if you have to stay using a one machine architecture and it is not economically possible to improve your hardware (64-bit machine and more RAM or SSD as you point out) I don't think that you can make any dramatic improvement.

回答3:

I don't really understand in what form you are storing the data on disk. If what you are storing consists of urls and some numbers, you might be able to speed up loading from disk quite a bit by compressing the data (unless you are already doing that).

Creating a multithreaded loader that decompresses while loading might be able to give you quite a big boost.

来源：https://stackoverflow.com/questions/11765517/java-custom-hash-map-table-some-points

标签

java

hashmap

hashtable