The best way to store and access 120,000 words in java

天涯浪子 提交于 2019-12-13 12:27:10

问题


I'm programming a java application that reads strictly text files (.txt). These files can contain upwards of 120,000 words.

The application needs to store all +120,000 words. It needs to name them word_1, word_2, etc. And it also needs to access these words to perform various methods on them.

The methods all have to do with Strings. For instance, a method will be called to say how many letters are in word_80. Another method will be called to say what specific letters are in word_2200.

In addition, some methods will compare two words. For instance, a method will be called to compare word_80 with word_2200 and needs to return which has more letters. Another method will be called to compare word_80 with word_2200 and needs to return what specific letters both words share.

My question is: Since I'm working almost exclusively with Strings, is it best to store these words in one large ArrayList? Several small ArrayLists? Or should I be using one of the many other storage possibilities, like Vectors, HashSets, LinkedLists?

My two primary concerns are 1.) access speed, and 2.) having the greatest possible number of pre-built methods at my disposal.

Thank you for your help in advance!!


Wow! Thanks everybody for providing such a quick response to my question. All your suggestions have helped me immensely. I’m thinking through and considering all the options provided in your feedback.

Please forgive me for any fuzziness; and let me address your questions:

  1. Q) English?
    A) The text files are actually books written in English. The occurrence of a word in a second language would be rare – but not impossible. I’d put the percentage of non-English words in the text files at .0001%

  2. Q) Homework?
    A) I’m smilingly looking at my question’s wording now. Yes, it does resemble a school assignment. But no, it’s not homework.

  3. Q) Duplicates?
    A) Yes. And probably every five or so words, considering conjunctions, articles, etc.

  4. Q) Access?
    A) Both random and sequential. It’s certainly possible a method will locate a word at random. It’s equally possible a method will want to look for a matching word between word_1 and word_120000 sequentially. Which leads to the last question…

  5. Q) Iterate over the whole list?
    A) Yes.

Also, I plan on growing this program to perform many other methods on the words. I apologize again for my fuzziness. (Details do make a world of difference, do they not?)

Cheers!


回答1:


I would store them in one large ArrayList and worry about (possibly unnecessary) optimisations later on.

Being inherently lazy, I don't think it's a good idea to optimise unless there's a demonstrated need. Otherwise, you're just wasting effort that could be better spent elsewhere.

In fact, if you can set an upper bound to your word count and you don't need any of the fancy List operations, I'd opt for a normal (native) array of string objects with an integer holding the actual number. This is likely to be faster than a class-based approach.

This gives you the greatest speed in accessing the individual elements whilst still retaining the ability to do all that wonderful string manipulation.

Note I haven't benchmarked native arrays against ArrayLists. They may be just as fast as native arrays, so you should check this yourself if you have less blind faith in my abilities than I do :-).

If they do turn out to be just as fast (or even close), the added benefits (expandability, for one) may be enough to justify their use.




回答2:


Just confirming pax assumptions, with a very naive benchmark

public static void main(String[] args)
{
    int size = 120000;
    String[] arr = new String[size];
    ArrayList al = new ArrayList(size);
    for (int i = 0; i < size; i++)
    {
        String put = Integer.toHexString(i).toString();
        // System.out.print(put + " ");
        al.add(put);
        arr[i] = put;
    }

    Random rand = new Random();
    Date start = new Date();
    for (int i = 0; i < 10000000; i++)
    {
        int get = rand.nextInt(size);
        String fetch = arr[get];

    }
    Date end = new Date();
    long diff = end.getTime() - start.getTime();
    System.out.println("array access took " + diff + " ms");

    start = new Date();
    for (int i = 0; i < 10000000; i++)
    {
        int get = rand.nextInt(size);
        String fetch = (String) al.get(get);

    }
    end = new Date();
    diff = end.getTime() - start.getTime();
    System.out.println("array list access took " + diff + " ms");
}

and the output:
array access took 578 ms
array list access took 907 ms

running it a few times the actual times seem to vary some, but generally array access is between 200 and 400 ms faster, over 10,000,000 iterations.




回答3:


If you will access these Strings sequentially, the LinkedList would be the best choice.

For random access, ArrayLists have a nice memory usage/access speed tradeof.




回答4:


My take:

For a non-threaded program, an Arraylist is always fastest and simplest.

For a threaded program, a java.util.concurrent.ConcurrentHashMap<Integer,String> or java.util.concurrent.ConcurrentSkipListMap<Integer,String> is awesome. Perhaps you would later like to allow threads so as to make multiple queries against this huge thing simultaneously.




回答5:


If you're going for fast traversal as well as compact size, use a DAWG (Directed Acyclic Word Graph.) This data structure takes the idea of a trie and improves upon it by finding and factoring out common suffixes as well as common prefixes.

http://en.wikipedia.org/wiki/Directed_acyclic_word_graph




回答6:


Use a Hashtable? This will give you your best lookup speed.




回答7:


ArrayList/Vector if order matters (it appears to, since you are calling the words "word_xxx"), or HashTable/HashMap if it doesn't.

I'll leave the exercise of figuring out why you would want to use an ArrayList vs. a Vector or a HashTable vs. a HashMap up to you since I have a sneaking suspicion this is your homework. Check the Javadocs.

You're not going to get any methods that help you as you've asked for in the examples above from your Collections Framework class, since none of them do String comparison operations. Unless you just want to order them alphabetically or something, in which case you'd use one of the Tree implementations in the Collections framework.




回答8:


How about a radix tree or Patricia trie?

http://en.wikipedia.org/wiki/Radix_tree




回答9:


The only advantage of a linked list over an array or array list would be if there are insertions and deletions at arbitrary places. I don't think this is the case here: You read in the document and build the list in order.

I THINK that when the original poster talked about finding "word_2200", he meant simply the 2200th word in the document, and not that there are arbitrary labels associated with each word. If so, then all he needs is indexed access to all the words. Hence, an array or array list. If there really is something more complex, if one word might be labeled "word_2200" and the next word is labeled "foobar_42" or some such, then yes, he'd need a more complex structure.

Hey, do you want to give us a clue WHY you want to do any of this? I'm hard pressed to remember the last time I said to myself, "Hey, I wonder if the 1,237th word in this document I'm reading is longer or shorter than the 842nd word?"




回答10:


Depends on what the problem is - speed or memory.

If it's memory, the minimum solution is to write a function getWord(n) which scans the whole file each time it runs, and extracts word n.

Now - that's not a very good solution. A better solution is to decide how much memory you want to use: lets say 1000 items. Scan the file for words once when the app starts, and store a series of bookmarks containing the word number and the position in the file where it is located - do this in such a way that the bookmarks are more-or-less evenly spaced through the file.

Then, open the file for random access. The function getWord(n) now looks at the bookmarks to find the biggest word # <= n (please use a binary search), does a seek to get to the indicated location, and scans the file, counting the words, to find the requested word.

An even quicker solution, using rather more memnory, is to build some sort of cache for the blocks - on the basis that getWord() requests usually come through in clusters. You can rig things up so that if someone asks for word # X, and its not in the bookmarks, then you seek for it and put it in the bookmarks, saving memory by consolidating whichever bookmark was least recently used.

And so on. It depends, really, on what the problem is - on what kind of patterns of retreival are likely.




回答11:


I don't understand why so many people are suggesting Arraylist, or the like, since you don't mention ever having to iterate over the whole list. Further, it seems you want to access them as key/value pairs ("word_348"="pedantic").

For the fastest access, I would use a TreeMap, which will do binary searches to find your keys. Its only downside is that it's unsynchronized, but that's not a problem for your application.

http://java.sun.com/javase/6/docs/api/java/util/TreeMap.html



来源:https://stackoverflow.com/questions/518936/the-best-way-to-store-and-access-120-000-words-in-java

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!