Comparing 2 very large arraylists in java

六月ゝ 毕业季﹏ 提交于 2019-12-11 06:27:30

问题


What would be the correct approach when you need to compare 2 very large arraylists with each other?

These arraylist are both 100,000 items in size and will definitely crash when simply comparing item per item.

for (CItem c : cItems) {
        for (CItem r : rItems) {
            if (c.getID().equals(r.getID())) {
                Mismatch m = compareItems(c, r);
                if (m != null) {
                    mismatches.add(m);
                }
            }
        }
    }

Now I'm not 100% sure how the garbage collection works in this situation but the errors we get are:

java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Arrays.java:3664) ~[na:1.8.0_73]
at java.lang.String.<init>(String.java:207) ~[na:1.8.0_73]
at java.lang.StringBuilder.toString(StringBuilder.java:407) ~[na:1.8.0_73]

and

java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOf(Arrays.java:3181) ~[na:1.8.0_73]
at java.util.ArrayList.grow(ArrayList.java:261) ~[na:1.8.0_73]
at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235) ~[na:1.8.0_73]
at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227) ~[na:1.8.0_73]
at java.util.ArrayList.add(ArrayList.java:458) ~[na:1.8.0_73]

So far the possible solutions are

  • Split each list into a maximum of x items and compare these multiple lists (kind of elaborate)
  • Create a new database and query every item (which would be very slow and not feasable right now)
  • Buy 200 gb of ram

Any input on this matter would be appreciated.


回答1:


If the IDs in any item-list are unique, you can use a Map for your rItems with the ID as key.

Map<Long, CItem> rItemMap = new HashMap<>(rItems.size());
for (CItem r : rItems) {
    rItemMap.put(r.getID(), r);
}

Now you can check directly for rItems with same ID:

for (CItem c : cItems) {
    CItem r = rItemMap.get(c.getID());
    if (r != null) {
        Mismatch m = compareItems(c, r);
        if (m != null) {
            mismatches.add(m);
        }
    }
}

Even if the IDs are not unique, you could still work with a Map, you just would have a List of all items with that ID as the value of one Map.Entry and you'd only have to iterate over those few items instead of iterating over the whole list.

Edit regarding OutOfMemory

I just saw from your Exception, that you're using ArrayList. Using LinkedList instead might help, because the ArrayList is based on a (fixed size) array and when that array is filled up, a new - larger - array is allocated and the data from the old array is copied to the new array and then freed.

So if you have an array of size 1000 and it is full, a new array of e.g. size 2000 is allocated. At that moment, memory for 3000 items is required (although the 1000 are freed shortly after).

A LinkedList just allocates memory for every item you add to it (plus memory to point to the next and previous element).




回答2:


It looks you want to see if 2 objects with the same ID are the same when compared an other way.

Probable Problem here is that you check 100.000 x 100.000 objects against each other. To make it worse, you just add the ones to a new list...

Option 1) You did not tell how you created the ArrayList()s. If you get the objects from a Database you might just query that. (those are good at that, even if you aren't)

Option 2) Add the 2 ArrayList()s together, they seem to be the same kind of objects. Make the objects sortable (maybe by ID), sort the single list. (creates an other problem) Then use a loop to compare the now sorted Objects to their neighbour.




回答3:


You can use the method removeAll in the collection interface :)

rItems.removeAll(cItems);

if you look inside of the implementtion, the method compares using equals aswell...

This approach would let you obtain the items from each list which do not match to the other one.




回答4:


Sort the 2 lists and then compare them in order. Sorting costs O(n log n) and compare costs O(n).

Comparator<CItem> idComparator = new Comparator<CItem>() {
    @Override
    public int compare(CItem i1, CItem i2) {
        // Implementation depends on the type of CItem ID:
        // if ID is an integer or double, maybe you need
        // return i1.getID() - i2.getID();
        return i1.getID().compareTo(i2.getID());
    }
});

Collections.sort(cItems, idComparator);
Collections.sort(rItems, idComparator);

int minLen = Math.min(cItems.size(), rItems.size());
for (int i = 0, j = 0; i < minLen && j < minLen; ) {

    CItem c = cItems.get(i);
    CItem r = rItems.get(j);

    // c.getID().equals(r.getID())
    if (idComparator.compare(c, r) == 0) {
        Mismatch m = compareItems(c, r);
        if (m != null) {
            mismatches.add(m);
        }
        i++;
        j++;

    // item c's ID does not exist in list rItems
    } else if (idComparator.compare(c, r) < 0) {
        i++;

    // item r's ID does not exist in list cItems
    } else {
        j++;
    }
}



回答5:


I have the same problem. So I tried to use LinkedList. So I have 2 Linkedlist that can contain up to 3.5millions of string records. And then I am running

LinkedList diflist= (LinkedList) ListUtils.subtract(sourceList, targetList);

to get the difference,but my application stacks on this.

So are there any good algorithms to compare lists?



来源:https://stackoverflow.com/questions/41608074/comparing-2-very-large-arraylists-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!