I\'ve currently got a spreadsheet type program that keeps its data in an ArrayList of HashMaps. You\'ll no doubt be shocked when I tell you that this hasn\'t proven ideal.
Some columns will have a lot of repeated values
immediately suggests to me the possible use of the FlyWeight pattern, regardless of the solution you choose for your collections.
Assuming all your rows have most of the same columns, you can just use an array for each row, and a Map<ColumnKey, Integer> to lookup which columns refers to which cell. This way you have only 4-8 bytes of overhead per cell.
If Strings are often repeated, you could use a String pool to reduce duplication of strings. Object pools for other immutable types may be useful in reducing memory consumed.
EDIT: You can structure your data as either row based or column based. If its rows based (one array of cells per row) adding/removing the row is just a matter of removing this row. If its columns based, you can have an array per column. This can make handling primitive types much more efficent. i.e. you can have one column which is int[] and another which is double[], its much more common for an entire column to have the same data type, rather than having the same data type for a whole row.
However, either way you struture the data it will be optmised for either row or column modification and performing an add/remove of the other type will result in a rebuild of the entire dataset.
(Something I do is have row based data and add columnns to the end, assuming if a row isn't long enough, the column has a default value, this avoids a rebuild when adding a column. Rather than removing a column, I have a means of ignoring it)
Why don't you try using cache implementation like EHCache.
This turned out to be very effective for me, when I hit the same situation.
You can simply store your collection within the EHcache implementation.
There are configurations like:
Maximum bytes to be used from Local heap.
Once the bytes used by your application overflows that configured in the cache, then cache implementation takes care of writing the data to the disk. Also you can configure the amount of time after which the objects are written to disk using Least Recent Used algorithm.
You can be sure of avoiding any out of memory errors, using this types of cache implementations.
It only increases the IO operations of your application by a small degree.
This is just a birds eye view of the configuration. There are a lot of configurations to optimize your requirements.
Guava does include a Table interface and a hash-based implementation. Seems like a natural fit to your problem. Note that this is still marked as beta.