Deduplicate string instances

只愿长相守 提交于 2021-01-28 22:08:28

问题


I have array of nearly 1,000,000 records, each record has a field "filename".

There are many records with exactly the same filename.

My goal is to improve memory footprint by deduplicating string instances (filename instances, not records).

.NET Framework 2.0 is a constraint. no LINQ here.

I wrote a generic (and thread-safe) class for the deduplication:

public class Deduplication<T>
    where T : class
{
    private static Deduplication<T> _global = new Deduplication<T>();

    public static Deduplication<T> Global
    {
        get { return _global; }
    }

    private Dictionary<T, T> _dic;// = new Dictionary<T, T>();
    private object _dicLocker = new object();

    public T GetInstance(T instance)
    {
        lock (_dicLocker)
        {
            if (_dic == null)
            {
                _dic = new Dictionary<T, T>();
            }

            T savedInstance;
            if (_dic.TryGetValue(instance, out savedInstance))
            {
                return savedInstance;
            }
            else
            {
                _dic.Add(instance, instance);
                return instance;
            }
        }
    }

    public void Clear()
    {
        lock (_dicLocker)
        {
            _dic = null;
        }
    }
}

The problem with this class is that it adds a lot of more memory usage, and it stays there until the next GC.

I searching for a way to reduce the memory footprint without adding a lot of more memory usage and without waiting for the next GC. Also i do not want to use GC.Collect() because it freezes the GUI for a couple of seconds.


回答1:


If you do not want to intern your strings. You could take a similar approach to Java 8's string deduplication (which is done by the GC on the heap).

  1. Get the hash values of the strings as they are added.
  2. If the hash does not exist, associate it with the string.
  3. If the hash does exist, compare the strings with the same hash character by character.
  4. If your comparison matches, store a reference to the original string instead of a new copy.

This would reduce your memory footprint assuming you have a lot of duplicates, but interning would probably perform a lot better as it is done at a lower level right on the heap.




回答2:


You could stick all strings in a prefix tree. Depending on how different you path names are this should automatically deduplicate common substrings. A quick search on google yielded in this C# implementation.




回答3:


I'd recommend you double-check that your memory footprint isn't already optimized. .NET automatically interns duplicate strings on the heap, which means you can have multiple identical String objects point to the same memory address. Just call String.Intern(targetString). That's why Strings are immutable, and StringBuilder exists.

More immediately, if you're having trouble with the leftover strings on the heap, you can kick off a garbage collection right after you finish (or even periodically during the run):

GC.Collect();



来源:https://stackoverflow.com/questions/18775932/deduplicate-string-instances

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!