Memory Efficiency and Performance of String.Replace .NET Framework

…衆ロ難τιáo~ 提交于 2019-12-17 05:01:17

问题


 string str1 = "12345ABC...\\...ABC100000"; 
 // Hypothetically huge string of 100000 + Unicode Chars
 str1 = str1.Replace("1", string.Empty);
 str1 = str1.Replace("22", string.Empty);
 str1 = str1.Replace("656", string.Empty);
 str1 = str1.Replace("77ABC", string.Empty);

 // ...  this replace anti-pattern might happen with upto 50 consecutive lines of code.

 str1 = str1.Replace("ABCDEFGHIJD", string.Empty);

I have inherited some code that does the same as the snippet above. It takes a huge string and replaces (removes) constant smaller strings from the large string.

I believe this is a very memory intensive process given that new large immutable strings are being allocated in memory for each replace, awaiting death via the GC.

1. What is the fastest way of replacing these values, ignoring memory concerns?

2. What is the most memory efficient way of achieving the same result?

I am hoping that these are the same answer!

Practical solutions that fit somewhere in between these goals are also appreciated.

Assumptions:

  • All replacements are constant and known in advance
  • Underlying characters do contain some unicode [non-ascii] chars

回答1:


All characters in a .NET string are "unicode chars". Do you mean they're non-ascii? That shouldn't make any odds - unless you run into composition issues, e.g. an "e + acute accent" not being replaced when you try to replace an "e acute".

You could try using a regular expression with Regex.Replace, or StringBuilder.Replace. Here's sample code doing the same thing with both:

using System;
using System.Text;
using System.Text.RegularExpressions;

class Test
{
    static void Main(string[] args)
    {
        string original = "abcdefghijkl";

        Regex regex = new Regex("a|c|e|g|i|k", RegexOptions.Compiled);

        string removedByRegex = regex.Replace(original, "");
        string removedByStringBuilder = new StringBuilder(original)
            .Replace("a", "")
            .Replace("c", "")
            .Replace("e", "")
            .Replace("g", "")
            .Replace("i", "")
            .Replace("k", "")
            .ToString();

        Console.WriteLine(removedByRegex);
        Console.WriteLine(removedByStringBuilder);
    }
}

I wouldn't like to guess which is more efficient - you'd have to benchmark with your specific application. The regex way may be able to do it all in one pass, but that pass will be relatively CPU-intensive compared with each of the many replaces in StringBuilder.




回答2:


If you want to be really fast, and I mean really fast you'll have to look beyond the StringBuilder and just write well optimized code.

One thing your computer doesn't like to do is branching, if you can write a replace method which operates on a fixed array (char *) and doesn't branch you have great performance.

What you'll be doing is that the replace operation is going to search for a sequence of characters and if it finds any such sub string it will replace it. In effect you'll copy the string and when doing so, preform the find and replace.

You'll rely on these functions for picking the index of some buffer to read/write. The goal is to preform the replace method such that when nothing has to change you write junk instead of branching.

You should be able to complete this without a single if statement and remember to use unsafe code. Otherwise you'll be paying for index checking for every element access.

unsafe
{
    fixed( char * p = myStringBuffer )
    {
        // Do fancy string manipulation here
    }
}

I've written code like this in C# for fun and seen significant performance improvements, almost 300% speed up for find and replace. While the .NET BCL (base class library) performs quite well it is riddled with branching constructs and exception handling this will slow down you code if you use the built-in stuff. Also these optimizations while perfectly sound are not preformed by the JIT-compiler and you'll have to run the code as a release build without any debugger attached to be able to observe the massive performance gain.

I could provide you with more complete code but it is a substantial amount of work. However, I can guarantee you that it will be faster than anything else suggested so far.




回答3:


StringBuilder: http://msdn.microsoft.com/en-us/library/2839d5h5.aspx

The performance of the Replace operation itself should be roughly same as string.Replace and according to Microsoft no garbage should be produced.




回答4:


Here's a quick benchmark...

        Stopwatch s = new Stopwatch();
        s.Start();
        string replace = source;
        replace = replace.Replace("$TS$", tsValue);
        replace = replace.Replace("$DOC$", docValue);
        s.Stop();

        Console.WriteLine("String.Replace:\t\t" + s.ElapsedMilliseconds);

        s.Reset();

        s.Start();
        StringBuilder sb = new StringBuilder(source);
        sb = sb.Replace("$TS$", tsValue);
        sb = sb.Replace("$DOC$", docValue);
        string output = sb.ToString();
        s.Stop();

        Console.WriteLine("StringBuilder.Replace:\t\t" + s.ElapsedMilliseconds);

I didn't see much difference on my machine (string.replace was 85ms and stringbuilder.replace was 80), and that was against about 8MB of text in "source"...




回答5:


1. What is the fastest way of replacing these values, ignoring memory concerns?

The fastest way is to build a custom component that's specific to your use case. As of .NET 4.6, There's no class in the BCL designed for multiple string replacements.

If you NEED something fast out of the BCL, StringBuilder is the fastest BCL component for simple string replacement. The source code can be found here: It's pretty efficient for replacing a single string. Only use Regex if you really need the pattern-matching power of regular expressions. It's slower and a little more cumbersome, even when compiled.

2. What is the most memory efficient way of achieving the same result?

The most memory-efficient way is to perform a filtered stream copy from the source to the destination (explained below). Memory consumption will be limited to your buffer, however this will be more CPU intensive; as a rule of thumb, you're going to trade CPU performance for memory consumption.

Technical Details

String replacements are tricky. Even when performing a string replacement in a mutable memory space (such as with StringBuilder), it's expensive. If the replacement string is a different length than original string, you're going to be relocating every character following the replacement string to keep the whole string contiguous. This results in a LOT of memory writes, and even in the case of StringBuilder, causes you to rewrite most of the string in-memory on every call to Replace.

So what is the fastest way to do string replacements? Write the new string using a single-pass: Don't let your code go back and have to re-write anything. Writes are more expensive than reads. You're going to have to code this yourself for best results.

High-Memory Solution

The class I've written generates strings based on templates. I place tokens ($ReplaceMe$) in a template which marks places where I want to insert a string later. I use it in cases where XmlWriter is too onerous for XML that's largely static and repetitive, and I need to produce large XML (or JSON) data streams.

The class works by slicing the template up into parts and places each part into a numbered dictionary. Parameters are also enumerated. The order in which the parts and parameters are inserted into a new string are placed into an integer array. When a new string is generated, the parts and parameters are picked from the dictionary and used to create a new string.

It's neither fully-optimized nor is it bulletproof, but it works great for generating very large data streams from templates.

Low-Memory Solution

You'll need to read small chunks from the source string into a buffer, search the buffer using an optimized search algorithm, and then write the new string to the destination stream / string. There are a lot of potential caveats here, but it would be memory efficient and a better solution for source data that's dynamic and can't be cached, such as whole-page translations or source data that's too large to reasonably cache. I don't have a sample solution for this handy.

Sample Code

Desired Results

<DataTable source='Users'>
  <Rows>
    <Row id='25' name='Administrator' />
    <Row id='29' name='Robert' />
    <Row id='55' name='Amanda' />
  </Rows>
</DataTable>

Template

<DataTable source='$TableName$'>
  <Rows>
    <Row id='$0$' name='$1$'/>
  </Rows>
</DataTable>

Test Case

class Program
{
  static string[,] _users =
  {
    { "25", "Administrator" },
    { "29", "Robert" },
    { "55", "Amanda" },
  };

  static StringTemplate _documentTemplate = new StringTemplate(@"<DataTable source='$TableName$'><Rows>$Rows$</Rows></DataTable>");
  static StringTemplate _rowTemplate = new StringTemplate(@"<Row id='$0$' name='$1$' />");
  static void Main(string[] args)
  {
    _documentTemplate.SetParameter("TableName", "Users");
    _documentTemplate.SetParameter("Rows", GenerateRows);

    Console.WriteLine(_documentTemplate.GenerateString(4096));
    Console.ReadLine();
  }

  private static void GenerateRows(StreamWriter writer)
  {
    for (int i = 0; i <= _users.GetUpperBound(0); i++)
      _rowTemplate.GenerateString(writer, _users[i, 0], _users[i, 1]);
  }
}

StringTemplate Source

public class StringTemplate
{
  private string _template;
  private string[] _parts;
  private int[] _tokens;
  private string[] _parameters;
  private Dictionary<string, int> _parameterIndices;
  private string[] _replaceGraph;
  private Action<StreamWriter>[] _callbackGraph;
  private bool[] _graphTypeIsReplace;

  public string[] Parameters
  {
    get { return _parameters; }
  }

  public StringTemplate(string template)
  {
    _template = template;
    Prepare();
  }

  public void SetParameter(string name, string replacement)
  {
    int index = _parameterIndices[name] + _parts.Length;
    _replaceGraph[index] = replacement;
    _graphTypeIsReplace[index] = true;
  }

  public void SetParameter(string name, Action<StreamWriter> callback)
  {
    int index = _parameterIndices[name] + _parts.Length;
    _callbackGraph[index] = callback;
    _graphTypeIsReplace[index] = false;
  }

  private static Regex _parser = new Regex(@"\$(\w{1,64})\$", RegexOptions.Compiled);
  private void Prepare()
  {
    _parameterIndices = new Dictionary<string, int>(64);
    List<string> parts = new List<string>(64);
    List<object> tokens = new List<object>(64);
    int param_index = 0;
    int part_start = 0;

    foreach (Match match in _parser.Matches(_template))
    {
      if (match.Index > part_start)
      {
        //Add Part
        tokens.Add(parts.Count);
        parts.Add(_template.Substring(part_start, match.Index - part_start));
      }


      //Add Parameter
      var param = _template.Substring(match.Index + 1, match.Length - 2);
      if (!_parameterIndices.TryGetValue(param, out param_index))
        _parameterIndices[param] = param_index = _parameterIndices.Count;
      tokens.Add(param);

      part_start = match.Index + match.Length;
    }

    //Add last part, if it exists.
    if (part_start < _template.Length)
    {
      tokens.Add(parts.Count);
      parts.Add(_template.Substring(part_start, _template.Length - part_start));
    }

    //Set State
    _parts = parts.ToArray();
    _tokens = new int[tokens.Count];

    int index = 0;
    foreach (var token in tokens)
    {
      var parameter = token as string;
      if (parameter == null)
        _tokens[index++] = (int)token;
      else
        _tokens[index++] = _parameterIndices[parameter] + _parts.Length;
    }

    _parameters = _parameterIndices.Keys.ToArray();
    int graphlen = _parts.Length + _parameters.Length;
    _callbackGraph = new Action<StreamWriter>[graphlen];
    _replaceGraph = new string[graphlen];
    _graphTypeIsReplace = new bool[graphlen];

    for (int i = 0; i < _parts.Length; i++)
    {
      _graphTypeIsReplace[i] = true;
      _replaceGraph[i] = _parts[i];
    }
  }

  public void GenerateString(Stream output)
  {
    var writer = new StreamWriter(output);
    GenerateString(writer);
    writer.Flush();
  }

  public void GenerateString(StreamWriter writer)
  {
    //Resolve graph
    foreach(var token in _tokens)
    {
      if (_graphTypeIsReplace[token])
        writer.Write(_replaceGraph[token]);
      else
        _callbackGraph[token](writer);
    }
  }

  public void SetReplacements(params string[] parameters)
  {
    int index;
    for (int i = 0; i < _parameters.Length; i++)
    {
      if (!Int32.TryParse(_parameters[i], out index))
        continue;
      else
        SetParameter(index.ToString(), parameters[i]);
    }
  }

  public string GenerateString(int bufferSize = 1024)
  {
    using (var ms = new MemoryStream(bufferSize))
    {
      GenerateString(ms);
      ms.Position = 0;
      using (var reader = new StreamReader(ms))
        return reader.ReadToEnd();
    }
  }

  public string GenerateString(params string[] parameters)
  {
    SetReplacements(parameters);
    return GenerateString();
  }

  public void GenerateString(StreamWriter writer, params string[] parameters)
  {
    SetReplacements(parameters);
    GenerateString(writer);
  }
}



回答6:


StringBuilder sb = new StringBuilder("Hello string");
sb.Replace("string", String.Empty);
Console.WriteLine(sb);  

StringBuilder, a mutable string.




回答7:


Here is my benchmark:

using System;
using System.Diagnostics;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;

internal static class MeasureTime
{
    internal static TimeSpan Run(Action func, uint count = 1)
    {
        if (count <= 0)
        {
            throw new ArgumentOutOfRangeException("count", "Must be greater than zero");
        }

        long[] arr_time = new long[count];
        Stopwatch sw = new Stopwatch();
        for (uint i = 0; i < count; i++)
        {
            sw.Start();
            func();
            sw.Stop();
            arr_time[i] = sw.ElapsedTicks;
            sw.Reset();
        }

        return new TimeSpan(count == 1 ? arr_time.Sum() : Convert.ToInt64(Math.Round(arr_time.Sum() / (double)count)));
    }
}

public class Program
{
    public static string RandomString(int length)
    {
        Random random = new Random();
        const string chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
        return new String(Enumerable.Range(1, length).Select(_ => chars[random.Next(chars.Length)]).ToArray());
    }

    public static void Main()
    {
        string rnd_str = RandomString(500000);
        Regex regex = new Regex("a|c|e|g|i|k", RegexOptions.Compiled);
        TimeSpan ts1 = MeasureTime.Run(() => regex.Replace(rnd_str, "!!!"), 10);
        Console.WriteLine("Regex time: {0:hh\\:mm\\:ss\\:fff}", ts1);

        StringBuilder sb_str = new StringBuilder(rnd_str);
        TimeSpan ts2 = MeasureTime.Run(() => sb_str.Replace("a", "").Replace("c", "").Replace("e", "").Replace("g", "").Replace("i", "").Replace("k", ""), 10);
        Console.WriteLine("StringBuilder time: {0:hh\\:mm\\:ss\\:fff}", ts2);

        TimeSpan ts3 = MeasureTime.Run(() => rnd_str.Replace("a", "").Replace("c", "").Replace("e", "").Replace("g", "").Replace("i", "").Replace("k", ""), 10);
        Console.WriteLine("String time: {0:hh\\:mm\\:ss\\:fff}", ts3);

        char[] ch_arr = {'a', 'c', 'e', 'g', 'i', 'k'};
        TimeSpan ts4 = MeasureTime.Run(() => new String((from c in rnd_str where !ch_arr.Contains(c) select c).ToArray()), 10);
        Console.WriteLine("LINQ time: {0:hh\\:mm\\:ss\\:fff}", ts4);
    }

}

Regex time: 00:00:00:008

StringBuilder time: 00:00:00:015

String time: 00:00:00:005

LINQ can't process rnd_str (Fatal Error: Memory usage limit was exceeded)

String.Replace is fastest




回答8:


if you want a built in class in dotnet i think StringBuilder is the best. to make it manully you can use unsafe code with char* and iterate through your string and replace based on your criteria




回答9:


Since you have multiple replaces on one string, I wolud recomend you to use RegEx over StringBuilder.



来源:https://stackoverflow.com/questions/399798/memory-efficiency-and-performance-of-string-replace-net-framework

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!