Removing hidden characters from within strings

不羁的心 提交于 2019-11-27 13:57:18
Yannick Blondeau

You can remove all control characters from your input string with something like this:

string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

Here is the documentation for the IsControl() method.

Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:

string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());
Mubashar Ahmad

I usually use this regular expression to replace all non-printable characters.

By the way, most of the people think that tab, line feed and carriage return are non-printable characters, but for me they are not.

So here is the expression:

string output = Regex.Replace(input, @"[^\u0009\u000A\u000D\u0020-\u007E]", "*");
  • ^ means if it's any of the following:
  • \u0009 is tab
  • \u000A is linefeed
  • \u000D is carriage return
  • \u0020-\u007E means everything from space to ~ -- that is, everything in ASCII.

See ASCII table if you want to make changes. Remember it would strip off every non-ASCII character.

To test above you can create a string by yourself like this:

    string input = string.Empty;

    for (int i = 0; i < 255; i++)
    {
        input += (char)(i);
    }
new string(input.Where(c => !char.IsControl(c)).ToArray());

IsControl misses some control characters like left-to-right mark (LRM) (the char which commonly hides in a string while doing copy paste). If you are sure that your string has only digits and numbers then you can use IsLetterOrDigit

new string(input.Where(c => char.IsLetterOrDigit(c)).ToArray())

If your string has special characters, then

new string(input.Where(c => c < 128).ToArray())

You can do this:

var hChars = new char[] {...};
var result = new string(yourString.Where(c => !hChars.Contains(c)).ToArray());

What best worked for me is:

string result = new string(value.Where(c =>  char.IsLetterOrDigit(c) || (c >= ' ' && c <= byte.MaxValue)).ToArray());

Where I'm making sure the character is any letter or digit, so that I don't ignore any non English letters, or if it is not a letter I check whether it's an ascii character that is greater or equal than Space to make sure I ignore some control characters, this ensures I don't ignore punctuation.

Some suggest using IsControl to check whether the character is non printable or not, but that ignores Left-To-Right mark for example.

If you know what these characters are you can use string.Replace:

newString = oldString.Replace("?", "");

where "?" represents the character you want to strip out.

The drawback with this approach is that you need to make this call repeatedly if there are multiple characters that you want to remove.

It has been a while but this haven't been answered yet.

How do you include the HMTL content in the sending code? if you are reading it from file, check the file encoding. If you are using UTF-8 with signature (the name slightly varies between editors), this is may cause the weird char at the begining of the mail.

string output = new string(input.Where(c => !char.IsControl(c)).ToArray()); This will surely solve the problem. I had a non printable substitute characer(ASCII 26) in a string which was causing my app to break and this line of code removed the characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!