问题
I have a large theoretical string (104 characters long) database generation program that returns results measured in petabytes. I don't have that much computing power so I would like to filter the low complexity strings from the database.
My grammer is a modified form of the English alphabet with no numerical characters. I read about Kolmogorov Complexity and how it is theoretically impossible to calculate but I just need something basic in C# using compression.
Using these two links
- How to measure complexity of a string?
- How to determine size of string, and compress it
I came up with this:
MemoryStream ms = new MemoryStream();
GZipStream gzip2 = new GZipStream(ms, CompressionMode.Compress, true);
byte[] raw = Encoding.UTF8.GetBytes(element);
gzip2.Write(raw, 0, raw.Length);
gzip2.Close();
byte[] zipped = ms.ToArray(); // as a BLOB
string smallstring = Convert.ToString(zipped); // as a string
// store zipped or base64
byte[] raw2 = Encoding.UTF8.GetBytes(smallstring);
int startsize = raw.Length;
int finishsize = raw2.Length;
double percent = Convert.ToDouble(finishsize) / Convert.ToDouble(startsize);
if (percent > .75)
{
///output
}
my first element is:
HHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFH
and it compresses to a finishsize of 13 characters but this other chatcter set
mlcllltlgvalvcgvpamdipqtkqdlelpklagtwhsmamatnnislmatlkaplrvhitsllptpednleivlhrwennscvekkvlgektenpkkfkinytvaneatlldtdydnflflclqdtttpiqsmmcqylarvlveddeimqgfirafrplprhlwylldlkqmeepcrf
also evaluates to 13. There is a bug but I don't know how to fix it.
回答1:
Your bug is the following part where you convert the array into a string:
byte[] zipped = ms.ToArray(); // as a BLOB
string smallstring = Convert.ToString(zipped); // as a string
// store zipped or base64
byte[] raw2 = Encoding.UTF8.GetBytes(smallstring);
Calling Convert.ToString()
on an array will return some debugging output, in this case the string System.Byte[]
. (See the following example on ideone.)
You should compare the lengths of the uncompressed and compressed byte array directly:
int startsize = raw.Length;
int finishsize = zipped.Length;
回答2:
Here is a code that I used
/// <summary>
/// Defines an interface to calculate relevant
/// to the input complexity of a string
/// </summary>
public interface IStringComplexity
{
double GetCompressionRatio(string input);
double GetRelevantComplexity(double min, double max, double current);
}
And the class that implements it
public class GZipStringComplexity : IStringComplexity
{
public double GetCompressionRatio(string input)
{
if (string.IsNullOrEmpty(input))
throw new ArgumentNullException();
byte[] inputBytes = Encoding.UTF8.GetBytes(input);
byte[] compressed;
using (MemoryStream outStream = new MemoryStream())
{
using (var zipStream = new GZipStream(
outStream, CompressionMode.Compress))
{
using (var memoryStream = new MemoryStream(inputBytes))
{
memoryStream.CopyTo(zipStream);
}
}
compressed = outStream.ToArray();
}
return (double)inputBytes.Length / compressed.Length;
}
/// <summary>
/// Returns relevant complexity of a string on a scale [0..1],
/// where <value>0</value> has very low complexity
/// and <value>1</value> has maximum complexity
/// </summary>
/// <param name="min">minimum compression ratio observed</param>
/// <param name="max">maximum compression ratio observed</param>
/// <param name="current">the value of compression ration
/// for which complexity is being calculated</param>
/// <returns>A relative complexity of a string</returns>
public double GetRelevantComplexity(double min, double max, double current)
{
return 1 - current / (max - min);
}
}
Here is how you can use it
class Program
{
static void Main(string[] args)
{
IStringComplexity c = new GZipStringComplexity();
string input1 = "HHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFHAAAAHHHFHHFFHHFHHFFHHFHHHFH";
string input2 = "mlcllltlgvalvcgvpamdipqtkqdlelpklagtwhsmamatnnislmatlkaplrvhitsllptpednleivlhrwennscvekkvlgektenpkkfkinytvaneatlldtdydnflflclqdtttpiqsmmcqylarvlveddeimqgfirafrplprhlwylldlkqmeepcrf";
string inputMax = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
double ratio1 = c.GetCompressionRatio(input1); //2.9714285714285715
double ratio2 = c.GetCompressionRatio(input2); //1.3138686131386861
double ratioMax = c.GetCompressionRatio(inputMax); //7.5
double complexity1 = c.GetRelevantComplexity(1, ratioMax, ratio1); // ~ 0.54
double complexity2 = c.GetRelevantComplexity(1, ratioMax, ratio2); // ~ 0.80
}
}
Some additional info that I found helpful.
You can try using LZMA, LZMA2 or PPMD from 7zip library. Those are relatively easy to set up and providing you have an interface you can implement several compression algorithms. I found that those algorithms perform much better compression than GZip, but if you put compression ratio on a scale this doesn't really matter.
If you need a normalised value for example from 0 to 1, you would need to calculate compression ratio for all the sequences first. This is because you can't be sure what is the max compression ratio possible.
回答3:
Sure, that will work. As long as you're just comparing sizes, it really doesn't matter what compression algorithm you use. Your main concern is just keeping an eye on the amount of processing power you're using to compress the strings.
来源:https://stackoverflow.com/questions/12115131/complexity-compression-string