Write a program that takes text as input and produces a program that reproduces that text

问题

Recently I came across one nice problem, which turned up as simple to understand as hard to find any way to solve. The problem is:

Write a program, that reads a text from input and prints some other program on output. If we compile and run the printed program, it must output the original text.

The input text is supposed to be rather large (more than 10000 characters).

The only (and very strong) requirement is that the size of the archive (i.e. the program printed) must be strictly less than the size of the original text. This makes impossible obvious solutions like

std::string s;
/* read the text into s */
std::cout << "#include<iostream> int main () { std::cout<<\"" << s << "\"; }";

I believe some archiving techniques are to be used here.

回答1:

Unfortunately, such a program does not exist.

To see why this is so, we need to do a bit of math. First, let's count up how many binary strings there are of length n. Each of the bits can be either a 0 or 1, which gives us one of two choices for each of those bits. Since there are two choices per bit and n bits, there are thus a total of 2ⁿ binary strings of length n.

Now, let's suppose that we want to build a compression algorithm that always compresses a bitstring of length n into a bitstring of length less than n. In order for this to work, we need to count up how many different strings of length less than n there are. Well, this is given by the number of bitstrings of length 0, plus the number of bitstrings of length 1, plus the number of bitstrings of length 2, etc., all the way up to n - 1. This total is

2⁰ + 2¹ + 2² + ... + 2^{n - 1}

Using a bit of math, we can get that this number is equal to 2ⁿ - 1. In other words, the total number of bitstrings of length less than n is one smaller than the number of bitstrings of length n.

But this is a problem. In order for us to have a lossless compression algorithm that always maps a string of length n to a string of length at most n - 1, we would have to have some way of associating every bitstring of length n with some shorter bitstring such that no two bitstrings of length n are associated with the same shorter bitstream. This way, we can compress the string by just mapping it to the associated shorter string, and we can decompress it by reversing the mapping. The restriction that no two bitstrings of length n map to the same shorter string is what makes this lossless - if two length-n bitstrings were to map to the same shorter bitstring, then when it came time to decompress the string, there wouldn't be a way to know which of the two original bitstrings we had compressed.

This is where we reach a problem. Since there are 2ⁿ different bitstrings of length n and only 2ⁿ-1 shorter bitstrings, there is no possible way we can pair up each bitstring of length n with some shorter bitstring without assigning at least two length-n bitstrings to the same shorter string. This means that no matter how hard we try, no matter how clever we are, and no matter how creative we get with our compression algorithm, there is a hard mathematical limit that says that we can't always make the text shorter.

So how does this map to your original problem? Well, if we get a string of text of length at least 10000 and need to output a shorter program that prints it, then we would have to have some way of mapping each of the 2¹⁰⁰⁰⁰ strings of length 10000 onto the 2¹⁰⁰⁰⁰ - 1 strings of length less than 10000. That mapping has some other properties, namely that we always have to produce a valid program, but that's irrelevant here - there simply aren't enough shorter strings to go around. As a result, the problem you want to solve is impossible.

That said, we might be able to get a program that can compress all but one of the strings of length 10000 to a shorter string. In fact, we might find a compression algorithm that does this, meaning that with probability 1 - 2¹⁰⁰⁰⁰ any string of length 10000 could be compressed. This is such a high probability that if we kept picking strings for the lifetime of the universe, we'd almost certainly never guess the One Bad String.

For further reading, there is a concept from information theory called Kolmogorov complexity, which is the length of the smallest program necessary to produce a given string. Some strings are easily compressed (for example, abababababababab), while others are not (for example, sdkjhdbvljkhwqe0235089). There exist strings that are called incompressible strings, for which the string cannot possibly be compressed into any smaller space. This means that any program that would print that string would have to be at least as long as the given string. For a good introduction to Kolmogorov Complexity, you may want to look at Chapter 6 of "Introduction to the Theory of Computation, Second Edition" by Michael Sipser, which has an excellent overview of some of the cooler results. For a more rigorous and in-depth look, consider reading "Elements of Information Theory," chapter 14.

Hope this helps!

回答2:

If we are talking about ASCII text...

I think this actually could be done, and I think the restriction that the text will be large than 10000 chars is there for a reason (to give you coding room).

People here are saying that the string cannot be compressed, yet it can.

Why?

Requirement: OUTPUT THE ORIGINAL TEXT

Text is not data. When you read input text you read ASCII chars (bytes). Which have both printable and non printable values inside.

Take this for example:

ASCII values    characters
0x00 .. 0x08    NUL, (other control codes)                                  
0x09 .. 0x0D    (white-space control codes: '\t','\f','\v','\n','\r')
0x0E .. 0x1F    (other control codes)
... rest of printable characters

Since you have to print text as output, you are not interested in the range (0x00-0x08,0x0E-0x1F). You can compress the input bytes by using a different storing and retrieving mechanism (binary patterns), since you don't have to give back the original data but the original text. You can recalculate what the stored values mean and readjust them to bytes to print. You would effectively loose only data that was not text data anyway, and is therefore not printable or inputtable. If WinZip would do that it would be a big fail, but for your stated requirements it simply does not matter.

Since the requirement states that the text is 10000 chars and you can save 26 of 255, if your packing did not have any loss you are effectively saving around 10% space, which means if you can code the 'decompression' in 1000 (10% of 10000) characters you can achieve that. You would have to treat groups of 10 bytes as 11 chars, and from there extrapolate te 11th, by some extrapolation method for your range of 229. If that can be done then the problem is solvable.

Nevertheless it requires clever thinking, and coding skills that can actually do that in 1 kilobyte.

Of course this is just a conceptual answer, not a functional one. I don't know if I could ever achieve this.

But I had the urge to give my 2 cents on this, since everybody felt it cannot be done, by being so sure about it.

The real problem in your problem is understanding the problem and the requirements.

回答3:

What you are describing is essentially a program for creating self-extracting zip archives, with the small difference that a regular self-extracting zip archive would write the original data to a file rather than to stdout. If you want to make such a program yourself, there are plenty of implementations of compression algorithms, or you could implement e.g. DEFLATE (the algorithm used by gzip) yourself. The "outer" program must compress the input data and output the code for the decompression, and embed the compressed data into that code.

Pseudocode:

string originalData;
cin >> originalData;
char * compressedData = compress(originalData);
cout << "#include<...> string decompress(char * compressedData) { ... }" << endl;
cout << "int main() { char compressedData[] = {";
(output the int values of the elements of the compressedData array)
cout << "}; cout << decompress(compressedData) << endl; return 0; }" << endl;

回答4:

Assuming "character" means "byte" and assuming the input text may contains at least as many valid characters as the programming language, its impossible to do this for all inputs, since as templatetypedef explained, for any given length of input text all "strictly smaller" programs are themselves possible inputs with smaller length, which means there are more possible inputs than there can ever be outputs. (It's possible to arrange for the output to be at most one bit longer than the input by using an encoding scheme that starts with a "if this is 1, the following is just the unencoded input because it couldn't be compressed further" bit)
Assuming its sufficient to have this work for most inputs (eg. inputs that consist mainly of ASCII characters and not the full range of possible byte values), then the answer readily exists: use gzip. That's what its good at. Nothing is going to be much better. You can either create self-extracting archives, or treat the gzip format as the "language" output. In some circumstances you may be more efficient by having a complete programming language or executable as your output, but often, reducing the overhead by having a format designed for this problem, ie. gzip, will be more efficient.

回答5:

It's called a file archiver producing self-extracting archives.

来源：https://stackoverflow.com/questions/6526181/write-a-program-that-takes-text-as-input-and-produces-a-program-that-reproduces

标签

c++

algorithm

compression

data-compression

lossless-compression