Specifically what does a compiler do to aggressively optimize generated bytecode?

问题

I have been reading up on the functionality of various compilers and I've come across the term "aggressive optimization" that many compilers are reported to perform. LLVM, for example cites the following compile-time optimization features:

Memory/pointer specific
Loop transforms
Data flow
Arithmetic
Dead code elimination
Inlining

What does this mean specifically? Say you had the following code snippet, how could you optimize the generated byte code to run any faster than what the compiler generated? I'm specifically interested in optimizing the bytecode of JIT-powered runtimes such as C#, Java and Flash. This is tricky because the JIT only supports a subset of the opcodes that the processor usually does, which limits the amount of optimization you can do. Still, I'm interested to see whats possible and exactly what transformations could push the limits of the VM.

Fictitious block of code:

for (i = 0; i < 100; i++){
    in = dataIn[i];
    if ((in % 5) == 0){
        out = ((in / 2) >> 16) - 10;
    }else{
        out = ((in << 5) / 2) * 50 + 10;
    }
    dataOut[i] = out;
}

Approximate pseudo code generated by the compiler, for a stack-based JIT VM such as Flash Player: (forgive me for any mistakes, this is entirely handwritten!)

// i = 0
label: "forInit"
   push 0
   writeTo "i"

// while i < 100
label: "forStart"
   push "i"
   push 100
   jumpIfMoreThan "forEnd"

       // in = dataIn[i];
       push "i"
       push "dataIn"
       readProp
       saveTo "in"

       // if ((in % 5) == 0)
       push "in"
       push 5
       mod
       push 0
       jumpIfNotEquals "ifPart2"
       label: ifPart1

           // out = ((in / 2) >> 16) - 10;
           push "in"
           push 2
           divide
           push 16
           rightshift
           push 10
           minus
           writeTo "out"
           goto "ifEnd"

       // else
       label: ifPart2

           // out = ((in << 5) / 2) * 50 + 10;
           push "in"
           push 5
           leftshift
           push 2
           divide
           push 50
           multiply
           push 10
           add
           writeTo "out"

       // dataOut[i] = out;
       label: ifEnd
           push "out"
           push "i"
           push "dataOut"
           writeProp

       // i++
       push "i"
       increment
       writeTo "i"

   // while i < 100
   goto "forStart"
label: "forEnd"

回答1:

Here are two simple optimizations a compiler could make:

out = ((i / 2) >> 16) - 10;

can be reduced to

out = (i >> 17) - 10;

and

out = ((i << 5) / 2) * 50 + 10;

can be reduced to

out = (i << 4) * 50 + 10;

To answer your question "how could you optimize the generated byte code to run any faster than what the compiler generated?" Here is another version of the bytecode that has some optimizations.

// i = 0
label: "forInit"
   push 0
   writeTo "i"

// while i < 100
label: "forStart"
   push "i"
   push 100
   jumpIfMoreThan "forEnd"

       // in = dataIn[i];
       push "i"
       push "dataIn"
       readProp
       saveTo "in"

       // if ((in % 5) == 0)
       push "in"
       push 5
       mod
       push 0
       jumpIfNotEquals "ifPart2"
       label: ifPart1
           // optimization: remove unnecessary /2
           // out = ((in / 2) >> 16) - 10;
           push "in"
           push 17
           rightshift
           push 10
           minus
           // optimization: don't need out var since value on stack
           // dataOut[i] = out;
           push "i"
           push "dataOut"
           writeProp
           // optimization: avoid branch to common loop end 
           // i++
           push "i"
           increment
           writeTo "i"
           goto "forStart"

       // else
       label: ifPart2
           // optimization: remove unnecessary /2
           // out = ((in << 5) / 2) * 50 + 10;
           push "in"
           push 4
           leftshift
           push 50
           multiply
           push 10
           add
           // optimization: don't need out var since value on stack
           // dataOut[i] = out;
           push "i"
           push "dataOut"
           writeProp
           // optimization: avoid branch to common loop end 
           // i++
           push "i"
           increment
           writeTo "i"
           goto "forStart"
label: "forEnd"

回答2:

I've also been working on this, the full list of transformations that LLVM performs, organized under headers:

Dead code removal
- Aggressive Dead Code Elimination
- Dead Code Elimination
- Dead Argument Elimination
- Dead Type Elimination
- Dead Instruction Elimination
- Dead Store Elimination
- Dead Global Elimination
- Delete dead loops
Unwanted data removal
- Strip all symbols from a module
- Strip debug info for unused symbols
- Strip Unused Function Prototypes
- Strip all llvm.dbg.declare intrinsics
- Strip all symbols, except dbg symbols, from a module
- Merge Duplicate Global Constants
- Remove unused exception handling info
Inlining functions
- Merge Functions
- Partial Inliner
- Function Integration/Inlining
Loop optimization
- Loop-Closed SSA Form Pass
- Loop Invariant Code Motion
- Extract loops into new functions
- Extract at most one loop into a new function
- Loop Strength Reduction
- Rotate Loops
- Canonicalize natural loops
- Unroll loops
- Unswitch loops
Misc
- Promote 'by reference' arguments to scalars
- Combine instructions to form vector instructions within basic blocks
- Profile Guided Basic Block Placement
- Break critical edges in CFG
- Optimize for code generation
- Simple constant propagation
- Deduce function attributes
- Global Variable Optimizer
- Global Value Numbering
- Canonicalize Induction Variables
- Insert instrumentation for edge profiling
- Insert optimal instrumentation for edge profiling
- Combine redundant instructions
- Internalize Global Symbols
- Interprocedural constant propagation
- Interprocedural Sparse Conditional Constant Propagation
- Jump Threading
- Lower atomic intrinsics to non-atomic form
- Lower invoke and unwind, for unwindless code generators
- Lower SwitchInst's to branches
- Promote Memory to Register
- MemCpy Optimization
- Unify function exit nodes
- Reassociate expressions
- Demote all values to stack slots
- Scalar Replacement of Aggregates (DT)
- Sparse Conditional Constant Propagation
- Simplify well-known library calls
- Simplify the CFG
- Code sinking
- Promote sret arguments to multiple ret values
- Tail Call Elimination
- Tail Duplication

回答3:

Although this does not answer your question, I came across the following transformations that a C++ compiler performs to optimize the generated machine code:

Strength Reduction --- iteration variables used as data indices are incremented at a rate matched to the size of the data unit
Hidden Paremeters --- a function which returns a structure actually writes it to an area pointed to by a hidden parameter
Integer Division --- certain fornulas can be used to evaluate integer division more efficiently in the case of a known divisor
Floating Conditions --- a floating point condition is turned into a complex sequence of instructions setting and querying the floating point status
Complex Math --- a complex multiplication or division is turned into a library call
Native routines --- a memcpy(), memset(), strcmp() or strlen() operation is transformed into rep mov, rep sto, rep zcmp, or rep zscas
Short Circuiting --- a complex condition is is evaluated in a tree of basic blocks
Union Ambiguation --- information is lost regarding which member of a union is intended
Copy Fragmentation --- large double or aggregate values are copied word by word
Test Fragmentation --- a condition on a long integer value is composed of separate tests on the individual words of that value
Switch Fragmentation --- a switch statement is replaced by nest of conditions on a value
Loop Header Copy --- a loop is augmented with a condition which decides whether to enter the loop
Loop Unrolling --- a loop is replaced by replicated copies of the loop body
Function Inlining --- a function call is replaced by a copy of the body of the function

来源：https://stackoverflow.com/questions/11072580/specifically-what-does-a-compiler-do-to-aggressively-optimize-generated-bytecode

标签

flash

optimization

compiler-construction

llvm