Automatic code deduplication of assembly language?

一曲冷凌霜 提交于 2019-12-03 13:29:54

What you want is a clone detector tool.

These exist in a variety of implementations that vary depending on the granularity of elements of the document being processed, and how much structure is available.

Those that match raw line (won't work for you, you want to parameterize your subroutines by differing constants [both data and index offset] and or named locations or other named suboutines). The token based detectors might work, in that they will identify single-point places (e.g., constants or identifiers) that vary. But what you really want is a structural matcher, that can pick out variant addressing modes or even variants in code in the middle of the block (See AST based clone detectors, which I happen to build).

To detect with structure, you have to have structure. Fortunately, even assembly language code has structure in the form of a grammar, and blocks of code delimited by subroutine entries and exits (these latter are bit more problematic to detect in assembly, because there may be more than one of each).

When you detect using structures, you have at least the potential to use the structure to modify the code. But if you have the program source represented as a tree, you have structure (subtrees and sequences of subtrees) over which to detect clones, and one can abstract clone matches by modifying the trees at the match points. (Early versions of my clone detector for COBOL abstracted clones into COPY libs. We stopped doing that mostly because you don't want to abstract every clone that way).

What you are proposing is called procedural abstraction and has been implemented by more than one group as research projects. Here is one. Here's another. And another.

Clone detection is normally used in the context of source code, though its function is similar. Since procedural abstraction occurs at a lower level, it can accomplish more. For example, suppose there are two calls to different functions, but with exactly the same complicated argument computations. A procedural abstractor can pull the argument calculation into a procedure, but a clone detector would have a hard time doing so.

I don't believe either gcc or llvm currently has a supported implementation of PA. I searched both sets of documents and didn't find one. In at least two cases above, the optimizer is running on assembly code produced by gcc rather than as a gcc internal optimization. This probably explains why these techniques were not built into the compiler. You might try the authors to see where their implementations are.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!