What is the actual relation between assembly, machine code, bytecode, and opcode?

前端 未结 6 852

What is the actual relation between assembly, machine code, bytecode, and opcode?

I have read most of the SO questions about assembly and machine code, such as this,

相关标签:
6条回答
  • 2020-12-29 12:11

    Is there some sort of standard reference that lists out all of those numbers, and what they mean, for whatever architecture you are on, and how each set of numbers maps to each assembly instruction?

    Yes, though they can be very complex. Also, due to the prevalence of assemblers and compilers, they're also sort of hard to find, because pretty much nobody uses them.

    Relation Between Assembly and Bytecode

    • Machine code - One or a series of values read into a CPU. Each number is an "instruction" or "opcode", and may be followed by one or more parameters to act on. In the linked code, 13 tells the processor to push a string onto the stack.
    • OpCode - The value for a command: In the sample, the opcode for pushing a string is 13.
    • Assembly - human readable instructions for a CPU's internal machine code. Pretty much always one assembly instruction per machine code instruction. In my code that you linked to, the "assembly" instruction PushString maps to machine instruction 13.
    • Byte Code - Since each processor uses a different machine code, sometimes programs compile to a machine code for an imaginary "virtual machine", and then have a program that reads this fake machine code and executes it (either via emulation or JIT). Java and C# and VB all do this. This "fake" machine code is called "byte code", though the terms are often used interchangeably.

    I should note that the bytecode instructions used in this post and in my other post that you linked to are simplified extracts from a proprietary byte code I work with at my company. We have a proprietary programming language that compiles to this bytecode which is interpreted by our product, and some of the values I mentioned are real bytecodes we actually use. 13 is actually pushAnything with complex parameters, but I kept things simple for the answer.

    0 讨论(0)
  • 2020-12-29 12:13

    You have clearly done some homework of your own on this, and I say good stuff (and voted you up one).

    As you are experiencing, the more you read, the more you say, "huh ?"

    Okay, first off, when you encounter the word "bytecode" just close the window and stop reading, because you are on the wrong path; probably a tangent at best and at worst you could be reading someone trying to sound smarter than he really is by tossing techhy sounding buzzwords into his writing.

    Now, as for the word "opcode", yes those really do exist, but do understand that those numbers are actually symbolic, for humans to grasp conceptually. In real life, they are super-ultra-tiny switches.

    If you really like history, and technology before the internet (or color TV for that matter) look up phrases like butterfly switches, vacuum tubes, butterfly girls, and I forget the other words. This was back before transistors existed. The original huge computers actually used vacuum tubes and generated enough heat to warm an entire floor (or two or three) of an office building in the dead of Winter. The electrical current draws were astounding.

    The thing to keep in your mind about all this is that those computers were "programmed" by individually flipping butterfly switches ("bat handles" were another term sometimes used) which connected and disconnected individual lines from individual tubes, and I forget what else.

    The facts were: You programmed a computer by flipping the bat handles that were connected to the lines that were connected to various tubes.

    Fast Forward To Today...

    When you write an opcode of 90h, (I think that's a NOP in x86, somebody correct me and I'll fix it) you are doing (with today's hi-tech wowee-zowee) the same thing that the butterfly girls did back in the stone age of computers.

    Specifically, you are "throwing" these "butterfly switches"...

    • 7 - ON
    • 6 - OFF
    • 5 - OFF
    • 4 - ON
    • 3 - OFF
    • 2 - OFF
    • 1 - OFF
    • 0 - OFF

    Here's the big difference (and part of today's hi-tech wowee-zowee)...

    They had to throw exactly those switches at exactly one place on the floor. You will be flipping them anywhere you want. Three other programs will cooperate and make those decisions for you.

    Those three programs are - The Assembler - The Linker - The Loader

    So then (I hope) that this has helped lay the foundation for you to understand that the OPCODE is a mental representation of a bunch of little switches that will be "opened" or "closed".

    (Actually, the hi-tech wowee-zowee has taken it a step further, but it's the same effect as the butterfly switches of previous gnerations.)

    Anyway, it works like this.

    Humans decided that there would be an instruction to do nothing; called a NOP

    So, you type the letters NOP in your text editor like this

      NOP           ;This is a No operation instruction
    

    You then save the file.

    You then ask the assembler to assemble that file

    When the assembler sees the NOP he creates the 90 (in hex) in the Object file which he is creating for the linker.

    The Linker uses the object file and creates an executable file

    The Loader places that executable file wherever it wants. (Note, in olden days of microcomputers, the software writer had to decide where to place that executable file; that was conflict bait like you wouldn't believe.)

    Anyway, the NOP became 90 in some place in the EXE file and the loader stuck it in a good area for you, based on 179 rules you don't have to worry about any longer.

    The loader then gets out of the picture and lets your program have the CPU.

    The CPU fetches your first instruction and starts obeying.

    When the CPU gets to the byte containing 90 it will be the same thing as the butterfly switches from generations past.

    While the current will not be traveling a bunch of long wires on the floor, it will be doing highly similar (and functionally equivalent) things inside the ASIC.

    Now with all that written (thanks if you're still actually reading) you can understand this boiled down one line explanation of what an opcode actually is...

    The opcode is a paradigmatic representation of butterfly switches of olden days.

    Now for your second question about what is machine code.

    Machine code is a bunch of opcodes

    If any of this is unclear, ask in the comments section and I'll try to edit this answer.

    0 讨论(0)
  • 2020-12-29 12:18

    Yes, each architecture has an instruction set reference that gives how instructions are encoded. For x86, it's the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z

    Most assemblers, including nasm, can produce a listing file for you. Feeding your sample code to nasm -l, we get:

     1                                  global main
     2                                  section .text
     3
     4                                  main:
     5 00000000 E800000000                call write
     6
     7                                  write:
     8 00000005 B804000002                mov rax, 0x2000004
     9 0000000A BF01000000                mov rdi, 1
    10 0000000F 48BE-                     mov rsi, message
    11 00000011 [0000000000000000]
    12 00000019 BA0E000000                mov rdx, length
    13 0000001E 0F05                      syscall
    14
    15                                  section .data
    16 00000000 48656C6C6F2C20776F-     message: db 'Hello, world!', 0xa
    17 00000009 726C64210A
    18                                  length: equ $ - message
    

    You can see the generated machine code in the third column (first is line number, second is address).

    Note that the output of the assembler is an object file, and the output of the linker is an executable. Both of those have a complex structure and contain more than just the machine code. This is why your hexdump differs from the above listing.

    Opcode is generally considered to be the part of the machine code instruction that specifies the operation to perform. For example, in the above code you have B804000002 mov rax, 0x2000004. There B8 is the opcode, 04000002 is the immediate operand.

    Bytecode is not typically used in the assembly context, it could be thought of as the machine code for a virtual machine.


    For a walkthrough, x86 is a very complicated architecture. But your sample code happens to have a simple instruction, the syscall. So let's see how to turn that into machine code. Open the above mentioned reference pdf, and go to the section about syscall in chapter 4. You will immediately see it listed as opcode 0F 05. Since it doesn't take any operands, we are done, those 2 bytes are the machine code. How do we turn it back? Go to Appendix A: Opcode map. Section A.1 tells us: For 2-byte opcodes beginning with 0FH (Table A-3), skip any instruction prefixes, the 0FH byte (0FH may be preceded by 66H, F2H, or F3H) and use the upper and lower 4-bit values of the next opcode byte to index table rows and columns.. Okay so we skip the 0F and split the 05 into 0 and 5 and look that up in table A-3 in row #0, column #5. We find it is a syscall instruction.

    0 讨论(0)
  • 2020-12-29 12:27

    Briefly:

    "Assembly" is what you feed through an "assembler". An assembler is a program which reads in several decks of punched cards and "assembles" them into a single program.

    Or at least that used to be. Now the cards are replaced with disk files. But the data on the "cards" is a "machine language" which is the numeric values for the machine instructions.

    But modern assemblers are SAPs -- Symbolic Assembler Programs -- so you can replace the numeric values with symbols -- say "LOD" for a Load instruction, "R1" for register 1, and "label5" for the instruction address 26734.

    "Machine language" is the way that individual instructions (or "orders", if you're a Brit) to the CPU are represented. For a symbolic assembler you might have "LOD R1, LOOPCOUNT" to represent the instruction to load the value at the word labeled LOOPCOUNT into register 1. "LOD", by the way, is the "opcode" -- the (symbolic version of the) numeric value that tells the computer what to do next. (And note that every different computer design uses a different machine language, possibly with different symbols for the opcodes. Most of what you will find on the web is one version or another of the Intel machine language, but you would find, say, the IBM 370 to be radically different.)

    "Bytecode" is a different sort of "machine language" which operates on a "virtual machine" instead of real hardware. The best known case of this is the Java Virtual Machine. "Bytecode" is a notation similar to regular "machine language" but idealized to an extent, since running on a virtual machine relieves it from some of the realities of a real hardware environment.

    0 讨论(0)
  • 2020-12-29 12:29

    The relationship is:

    Assembler instruction (readable) ->  machine code (binary) 
    
    machine code = opcode + operands
    

    The assembler instruction is human readable code, such as: mov rax, 0x2000004

    The opcode is the part of the machine code that relates to the instruction, but from the CPU point of view (so it's not just MOV, but MOV constant to register). For example, see here for i386 MOV opcodes:

    • MOV reg32, immediate value is coded as B8+ register code (AX is the first one so it's 0),
    • the opcode is followed by operand 0x20000004 which is encoded in little endian logic as: 04 00 00 02

    Byte-code is the equivalent of machine code but for virtual machines such as the JVM. The term bytecode codes from the first environments that used this technology (p-code from the UCSD pascal compiler), which used a byte to encode the virtual instruction. You can find for example the small p-code insruction set here, and the more recent and extensive JVM bytecode here

    To be noted: LLVM use an intermediate format (IF) that is stored in a compacted form also known as bytecode. This allows to perform machine neutral code analysizs optimizing before generating native code

    0 讨论(0)
  • 2020-12-29 12:33

    Assembly: Human readable instructors to the assembler + data bytes + operators

    Machine code: The actual bit sequences that the CPU understands.

    It contains:

    • the opcode,
    • which registers to use,
    • offset from the PC register,
    • and similar info

    Bytecode: This is the code read by a interpreter (most implementations of java are actually an interpreter that reads bytecode and uses that bytecode to select a sequence of machine code to have the CPU actually execute). Bytecode is often used to make the same source code work on several different CPUs.

    Opcode: The first one (or two) bytes of the machine code. It acts like a selector to tell the CPU which microcode sequence the CPU it is to perform (something like a switch statement in C)

    Microcode: The hardwired instruction sequences within the CPU that are used to execute the machine code.
    There are lots of microcode sequences, at least one sequence for each opcode. In general, the rest of the machine code is just parameters to the microcode sequence that is selected by the opcode each microcode sequence contains instructions to open/close gates, clock data, pass info to/from the accumulator, etc

    0 讨论(0)
提交回复
热议问题