What is the meaning of align an the start of a section?
For example:
align 4
a: dw 0
How does it save memory access?
Memories are a fixed width, today either 32 bit or typically 64 bit wide (even if it is a 32 bit system). Lets assume a 32 bit data bus for now. Every time you do a read, be it 8, 16, or 32 bits, it is a 32 bit bus so those data lines will have something on them, makes sense to just put the 32 bits related to the aligned address.
So if at address 0x100 you had the 32 bit value 0x12345678. And you were to perform a 32 bit read well all of those bits would be on the bus. If you were to perform an 8 bit read at address 0x101, the memory controller would do a read of address 0x100, it would get 0x12345678. And from those 32 bits it would isolate the proper "byte lane", the 8 bits related to address 0x101. Some processors the memory controller may never see anything but 32 bit reads, the processor would handle isolating the byte lane.
What about processors that allow unaligned accesses like the x86? If you had 0x12345678 at address 0x100 and 0xAABBCCDD at address 0x104. And were to do a 32 bit read at address 0x102 on this 32 bit data bus based system, then two memory cycles are required, one at address 0x100 where 16 bits of the desired value live and then another at 0x104 where the other two bytes are found. After those two reads happen you can piece together the 32 bits and provide that deeper into the processor where it was requested. Same thing happens if you want to do a 16 bit read at say address 0x103, costs you twice as many memory cycles, takes twice as long.
What the .align directive normally does in assembly language (of course you have to specify the exact assembler and processor as this is a directive and each assembler can define whatever it wants to define for directives) is pad the output such that the thing that immediately follows the .align is, well, aligned on that boundary. If I had this code:
b: .db 0
c: .dw 0
And it turns out that when I assemble and link the address for C is 0x102, but I know I will be accessing that very often as a 32 bit value, then I can align it by doing something like this:
b: .db 0
.align 4
c: .dw 0
assuming nothing else before this changes as a result, then b will still be at address 0x101, but the assembler will put two more bytes in the binary between b and c so that c changes to address 0x104, aligned on a 4 byte boundary.
"aligned on a 4 byte boundary" simply means that the address modulo 4 is zero. basically 0x0, 0x4, 0x8, 0xc, 0x10, 0x14, 0x18, 0x1C and so on. (the lower two bits of the address are zero). Aligned on 8 means 0x0, 0x8, 0x10, 0x18, or lower 3 bits of the address are zero. And so on.
Writes are worse than reads as you have to do read-modify-writes for data smaller than the bus. If we wanted to change the byte at address 0x101, we would read the 32 bit value at address 0x100, change the one byte, then write that 32 bit value back to 0x100. So when you are writing a program and you think you are making things faster by using smaller values, you are not. So a write that is not aligned and the width of the memory costs you the read-modify-write. An unaligned write costs you twice as much just as it did with reads. An unaligned write would be two read-modify-writes. Writes do have a performance feature over reads though. When a program needs to read something from memory and use that value right away, the next instruction has to wait for the memory cycle to complete (which these days can be hundreds of clock cycles, dram has been stuck at 133MHz for about a decade, your 1333MHz DDR3 memory is not 1333MHz, the bus is 1333MHz/2 and you can put requests in at that speed but the answer doesnt come back for a long while). Basically with a read you have an address but you have to wait for the data as long as it takes. For a write you have both items, the address and data, and you can "fire and forget" you give the memory controller the address and data and your program can keep running. Granted if the next instruction or set of instructions need to access memory, read or write, then everyone has to wait for the first write to finish then move on to the next access.
All of the above is very simplistic, yet what you would see between the processor and cache, on the other side of the cache, the fixed width memory (the fixed width of the sram in the cache and the fixed width of the dram on the far side do not have to match) on the other side of the cache is accessed in "cache lines" which are generally multiples of the size of the bus width. this both helps and hurts with alignment. Say for example 0x100 is a cache line boundary. The word at 0xFE let's say is the tail end of one cache line and 0x100 the beginning of the next. If you were to perform a 32 bit read at address 0xFE, not only do two 32 bit memory cycles have to happen but two cache line fetches. Worst case would be to have to evict two cache lines to memory to make room for the two new cache lines you are fetching. Had you used an aligned address, it would still be bad but only half as bad.
Your question did not specify the processor, but the nature of your question implies x86 which is well known for this problem. Other processor families do not allow unaligned accesses, or you have to specifically disable the exception fault. And sometimes the unaligned access isn't x86 like. For example on at least one processor if you had 0x12345678 at address 0x100, and 0xAABBCCDD at address 0x104 and you disabled the fault and performed a 32 bit read at address 0x102 you will get 0x56781234. A single 32 bit read with the byte lanes rotated to put the lower byte in the right place. No, I am not talking about an x86 system but some other processor.