x86-64 | 易学教程

How does Linux support more than 512GB of virtual address range in x86-64?

阅读更多关于 How does Linux support more than 512GB of virtual address range in x86-64?

The user virtual address space for x86-64 with Linux is 47 bit long. Which essentially means that Linux can map a process with around ~128 TB virtual address range. However, what confuses me that x86-64 architecture supports ISA defined 4-level hierarchical page table (arranged as radix-tree) for each process. The root of the page table can only map up to 512 GB of contiguous virtual address space. So how Linux can support more than 512GB of virtual address range? Does it uses multiple page tables for each process? If yes, then for a process what should the CR3 (x86-64's register to contain

Where does the SSE instructions outperform normal instructions

阅读更多关于 Where does the SSE instructions outperform normal instructions

问题 Where does the x86-64's SSE instructions (vector instructions) outperform the normal instructions. Because what I'm seeing is that the frequent loads and stores that are required for executing SSE instructions is nullifying any gain we have due to vector calculation. So could someone give me an example SSE code where it performs better than the normal code. Its maybe because I am passing each parameter separately, like this... __m128i a = _mm_set_epi32(pa[0], pa[1], pa[2], pa[3]); __m128i b =

Cost of a page fault trap

阅读更多关于 Cost of a page fault trap

I have an application which periodically (after each 1 or 2 seconds) takes checkpoints by forking itself. So checkpoint is a fork of the original process which just stays idle until it is asked to start when some error in the original process occurs. Now my question is how costly is the copy-on-write mechanism of fork. How much is the cost of a page fault trap that will occur whenever the original process writes to a memory page (first time after taking a checkpoint that is), as copy-on-write mechanism will make sure that it gives the original process a different physical page than the

Problem of loading mod_wsgi module into apache on Windows 64-bit

阅读更多关于 Problem of loading mod_wsgi module into apache on Windows 64-bit

问题 I'm trying to install mod_wsgi module followed this instruction. I've downloaded mod_wsgi.so from this source. It seems like apache cannot restart services properly and the page cannot be loaded after I added the following line to httpd.conf LoadModule wsgi_module modules/mod_wsgi.so I've checked some issues from some sources as follows: The file name is correct - mod_wsgi.so not mod_wsgi.so.so Permissions on the file was set as same as other modules that loaded properly Python installed for

SSE: unaligned load and store that crosses page boundary

阅读更多关于 SSE: unaligned load and store that crosses page boundary

问题 I read somewhere that before performing unaligned load or store next to page boundary (e.g. using _mm_loadu_si128 / _mm_storeu_si128 intrinsics), code should first check if whole vector (in this case 16 bytes) belongs to the same page, and switch to non-vector instructions if not. I understand that this is needed to prevent coredump if next page does not belong to process. But what if both pages belongs to process (e.g. they are part of one buffer, and I know size of that buffer)? I wrote

The difference between cmpl and cmp

阅读更多关于 The difference between cmpl and cmp

I am trying to understand assembly to be able to solve a puzzle. However I encountered the following instructions: 0x0000000000401136 <+44>: cmpl $0x7,0x14(%rsp) 0x000000000040113b <+49>: ja 0x401230 <phase_3+294> What I think its doing is: The value of 0x14(%rsp) is -7380. According to my understanding cmpl compares unsigned. Also the jump is performed. So can it be that (unsigned)-7380 > 7 (unsigned)7380 > 7--> jump I actually don't want it to jump. But is this the correct explanation or not? Am I flipping arguments? Also if you have any advice about how to manipulate this jump! According to

Linux: Large int array: mmap vs seek file?

阅读更多关于 Linux: Large int array: mmap vs seek file?

问题 Suppose I have a dataset that is an array of 1e12 32-bit ints (4 TB) stored in a file on a 4TB HDD ext4 filesystem.. Consider that the data is most likely random (or at least seems random). // pseudo-code for (long long i = 0; i < (1LL << 40); i++) SetFileIntAt(i) = GetRandInt(); Further, consider that I wish to read individual int elements in an unpredictable order and that the algorithm runs indefinately (it is on-going). // pseudo-code while (true) UseInt(GetFileInt(GetRand(1<<40))); We

Why does gcc force PIC for x64 shared libs?

阅读更多关于 Why does gcc force PIC for x64 shared libs?

Trying to compile non-PIC code into a shared library on x64 with gcc results in an error, something like: /usr/bin/ld: /tmp/ccQ2ttcT.o: relocation R_X86_64_32 against `a local symbol' can not be used when making a shared object; recompile with -fPIC This question is about why this is so. I know that x64 has RIP-relative addressing which was designed to make PIC code more efficient. However, this doesn't mean load-time relocation can't be (in theory) applied to such code. Some online sources, including this one (which is widely quoted on this issue) claim that there's some inherent limitation

Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code

阅读更多关于 Handling calls to (potentially) far away ahead-of-time compiled functions from JITed code

This question was put on hold as too broad, presumably because of the research I included in an effort to "show my work" instead of asking a low effort question. To remedy this, allow me to summarize the entire question in a single sentence (credit to @PeterCordes for this phrase): How do I efficiently call (x86-64) ahead-of-time compiled functions (that I control, may be further than 2GB away) from JITed code (that I am generating)? This alone, I suspect, would be put on hold as "too broad." In particular, it lacks a "what have you tried." So, I felt the need to add additional information

What is the difference between retq and ret?

阅读更多关于 What is the difference between retq and ret?

Let's consider the following program, which computes an unsigned square of the argument: .global foo .text foo: mov %rdi, %rax mul %rdi ret This is properly compiled by as , but disassembles to 0000000000000000 <foo>: 0: 48 89 f8 mov %rdi,%rax 3: 48 f7 e7 mul %rdi 6: c3 retq Is there any difference between ret and retq ? In long (64-bit) mode, you return ( ret ) by popping a quadword address from the stack to %rip . In 32-bit mode, you return ( ret ) by popping a dword address from the stack to %eip . Some tools like objdump -d call the first one retq . It's just a name, the instruction