computer-architecture | 易学教程

Why use SIMD if we have GPGPU? [closed]

阅读更多关于 Why use SIMD if we have GPGPU? [closed]

问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 5 years ago . Now that we have GPGPUs with languages like CUDA and OpenCL, do the multimedia SIMD extensions (SSE/AVX/NEON) still serve a purpose? I read an article recently about how SSE instructions could be used to accelerate sorting networks. I thought this was pretty neat but when I

How cache memory works?

阅读更多关于 How cache memory works?

问题 Today when I was in computer organization class, teacher talked about something interesting to me. When it comes to talk about Why cache memory works, he said that: for (i=0; i<M; i++) for(j=0; j<N; j++) X[i][j] = X[i][j] + K; //X is double(8 bytes) it is not good to change the first line with the second. What is your opinions on this? And why it is like that? 回答1: Locality of reference. Because the data is stored by rows, for each row the j columns are in adjacent memory addresses. The OS

Seeing how Instructions get Translated (Computer Architecture)

阅读更多关于 Seeing how Instructions get Translated (Computer Architecture)

问题 Little bit of a confusing question. But Im really looking for learning some low level programming. Thing is, Dev boards like Arduino/Etc. really hide alot of whats going on. I have spent some time learning about Computer Architecture, Logic/Gates/Sequential Logic/etc.. (I went even as far as to learn the Physics of Semiconductors and Electronics related to it all, just to know what exactly is going on, as well as how Gates are made using CMOS Transistors and such). But thats about where it

Writing a Register File in VHDL

阅读更多关于 Writing a Register File in VHDL

I am trying to write a register file in VHDL. The file contains 16 64-bit registers. Each cycle, two registers are read and one register is written (given that writing is enabled). There should be a data bypass (forwarding) so that the value just written is forwarded directly to the output if we are reading and writing to/from the same register in a single cycle. My idea was to write on the rising edge and read on the falling edge of the clock in order to complete this in one cycle. However, my design isn't working (not that I expected it to since I don't believe that checking for a falling

Why are conditionally executed instructions not present in later ARM instruction sets?

阅读更多关于 Why are conditionally executed instructions not present in later ARM instruction sets?

问题 Naively, conditionally executed instructions seem like a great idea to me. As I read more about ARM (and ARM-like) instruction sets (Thumb2, Unicore, AArch64) I find that they all lack the bits for conditional execution. Why is conditional execution missing from each of these? Was conditional execution a mistake at the time, or have subsequent changes made it an expensive waste of instruction bits? 回答1: General claim is modern systems have better branch predictors and compilers are much more

Android CPU register names?

阅读更多关于 Android CPU register names?

This code fragment is extracted from an Android crash report on a Samsung Tab S: Build fingerprint: 'samsung/chagallwifixx/chagallwifi:5.0.2/LRX22G/T800XXU1BOCC:user/release-keys' Revision: '7' ABI: 'arm' r0 a0d840bc r1 a0dcb880 r2 00000001 r3 a0d840bc r4 a0dc3c4c r5 00000000 r6 a066d200 r7 00000000 r8 32d68f40 r9 a0c359a8 sl 00000014 fp bef3ba84 ip a0dc3fb8 sp bef3ba10 lr a0c35a0c pc a0c34bc8 cpsr 400d0010 r0 through r9 are pretty clearly general purpose registers, sp ( r13 ) is the stack pointer, and pc ( r15 ) is the program counter (instruction pointer). Referring to the Wikipedia's ARM

Write a program to get CPU cache sizes and levels

阅读更多关于 Write a program to get CPU cache sizes and levels

问题 I want to write a program to get my cache size(L1, L2, L3). I know the general idea of it. Allocate a big array Access part of it of different size each time. So I wrote a little program. Here's my code: #include <cstdio> #include <time.h> #include <sys/mman.h> const int KB = 1024; const int MB = 1024 * KB; const int data_size = 32 * MB; const int repeats = 64 * MB; const int steps = 8 * MB; const int times = 8; long long clock_time() { struct timespec tp; clock_gettime(CLOCK_REALTIME, &tp);

Can we have a computer with just registers as memory?

阅读更多关于 Can we have a computer with just registers as memory?

Registers are the fastest memories in a computer. So if we want to build a computer with just registers and not even caches is it possible? I think of even replacing the magnetic discs with registers although they are naturally volatile memories. Do we have some nonvolatile registers for that use? It would become so fast! I'm just wondering if that could be happen or not? David Johnstone The very short answer is yes, you could in theory, but it doesn't really work in real life . Let me explain... The reason the memory hierarchy exists is because those small and fast memory stores are very

Which standard C++ features can be used for querying machine/OS architecture?

阅读更多关于 Which standard C++ features can be used for querying machine/OS architecture?

问题 What are the standard C++ features and utilities for querying the properties of the hardware or operating system capabilities, on which the program is running? For instance, std::thread::hardware_concurrency() gives you the number of threads the machine supports. But how do you detect how much RAM the computer has, or how much RAM the process is using, or how much disk space is available to write to in a certain directory, or how much L2 cache is available? I would prefer answers by means of

Is there a code that results in 50% branch prediction miss?

阅读更多关于 Is there a code that results in 50% branch prediction miss?

问题 The problem: I'm trying to figure out how to write a code (C preffered, ASM only if there is no other solution) that would make the branch prediction miss in 50% of the cases . So it has to be a piece of code that "is imune" to compiler optimizations related to branching and also all the HW branch prediction should not go better than 50% (tossing a coin). Even a greater challenge is being able to run the code on multiple CPU architectures and get the same 50% miss ratio. I managed to write a