Latest

How Computers Work: Hardware and Software

 How Computers Work: Hardware and Software

when referring to computing performance, we were focused on classical computing based on the CPU. Classical computing, is essentially the digital computer, almost every computing device on the market today is a classical computer. Classical computers operate in serial, in other words, as mentioned, Computing Origins, executing various instructions extremely fast in 'order', but to the average user, it appears to be running. This is due to many hardware and software optimizations to allow for asynchronous operation. As a disclaimer: the concepts we will be discussing are an overgeneralization of computing architecture, but for the sake of getting an abstracted understanding of classical computing functionality will serve well. Alright so first let's bring in the central processing unit, the CPU is the brains of the computer. Let's also usher in memory, the RAM, this is often where the CPU accesses stored information it needs. 

How Computers Work: Hardware and Software

How Computers Work: Hardware and Software


Now the CPU also has a built-in memory, this is often called the cache. The cache is considerably smaller than the RAM, with sizes ranging in the order of 32 kilobytes to 8 megabytes. The purpose of the cache is to give the CPU the information it needs immediately. The CPU and RAM are separate objects, so when the CPU needs information, it takes time, albeit a very small amount of time to read the data from the memory. This time can add considerable delay to machine operation. With the cache being right on the CPU reduces this time to almost nothing. The reason why you don't need much cache storage is that it just needs to store little bits of important information that the CPU will need to use soon or has been using a lot recently. There are various methods implemented to determine what goes on to the cache, what should be kept on the cache and when it should be written back on to the RAM. In a typical CPU, there are various levels of cache, each with different read and write times and sizes, for the sake of simplicity we'll assume a single cache for our CPU. So now with the basic components out of the way, let's get into how the computer operates.

When a CPU executes an instruction there are five basic steps that need to be completed:

1) Fetch: Get the instruction from the memory and store it in the cache in some cases.

2) Decode: Get the acceptable variables needed for the execution of the instruction.

3) Execute: Compute the result of the instruction.

4) Memory: For instructions that need a memory read/ write operation to be done.

5) Write Back: Write the results of the instruction back to memory. Nearly every instruction goes through the first three and final steps, only certain instructions go through the memory steps such as load and stores but for the sake of simplicity, we'll assume every instruction requires all five steps. Now each step takes one clock cycle, this translates to a CPI, clock cycles per instruction, of five. As a note, most modern processors can execute billions of clock cycles per second, for example, a 3.4 gigahertz processor can execute 3.4 billion clock cycles per second. Now a CPI of 5 is very inefficient, meaning the resources of the CPU are wasted. This is why pipelining was introduced, bringing operation into computing. Pipelining essentially makes it so each step can be executed in a different clock cycle, translating to 5 instructions per 5 clock cycles, or in other words, one instruction per clock cycle, a CPI of 1. Essentially what pipelining does is take the segmented steps of instruction and execute them in each clock cycle, since the segmented steps are smaller than the size and less complex than a normal instruction, you can do the steps of other instructions in the same clock cycle. For example, if a step for one instruction is fetching the data, you could begin decoding another, executing another, etc - since the hardware involved for those steps isn't being blocked. Superscalar pipelines add to this performance further.

Think of pipelines as a highway, now typical lane within the highway can execute one instruction per clock cycle. With superscalar processors you add more lanes to the highway, for example, a 2 wide superscalar also referred to as a dual-issue machine, has a theoretical CPI of 1/2, two instructions per clock cycle. There are various other methods implemented to form the processor CPI more efficient, such as: unrolling loops, very long instruction words [VLIWs] - which are essentially multiple instructions wrapped into one larger instruction, compiler scheduling an optimization - allowing for out of order execution, and more. There are also many issues that come along with pipelining that decrease CPI, such as data hazards, memory hazards, structural hazards, and more. So at this point, we now know about the basic design of a CPU, how it communicates with memory, the stages that it executes instructions in, as well as pipelining and superscalar design. Now instead of imagining all of this as a single CPU, let's take it further, all this technology can be embedded on a single core of a processor. With multiple cores, you take the performance of a single core and multiply it by the core count, for example, in a quad-core by four. Multiple cores also have a shared cache as a side note. The use of superscalar pipelines as well as multiple cores is considered hardware-level parallelism. The computer industry after years of stagnation is now beginning to divert more focus to hardware level parallelism, by adding more cores to processors. This can be demonstrated by consumer processors like AMD's Threadripper line and Intel's i9 processor line, with core counts ranging from 8 to 16 and 10 to 18 respectively. While these may be their higher-end consumer processor,  even the low and mid-end processors from i3, i5 and i7 are getting buffs, with core counts ranging from the quad, hex, and octa-core. As a side note, supercomputers are the best examples of utilizing hardware parallelism. For example, Intel's Xeon Phi and AMD's Epic processors have core counts ranging from 24 to 72! With supercomputer having tens of thousands of processors with them. Now there's one key component that is required in tandem with hardware parallelism to truly use all the resources efficiently, software parallelism. This leads us to the final topic in classical computing will cover, hyperthreading also referred to as multithreading. Instead of being implemented as hardware parallelism, this is used as higher-level software parallelism. Think of a thread as a sequence of instructions, now with single-threading, that sequence of instructions just flows through the pipeline as normal.

However, with multi-threading, you can segment your application into many threads and specifically choose how you want to execute them. Multi-threading can significantly increase computing performance, by explicitly stating what CPU resources you want to utilize and when. For example, for an application, the user interface, GUI, can be executed on one thread while the logic is executed on another. This is just one example of many instances where multi-threading can be used. Now multi-threading can't just be used for every application. Since classical computing isn't intrinsically parallel, there can be a lot of issues with concurrency, or in other words, when multiple threads are executing at the same time but depending on the result of each other. Thus, some applications end up being only single-threaded. However, many individuals and groups are working on ways to best utilize hardware parallelism through new software practices and rewriting old software. For example, the latest Firefox updates now are bringing in multi-threading. Also, some of the most computationally intensive tasks by default excel at multi-threading, such as video editing, rendering, and data processing - to list a few. Also as exemplified by the gaming industry, a lot of games are now moving into multi-threaded performance. Instructions are still executed in serial, but through the use of hardware and software level parallelism, maximize the utilization of the resources of the computer, making them execute extremely fast; giving the illusion of parallel operation.


No comments