How Computers Work: Hardware and Software
How Computers Work: Hardware and Software
when referring to computing performance, we were focused on classical computing based on the CPU. Classical computing, is essentially the digital computer, almost every computing device on the market today is a classical computer. Classical computers operate in serial, in other words, as mentioned, Computing Origins, executing various instructions extremely fast in 'order', but to the average user, it appears to be running. This is due to many hardware and software optimizations to allow for asynchronous operation. As a disclaimer: the concepts we will be discussing are an overgeneralization of computing architecture, but for the sake of getting an abstracted understanding of classical computing functionality will serve well. Alright so first let's bring in the central processing unit, the CPU is the brains of the computer. Let's also usher in memory, the RAM, this is often where the CPU accesses stored information it needs.
How Computers Work: Hardware and Software
Now the CPU also has a built-in memory, this is often
called the cache. The cache is considerably smaller than the RAM, with sizes
ranging in the order of 32 kilobytes to 8 megabytes. The purpose of the cache
is to give the CPU the information it needs immediately. The CPU and RAM are
separate objects, so when the CPU needs information, it takes time, albeit a
very small amount of time to read the data from the memory. This time can add
considerable delay to machine operation. With the cache being right on the CPU
reduces this time to almost nothing. The reason why you don't need much cache storage
is that it just needs to store little bits of important information that the
CPU will need to use soon or has been using a lot recently. There are
various methods implemented to determine what goes on to the cache, what should
be kept on the cache and when it should be written back on to the RAM. In a
typical CPU, there are various levels of cache, each with different read and
write times and sizes, for the sake of simplicity we'll assume a single cache
for our CPU. So now with the basic components out of the way, let's get into
how the computer operates.
When a CPU executes an
instruction there are five basic steps that need to be completed:
1) Fetch: Get the
instruction from the memory and store it in the cache in some cases.
2) Decode: Get the
acceptable variables needed for the execution of the instruction.
3) Execute: Compute the
result of the instruction.
4) Memory: For
instructions that need a memory read/ write operation to be done.
5) Write Back: Write
the results of the instruction back to memory. Nearly every instruction goes
through the first three and final steps, only certain instructions go through
the memory steps such as load and stores but for the sake of simplicity, we'll
assume every instruction requires all five steps. Now each step takes one clock
cycle, this translates to a CPI, clock cycles per instruction, of five. As a
note, most modern processors can execute billions of clock cycles per second,
for example, a 3.4 gigahertz processor can execute 3.4 billion clock cycles per
second. Now a CPI of 5 is very inefficient, meaning the resources of the CPU
are wasted. This is why pipelining was introduced, bringing operation into
computing. Pipelining essentially makes it so each step can be executed in a
different clock cycle, translating to 5 instructions per 5 clock cycles, or in
other words, one instruction per clock cycle, a CPI of 1. Essentially what
pipelining does is take the segmented steps of instruction and execute them
in each clock cycle, since the segmented steps are smaller than the size and
less complex than a normal instruction, you can do the steps of other
instructions in the same clock cycle. For example, if a step for one
instruction is fetching the data, you could begin decoding another, executing
another, etc - since the hardware involved for those steps isn't being blocked.
Superscalar pipelines add to this performance further.
Think of pipelines as a
highway, now typical lane within the highway can execute one instruction per
clock cycle. With superscalar processors you add more lanes to the highway, for
example, a 2 wide superscalar also referred to as a dual-issue machine, has a
theoretical CPI of 1/2, two instructions per clock cycle. There are various
other methods implemented to form the processor CPI more efficient, such as:
unrolling loops, very long instruction words [VLIWs] - which are essentially
multiple instructions wrapped into one larger instruction, compiler scheduling
an optimization - allowing for out of order execution, and more. There are also
many issues that come along with pipelining that decrease CPI, such as data
hazards, memory hazards, structural hazards, and more. So at this point, we now
know about the basic design of a CPU, how it communicates with memory, the stages
that it executes instructions in, as well as pipelining and superscalar design.
Now instead of imagining all of this as a single CPU, let's take it further,
all this technology can be embedded on a single core of a processor. With
multiple cores, you take the performance of a single core and multiply it by
the core count, for example, in a quad-core by four. Multiple cores also have a
shared cache as a side note. The use of superscalar pipelines as well as
multiple cores is considered hardware-level parallelism. The computer industry
after years of stagnation is now beginning to divert more focus to hardware
level parallelism, by adding more cores to processors. This can be demonstrated
by consumer processors like AMD's Threadripper line and Intel's i9 processor
line, with core counts ranging from 8 to 16 and 10 to 18 respectively. While
these may be their higher-end consumer processor, even the low and mid-end processors from i3,
i5 and i7 are getting buffs, with core counts ranging from the quad, hex, and
octa-core. As a side note, supercomputers are the best examples of utilizing
hardware parallelism. For example, Intel's Xeon Phi and AMD's Epic processors
have core counts ranging from 24 to 72! With supercomputer having tens of
thousands of processors with them. Now there's one key component that is
required in tandem with hardware parallelism to truly use all the resources
efficiently, software parallelism. This leads us to the final topic in
classical computing will cover, hyperthreading also referred to as
multithreading. Instead of being implemented as hardware parallelism, this is
used as higher-level software parallelism. Think of a thread as a sequence of
instructions, now with single-threading, that sequence of instructions just
flows through the pipeline as normal.
However, with
multi-threading, you can segment your application into many threads and
specifically choose how you want to execute them. Multi-threading can
significantly increase computing performance, by explicitly stating what CPU
resources you want to utilize and when. For example, for an application, the
user interface, GUI, can be executed on one thread while the logic is executed
on another. This is just one example of many instances where multi-threading
can be used. Now multi-threading can't just be used for every application.
Since classical computing isn't intrinsically parallel, there can be a lot of
issues with concurrency, or in other words, when multiple threads are executing
at the same time but depending on the result of each other. Thus, some
applications end up being only single-threaded. However, many individuals and
groups are working on ways to best utilize hardware parallelism through new
software practices and rewriting old software. For example, the latest Firefox
updates now are bringing in multi-threading. Also, some of the most
computationally intensive tasks by default excel at multi-threading, such as
video editing, rendering, and data processing - to list a few. Also as
exemplified by the gaming industry, a lot of games are now moving into
multi-threaded performance. Instructions are still executed in serial, but
through the use of hardware and software level parallelism, maximize the
utilization of the resources of the computer, making them execute extremely
fast; giving the illusion of parallel operation.
No comments