To set a foundation for our examination of parallel processing, we need to understand just what kinds of processing alternatives have already been identified, and where they fit into the "parallel picture", if you will. One of the longest-lived and still very reasonable classification schemes was proposed by Flynn, in 1966, and distinguishes computer architectures according to how they can be classified along two independent, binary-valued dimensions; independent simply asserts that neither of the two dimensions has any effect on the other, and binary-valued means that each dimension has only two possible states, as a coin has only two distinct flat sides. For computer architectures, Flynn proposed that the two dimensions be termed Instruction and Data, and that, for both of them, the two values they could take be Single or Multiple. The two dimensions could then be drawn like a matrix having two rows and two columns, and each of the four cells thus defined would characterize a unique type of computer architecture.
This is the oldest style of computer architecture, and still one of the most important: all personal computers fit within this category, as did most computers ever designed and built until fairly recently. Single instruction refers to the fact that there is only one instruction stream being acted on by the CPU during any one clock tick; single data means, analogously, that one and only one data stream is being employed as input during any one clock tick. These factors lead to two very important characteristics of SISD style computers:
Instructions are executed one after the other, in lock-step; this type of sequential execution is commonly called serial, as opposed to parallel, in which multiple instructions may be processed simultaneously.
Because each instruction has a unique place in the execution stream, and thus a unique time during which it and it alone is being processed, the entire execution is said to be deterministic, meaning that you (can potentially) know exactly what is happening at all times, and, ideally, you can exactly recreate the process, step by step, at any later time.
Most computers commonly available today are of the SISD variety: all personal computers, all single-instruction-unit-CPU workstations, mini-computers, and mainframes.
Few actual examples of computers in this class exist; this category was included more for the sake of completeness than to identify a working group of actual computer systems. However, special-purpose machines are certainly conceivable that would fit into this niche: multiple frequency filters operating on a single signal stream, or multiple cryptography algorithms attempting to crack a single coded message. Both of these are examples of this type of processing where multiple, independent instruction streams are applied simultaneously to a single data stream.
A very important class of architectures in the history of computation, single-instruction/multiple-data machines are capable of applying the exact same instruction stream to multiple streams of data simultaneously. For certain classes of problems, e.g., those known as data-parallel problems, this type of architecture is perfectly suited to achieving very high processing rates, as the data can be split into many different independent pieces, and the multiple instruction units can all operate on them at the same time.
These types of systems are generally considered to be synchronous, meaning that they are built in such a way as to guarantee that all instruction units will receive the same instruction at the same time, and thus all will potentially be able to execute the same operation simultaneously.
SIMD architectures are deterministic because, at any one point in time, there is only one instruction being executed, even though multiple units may be executing it. So, every time the same program is run on the same data, using the same number of execution units, exactly the same result is guaranteed at every step in the process.
The "single" in single-instruction doesn't mean that there's only one instruction unit, as it does in SISD, but rather that there's only one instruction stream, and this instruction stream is executed by multiple processing units on different pieces of data, all at the same time, thus achieving parallelism.
SIMD machines come in two major flavors, very easily distinguished:
This type of machine typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units, each typically capable of operating on only a single bit or nybble (4-bit, 1/2 byte) data element at a time. Characteristically, many of these are ganged together in order to be able to handle typically encountered data types; for example:
This single-bit system will associate 32 of its instruction units together, and then do single-bit operations in parallel on all bits of a single 32-bit data element...
...while this one uses 4-bit units, but essentially handles larger data elements in the same, "gang-up on 'em" fashion.
This is also a single-bit processor system, where the processors are grouped together two-dimensionally, and each processor is directly connected to its four nearest neighbors. As Hockney and Jesshope describe (Parallel Computers 2, Hockney, R. W. and C. R. Jesshope; Adam Hilger, Bristol and Philidelphia, 1988),
An important feature of the engineering design of the ICL DAP is that the PE [processing element] logic is mounted on the same printed circuit board as the memory to which it belongs. (pg. 28)The DAP is also capable of storing data both vectorially, with the data-word spread horizontally across a row of processing elements, or in matrix mode where the data are stored vertically within a single PE.
The other major category of SIMD system is at the completely opposite end of both the "number of units" and "complexity of units" spectrums:
This type of machine has only a fairly small number, typically between 1 and 32, of very powerful execution units, called vectors because they are specially designed to be able to effectively handle long strings ("vectors") of floating point numbers. The main CPU handles dispatching jobs and associated data to the vector units, and takes care of coordinating whatever has to be done with the results from each, while the vectors themselves concentrate on applying the program they've been loaded with to their own unique slice of the overall data.
Many believe that the next major advances in computational capabilities will be enabled by this approach to parallelism which provides for multiple instruction streams simultaneously applied to multiple data streams. The most general of all of the major categories, a MIMD machine is capable of being programmed to operate as if it were in fact any of the four.
MIMD instruction streams can potentially be executed either synchronously or asynchronously, i.e., either in tightly controlled lock-step or in a more loosely bound "do your own thing" mode. Some kinds of algorithms require one or the other, and different kinds of MIMD systems are better suited to one or the other; optimum efficiency depends on making sure that the system you run your code on reflects the style of synchronicity required by your code.
MIMD systems are potentially capable of deterministic behavior, that is, of reproducing the exact same set of processing steps every time a program is run on the same data. A number of other factors, for example, how multiple messages at a receiver are handled, go into the actual determination of this characteristic, but, if it is important for the system to be deterministic, there is nothing in the nature of MIMD parallelism that fundamentally precludes it.
The more code each processor in an MIMD assembly is given domain over, the more efficiently the entire system will operate, in general. This is largely due to communications requirements, particularly synchronization, which are characteristically less stringent at levels above instruction-oriented parallelism.
MIMD-style systems are capable of running in true "multiple-instruction" mode, with every processor doing something different, or every processor can be given the same code; this latter case is called SPMD, "Single Program Multiple Data", and is a generalization of SIMD-style parallelism, with much less strict synchronization requirements.
The following are representative of the many different ways that MIMD parallelism can be realized:
All of these systems implement MIMD parallelism in terms of more or less loosely coupled instruction streams:
The nature of message-passing libraries, especially general ones applicable to many different kinds of processors, makes the resulting parallel system much more suited to coarser-grained, loosely-coupled tasks.
Vector-parallel systems, such as the ES/9000-900 (the "900" indicates that there are 9 vector units associated with the ES/9000 mainframe) are very efficient at data parallel tasks, where each vector unit is given responsibility for computations involving one unique segment of the overall data.
The KSR coupled its processors via a very fast, high-bandwidth communications "ring," and implemented a novel form of shared memory across the main memory associated with each processor.
The hypercube is one of a whole family of network architectures that provide multiple connection points among the processors, typically allowing each processor to be directly connected to 8 or more processors. Meshes do much the same kind of thing, and come in 2- or 3-dimensional configurations, the former looking like a simple flat grid, the latter like a box.
Processors are now capable of executing multiple instructions every cycle, although not every cycle is so filled. This capability allows these processors to execute a number of related instructions simultaneously, for example, the multiplication of two floating point values already loaded into registers, the integer addition corresponding to the array location in memory of the next floating point value to be added, and the fetch of that value. This type of parallel processing, however, requires a sophisticated compiler with the ability to order instructions in such a way as to both preserve necessary instruction precedence and recognize instruction independence.
A few systems are exploring hybrid designs, mixing SIMD and MIMD capabilities in order to take advantage of either when the algorithm benefits most from it. The CM-5 is an example of such a system:
In order to provide maximum efficiency for SIMD-type applications, a separate communications network is maintained for the synchronous distribution of execution instructions to all participating processors.
As the characteristics of the SIMD control network would greatly hinder MIMD-style communications, a totally separate asynchronous network is maintained for just this purpose.