Computer history: CDC 6000 series Hardware Architecture
CDC 6000 series Hardware Architecture
The Central Processor (CPU)
The 6000 series CPU was a Reduced Instruction Set Computer (RISC) long before it became popular to have a reduced instruction set. The CPU was usually said to have some 74 CPU instructions (the exact number depends on how you count them), but by modern standards, the number was less than the number one counted as “RISC”. The rough number 74 counts each of eight addressing modes three times, whereas you could reasonably say that an addressing mode shouldn’t be counted as a separate instruction at all. Despite the lean instruction set, there were a few complaints about missing instructions in the instruction set.
The system was designed around packages of discrete components and transistors. They were regarded as becoming “reliable” (see transistor reliability graph from [Thornton70]). As the CDC 6600 required 400,000 transistors, it was estimated that the MTBF of the system (based upon transistor reliability) would be over 2,000 hours. The technique used was that of Direct Coupled Transistor Logic (DCLT).
Central Memory
Central memory (CM) was organised as 60-bit words. In the early days of computing (6000-series) the memory had no parity and was built up from core memory blocks (6.75 by 6.75 by 3.625 inches tall), each containing 4,096 addressable lines of 12 bits. Five blocks (“bank”) in a row comprised central memory – 4k Words of 60 bits. The access speed was 1.0 microseconds. Interleaving of the memory banks (factor of 8 per 32 kWords) allowed a high memory access speed matching the 100 nanoseconds CPU clock cycle.
A single memory block provided the 4k*12 bits memory of a single Peripheral Processor or PP. Click for more information on how core memory works.
There was no byte addressability. If you wanted to store multiple characters in a 60-bit word, you had to shift and mask. Typically, a six-bit character set was used, which meant no lower-case characters. These systems were meant to be (super)computing engines, not text processors! To signal the end of a text string, e.g. a sentence, two different coding techniques were invented. The so-called 64-character set was the CDC default. A line end comprised of two (or more) null-“bytes” at the end of a word followed by a full zero word. The 63-character set, quite popular in the Netherlands and the University of Austin, Texas, signalled the line termination by two (or more) null-“bytes” at the end of a 60-bit word.
The Michigan State University (MSU) invented a 12-bit character set, which was a 7-bit ASCII format with five wasted bits per character. Other sites used special shift/unshift characters in a 6-bit character set to achieve upper/lower case.
Some systems of the short-lived Cyber 70 Series, which followed the 6000 Series, had a Compare and Move Unit (CMU) which did complex character handling in hardware. The CMU was not used much, probably due to compatibility concerns. The CMU was such a departure from the 6000’s lean and mean instruction set that the CDC engineers must have been relieved to be able to omit it from the next line of computers, the Cyber 170 Series.
Central Memory (CM) addresses were 18 bits wide in the later series, but in the original 6000 line, the sign bit had to be zero, limiting addresses to a range of 17 bits. Even without the sign bit problem, though, the amount of addressable central memory was extremely limited by modern standards. A maxed-out 170 series of systems from around 1980 were limited to 256K words, which in total bits is slightly less than two megabytes (using 8-bit bytes purely as a means to compare with modern machines). In the early days, 256K words were more than anyone could afford, but eventually, this addressability problem became a real problem in the NOS and NOS/BE fixed memory operating systems.
A workaround was the Extended Core Storage (ECS) unit. This was auxiliary memory made from different magnetic cores of which CM was fabricated Later versions of ECS were named ESM: Extended Semiconductor Memory. ECS was accessible only by block moves to or from CM. The initial ECS had a read and store cycle time of 3.2 microseconds; 480 bits (8-word block) storage “words”; 125,000 CM words bank capacity; up to 16 banks; CM to/from ECS rate: 10 central memory (60-bit) words per microsecond.
ECS could be shared by four CYBER systems. Operating systems used this to share the job load, exchange information and so on.
The address width of ECS was 24 bits. However not being able to run programs or directly access data from ECS meant it was used mostly to store operating system tables or to swap programs. In the 180 series of systems, one could emulate ECS in the upper part of memory (above 256 KWords)
The “swap” of programs was used because there was no virtual memory (hardware) on the machine. Memory management was primitive but effective. Each user program had to be allocated a single area of contiguous memory. This region started at the address in the RA (Reference Address) register and went for a certain number of words, as dictated by the contents of the FL (Field Length) register. The CPU hardware always added the contents of the RA register to all address references before the memory access was made; as far as the program was concerned, its first address was always 0. Any attempt to access memory after FL resulted in a fatal job error.
As programs came and went from CM, holes opened up between regions of memory. To place programs optimally in memory, an operating system had to suspend the execution of a program, copy its field length to close up a gap, adjust the RA register to point to the program’s new location, and resume execution.
On some systems that had ECS, it was faster to do a block move to ECS and then a block move from ECS than it was to move memory in a tight loop coded with the obvious load and store instructions.
Incidentally, the CPU enforced access to ECS in much the same way as it did to CM. There were two registers specifying the start address and the number of words of the single region of ECS to which the CPU had access at any time. Depending on system parameters, user programs could be forced to have an ECS field length of zero.
The 6000 CPU had a load/store architecture: data in memory could be referenced only by load and store instructions. To increment a memory location, then, you had to execute at least three instructions: load from memory, do an add and store to memory.
Memory access on 6000 series systems was interleaved into eight independent banks or core memory, so usually, the CPU did not have to wait for a memory cycle to complete before starting a new one.
The designs of one of the basic CDC 6000 series circuits [Thornton70]
CPU registers
In addition to the obvious program counter (P register), the 6000 Series computers had 24 user-accessible CPU registers. There were three types of registers, eight of each type: A, B, and X. Registers of each type were numbered 0-7.
- X-registers were 60 bits wide and were general-purpose data registers. Most instructions operated only on X registers.
- A-registers were 18-bit address registers with a strange relationship to X registers: loading a value (let’s call it m) into any register A1 – A5 would cause the CPU to load the correspondingly numbered X register from memory location m. Loading A6 or A7 with m would cause the content of the corresponding number X register to be stored at that location. This was the only way that data could be moved between any register and memory.
A0 was a pretty worthless register not connected to an X-register on the base system. By convention, code generated by FORTRAN kept a pointer to the beginning of the current subroutine in A0, to aid in subroutine traceback in case an error occurred. Similarly, X0 was not too useful, as it could neither be loaded from memory nor stored in memory directly. However, it was moderately useful for holding intermediate results.
For systems with ECS (later ESM), however, A0 contained the CM address to read or write, and X0 denoted the from resp. to ECS address. - B-registers were used as an indexing offset in combination with an A-register, as a counter for shiting an X-register content left or right, and as a result register for the number of shifts during the execution of a normalise instruction. The instructions allowed for some light-duty arithmetic. B registers tended to not get used a whole lot for arithmetic because
- They were only 18 bits wide.
- The arithmetic you could do on them was limited to addition and subtraction.
- You couldn’t load or store B registers directly to or from memory. Instead, you had to go through an X register and move the contents to or from a B register.
B0 was hardwired to 0. Any attempt to set B0 was ignored by the CPU. In fact, on some CPUs, it was faster to execute a 30-bit instruction to load B0 with a constant than it was to execute two consecutive no-ops (which were 15-bit instructions). Therefore, if you had to “force upper” by thirty or more bits, it made sense to use a 30-bit load into B0. Fortunately, the assembler did force uppers automatically when necessary, so programmers were generally isolated from those details.
Many programmers felt that Control Data should also have hardwired B1 to 1 since there was no CPU register increment or decrement instruction. Since there was no register hardwired to 1, most assembly language programs started with “SB1 1”, the instruction to load a 1 into B1.
Instruction Set
Instructions in the CPU were either 15 or 30 bits long. The 30-bit instructions contained an 18-bit constant. Usually, this was an address, but the value could also be used as an arbitrary 18-bit integer. From the point of view of the instruction decoder, each 60-bit word was divided into four 15-bit instruction parcels. While up to four instructions could be packed into a single 60-bit word, instructions could not be broken across word boundaries. If you needed to execute a 30-bit instruction and the current position was at 45 bits in a word, you had to fill out the word with a no-op and start the 30-bit instruction at the beginning of the next word. Probably this caused the 6000 Series to make heavier use of its no-op instruction (46000 octal) than nearly any other machine. The better programmers shuffled instructions around to squeeze out as many no-ops as possible. Another reason to do that was to optimise the loops as a set of seven instructions (or four words) was kept in internal hardware registers of the CPU. Looping through such a set of instructions did not require memory access to read the next word with instructions. Later on, this technique was extended and became widely known as “instruction cache“.
No-ops were also necessary to pad out a word if the next instruction was to be the target of a branch. Jumping to an address for the next instructions could be done only to whole-word boundaries. The act of inserting no-ops to word-align the next instruction was called doing a “force-upper”.
The 6000-series did not have a condition code register. Instructions that did conditional branches directly did the test and then branched on the result. This, of course, is in contrast to many architectures such as the Intel x86 and successors, which use a condition code register that stores the sign of the last arithmetic operation.
Mark Riordan: When I learned about condition code registers years after first learning the 6000 architecture, I was shocked. Having a single condition code register seemed to me to be a significant potential bottleneck. It would make execution of multiple instructions simultaneously very difficult. I still think that having a single condition code register is stupid, but I must admit that the Intel Pentium Pro, for instance, is pretty darned fast anyway.
The instruction set included integer (I), logical (B), and floating-point (F) instructions. The assembler syntax was different than most assemblers. There were very few different mnemonics; differentiation amongst instructions was done largely by operators. Arithmetic instructions were mostly three-address; that is, an operation was performed on two registers, with the result going to a third register. (Remember that the 6000’s load/store architecture precluded working with memory-based operands.) For instance, to add two integers in X1 and X5 and place the result in X6, you performed:
IX6 X1+X2
A floating-point multiplication of X3 and X7, with the result going to X0, would be:
FX0 X3*X7
An Exclusive Or of X6 and X1, with the result going to X6, would be:
BX6 X6-X1
Initially, there was no integer multiply instruction. Integer multiply was added to the instruction set pretty early in the game, though, when CDC engineers figured out a way of using the existing floating-point hardware to implement the integer multiply. The downside of this clever move was that the integer multiply could multiply only numbers that could fit into the 48-bit mantissa field of a 60-bit register. If your integers were bigger than 48 bits, you’d get unexpected results.
You’d think that 60-bit floating-point numbers (1 sign bit, 11-bit exponent including bias, 48-bit bit-normalized mantissa) would be large enough to satisfy anyone. Nope: the 6000 instruction set, lean as it was, did include double precision instructions for addition, subtraction, and multiplication. They operated on 60-bit quantities, just as single precision numbers; the only difference is that the double precision instructions returned a floating point number with the 48 least-significant bits, rather than the 48 most significant bits. So, double precision operations–especially multiplication and division–required several instructions to produce the final 120-bit result. Double precision numbers were just two single precision numbers back-to-back, with the second exponent being essentially redundant. It was a waste of 12 bits, but you still got 96 bits of precision.
You can tell that floating point was important to Control Data when you consider that there were separate rounding versions of the single precision operations. These were rarely used, for some reason. The non-rounding versions needed to be in the instruction set because they were required for double-precision work. The mnemonic for double precision operations was D (as in DX7 X2*X3) and for rounded operations was R. By the way, some 170-series systems CPUs rounded funnily: 1/3 rather than 1/2 of the least significant bit.
Another instruction that is surprising to find in such a lean instruction set is the Population Count instruction. This instruction counted the number of 1 bits in a word. CX6 X2, for instance, would count the number of bits in X2 and place the result in X6. This was the slowest instruction on most 6000 machines. Rumour was always that the instruction was implemented upon request of the National Security Agency (NSA) for use in cryptoanalysis. Several people pointed out (see notes on instruction set page) the use of the extremely fast popcount instruction (11 cycles on the 6600). The instruction existence relates to the article “Set Comparison Using Hashing Techniques”, Malcolm C. Harrison, Courant Institute (NYU), June 1970.
Moreover, the PLATO educational system used the popcount instruction to create a hash of the student’s answers and use an XOR to compare the result with a hash of the set of right answers.
Functional Units
The design of the 6000 series systems and their successors was based on parallel operating functional units. Some models (e.g. 6600 and Cyber 74) even had two multiple units. This allowed multiple operations to take place in parallel when the A, B and X registers were scheduled optimally. Whether an operation (or even the next one) could be scheduled to a functional unit or the CPU had to hold until the results being calculated could be used was determined by the hardware “scoreboard” located in the so-called “stunt box”. There appeared several analysis articles that, based on Petri-nets (some references at the bottom), made optimal use of the parallelism available in the CPU. Conflicts of several “types” could occur.
In the time that CPUs regularly broke down, multiple floating point units were handy as one could trace the flow of the bits by using alternately the oscilloscope points of the first and second functional unit. In that way, not propagated bits could be found.
The 6600/CYBER 74 functional units were: Branch unit (instruction groups 00-07), Boolean unit (10-17), Shift unit (20-27, 43), FP addition (30-35), Long addition (36-37), FP multiply (40-42), FP divide (44, 45,47), Increment (50-77).
The model 76 types organised this a little differently: Boolean unit (10-17, 25, 27), Shift unit (20-23, 26, 43), Normalize (24, 25), FP addition (30-35), Long add (36-37), FP multiply (40-42), FP divide (44, 45), Population count (47), Increment (50-77).
The 6400 CPU, however, had a unified CPU: the CPU was a sequential instruction processor without parallel processing.
System variants
System | # of processors | Type of CPU |
---|---|---|
6400 | 1 | Unified CPU |
6500 | 2 | Unified CPU |
6600 | 1 | CPU with functional units |
6700 | 2 | A 6600 – 6400 combination |
References
[1] “A New Approach to Optimizing of Sequencing Decisions”, R.M. Shapiro and H.Saint, Meta Information Applications, New York
[2] “A Petri net model of the CDC 6400”, Jerre D. Noe, University of Washington
[3] “Modular, Asynchronous Control Procedures for a High Performance Processor”, Jack B. Dennis, Project MAC, MIT, Massachusetts published by the ACM in “Concurrent Systems and Applications”, pp.55-80, 1970.
[Thornton70] “Design of a computer: The Control Data 6600” (title in 6000 console display lettering!), J.E.Thornton; Scott, Foresman and Company, 1970; Library of Congress Catalog No. 74-96462