Computer history: CDC 6000 series Hardware Architecture

CDC 6000 series Hardware Architecture

The Central Processor (CPU)

The 6000 series CPU was a Reduced Instruction Set Computer (RISC) long before it became popular to have a reduced instruction set. The CPU was usually said to have some 74 CPU instructions (the exact number depends on how you count them), but by modern standards, the number was less than the number one counted as “RISC”. The rough number 74 counts each of eight addressing modes three times, whereas you could reasonably say that an addressing mode shouldn’t be counted as a separate instruction at all. Despite the lean instruction set, there were few complaints about missing instructions in the instruction set.

The system was designed around packages of discrete components and transistors. They were regarded as becoming “reliable” (see transistor reliability graph from [Thornton70]). As the CDC 6600 required 400,000 transistors, it was estimated that the MTBF of the system (based upon transistor reliability) would be over 2000 hours. The technique used was that of Direct Coupled Transistor Logic (DCLT).

Transistor reliability graph in the early 60's - why the 6000-series designers choose for transistors [Thornton70]
Transistor reliability graph in the early 60’s – why the 6000-series designers choose for transistors [Thornton70]

Central Memory

6000 series memory block (4096 words,12 bits)
6000 series memory block (4096 words,12 bits)

Central memory (CM) was organized as 60-bit words. In early days (6000-series) the memory had no parity and was build up from core memory blocks (6.75 by 6.75 by 3.625 inches tall), each containing 4096 lines of 12 bits. Five blocks in a row comprised central memory. One block occupied the memory of a single Peripheral Processor or PP. Click for more information on how core memory works.

There was no byte addressability. If you wanted to store multiple characters in a 60-bit word, you had to shift and mask. Typically, a six-bit character set was used, which meant no lower-case. These systems were meant to be (super)computing engines, not text processors! To signal the end of a text string, e.g. a sentence, two different coding techniques were invented. The so-called 64 character set was the CDC-default. A line end comprised of two (or more) null-“bytes” at the end of a word followed by a full zero word. The 63 character set, quite popular in the Netherlands and the University of Austin, Texas, signalled the line termination by two (or more) null-“bytes” at the end of a 60-bit word.
The Michigan State University (MSU) invented a 12-bit character set, which was basically 7-bit ASCII format with five wasted bits per character. Other sites used special shift/unshift characters in a 6-bit character set to achieve upper/lower case.

CYBER core memory: 64*64 bits. Twelve of these boards were inside a memory block.
CYBER core memory: 64*64 bits. Twelve of these boards were inside a memory block.
close-up of the core memory
close-up of the core memory

Some systems of the short-lived Cyber 70 Series, which followed the 6000 Series, had a Compare and Move Unit (CMU) which did complex character handling in hardware. The CMU was not used much, probably due to compatibility concerns. The CMU was such a departure from the 6000’s lean and mean instruction set that the CDC engineers must have been relieved to be able to omit it from the next line of computers, the Cyber 170 Series.

Central Memory (CM) addresses were 18 bits wide in the later series, but in the original 6000 line, the sign bit had to be zero, limiting addresses to a range of 17 bits. Even without the sign bit problem, though, the amount of addressable central memory was extremely limited by modern standards. A maxed-out 170 series systems from around 1980 were limited to 256K words, which in total bits is slightly less than two megabytes (using 8-bit bytes purely as a means to compare with modern machines). In the early days, 256K words was more than anyone could afford, but eventually, this addressability problem became a real problem in the NOS and NOS/BE fixed memory operating systems.

A workaround was the Extended Core Storage (ECS) unit. This was auxiliary memory made from the same magnetic cores of which CM was fabricated. Later versions of ECS were named ESM: Extended Semiconductor Memory. ECS was accessible only by block moves to or from CM. The initial ECS had a read and store cycle time of 3.2 microseconds; 480 bits (8-word block) storage “words”; 125,000 CM words bank capacity; up to 16 banks; CM to/from ECS rate: 10 central memory (60 bit) words per microsecond.
ECS could be shared by four CYBER systems. Operating systems used this to share the job load, exchange information and so on.
The address width of ECS was 24 bits. But not being able to run programs or directly access data from ECS meant it was used mostly to store operating system tables or to swap programs. In the 180 series of systems, one could emulate ECS in the upper part of memory (above 256 KWords)

The “swap” of programs was used because there was no virtual memory (hardware) on the machine. Memory management was primitive but effective. Each user program had to be allocated a single area of contiguous memory. This region started at the address in the RA (Reference Address) register and went for a certain number of words, as dictated by the contents of the FL (Field Length) register. The CPU hardware always added the contents of the RA register to all address references before a memory access was made; as far as the program was concerned, its first address was always 0. Any attempt to access memory after FL resulted in a fatal job error.

As programs came and went from CM, holes opened up between regions of memory. To place programs optimally in memory, an operating system had to suspend the execution of a program, copy its field length to close up a gap, adjust the RA register to point to the program’s new location, and resume execution.
On some systems that had ECS, it was actually faster to do a block move to ECS and then a block move from ECS than it was to move memory in a tight loop coded with the obvious load and store instructions.
Incidentally, the CPU enforced access to ECS in much the same way as it did to CM. There were two registers specifying the start address and a number of words of the single region of ECS to which the CPU had access at any time. Depending on system parameters, user programs could be forced to have an ECS field length of zero.

The 6000 CPU had a load/store architecture: data in memory could be referenced only by load and store instructions. To increment a memory location, then, you had to execute at least three instructions: load from memory, do an add and store to memory.

Memory access on 6000 series systems was interleaved into eight independent banks or core memory, so usually, the CPU did not have to wait for a memory cycle to complete before starting a new one.

CDC 6600 basics: a flop flop circuit
CDC 6600 basics: a flop-flop circuit

The designs of one of the basic CDC 6000 series circuits [Thornton70]

CPU registers

In addition to the obvious program counter (P register), the 6000 Series computers had 24 user-accessible CPU registers. There were three types of registers, eight of each type: A, B, and X. Registers of each type were numbered 0-7.

  • X-registers were 60 bits wide and were general-purpose data registers. Most instructions operated only on X registers.
  • A-registers were 18-bit address registers with a strange relationship to X registers: loading a value (let’s call it m) into any register A1 – A5 would cause the CPU to load the correspondingly-numbered X register from memory location m. Loading A6 or A7 with m would cause the content of the correspondingly-number X register to be stored at that location. This was the only way that data could be moved between any register and memory.
    A0 was a pretty worthless register not connected to an X-register. By convention, code generated by FORTRAN kept a pointer to the beginning of the current subroutine in A0, to aid in subroutine traceback in case an error occurred. Similarly, X0 was not too useful, as it could neither be loaded from or stored to memory directly. However, it was moderately useful for holding intermediate results.
  • B-registers were index registers that could also be used for light-duty arithmetic. B registers tended to not get used a whole lot because
    • They were only 18 bits wide.
    • The arithmetic you could do on them was limited to addition and subtraction.
    • You couldn’t load or store B registers directly to or from memory. Instead, you had to go through an X register and move the contents to or from a B register.

    B0 was hardwired to 0. Any attempt to set B0 was ignored by the CPU. In fact, on some CPUs, it was faster to execute a 30-bit instruction to load B0 with a constant than it was to execute two consecutive no-ops (which were 15-bit instructions). Therefore, if you had to “force upper” by thirty or more bits, it made sense to use a 30-bit load into B0. Fortunately, the assembler did force uppers automatically when necessary, so programmers were generally isolated from those details.

    Many programmers felt that CDC should also have hardwired B1 to 1, since there was no CPU register increment or decrement instruction. Since there was no register hardwired to 1, most assembly language programs started with “SB1 1”, the instruction to load a 1 into B1.

Instruction Set

Instructions in the CPU were either 15 or 30 bits long. The 30-bit instructions contained an 18-bit constant. Usually, this was an address, but the value could also be used as an arbitrary 18-bit integer. From the point of view of the instruction decoder, each 60-bit word was divided into four 15-bit instruction parcels. While up to four instructions could be packed into a single 60-bit word, instructions could not be broken across word boundaries. If you needed to execute a 30-bit instruction and the current position was at 45 bits in a word, you had to fill out the word with a no-op and start the 30-bit instruction at the beginning of the next word. Probably this caused that the 6000 Series made heavier use of its no-op instruction (46000 octal) than nearly any other machine. The better programmers shuffled instructions around to squeeze out as many no-ops as possible. Another reason to do that was to optimise the loops as a set of seven instructions (or four words) was kept in internal hardware registers of the CPU. Looping through such a set of instructions did not require a memory access to read the next word with instructions. Later on, this technique was extended and became widely known as “instruction cache“.
No-ops were also necessary to pad out a word if the next instruction was to be the target of a branch. Jumping to an address for the next instructions could be done only to whole-word boundaries. The act of inserting no-ops to word-align the next instruction was called doing a “force-upper”.

There was no condition code register in the 6000 Series. Instructions that did conditional branches actually did the test and then branched on the result. This, of course, is in contrast to many architectures such as the Intel x86 and successors, which use a condition code register that stores the sign result of the last arithmetic operation.

Mark Riordan: When I learned about condition code registers years after first learning the 6000 architecture, I was shocked. Having a single condition code register seemed to me to be a significant potential bottleneck. It would make execution of multiple instructions simultaneously very difficult. I still think that having a single condition code register is stupid, but I must admit that the Intel Pentium Pro, for instance, is pretty darned fast anyway.

The instruction set included integer (I), logical (B), and floating-point (F) instructions. The assembler syntax was different than most assemblers. There were very few different mnemonics; differentiation amongst instructions was done largely by operators. Arithmetic instructions were mostly three-address; that is, an operation was performed on two registers, with the result going to a third register. (Remember that the 6000’s load/store architecture precluded working with memory-based operands.) For instance, to add two integers in X1 and X5 and place the result in X6, you performed:

IX6  X1+X2

A floating-point multiplication of X3 and X7, with the result going to X0, would be:

FX0  X3*X7

An Exclusive Or of X6 and X1, with the result going to X6, would be:

BX6  X6-X1

Initially, there was no integer multiply instruction. Integer multiply was added to the instruction set pretty early in the game, though, when CDC engineers figured out a way of using the existing floating-point hardware to implement the integer multiply. The downside of this clever move was that the integer multiply could multiply only numbers that could fit into the 48-bit mantissa field of a 60-bit register. If your integers were bigger than 48 bits, you’d get unexpected results.

You’d think that 60-bit floating-point numbers (1 sign bit, 11-bit exponent including bias, 48-bit bit-normalized mantissa) would be large enough to satisfy anyone. Nope: the 6000 instruction set, lean as it was, did include double precision instructions for addition, subtraction, and multiplication. They operated on 60-bit quantities, just as single precision numbers; the only difference is that the double precision instructions returned a floating point number with the 48 least-significant bits, rather than the 48 most significant bits. So, double precision operations–especially multiplication and division–required several instructions to produce the final 120-bit result. Double precision numbers were just two single precision numbers back-to-back, with the second exponent being essentially redundant. It was a waste of 12 bits, but you still got 96 bits of precision.

You can tell that floating point was important to Control Data when you consider that there were separate rounding versions of the single precision operations. These were rarely used, for some reason. The non-rounding versions needed to be in the instruction set because they were required for double-precision work. The mnemonic for double precision operations was D (as in DX7 X2*X3) and for rounded operations was R. By the way, some 170-series systems CPUs rounded in a funny way: 1/3 rather than 1/2 of the least significant bit.

Another instruction that is surprising to find in such a lean instruction set was Population Count. This instruction counted the number of 1 bits in a word.  CX6 X2, for instance, would count the number of bits in X2 and place the result in X6.  This was the slowest instruction on most 6000 machines.  Rumour was always that the instruction was implemented upon request of the National Security Agency (NSA) for use in cryptoanalysis. Several people pointed out (see notes on instruction set page) the use of the extremely fast popcount instruction (11 cycles on the 6600) . The instruction existence relates to an article “Set Comparison Using Hashing Techniques”, Malcolm C. Harrison, Courant Institute (NYU), June 1970.

CDC 6000 bay: memory blocks on top, logic modules below
CDC 6000 bay: memory blocks on top, logic modules below

Functional Units

The 6000 series systems and their successors had functional units. Some models (e.g. 6600 and Cyber 74) even had two multiply units. This allowed multiple operations to take place in parallel when the A, B and X-registers were scheduled optimally. Whether an operation (or even the next one) could be scheduled to a functional unit or the CPU really had to hold until results being calculated could be used was determined by the hardware “score board” located in the so-called “stunt box”. There appeared several analysis articles that, based upon Petri-nets (some references at the bottom), made optimal use of the parallelism available in the CPU. Conflicts of several “types” could occur.
In the time that CPUs regularly broke down, multiple floating point units were handy as one could trace the flow of the bits by using alternately the oscilloscope points of the first and second functional unit. In that way, not propagated bits could be found.

The 6600/CYBER 74 functional units were: Branch unit (instruction groups 00-07), Boolean unit (10-17), Shift unit (20-27, 43), FP addition (30-35), Long addition (36-37), FP multiply (40-42), FP divide (44, 45,47), Increment (50-77).
The model 76 types organised this a little different: Boolean unit (10-17, 25, 27), Shift unit (20-23, 26, 43), Normalize (24, 25), FP addition (30-35), Long add (36-37), FP multiply (40-42), FP divide (44, 45), Population count (47), Increment (50-77).

References

[1] “A New Approach to Optimizing of Sequencing Decisions”, R.M. Shapiro and H.Saint, Meta Information Applications, New York
[2] “A Petri net model of the CDC 6400”, Jerre D. Noe, University of Washington
[3] “Modular, Asynchronous Control Procedures for a High Performance Processor”, Jack B. Dennis, Project MAC, MIT, Massachusetts published by the ACM in “Concurrent Systems and Applications”, pp.55-80, 1970.
[Thornton70] “Design of a computer: The Control Data 6600” (title in 6000 console display lettering!), J.E.Thornton; Scott, Foresman and Company, 1970; Library of Congres Catalog No. 74-96462