The Control Data CYBER 74 system
At the beginning of 1978, it became clear that the CDC 6400 was so overloaded that expansion of the CPU capacity, disk storage capacity and memory had to take place. The replacement of the CDC 6400 by a CDC CYBER 74 was scheduled. The largest applications at that time were simulations for tanks and the “Wargame”-application which continuously expanded as additional models of military equipment were developed.
The CDC CYBER 74 was a modern version of the Control Data 6600 system. The system had a clock cycle of 100 nanoseconds (10 Mhz), a ring core memory of 131 K words of 60 bits (no parity) and ten (later 14) peripheral processors (1 msec major cycle). The system eliminated 63 KW per hour via the water and air cooling system and required three hours of preventive maintenance every week. The computer was built with discrete components: an estimated one million transistors and three million resistors and capacitors. In addition, 9 million ferrite memory rings and kilometres of twisted cable.
A number of limitations were built into the system for fair use by the users. A maximum of ten seconds of calculation time could be used per interactive process step. Batch jobs taking longer than 7.5 minutes of CPU time were only processed in the night and weekend periods. The batch jobs during the day were not allowed to require more memory than 140000B and at night 200000B words (369 KB or 429 KB). For interactive sessions, the memory limit was at 70000B words (215 KB).
The installation of the CYBER 74 in August 1978 took three weeks. First, the Control Data 6400 was removed. The computer room was expanded with a room (roughly 5*4 meters). During this replacement period, the Laboratory had no computing facilities at all, Yes, we could live with that!.
Each of the four ‘legs’ (bays) of the system weighed about 2000 kilos. After installation, the weight would be increased by the cooling water running through the bays. Reason to ask the architect of the building to calculate whether the floors of the building would have sufficient bearing-strength. No problems were foreseen. However, special precautions had to be taken to pull up the system parts along the slanting slope of the raised floor: thick wooden beams against the outside of the laboratory wall to connect the hoists. In order not to disrupt the daily transport of the coffee carts and the like within the Laboratory, it was decided to transport the computer parts via the elevator on Sunday. Beforehand, it was verified whether the weight could be transported with the elevator. One thing was forgotten during all those preparations, namely ordering an elevator mechanic to remove a motor protection for heavy transport. In order to get the mainframe parts to the 3rd floor, the cable drum of the lift had to be rotated many times by manual force.
The new tape controller needed to be loaded with ‘controlware’. Therefore, the so-called cold-start control program had to be read by the card reader and sent to the tape controller via a special deadstart program (set with toggle switches). The problem was that TNO did not have a card punch and had not prepared a card deck. Unfortunately, the cold start control program was only available on magnetic tape. A chicken-egg problem. With a lot of creativity, a cold start deck was created (“ABC”-program) with which the CYBER 74 could be brought to life.
Most Cyber 74 equipment was installed according to a changed layout of the computer room. Among other things, an air circulation separation was installed between the ‘dusty’ print and card reading area of the computer room and the (dust sensitive) magnetic tape handling and harddisk area with the mainframe. As a result, the number of disk problems fell spectacularly. Hardware problems were incidentally monitored daily through a system report (Cerfile, HPA).
In the first month after installation, we observed a huge amount of reading errors on the magnetic tape units every few days. Analysis showed that the problems only occurred with magnetic tapes of users. The backup tapes that were written in the evening were trouble-free. The strange thing was that the problems appeared on all tape units and that more often used magnetic tapes showed only a few normal errors during a certain job and the next time several thousand. Even deeper digging showed that the errors – if they occurred – did this with a frequency of 0.1 Hz. The technical staff checked the electricity grid but could not find anything. Meanwhile, the operator console display program was modified to show the logging on the screen during magnetic tape processing.
For several weeks, the 10-second problem did not occur until on a Friday afternoon a reading error occurred every ten seconds. In no time ‘all hands were called on deck’. At least ten people were watching the phenomenon on the console. A second magnetic tape unit was started, now two messages were synchronously logged every ten seconds. One of those present looked outside and discovered that at NATO’s Shape Technical Center (STC) a radar antenna was turning around with a revolution speed of exactly ten seconds. A phone call to STC to turn off the radar resulted in an angry researcher: his radar did not bother computers! We asked him to view the phenomenon himself in our computer room. He himself determined that the side lob of his radar beamed exactly on the heads of the recently mounted magnetic band units. Applying a piece of earthed chicken wire to the windows solved the problems for good.
Anyway, around that time the same type of problems occurred at the LEOK-location after moving some radar equipment on the laboratory’s roof. According to the radar experts, a line of trees reflected the radiated radar energy into the computer room. After receiving tree cutting permissions from the municipality, a complete line of high trees was chopped. It did not solve the problem! Later, it turned out that the radar radiation was directly transmitted through the roof into the computer room! A simple Faraday cage solved the computer problems.
Programming languages and disks
In early 1980, Fortran’77 (FTN5) was introduced at the Physics Laboratory. The new compiler operated besides the old Fortran’66 (FTN4) compiler. The FTN5 compiler sometimes showed better optimisations and thus an increased program performance than the FTN4 compiler. Some errors occurred with a new release. Cleverly, the sine and cosine values were delivered simultaneously by the system routine SINCOS. Unfortunately, the optimiser step overlooked that sin(a(i)) had a different argument than cos(a(j)). The result was an air defence with simulated torques by the LEOK: the anti-aircraft missiles went deep into the ground! The new compiler generated sometimes much faster code than the old compiler. However, the compiler was sometimes too smart. In one of the new releases (or “level”), the system call to the SINCOS routine delivered both the sine and cosine values in X6 and X7 respectively. Great! As most calculations require both values in the same piece of code. The smart compiler overlooked the fact that a(i) in sin(a(i)) might have another value as cos(a(j)). The result was that simulated anti-aircraft equipment had a torpedo-like behaviour: rockets went straight into the ground!
In the same period, we exchanged the Pascal 2 compiler by the new Pascal 3 compiler. Pascal was developed by Nicklaus Wirth at the Eidgenosische Technische Hochschule Zürich (ETZ). The compiler delivered highly optimised code for a CDC 6400. Additional system routines and functions were added by both the computer centre of the Universities in Amsterdam (SARA) and us. Around 1983, we tried to convince the Pascal compiler to compile an ADA-compiler that was written in Pascal using an intermediate language (DIANA). This required very large compiler tables and full ASCII rather than 6-bit bytes. Unfortunately, this project was overtaken by more urgent projects.
The disk capacity was used near the capacity limits. This despite the exchange of three CDC 844-41 drives by two CDC 885-disk units drives in February 1983, resulting in a doubling of disk capacity.
Therefore, it was decided to remove inactive files from the system on a weekly basis and to archive them on magnetic tape. The users could use the ‘INDEX-system’ program to indicate whether they wanted to archive, clean up (delete) or reload files. The INDEX-system made heavy use of the new Cyber Control Language-features (CCL; 1982).
Cyber Control Language (CCL) was in UNIX terms a shell language. The TNO Physics Laboratory added many additional features to CCL in order to optimally use the operational and user’s tools. Things like ’empty’ parameters, invisible/not to print ‘security’ parameters such as passwords and substrings were added. Most of these features were developed in 1980. These features allowed the CCL procedures for the Micro Development Station (MDS) to be halved in length, an efficiency gain was achieved on the CYBER during the expansion and processing of the CCL procedures.
Manchester Trace Package: post-mortem dump
The standard CDC system pointed out to Fortran users that if they made a programming error by generating error messages like CPU Error Modes 1, 2 or 4. This meant that the user tried to address memory outside his allocated memory, tried to divide by zero or wanted to execute a non-existing instruction. The only way to figure out where the problem occurred was by looking to the latest printed line and the address where the offending action took place. The octal dump could be of some help.
User-friendliness was something else, especially one had to wait for the next night before the job could be run again. Reason to request the public domain (already at that time!) post-mortem dump analysis package PMD from the University of Manchester. Unfortunately, the University just got rid of their low-end Cybersystems and only had a CYBER 205 supercomputer left. For that reason, the package support was transferred to the University of Leicester, a NOS-site. We could get a copy of them, but then we had to take the burden to make the necessary adjustments ourselves to transfer the package developed to NOS/BE. This was relatively easy with some adjustments in the Fortran translator and the Loader.
Just after we ran the package in production, including adjustments for the segmentation loader used by the Physics Laboratory (SEGLOAD), Control Data (CDC) came up with a new software version (release) and surprisingly a major new loader feature: ‘dynamic load’ (DLL) of modules. A new feature that saved a lot of the expensive fixed (non-virtual) memory. In particular, the Fortran library was made fully dynamic (including the I/O modules). Adapting PMD/LTP for another operating system was one thing. Now we had to figure out many system calls (another type of system call signalling) and change them. Adapting the package for the new dynamic, highly undocumented features, however, required full insight and understanding of the working of the package itself.
After a lot of work, PMD/LTP was able to recognise the dynamically loaded routines in memory. At that time, PMD/LTP was brought into production again. That had large consequences for most users as we changed the behaviour of the loader in such a way that the use of uninitialised data resulted in an abort of the program. Before that, the value zero was used. At the end, this resulted in a large quality improvement and less ‘strange’ behaviour of the user programs during execution.
Cybersecurity in 1976!
In 1976, the working group on operating systems (ECCOS) of the European Control Data User Association (ECODU) decided to conduct a security study. The working group had members from RRZN (Hannover), LUCS (London), EPFL (Lausanne) en RUS (Stuttgart) and the Physics Laboratory TNO. This effort was in line with the planned tightening of both physical and system security at the Laboratory. In ten main categories, around 45 large security holes, in particular, buffer overflows, were discovered in the NOS/BE system for which test software was written and solutions were developed.
The next phase was to extend the standard NOS/BE operating system with extra security code. Operators could no longer see sensitive parts of the memory on their console. A much stricter function separation was introduced between Operations and Systems programming. Only after entering a password, the console was released for full functionality. We developed a very small 100B PP-overlay that was coded completely relative. Thus it could be read and executed at any place by another PP program. This PP overlay program was catalogued (stored) as a system permanent file with by the system at the moment of cataloguing generated lees-, modify-, append- en write passwords that included information from the system microsecond clock. Every ‘rootfile’ system password was therefore different and impossible for anyone to retrieve, not even by the system programmers. The replacement of system code was only permitted if the special password “Os0lem1o” (O-solemio) was known.
Only one special system PP-program could attach the file with the security overlay and could trick the system by setting the read permit in system memory. This system-unlock overlay code could be loaded as an overlay of the display PP (DSD). The overlay code contained a kind of pin code to unlock the secured system. When the PP overlay could not be loaded, the system could not be UNLOCKed at all. As a result, the system was optimally secured because the real security code could not be read in any other way.
A trick had to be used when testing new security code. It was always very difficult to test changes in a system that was optimally secured (protecting the system programmers against themselves), causing different processing flows for ‘sensitive’ jobs (confidential and secret). To test new code in a normal environment, the security code was developed and compiled first. The binary ‘executable’ code was then read into the memory behind the console using the O26 console line editor, which was greatly expanded by system programming itself (also a PP program).). Then the octal memory content was made visible on the other console screen. It was then easy to shift the special security tests by changing the memory words – and thus the PP code. Then the O26 editor was reactivated again and the modified code was saved as a temporary test file. Then the adapted PP code with EDITLIB (SYSTEM) was added to the system. Then the new security code could be tested without problems. Completed the code, then the real version was added to the system. This way of working saved an extra translation that could take more than one hour of lead time for different PP programs.
To test the newly developed security code required many tricks. Debugging required an ‘open, unlocked system’ while on the other hand, one wanted to use sensitive jobs. It is very cumbersome to test security in a system that is optimally secured against users and system programmers. The trick we used was to develop and compile the security code. Then, from behind console, the binary “executable” code was read into memory using the (by ourselves extended) O26-console editor, which was another PP-program). At the left screen, the octal memory was displayed, making it easy to find the security code and change that by binary patching memory. O26 was then used to save the file. That slightly adapted PP-code was brought into the system using EDITLIB(SYSTEM). That allowed all the required tests to be made. When the code worked as expected, the original binary code could be added to the system without another recompilation. That saved a lot of time as it took an hour compilation time or more when extensive changes were made to, for instance. the console display driver PP DSD.
The O26 console editor had the following keyboard functions according to a recently found printout. Note that the PP of 4096 instruction words executed all these line editor functions on a file in the central memory.
As an example of the safeguards that were built into PP programs, the ABS program was used to dump the memory of the system or job onto paper. The standard code of ABS allowed each user to dump a major part of the system main memory and the user’s own occupied memory space. ABS was changed in such a way that the memory contents reported a content of zeroes unless:
- ABS was called by the operating system itself (the ‘control point zero’ ABS dump).
- ABS was called from a PP program that could run only in the display unlock mode. This mode was to please the system programmers.
- ABS was supposed to dump an inherent ‘safe memory area’: the jobs’ own user space memory and safe address ranges (tables) in the absolute memory of the system.
- ABS was called by a system application.
In 1981, we introduced the Tape Security System (TSS) which was developed by the University of Bologna, Italy. We had to convert the package to the latest NOS/BE release solving a number of errors and extended the package with improved logging. TSS made it impossible for users to read or overwrite tapes of other users unless the password of the tape (hidden in an extended tape label) area was known.
Just after the Data Encryption Standard (DES) was made public, a fast implementation of the Data Encryption Standard (DES) utilities in Compass (assembler) was received from a friendly American contact. We received a deck of punch cards, a couple of weeks before new US export rules stated that export of DES code was illegal.
The export ban did not stop us from using DES to encrypt sensitive data, especially in the operating system. Early 1984, we started to enforce a new password management policy. The INTERCOM (interactive users) password system had to be adjusted. In addition to the standard password file, there existed an extra file which linked a ‘physical record unit’ (PRU) of 64 words to each account name. In that PRU, the most recently used ten one-way encrypted passwords were stored to prevent the reuse of passwords. In order to exclude PCs that attempted to use an abc-a schema to use the old password again for a long time, we developed a countermeasure. The list of previously used passwords was not moved down until the recently changed password in the list was used at least two times. The PRU also recorded the time/date data of the last five logins and counters that forced a user to change his password after 250 log-ins or three months of use. In that case, the user could log in several times before a fatal login blocking occurred (grace login).
In total, some 3,000 lines security code (1.5 boxes with punched cards full) were developed, which involved around 40 system programs and PPs, as well as ten overlays of these PP programs.
Security measures extended also to physical security. In 1978, we developed a method to wipe defective disk packages. Using sand grit, we really wiped disks to comply with NATO security regulations. At the world-wide VIM/ECODU users conference in Minneapolis, Control Data proudly introduced the new 885-disk units which had closed head-disk assemblies (HDAs). The standard maintenance contract required CDC’s maintenance engineers to take the HDA (media and heads) with them in case of errors, something which conflicted with operations in a secure environment. The security regulations required us to keep the disks unless we could guarantee that all data was really wiped. Based upon our suggestions, Control Data made a worldwide change of the standard maintenance contract allowing sites to keep the HDAs at a small ‘insurance’ increase of the maintenance costs.
After following a LOADER class that made clear all internals of the Loader, linking, dynamic loading and segmentation loading, we had enough knowledge to resolve another annoying problem for our interactive users. The Loader is the system program that loads another executable program and may start its execution.
The execution of system commands under the “interactive” NOS operating system differed slightly from NOS/BE. Under NOS, each next command was provided by the terminal buffer. If no command or input was available, the system waited for the next user action. Under the interactive subsystem of NOS/BE, INTERCOM, only one single command could be typed ahead as the buffer only allowed one single command in the buffer. Then, the loader received an end-of-file signal. The Loader sometimes required a number of subsequent commands, which could not be done under Intercom for the reason mentioned above. The new knowledge allowed us to activate the NOS code and to make the necessarily small number of changes to get it working. We provided this code for free to many other computer centres in the world. In the end, CDC implemented our code as well in the standard NOS/BE system.
System performance: “Polish smartness”
During one of the ECODU conferences, a system programmer of the University of Krakow gave a lecture that ended after only fifteen minutes. The normal session time was one hour. Behind the then iron curtain one had to give at least a presentation to attend a conference. This time, however, the presented ideas made the system programmers of the Physical Laboratory think (and work).
The standard NOS/BE operating system located a table in the system’s main memory with disk pointers to the correct location of PP programs on disk or to a location in the fixed part of the central memory. The PP loader made use of this table. In the table, several bytes of each word of the table were reserved for reference to PP code that had been preloaded in a so-called extended memory. Extended memory was an expensive machine extension. Something that neither the Polish nor we could afford in our system. He made clever use of that unused field to count the number of calls from the PP program. By regularly reading the complete table with a simple Fortran program and comparing it with the stored counters of a previous run, you could check which PP programs were loaded often, loaded regularly, or were loaded in ‘bursts’. Within the very limited central computer memory, a part was reserved to store essential PP code (e.g., error-handling disks) and PP programs that were invoked many times per second. The loading of PP code, in that case, was the direct reading of a number of memory words and did not require disk I/O and was super fast. From the counts, the standard selection of which PPs were in the expensive memory for our environment turned out to be far from optimal. A manual adjustment of the system resulted in a 10% speed improvement of the machine.
Yet the system programmers were not yet satisfied. With a load of about ten PP-programs per second of the (slow) disk, some extra system modifications were developed. After the initial installation of the system on the disk, we could use the aforementioned PP-load table and other system tables to locate where the approximately 3.5 cylinders (14 disk plates above each other) with the PP program library started on disk. The library was now rearranged in such a way that first a number of least used PP programs were written on disk until the – already partially filled – disk cylinder overflowed. Then the most invoked PP programs – not stored in computer memory – were crammed into the next full cylinder as much as possible. Then the other PP programs were put on disk. Because the most used PP codes were present in one disk cylinder, the number of head movements of the disk unit could be considerably reduced. It was likely that a PP code to be loaded within a half disk-revolution without the need for any arm movement of the read head. Depending on the type of work, a 5-10% performance gain was achieved.
It could be even more optimal! Apart from the primary system PP-library, the system placed temporary system changes (EDITLIB(SYSTEM)) in a second library. By changing some code in the operating system (technically: moving the SYS-bit), we moved this library from the original SYStem disk to a second SYStem disk. Playing the same cylinder tricks, we were able to replace often used PP programs by a copy that was loaded on this second disk. In this way, we had a balanced, very optimal operating system that was much faster than the original NOS/BE system. The users gained a lot of performance. We were able to load a PP-program directly from one of the two system disks without requiring the dozens of milliseconds repositioning those disks in more than 95% of the cases. The interactive users gained the most as the system performance became more consistent. It also meant a higher CPU-performance of jobs because the system required less waiting time for I/O.
More performance problems
In 1979, the CPU utilisation during working hours was 50%. In 1980, the CPU utilisation increased to 60%. The overall utilisation in 1979 was 23%. Via 32% in 1980, this figure increased to 43% in 1981.
Despite the Polish and our own smart code, the users made complaints about the interactive response times on the CYBER 74 in the mid of 1981. Using a program written by Edo Roos Lindgreen at SARA on an Apple IIe microcomputer, every 5 minutes a number of simple interactive commands were issued to the CYBER. The PC measured the time between the carriage return (Enter) and the first character of the response of that key-action. These time values were sent to the Cyber for post-processing. The Apple displayed the last twenty response times on the PC-monitor intended for the operators. They got an almost immediate feedback in case they admitted to many batch jobs to the system, or when other system performance problems occurred.
Using this tool, we measured the behaviour of a temporarily extended configuration of the CYBER with 14 PP’s (instead of 10) and an additional disk controller. It was shown that either the CYBER configuration had to be extended or the CYBER had to be replaced by a more powerful system in order to remove a number of bottle-necks. Apart from the Apple IIe, we measured in the system itself as well. Every ten seconds, a PP-program made a snapshot of the system tables, PP occupation and I/O activities. This PP and the analysis program was called User Performance Measurement (UPM) and originated from the European Center for Mid-Range Weather Forecasting (ECMWF). The four additional PPs and an additional disk controller decreased the average interactive response time (reaction time to input in an application) from 1.5 to 0.4 seconds. The average execution time of 95% of the interactive commands decreased from ten to 0.5 seconds. The system got some ‘air’ again.
The execution of batch jobs was very priority dependent. When a user entered, for instance, twenty jobs in a quick succession, operators considered that unfair and set a larger part of such jobs on priority 0 – ‘not eligible for execution’. As it required manual intervention, making jobs eligible for execution was often forgotten until the system became almost empty. This resulted in a number of wasted CPU seconds, either by idle time or rerun time, and too late activation of the job resulting is annoyed users. Apart from this, long jobs were often moved to the end of the queue. The selection was often based on personal preferences and experiences of the operators. Sometimes this turned out to be a very good decision, but often the decision was contra-productive. Many complaints were received. In May 1982, a new priority algorithm was introduced for batch jobs. The following interesting formula was used to determine the start priority in the queue:
2200B – 2log(CM*(T+ß*IO) + 2log(seconds waiting time) .
The parameter ß was two for normal batch jobs and 0.5 for tape jobs. The queue was aged in a smart way. In this way, each batch job was started in an honest way without many operator interventions. Those users, who estimated their execution and I/O requirements well, received as a bonus a higher start priority and a thus faster moment when their job came to the top of the wait queue.
Stress-testing new operators
The operator console of the CYBER-system consisted of two 15 inch large round cathode ray tubes. The characters were generated by means of R-C-circuits, causing circles in zero, six, eight and nine to be squeezed a lot.
New operators that were left alone behind such a console for the first time after a number of weeks introduction and training, were stress-tested by their colleagues. Using a terminal, the system programmers started the PP-program EYE, the first example of a “Trojan horse“. The PP-program took over both console displays for a short period, cleared the screen and displayed to ovals with a pupil in each oval. The screens looked at the left, then to the right, blinked and disappeared leaving an untouched console screen. The colleague operator who returned a minute later had a flat face, never seen such a spooky system behaviour before… until the new operator was introduced to the wonderful set of game PPs that made use of the so-called L-display. These Fortran/PP programs included Chess, Life, Wurm, Blob and a space mission program with dangers such as ‘coffee pot blown by meteorite’.
As said before, the CYBER 74 had a water cooling system that was supposed to get rid of the heat exchanged by freon cooling circuits in the bays. Non-detected water leakage caused regular filling of the system with additional water. At a certain moment, one of the CYBER 74 bays indicated temperature problems. The built-in temperature meter indicated a very strange temperature behaviour: a very fast switching between too hot and too cool. Switching the main cooling circuit to a secondary circuit did not solve the problem. It was decided to replace the three-way valve of the cooling circuit of the bay. When the pipes were detached, the root cause of the problem became clear. As the Laboratory had iron pipes in its primary cooling circuit, the regular supply of oxygen-rich water had caused much rust on the inside of the pipes over time. As chips of rust loosened once in a while, these were transported to the smaller cooling pipes in the CYBER system where they became trapped. In this way, our computer was the first system with a cardiac arrest. The cleaning of the internal cooling system of the CYBER required a couple of days work. Obviously, the total water cooling circuit required replacement, something that required more time and planning. In order to continue computer services, it was decided to couple the CYBER to an external ‘blood circulation’. The emergency cooling was supplied by connecting a fire hose to the cooling input side of the CYBER and at the output a fire hose that deposited the heated water outflow through the room window onto one of the courtyards of the Laboratory. It took three weeks to do all the replacement fitting. It took only a short while before the total area around the Laboratory became part of the extended water production area for the Hague (Haagse Duinwaterleiding).
More cooling problems
The cooling system of the air conditioning was not always as reliable as it should have been. There was a period that the heat exchanger on the roof had frozen pipes causing blocked water throughput. When there was no immediate reaction to the airconditioning failure, especially during the night when nobody was around, the air temperature in the computer room increased very fast to above 27 C. The control system then issued an audible alarm during thirty seconds and subsequently took down the total system power. The effect was often (the well known Murphy was most often around during the night!) a write action on a disk while retreating the disk heads. The result was damaged disk contents. The restore action required four to six hours using the backup tapes of the previous evening. After the n-th restore, a smart solution popped into mind during a restless night. As the system went down automatically anyway, there was no reason not to press the system deadstart button (boot-button) five seconds earlier. The boot action is a read action on disks, thus non-destructive. The timer costs were a mere 50 Dutch guilders and required 1.5 hours installation time. The long reload actions due to this cause disappeared and many downtime hours were saved.
In order to optimise the “restore”-process after emergencies as disk crashes, a Fortran program was developed that optimised the use of the magnetic tape units for RESTORE. Both magnetic tape units could run in parallel. Also, PPs required during the restore were temporarily loaded to main memory, causing a largely decreased recovery time. Permanent files that could not be retrieved from the most recent backup tapes were restored from the most recent weekly backup tapes.