Museum logo

CDC CYBER 74: Performance problems

System performance : "Polish smartness"

During one of the ECODU conferences, a system programmer of the University of Krakau gave a very short, a 15 minutes presentation. In order to be allowed to go to conference, one had to give him a paper. At this time, the ideas were great. And our system programmers took up the idea (the work included).

The standard NOS/BE operating system contained a table in the system memory in which the PP-loader could find either disk "pointers" to the PP-programs on disk or pointers to the central memory location. The table contained a number of "bytes" to point to PP code in case this code was preloaded in extended memory. The Polish CDC 6400 had no extended memory. Our system neither had extended memory. The Polish collegue wrote a small number of instructions in the PP loader code. When a PP (overlay) was loaded, it incremented a counter that was stored in these unused bytes.
Using a very simple Fortran program which called a PP that copied the table into user space, we were able to subtract the counters from the stored counters during a previous run. In this way it became clear which of the PP programs were loaded very often PP-programma's and which occurred in "bursts". Standard, some part of the very limited central memory (96 KWords * 60 bits) was used to store PP overlays that either were used very often per second or had to be preloaded as these were required to handle error situations in which for instance disks could not be reached. Loading PP code could then be done by a read directly from memory and required no disk I/O actions. By means of the counters, it became clear that the standard NOS/BE preload set of PP code was far from optimal. A manual optimisation lead to a 10% performance increase.

Despite that, the system programmers were not yet satisfied. A load of a dozens of PP-programs per second on the relative (slow) disk could be optimised further. After the initial load of a new complete version of the operating system by means of a deadstart tape, we could determine from the disk addresses in the previously mentioned PP loader table the place of the PP code on the disk. In this way we were able to determine where the roughly 3.5 cylinders (14 disk platters above each other) that comprised the PP-program library on disk started. By reorganising the PP-code on the magnetic deadstart tape, the library was organised in such a way that the less used PP code occupied the disk area up to the end of the initial cylinder. The next cylinder comprised all PP code that was used the most (leaving the memory preloaded PP code out). Then the next cylinders comprised the rest of the PP-library. In this way we enlarged the chance that the disk read heads were already positioned above the right cylinder greatly. Thus cylinder to cylinder movements were no more required and on average only a half rotation of the disk was required to start loading PP code. This enhanced the performance of the system with another 5-10%, depending on the type of load.

Apart from the primairy system PP-library, the system placed "temporary" system changes (EDITLIB(SYSTEM)) in a second library. By changing some code in the operating system (technically: moving the SYS-bit), we moved this library from the original system disk to a second disk. Playing the same cylinder tricks, we were able to replace often used PP programs by a copy of them that was loaded from this second disk. In this way we had a balanced, very optimal operating system that was much faster than the original NOS/BE system. The users gained a lot of performance. We were able to load for more than 95% of the cases a PP-program directly from one of the two system disks without requiring the dozens of milliseconds repositioning those disks. The interactive users gained the most while the performance was then more consistent. It also meant a higher CPU-performance of jobs because the system required less waiting time for input/output.

More performance problems

In 1979, the CPU utilisation during working hours was 50%. In 1980, the utilisation increased to 60%. The overall utilisation in 1979 was 23%. Via 32% in 1980, this figure increased to 43% in 1981.

Despite the "Polish and our own smart code", in the mid of 1981 the users made complaints about the interactive response times on the CYBER 74. Using a program written by Edo Roos Lindgreen at SARA on a Apple IIe microcomputer, every 5 minutes a number of simple interactive commands were issued to the CYBER. The PC measured the time between the carriage return (Enter) and the first character of the response of that key-action. These time values were sent to the Cyber for post-processing. The Apple displayed the last twenty response times on the PC-monitor intended for the operators. They got an almost immediate feedback in case they admitted to many batch jobs to the system, or when other system performance problems occurred.

Using this tool, we measured the behaviour of a temporary extended configuration of the CYBER with 14 PP's (instead of 10) and an additional disk controller. It was shown that either the CYBER configuration had to extended or the CYBER required replacement in order to removed a number of bottle-necks. Apart from the Apple IIe, we measured in the system itself as well. Every ten seconds, a PP-program made a snapshot of the system tables, PP occupation and I/O activities. This PP and the analysis code originated from the European Center for Mid-Range Weather Forecasting (ECMWF) and was called User Performance Measurement (UPM).
The four additional PP's and the additional disk controller decreased the average interactive response time (reaction time to input in an application) from 1.5 to 0.4 second. The average execution of 95% of the interactive commands decreased from ten to 0.5 seconds. The system got "air" again.

The execution of batch jobs was very "priority" dependent. When a user entered for instance twenty jobs in one go, operators most of the time set a greater part of these jobs on priority 0 - not eligible for execution . As it required manual intervention, making jobs eligible for execution was often forgotten until the system became almost empty. This resulted in a number of wasted CPU seconds, either by idle time or rerun time, and late activation of the job resulting is annoyed users. Apart from this, "long" jobs were often moved to the end of the queue. This often based on the personal preferences and experiences of the operators. Sometimes this turned out to be a very good decision, but often the decision was contra-productive. Many complaints were received. In May 1982, a new priority algorithm was introduced for batchjobs. The following integring formula was used to determine the start priority in the queue:

2200B - 2log(CM*(T+ß*IO) + 2log(secondswaiting time) .

The parameter ß was two for normal batch jobs and 0.5 for tape jobs. The queue was aged in a smart way. In this way, each batchjob was started in an honest way without many operator intervention. Those users, who estimated their execution and I/O requirements well, received as bonus a faster moment when their job came to the top of the waiting queue.

Stress-testing new operators

The operator console of the CYBER-system consisted of two 15 inch large round kathode ray tubes. The characters were generated by means of R-C-circuits, causing circels in zero, six, eigth and nine to be squeezed a lot.
New operators that were left alone behind such a console for the first time after a number of weeks introduction and training, were stress-tested by their collegues. Using a terminal, the system programmers started the PP-program EYE, a first example of a "trojan horse". The PP-program took over both console displays for a short period, cleared the screen and displayed to ovals with a "pupil" in each oval. The screens looked at the left, then to the right, blinked and disappeared leaving an "untouched" console screen. The collegue operator who returned a minute later had a flat face, never seen such a spooky system behaviour before... until the new operator was introduced to the wonderful set of game PP's that made use of the so called L-display. These Fortran/PP programs included Chess, Life, Wurm and Blob.

Emergency cooling

As said before, the CYBER 74 had a water cooling system that was supposed to get rid of the heat exchanged by freon cooling circuits in the bays. Water leakage required regular filling of the system with additional water. At a certain moment, one of the CYBER 74 bays indicated temperature problems. The built-in temperature meter indicated a very strange temperature behaviour: a very fast switching between too hot and too cool. Switching the main cooling circuit to the secundary circuit did not solve the problem. It was decided to replace the three-way valve in the cooling circuit of the bay. When the pipes were detached, the cause of the problem became clear. As the Laboratory had iron pipes in its primary cooling circuit, the addition of oxygin-rich water had caused over time much rust on the inside of the pipes. As chips of rust loosened once in a while, these were transported to the smaller cooling pipes in the CYBER system where they became trapped. In this way, our computer was the first system with a "cardiac arrest". The cleaning of the internal cooling system of the CYBER required a couple of days work. Obviously, the total primary circuit required replacement, something that required more time and planning. In order to continue computer services, it was decided to couple the CYBER to an external "blood circulation". The "emergency cooling" was supplied by connecting a fire hose to the cooling input side of the CYBER and at the output a fire hose that deposited the heated water outflux through the window onto the courtyard of the Laboratory. It took three weeks to do all the replacement fitting. It took only a short while before the total area around the Laboratory became part of the extended water production area for the Hague (Haagse Duinwaterleiding).

Photo of a CDC 6600 which was a look-alike of the CYBER 74. The cooling
controls can be seen at the end ofg the bay. The CPU interconnection wires
are very well visible. Ckick photo to enlarge.
(photo courtesy by http://ed-thelen.org/comp-hist/).

More cooling problems

The cooling system of the airconditioning was not always as reliable as it should have been. There was a period that the heat exchanger on the roof had froozen pipes causing blocked water throughput. When not immediately taken care of the sounding alarm, meaning during the night when nobody was around, the air temperature in the computer room increased very fast to above 27 C. The control system then issued an audible alarm during thirty seconds and took down the total system power. The effect was often (the well known "Murphy" was mostly around during the night!) an interrupted write actions on the disks while retreating the disk heads. The result was damaged disk contents. The restore action required four to six hours using the backup tapes of the previous evening. After the n-th "restore" a smart solution popped into mind during a restless night. As the system went down automatically anyway, there was no reason to press the system deadstart button (boot-button) 5 seconds earlier. The boot action is a read action on disks, thus non-destructive. The timer costs were a mere 50 Dutch guilders and required 1.5 hours installation time. The long reload actions due to this cause disappeared and many downtime hours were saved.

In order to optimise the "restore"-process after emergencies as disk crashes, a Fortran program was developed that optimised the use of the magnetic tape units. Both magnetic tape units could run in parallel mode. Also, PP's required during the restore were loaded to memory, causing a largely decreased down time. Permanent files that could not be retrieved from the most recent back-up tapes were restored from the most recent weekly back-up tapes.

Museum logo