5
INPUT/OUTPUT
In addition to providing abstractions such as processes, address spaces, and files, an operating system also
controls all the computer’s I/O (Input/Output) de- vices. It must issue commands to the devices, catch
interrupts, and handle errors. It should also provide an interface between the devices and the rest of the
system that is simple and easy to use. To the extent possible, the interface should be the same for all
devices (device independence). The I/O code represents a significant fraction of the total operating
system. How the operating system manages I/O is the subject of this chapter.
This chapter is organized as follows. We will look first at some of the prin- ciples of I/O hardware and
then at I/O software in general. I/O software can be structured in layers, with each having a well-defined
task. We will look at these layers to see what they do and how they fit together.
Next, we will look at several I/O devices in detail: disks, clocks, keyboards, and displays. For each device
we will look at its hardware and software. Finally, we will consider power management.
5.1 PRINCIPLES OF I/O HARDWARE
Different people look at I/O hardware in different ways. Electrical engineers look at it in terms of chips,
wires, power supplies, motors, and all the other physi- cal components that comprise the hardware.
Programmers look at the interface
337
338 INPUT/OUTPUT CHAP. 5
presented to the software—the commands the hardware accepts, the functions it carries out, and the errors
that can be reported back. In this book we are concerned with programming I/O devices, not designing,
building, or maimtaining them, so our interest is in how the hardware is programmed, not how it works
inside. Never- theless, the programming of many I/O devices is often intimately connected with their
internal operation. In the next three sections we will provide a little general background on I/O hardware
as it relates to programming. It may be regarded as a review and expansion of the introductory material in
Sec. 1.3.
5.1.1 I/O Devices
I/O devices can be roughly divided into two categories: block devices and character devices. A block
device is one that stores information in fixed-size blocks, each one with its own address. Common block
sizes range from 512 to 65,536 bytes. All transfers are in units of one or more entire (consecutive) blocks.
The essential property of a block device is that it is possible to read or write each block independently of
all the other ones. Hard disks, Blu-ray discs, and USB sticks are common block devices.
If you look very closely, the boundary between devices that are block address- able and those that are not
is not well defined. Everyone agrees that a disk is a block addressable device because no matter where the
arm currently is, it is always possible to seek to another cylinder and then wait for the required block to
rotate under the head. Now consider an old-fashioned tape drive still used, sometimes, for making disk
backups (because tapes are cheap). Tapes contain a sequence of blocks. If the tape drive is given a
command to read block N, it can always rewind the tape and go forward until it comes to block N. This
operation is analogous to a disk doing a seek, except that it takes much longer. Also, it may or may not be
pos- sible to rewrite one block in the middle of a tape. Even if it were possible to use tapes as random
access block devices, that is stretching the point somewhat: they are normally not used that way.
The other type of I/O device is the character device. A character device deliv- ers or accepts a stream of
characters, without regard to any block structure. It is not addressable and does not have any seek
operation. Printers, network interfaces, mice (for pointing), rats (for psychology lab experiments), and
most other devices that are not disk-like can be seen as character devices.
This classification scheme is not perfect. Some devices do not fit in. Clocks, for example, are not block
addressable. Nor do they generate or accept character streams. All they do is cause interrupts at welldefined intervals. Memory-mapped screens do not fit the model well either. Nor do touch screens, for that
matter. Still, the model of block and character devices is general enough that it can be used as a basis for
making some of the operating system software dealing with I/O device in- dependent. The file system, for
example, deals just with abstract block devices and leaves the device-dependent part to lower-level
software.
SEC. 5.1 PRINCIPLES OF I/O HARDWARE 339
I/O devices cover a huge range in speeds, which puts considerable pressure on the software to perform
well over many orders of magnitude in data rates. Figure 5-1 shows the data rates of some common
devices. Most of these devices tend to get faster as time goes on.
Device
Data rate
Keyboard
10 bytes/sec
Mouse
100 bytes/sec
56K modem
7 KB/sec
Scanner at 300 dpi
1 MB/sec
Digital camcorder
3.5 MB/sec
4x Blu-ray disc
18 MB/sec
802.11n Wireless
37.5 MB/sec
USB 2.0
60 MB/sec
FireWire 800
100 MB/sec
Gigabit Ethernet
125 MB/sec
SATA 3 disk drive
600 MB/sec
USB 3.0
625 MB/sec
SCSI Ultra 5 bus
640 MB/sec
Single-lane PCIe 3.0 bus 985 MB/sec
Thunderbolt 2 bus
2.5 GB/sec
SONET OC-768 network 5 GB/sec
Figure 5-1. Some typical device, network, and bus data rates.
5.1.2 Device Controllers
I/O units often consist of a mechanical component and an electronic compo- nent. It is possible to
separate the two portions to provide a more modular and general design. The electronic component is
called the device controller or adapter. On personal computers, it often takes the form of a chip on the
par- entboard or a printed circuit card that can be inserted into a (PCIe) expansion slot. The mechanical
component is the device itself. This arrangement is shown in Fig. 1-6.
The controller card usually has a connector on it, into which a cable leading to the device itself can be
plugged. Many controllers can handle two, four, or even eight identical devices. If the interface between
the controller and device is a stan- dard interface, either an official ANSI, IEEE, or ISO standard or a de
facto one, then companies can make controllers or devices that fit that interface. Many com- panies, for
example, make disk drives that match the SATA, SCSI, USB, Thunder- bolt, or FireWire (IEEE 1394)
interfaces.
340 INPUT/OUTPUT CHAP. 5
The interface between the controller and the device is often a very low-level one. A disk, for example,
might be formatted with 2,000,000 sectors of 512 bytes per track. What actually comes off the drive,
however, is a serial bit stream, start- ing with a preamble, then the 4096 bits in a sector, and finally a
checksum, or ECC (Error-Correcting Code). The preamble is written when the disk is for- matted and
contains the cylinder and sector number, the sector size, and similar data, as well as synchronization
information.
The controller’s job is to convert the serial bit stream into a block of bytes and perform any error
correction necessary. The block of bytes is typically first assem- bled, bit by bit, in a buffer inside the
controller. After its checksum has been veri- fied and the block has been declared to be error free, it can
then be copied to main memory.
The controller for an LCD display monitor also works as a bit serial device at an equally low level. It
reads bytes containing the characters to be displayed from memory and generates the signals to modify
the polarization of the backlight for the corresponding pixels in order to write them on screen. If it were
not for the display controller, the operating system programmer would have to explicitly pro- gram the
electric fields of all pixels. With the controller, the operating system ini- tializes the controller with a few
parameters, such as the number of characters or pixels per line and number of lines per screen, and lets
the controller take care of actually driving the electric fields.
In a very short time, LCD screens have completely replaced the old CRT (Cathode Ray Tube) monitors.
CRT monitors fire a beam of electrons onto a flu- orescent screen. Using magnetic fields, the system is
able to bend the beam and draw pixels on the screen. Compared to LCD screens, CRT monitors were
bulky, power hungry, and fragile. Moreover, the resolution on today ́s (Retina) LCD screens is so good
that the human eye is unable to distinguish individual pixels. It is hard to imagine today that laptops in the
past came with a small CRT screen that made them more than 20 cm deep with a nice work-out weight of
around 12 kilos.
5.1.3 Memory-Mapped I/O
Each controller has a few registers that are used for communicating with the CPU. By writing into these
registers, the operating system can command the de- vice to deliver data, accept data, switch itself on or
off, or otherwise perform some action. By reading from these registers, the operating system can learn
what the device’s state is, whether it is prepared to accept a new command, and so on.
In addition to the control registers, many devices have a data buffer that the op- erating system can read
and write. For example, a common way for computers to display pixels on the screen is to have a video
RAM, which is basically just a data buffer, available for programs or the operating system to write into.
The issue thus arises of how the CPU communicates with the control registers and also with the device
data buffers. Two alternatives exist. In the first approach,
SEC. 5.1 PRINCIPLES OF I/O HARDWARE 341
each control register is assigned an I/O port number, an 8- or 16-bit integer. The set of all the I/O ports
form the I/O port space, which is protected so that ordinary user programs cannot access it (only the
operating system can). Using a special I/O instruction such as
IN REG,PORT,
the CPU can read in control register PORT and store the result in CPU register
REG.
Similarly, using OUT PORT,REG
the CPU can write the contents of REG to a control register. Most early computers, including nearly all
mainframes, such as the IBM 360 and all of its successors, worked this way.
In this scheme, the address spaces for memory and I/O are different, as shown in Fig. 5-2(a). The
instructions
IN R0,4
and
MOV R0,4
are completely different in this design. The former reads the contents of I/O port 4 and puts it in R0
whereas the latter reads the contents of memory word 4 and puts it in R0. The 4s in these examples refer
to different and unrelated address spaces.
Two address 0xFFFF...
0
(a)
Figure 5-2. (a) Separate I/O and memory space. (b) Memory-mapped I/O. (c) Hybrid.
One address space Two address spaces
Memory
I/O ports
The second approach, introduced with the PDP-11, is to map all the control registers into the memory
space, as shown in Fig. 5-2(b). Each control register is assigned a unique memory address to which no
memory is assigned. This system is called memory-mapped I/O. In most systems, the assigned addresses
are at or near the top of the address space. A hybrid scheme, with memory-mapped I/O data buffers and
separate I/O ports for the control registers, is shown in Fig. 5-2(c).
(b)
(c)
342 INPUT/OUTPUT CHAP. 5
The x86 uses this architecture, with addresses 640K to 1M 1 being reserved for device data buffers in
IBM PC compatibles, in addition to I/O ports 0 to 64K 1.
How do these schemes actually work in practice? In all cases, when the CPU wants to read a word, either
from memory or from an I/O port, it puts the address it needs on the bus’ address lines and then asserts a
READ signal on a bus’ control line. A second signal line is used to tell whether I/O space or memory space
is needed. If it is memory space, the memory responds to the request. If it is I/O space, the I/O device
responds to the request. If there is only memory space [as in Fig. 5-2(b)], every memory module and
every I/O device compares the address lines to the range of addresses that it services. If the address falls
in its range, it re- sponds to the request. Since no address is ever assigned to both memory and an I/O
device, there is no ambiguity and no conflict.
These two schemes for addressing the controllers have different strengths and weaknesses. Let us start
with the advantages of memory-mapped I/O. Firstof all, if special I/O instructions are needed to read and
write the device control registers, access to them requires the use of assembly code since there is no way
to execute an IN or OUT instruction in C or C++. Calling such a procedure adds overhead to controlling
I/O. In contrast, with memory-mapped I/O, device control registers are just variables in memory and can
be addressed in C the same way as any other var- iables. Thus with memory-mapped I/O, an I/O device
driver can be written entirely in C. Without memory-mapped I/O, some assembly code is needed.
Second, with memory-mapped I/O, no special protection mechanism is needed to keep user processes
from performing I/O. All the operating system has to do is refrain from putting that portion of the address
space containing the control regis- ters in any user’s virtual address space. Better yet, if each device has
its control registers on a different page of the address space, the operating system can give a user control
over specific devices but not others by simply including the desired pages in its page table. Such a scheme
can allow different device drivers to be placed in different address spaces, not only reducing kernel size
but also keeping one driver from interfering with others.
Third, with memory-mapped I/O, every instruction that can reference memory can also reference control
registers. For example, if there is an instruction, TEST, that tests a memory word for 0, it can also be used
to test a control register for 0, which might be the signal that the device is idle and can accept a new
command. The assembly language code might look like this:
// check if port 4 is 0
// if it is 0, go to ready
// otherwise, continue testing
LOOP: TEST PORT 4 BEQ READY
BRANCH LOOP
READY:
If memory-mapped I/O is not present, the control register must first be read into the CPU, then tested,
requiring two instructions instead of just one. In the case of
SEC. 5.1 PRINCIPLES OF I/O HARDWARE 343
the loop given above, a fourth instruction has to be added, slightly slowing down the responsiveness of
detecting an idle device.
In computer design, practically everything involves trade-offs, and that is the case here, too. Memorymapped I/O also has its disadvantages. First, most com- puters nowadays have some form of caching of
memory words. Caching a device control register would be disastrous. Consider the assembly-code loop
given above in the presence of caching. The first reference to PORT 4 would cause it to be cached.
Subsequent references would just take the value from the cache and not even ask the device. Then when
the device finally became ready, the software would have no way of finding out. Instead, the loop would
go on forever.
To prevent this situation with memory-mapped I/O, the hardware has to be able to selectively disable
caching, for example, on a per-page basis. This feature adds extra complexity to both the hardware and
the operating system, which has to man- age the selective caching.
Second, if there is only one address space, then all memory modules and all I/O devices must examine all
memory references to see which ones to respond to. If the computer has a single bus, as in Fig. 5-3(a),
having everyone look at every address is straightforward.
CPU reads and writes of memory go over this high-bandwidth bus
This memory port is to allow I/O devices access to memory
(b)
Figure 5-3. (a) A single-bus architecture. (b) A dual-bus memory architecture.
However, the trend in modern personal computers is to have a dedicated high- speed memory bus, as
shown in Fig. 5-3(b). The bus is tailored to optimize memo- ry performance, with no compromises for the
sake of slow I/O devices. x86 sys- tems can have multiple buses (memory, PCIe, SCSI, and USB), as
shown in Fig. 1-12.
The trouble with having a separate memory bus on memory-mapped machines is that the I/O devices have
no way of seeing memory addresses as they go by on the memory bus, so they have no way of responding
to them. Again, special meas- ures have to be taken to make memory-mapped I/O work on a system with
multiple
CPU
Memory
I/O
CPU
Memory
I/O
All addresses (memory and I/O) go here
(a)
Bus
344 INPUT/OUTPUT CHAP. 5
buses. One possibility is to first send all memory references to the memory. If the memory fails to
respond, then the CPU tries the other buses. This design can be made to work but requires additional
hardware complexity.
A second possible design is to put a snooping device on the memory bus to pass all addresses presented to
potentially interested I/O devices. The problem here is that I/O devices may not be able to process
requests at the speed the memory can.
A third possible design, and one that would well match the design sketched in Fig. 1-12, is to filter
addresses in the memory controller. In that case, the memory controller chip contains range registers that
are preloaded at boot time. For ex- ample, 640K to 1M 1 could be marked as a nonmemory range.
Addresses that fall within one of the ranges marked as nonmemory are forwarded to devices in- stead of
to memory. The disadvantage of this scheme is the need for figuring out at boot time which memory
addresses are not really memory addresses. Thus each scheme has arguments for and against it, so
compromises and trade-offs are inevitable.
5.1.4 Direct Memory Access
No matter whether a CPU does or does not have memory-mapped I/O, it needs to address the device
controllers to exchange data with them. The CPU can request data from an I/O controller one byte at a
time, but doing so wastes the CPU’s time, so a different scheme, called DMA (Direct Memory Access)
is often used. To simplify the explanation, we assume that the CPU accesses all devices and memory via a
single system bus that connects the CPU, the memory, and the I/O devices, as shown in Fig. 5-4. We
already know that the real organization in modern systems is more complicated, but all the principles are
the same. The operating system can use only DMA if the hardware has a DMA controller, which most
systems do. Sometimes this controller is integrated into disk controllers and other controllers, but such a
design requires a separate DMA controller for each device. More com- monly, a single DMA controller is
available (e.g., on the parentboard) for regulat- ing transfers to multiple devices, often concurrently.
No matter where it is physically located, the DMA controller has access to the system bus independent of
the CPU, as shown in Fig. 5-4. It contains several reg- isters that can be written and read by the CPU.
These include a memory address register, a byte count register, and one or more control registers. The
control regis- ters specify the I/O port to use, the direction of the transfer (reading from the I/O device or
writing to the I/O device), the transfer unit (byte at a time or word at a time), and the number of bytes to
transfer in one burst.
To explain how DMA works, let us first look at how disk reads occur when DMA is not used. First the
disk controller reads the block (one or more sectors) from the drive serially, bit by bit, until the entire
block is in the controller’s internal buffer. Next, it computes the checksum to verify that no read errors
have occurred.
SEC. 5.1
PRINCIPLES OF I/O HARDWARE
345
CPU
1. CPU programs the DMA controller
Interrupt when done
DMA controller
Disk controller
Drive
Buffer
Main memory
Address
Count
Control
4. Ack
2. DMA requests transfer to memory
3. Data transferred
Bus
Figure 5-4. Operation of a DMA transfer.
Then the controller causes an interrupt. When the operating system starts running, it can read the disk
block from the controller’s buffer a byte or a word at a time by executing a loop, with each iteration
reading one byte or word from a controller de- vice register and storing it in main memory.
When DMA is used, the procedure is different. First the CPU programs the DMA controller by setting its
registers so it knows what to transfer where (step 1 in Fig. 5-4). It also issues a command to the disk
controller telling it to read data from the disk into its internal buffer and verify the checksum. When valid
data are in the disk controller’s buffer, DMA can begin.
The DMA controller initiates the transfer by issuing a read request over the bus to the disk controller (step
2). This read request looks like any other read request, and the disk controller does not know (or care)
whether it came from the CPU or from a DMA controller. Typically, the memory address to write to is on
the bus’ address lines, so when the disk controller fetches the next word from its internal buffer, it knows
where to write it. The write to memory is another standard bus cycle (step 3). When the write is complete,
the disk controller sends an acknowl- edgement signal to the DMA controller, also over the bus (step 4).
The DMA con- troller then increments the memory address to use and decrements the byte count. If the
byte count is still greater than 0, steps 2 through 4 are repeated until the count reaches 0. At that time, the
DMA controller interrupts the CPU to let it know that the transfer is now complete. When the operating
system starts up, it does not have to copy the disk block to memory; it is already there.
DMA controllers vary considerably in their sophistication. The simplest ones handle one transfer at a
time, as described above. More complex ones can be pro- grammed to handle multiple transfers at the
same time. Such controllers have mul- tiple sets of registers internally, one for each channel. The CPU
starts by loading each set of registers with the relevant parameters for its transfer. Each transfer must
346 INPUT/OUTPUT CHAP. 5
use a different device controller. After each word is transferred (steps 2 through 4) in Fig. 5-4, the DMA
controller decides which device to service next. It may be set up to use a round-robin algorithm, or it may
have a priority scheme design to favor some devices over others. Multiple requests to different device
controllers may be pending at the same time, provided that there is an unambiguous way to tell the acknowledgements apart. Often a different acknowledgement line on the bus is used for each DMA channel
for this reason.
Many buses can operate in two modes: word-at-a-time mode and block mode. Some DMA controllers can
also operate in either mode. In the former mode, the operation is as described above: the DMA controller
requests the transfer of one word and gets it. If the CPU also wants the bus, it has to wait. The mechanism
is called cycle stealing because the device controller sneaks in and steals an occa- sional bus cycle from
the CPU once in a while, delaying it slightly. In block mode, the DMA controller tells the device to
acquire the bus, issue a series of transfers, then release the bus. This form of operation is called burst
mode. It is more ef- ficient than cycle stealing because acquiring the bus takes time and multiple words
can be transferred for the price of one bus acquisition. The down side to burst mode is that it can block the
CPU and other devices for a substantial period if a long burst is being transferred.
In the model we have been discussing, sometimes called fly-by mode, the DMA controller tells the
device controller to transfer the data directly to main memory. An alternative mode that some DMA
controllers use is to have the device controller send the word to the DMA controller, which then issues a
second bus re- quest to write the word to wherever it is supposed to go. This scheme requires an extra bus
cycle per word transferred, but is more flexible in that it can also perform device-to-device copies and
even memory-to-memory copies (by first issuing a read to memory and then issuing a write to memory at
a different address).
Most DMA controllers use physical memory addresses for their transfers. Using physical addresses
requires the operating system to convert the virtual ad- dress of the intended memory buffer into a
physical address and write this physical address into the DMA controller’s address register. An alternative
scheme used in a few DMA controllers is to write virtual addresses into the DMA controller in- stead.
Then the DMA controller must use the MMU to have the virtual-to-physical translation done. Only in the
case that the MMU is part of the memory (possible, but rare), rather than part of the CPU, can virtual
addresses be put on the bus.
We mentioned earlier that the disk first reads data into its internal buffer before DMA can start. You may
be wondering why the controller does not just store the bytes in main memory as soon as it gets them
from the disk. In other words, why does it need an internal buffer? There are two reasons. First, by doing
internal buffering, the disk controller can verify the checksum before starting a transfer. If the checksum
is incorrect, an error is signaled and no transfer is done.
The second reason is that once a disk transfer has started, the bits keep arriving from the disk at a constant
rate, whether the controller is ready for them or not. If
SEC. 5.1 PRINCIPLES OF I/O HARDWARE 347
the controller tried to write data directly to memory, it would have to go over the system bus for each
word transferred. If the bus were busy due to some other de- vice using it (e.g., in burst mode), the
controller would have to wait. If the next disk word arrived before the previous one had been stored, the
controller would have to store it somewhere. If the bus were very busy, the controller might end up
storing quite a few words and having a lot of administration to do as well. When the block is buffered
internally, the bus is not needed until the DMA begins, so the design of the controller is much simpler
because the DMA transfer to memory is not time critical. (Some older controllers did, in fact, go directly
to memory with only a small amount of internal buffering, but when the bus was very busy, a trans- fer
might have had to be terminated with an overrun error.)
Not all computers use DMA. The argument against it is that the main CPU is often far faster than the
DMA controller and can do the job much faster (when the limiting factor is not the speed of the I/O
device). If there is no other work for it to do, having the (fast) CPU wait for the (slow) DMA controller to
finish is pointless. Also, getting rid of the DMA controller and having the CPU do all the work in
software saves money, important on low-end (embedded) computers.
5.1.5 Interrupts Revisited
We briefly introduced interrupts in Sec. 1.3.4, but there is more to be said. In a typical personal computer
system, the interrupt structure is as shown in Fig. 5-5. At the hardware level, interrupts work as follows.
When an I/O device has finished the work given to it, it causes an interrupt (assuming that interrupts have
been enabled by the operating system). It does this by asserting a signal on a bus line that it has been
assigned. This signal is detected by the interrupt controller chip on the parentboard, which then decides
what to do.
CPU
3. CPU acks interrupt
2. Controller issues
interrupt
Interrupt controller
1. Device is finished Disk
11
1
12
10 2
Keyboard Printer
93
Clock
84
6
7 5
Figure 5-5. How an interrupt happens. The connections between the devices and the controller actually use interrupt lines on the
bus rather than dedicated wires.
If no other interrupts are pending, the interrupt controller handles the interrupt immediately. However, if
another interrupt is in progress, or another device has made a simultaneous request on a higher-priority
interrupt request line on the bus,
Bus
348 INPUT/OUTPUT CHAP. 5
the device is just ignored for the moment. In this case it continues to assert an in- terrupt signal on the bus
until it is serviced by the CPU.
To handle the interrupt, the controller puts a number on the address lines speci- fying which device wants
attention and asserts a signal to interrupt the CPU.
The interrupt signal causes the CPU to stop what it is doing and start doing something else. The number
on the address lines is used as an index into a table called the interrupt vector to fetch a new program
counter. This program counter points to the start of the corresponding interrupt-service procedure.
Typically traps and interrupts use the same mechanism from this point on, often sharing the same
interrupt vector. The location of the interrupt vector can be hardwired into the ma- chine or it can be
anywhere in memory, with a CPU register (loaded by the operat- ing system) pointing to its origin.
Shortly after it starts running, the interrupt-service procedure acknowledges the interrupt by writing a
certain value to one of the interrupt controller’s I/O ports. This acknowledgement tells the controller that
it is free to issue another interrupt. By having the CPU delay this acknowledgement until it is ready to
handle the next interrupt, race conditions involving multiple (almost simultaneous) interrupts can be
avoided. As an aside, some (older) computers do not have a centralized inter- rupt controller, so each
device controller requests its own interrupts.
The hardware always saves certain information before starting the service pro- cedure. Which information
is saved and where it is saved varies greatly from CPU to CPU. As a bare minimum, the program counter
must be saved, so the inter- rupted process can be restarted. At the other extreme, all the visible registers
and a large number of internal registers may be saved as well.
One issue is where to save this information. One option is to put it in internal registers that the operating
system can read out as needed. A problem with this ap- proach is that then the interrupt controller cannot
be acknowledged until all poten- tially relevant information has been read out, lest a second interrupt
overwrite the internal registers saving the state. This strategy leads to long dead times when in- terrupts
are disabled and possibly to lost interrupts and lost data.
Consequently, most CPUs save the information on the stack. However, this ap- proach, too, has problems.
To start with: whose stack? If the current stack is used, it may well be a user process stack. The stack
pointer may not even be legal, which would cause a fatal error when the hardware tried to write some
words at the ad- dress pointed to. Also, it might point to the end of a page. After several memory writes,
the page boundary might be exceeded and a page fault generated. Having a page fault occur during the
hardware interrupt processing creates a bigger problem: where to save the state to handle the page fault?
If the kernel stack is used, there is a much better chance of the stack pointer being legal and pointing to a
pinned page. However, switching into kernel mode may require changing MMU contexts and will
probably invalidate most or all of the cache and TLB. Reloading all of these, statically or dynamically,
will increase the time to process an interrupt and thus waste CPU time.
SEC. 5.1 PRINCIPLES OF I/O HARDWARE 349 Precise and Imprecise Interrupts
Another problem is caused by the fact that most modern CPUs are heavily pipelined and often superscalar
(internally parallel). In older systems, after each instruction was finished executing, the microprogram or
hardware checked to see if there was an interrupt pending. If so, the program counter and PSW were
pushed onto the stack and the interrupt sequence begun. After the interrupt handler ran, the reverse
process took place, with the old PSW and program counter popped from the stack and the previous
process continued.
This model makes the implicit assumption that if an interrupt occurs just after some instruction, all the
instructions up to and including that instruction have been executed completely, and no instructions after
it have executed at all. On older ma- chines, this assumption was always valid. On modern ones it may
not be.
For starters, consider the pipeline model of Fig. 1-7(a). What happens if an in- terrupt occurs while the
pipeline is full (the usual case)? Many instructions are in various stages of execution. When the interrupt
occurs, the value of the program counter may not reflect the correct boundary between executed
instructions and nonexecuted instructions. In fact, many instructions may have been partially ex- ecuted,
with different instructions being more or less complete. In this situation, the program counter most likely
reflects the address of the next instruction to be fetched and pushed into the pipeline rather than the
address of the instruction that just was processed by the execution unit.
On a superscalar machine, such as that of Fig. 1-7(b), things are even worse. Instructions may be
decomposed into micro-operations and the micro-operations may execute out of order, depending on the
availability of internal resources such as functional units and registers. At the time of an interrupt, some
instructions started long ago may not have started and others started more recently may be al- most done.
At the point when an interrupt is signaled, there may be many instruc- tions in various states of
completeness, with less relation between them and the program counter.
An interrupt that leaves the machine in a well-defined state is called a precise interrupt (Walker and
Cragon, 1995). Such an interrupt has four properties:
1.
2.
3.
4.
The PC (Program Counter) is saved in a known place.
All instructions before the one pointed to by the PC have completed.
No instruction beyond the one pointed to by the PC has finished.
The execution state of the instruction pointed to by the PC is known.
Note that there is no prohibition on instructions beyond the one pointed to by the PC from starting. It is
just that any changes they make to registers or memory must be undone before the interrupt happens. It is
permitted that the instruction pointed to has been executed. It is also permitted that it has not been
executed.
350 INPUT/OUTPUT CHAP. 5
However, it must be clear which case applies. Often, if the interrupt is an I/O inter- rupt, the instruction
will not yet have started. However, if the interrupt is really a trap or page fault, then the PC generally
points to the instruction that caused the fault so it can be restarted later. The situation of Fig. 5-6(a)
illustrates a precise in- terrupt. All instructions up to the program counter (316) have completed and none
of those beyond it have started (or have been rolled back to undo their effects).
332
328
324
320
PC
316 312 308 304 300
(a)
PC
332
328
324
320
316
312
308
304
300
Not executed
Not executed
Not executed
Not executed
Fully executed
Fully executed
Fully executed
Fully executed
Not executed
10% executed
40% executed
35% executed
20% executed
60% executed
80% executed
Fully executed
Figure 5-6. (a) A precise interrupt. (b) An imprecise interrupt.
An interrupt that does not meet these requirements is called an imprecise int- errupt and makes life most
unpleasant for the operating system writer, who now has to figure out what has happened and what still
has to happen. Fig. 5-6(b) illus- trates an imprecise interrupt, where different instructions near the
program counter are in different stages of completion, with older ones not necessarily more com- plete
than younger ones. Machines with imprecise interrupts usually vomit a large amount of internal state onto
the stack to give the operating system the possibility of figuring out what was going on. The code
necessary to restart the machine is typically exceedingly complicated. Also, saving a large amount of
information to memory on every interrupt makes interrupts slow and recovery even worse. This leads to
the ironic situation of having very fast superscalar CPUs sometimes being unsuitable for real-time work
due to slow interrupts.
Some computers are designed so that some kinds of interrupts and traps are precise and others are not. For
example, having I/O interrupts be precise but traps due to fatal programming errors be imprecise is not so
bad since no attempt need be made to restart a running process after it has divided by zero. Some
machines have a bit that can be set to force all interrupts to be precise. The downside of set- ting this bit is
that it forces the CPU to carefully log everything it is doing and maintain shadow copies of registers so it
can generate a precise interrupt at any in- stant. All this overhead has a major impact on performance.
Some superscalar machines, such as the x86 family, have precise interrupts to allow old software to work
correctly. The price paid for backward compatibility with precise interrupts is extremely complex
interrupt logic within the CPU to make sure that when the interrupt controller signals that it wants to
cause an inter- rupt, all instructions up to some point are allowed to finish and none beyond that
(b)
SEC. 5.1 PRINCIPLES OF I/O HARDWARE 351
point are allowed to have any noticeable effect on the machine state. Here the price is paid not in time, but
in chip area and in complexity of the design. If precise in- terrupts were not required for backward
compatibility purposes, this chip area would be available for larger on-chip caches, making the CPU
faster. On the other hand, imprecise interrupts make the operating system far more complicated and
slower, so it is hard to tell which approach is really better.
5.2 PRINCIPLES OF I/O SOFTWARE
Let us now turn away from the I/O hardware and look at the I/O software. First we will look at its goals
and then at the different ways I/O can be done from the point of view of the operating system.
5.2.1 Goals of the I/O Software
A key concept in the design of I/O software is known as device independence. What it means is that we
should be able to write programs that can access any I/O device without having to specify the device in
advance. For example, a program that reads a file as input should be able to read a file on a hard disk, a
DVD, or on a USB stick without having to be modified for each different device. Similarly, one should be
able to type a command such as
sor t output
and have it work with input coming from any kind of disk or the keyboard and the output going to any
kind of disk or the screen. It is up to the operating system to take care of the problems caused by the fact
that these devices really are different and require very different command sequences to read or write.
Closely related to device independence is the goal of uniform naming. The name of a file or a device
should simply be a string or an integer and not depend on the device in any way. In UNIX, all disks can
be integrated in the file-system hier- archy in arbitrary ways so the user need not be aware of which name
corresponds to which device. For example, a USB stick can be mounted on top of the directory
/usr/ast/backup so that copying a file to /usr/ast/backup/monday copies the file to the USB stick. In this
way, all files and devices are addressed the same way: by a path name.
Another important issue for I/O software is error handling. In general, errors should be handled as close
to the hardware as possible. If the controller discovers a read error, it should try to correct the error itself
if it can. If it cannot, then the device driver should handle it, perhaps by just trying to read the block
again. Many errors are transient, such as read errors caused by specks of dust on the read head, and will
frequently go away if the operation is repeated. Only if the lower layers
352 INPUT/OUTPUT CHAP. 5
are not able to deal with the problem should the upper layers be told about it. In many cases, error
recovery can be done transparently at a low level without the upper levels even knowing about the error.
Still another important issue is that of synchronous (blocking) vs. asyn- chronous (interrupt-driven)
transfers. Most physical I/O is asynchronous—the CPU starts the transfer and goes off to do something
else until the interrupt arrives. User programs are much easier to write if the I/O operations are
blocking—after a read system call the program is automatically suspended until the data are avail- able in
the buffer. It is up to the operating system to make operations that are ac- tually interrupt-driven look
blocking to the user programs. However, some very high-performance applications need to control all the
details of the I/O, so some operating systems make asynchronous I/O available to them.
Another issue for the I/O software is buffering. Often data that come off a de- vice cannot be stored
directly in their final destination. For example, when a packet comes in off the network, the operating
system does not know where to put it until it has stored the packet somewhere and examined it. Also,
some devices have severe real-time constraints (for example, digital audio devices), so the data must be
put into an output buffer in advance to decouple the rate at which the buffer is filled from the rate at
which it is emptied, in order to avoid buffer underruns. Buff- ering involves considerable copying and
often has a major impact on I/O per- formance.
The final concept that we will mention here is sharable vs. dedicated devices. Some I/O devices, such as
disks, can be used by many users at the same time. No problems are caused by multiple users having open
files on the same disk at the same time. Other devices, such as printers, have to be dedicated to a single
user until that user is finished. Then another user can have the printer. Having two or more users writing
characters intermixed at random to the same page will defi- nitely not work. Introducing dedicated
(unshared) devices also introduces a variety of problems, such as deadlocks. Again, the operating system
must be able to hanfle both shared and dedicated devices in a way that avoids problems.
5.2.2 Programmed I/O
There are three fundamentally different ways that I/O can be performed. In this section we will look at the
first one (programmed I/O). In the next two sec- tions we will examine the others (interrupt-driven I/O
and I/O using DMA). The simplest form of I/O is to have the CPU do all the work. This method is called
pro- grammed I/O.
It is simplest to illustrate how programmed I/O works by means of an example. Consider a user process
that wants to print the eight-character string ‘‘ABCDE- FGH’’ on the printer via a serial interface.
Displays on small embedded systems sometimes work this way. The software first assembles the string in
a buffer in user space, as shown in Fig. 5-7(a).
SEC. 5.2
PRINCIPLES OF I/O SOFTWARE
353
User space
Kernel space
Printed page
Printed page
String to be printed
ABCD EFGH
A
AB
(a)
(b)
(c)
Next
Figure 5-7. Steps in printing a string.
The user process then acquires the printer for writing by making a system call to open it. If the printer is
currently in use by another process, this call will fail and return an error code or will block until the
printer is available, depending on the operating system and the parameters of the call. Once it has the
printer, the user process makes a system call telling the operating system to print the string on the printer.
The operating system then (usually) copies the buffer with the string to an array, say, p, in kernel space,
where it is more easily accessed (because the kernel may have to change the memory map to get at user
space). It then checks to see if the printer is currently available. If not, it waits until it is. As soon as the
printer is available, the operating system copies the first character to the printer’s data regis- ter, in this
example using memory-mapped I/O. This action activates the printer. The character may not appear yet
because some printers buffer a line or a page be- fore printing anything. In Fig. 5-7(b), however, we see
that the first character has been printed and that the system has marked the ‘‘B’’ as the next character to
be printed.
As soon as it has copied the first character to the printer, the operating system checks to see if the printer
is ready to accept another one. Generally, the printer has a second register, which gives its status. The act
of writing to the data register causes the status to become not ready. When the printer controller has
processed the current character, it indicates its availability by setting some bit in its status reg- ister or
putting some value in it.
At this point the operating system waits for the printer to become ready again. When that happens, it
prints the next character, as shown in Fig. 5-7(c). This loop continues until the entire string has been
printed. Then control returns to the user process.
The actions followed by the operating system are briefly summarized in Fig. 5-8. First the data are copied
to the kernel. Then the operating system enters a
Next
ABCD EFGH
ABCD EFGH
354 INPUT/OUTPUT CHAP. 5
tight loop, outputting the characters one at a time. The essential aspect of program- med I/O, clearly
illustrated in this figure, is that after outputting a character, the CPU continuously polls the device to see
if it is ready to accept another one. This behavior is often called polling or busy waiting.
*
*
copy from user(buffer, p, count); / p is the kernel buffer /
for (i = 0; i < count; i++) {
*
while ( printer status reg != READY) ;
*
printer data register = p[i]; return to user( );
*
* *
*
/ loop on every character / / loop until ready /
*
*
/ output one character /
}
Programmed I/O is simple but has the disadvantage of tying up the CPU full time until all the I/O is done.
If the time to ‘‘print’’ a character is very short (because all the printer is doing is copying the new
character to an internal buffer), then busy waiting is fine. Also, in an embedded system, where the CPU
has nothing else to do, busy waiting is fine. However, in more complex systems, where the CPU has other
work to do, busy waiting is inefficient. A better I/O method is needed.
5.2.3 Interrupt-Driven I/O
Now let us consider the case of printing on a printer that does not buffer char- acters but prints each one
as it arrives. If the printer can print, say 100 charac- ters/sec, each character takes 10 msec to print. This
means that after every charac- ter is written to the printer’s data register, the CPU will sit in an idle loop
for 10 msec waiting to be allowed to output the next character. This is more than enough time to do a
context switch and run some other process for the 10 msec that would otherwise be wasted.
The way to allow the CPU to do something else while waiting for the printer to become ready is to use
interrupts. When the system call to print the string is made, the buffer is copied to kernel space, as we
showed earlier, and the first character is copied to the printer as soon as it is willing to accept a character.
At that point the CPU calls the scheduler and some other process is run. The process that asked for the
string to be printed is blocked until the entire string has printed. The work done on the system call is
shown in Fig. 5-9(a).
When the printer has printed the character and is prepared to accept the next one, it generates an interrupt.
This interrupt stops the current process and saves its state. Then the printer interrupt-service procedure is
run. A crude version of this code is shown in Fig. 5-9(b). If there are no more characters to print, the
interrupt handler takes some action to unblock the user. Otherwise, it outputs the next char- acter,
acknowledges the interrupt, and returns to the process that was running just before the interrupt, which
continues from where it left off.
Figure 5-8. Writing a string to the printer using programmed I/O.
SEC. 5.2 PRINCIPLES OF I/O SOFTWARE 355
copy from user(buffer,p,count); if (count == 0) { enable interrupts( ); unblock user( );
*
*
while ( printer status reg != READY) ; printer data register = p[0]; scheduler( );
} else {
*
printer data register = p[i]; count = count
1;
i = i + 1; }
acknowledge interrupt(); return from interrupt();
(a) (b)
Figure 5-9. Writing a string to the printer using interrupt-driven I/O. (a) Code executed at the time the print system call is made.
(b) Interrupt service procedure for the printer.
5.2.4 I/O Using DMA
An obvious disadvantage of interrupt-driven I/O is that an interrupt occurs on every character. Interrupts
take time, so this scheme wastes a certain amount of CPU time. A solution is to use DMA. Here the idea
is to let the DMA controller feed the characters to the printer one at time, without the CPU being
bothered. In essence, DMA is programmed I/O, only with the DMA controller doing all the work, instead
of the main CPU. This strategy requires special hardware (the DMA controller) but frees up the CPU
during the I/O to do other work. An outline of the code is given in Fig. 5-10.
copyfromuser(buffer,p,count); acknowledgeinterrupt(); set up DMA controller( ); unblock user( );
return from interrupt( ); (a) (b)
Figure 5-10. Printing a string using DMA. (a) Code executed when the print system call is made. (b) Interrupt-service procedure.
The big win with DMA is reducing the number of interrupts from one per character to one per buffer
printed. If there are many characters and interrupts are slow, this can be a major improvement. On the
other hand, the DMA controller is usually much slower than the main CPU. If the DMA controller is not
capable of driving the device at full speed, or the CPU usually has nothing to do anyway while waiting for
the DMA interrupt, then interrupt-driven I/O or even pro- grammed I/O may be better. Most of the time,
though, DMA is worth it.
scheduler( );
356 INPUT/OUTPUT CHAP. 5 5.3 I/O SOFTWARE LAYERS
I/O software is typically organized in four layers, as shown in Fig. 5-11. Each layer has a well-defined
function to perform and a well-defined interface to the ad- jacent layers. The functionality and interfaces
differ from system to system, so the discussion that follows, which examines all the layers starting at the
bottom, is not specific to one machine.
User-level I/O software
Device-independent operating system software
Device drivers
Interrupt handlers
Hardware
Figure 5-11. Layers of the I/O software system.
5.3.1 Interrupt Handlers
While programmed I/O is occasionally useful, for most I/O, interrupts are an unpleasant fact of life and
cannot be avoided. They should be hidden away, deep in the bowels of the operating system, so that as
little of the operating system as pos- sible knows about them. The best way to hide them is to have the
driver starting an I/O operation block until the I/O has completed and the interrupt occurs. The driver can
block itself, for example, by doing a down on a semaphore, a wait on a condi- tion variable, a receive on a
message, or something similar.
When the interrupt happens, the interrupt procedure does whatever it has to in order to handle the
interrupt. Then it can unblock the driver that was waiting for it. In some cases it will just complete up on a
semaphore. In others it will do a signal on a condition variable in a monitor. In still others, it will send a
message to the blocked driver. In all cases the net effect of the interrupt will be that a driver that was
previously blocked will now be able to run. This model works best if drivers are structured as kernel
processes, with their own states, stacks, and program counters.
Of course, reality is not quite so simple. Processing an interrupt is not just a matter of taking the interrupt,
doing an up on some semaphore, and then executing an IRET instruction to return from the interrupt to the
previous process. There is a great deal more work involved for the operating system. We will now give an
out- line of this work as a series of steps that must be performed in software after the hardware interrupt
has completed. It should be noted that the details are highly
SEC. 5.3 I/O SOFTWARE LAYERS 357
system dependent, so some of the steps listed below may not be needed on a partic- ular machine, and
steps not listed may be required. Also, the steps that do occur
may be 1.
2.
3. 4.
5. 6. 7.
8.
9. 10.
in a different order on some machines.
Save any registers (including the PSW) that have not already been saved by the interrupt hardware.
Set up a context for the interrupt-service procedure. Doing this may involve setting up the TLB, MMU
and a page table.
Set up a stack for the interrupt service-procedure.
Acknowledge the interrupt controller. If there is no centralized inter- rupt controller, reenable interrupts.
Copy the registers from where they were saved (possibly some stack) to the process table.
Run the interrupt-service procedure. It will extract information from the interrupting device controller’s
registers.
Choose which process to run next. If the interrupt has caused some high-priority process that was blocked
to become ready, it may be chosen to run now.
Set up the MMU context for the process to run next. Some TLB set- up may also be needed.
Load the new process’ registers, including its PSW. Start running the new process.
As can be seen, interrupt processing is far from trivial. It also takes a considerable number of CPU
instructions, especially on machines in which virtual memory is present and page tables have to be set up
or the state of the MMU stored (e.g., the R and M bits). On some machines the TLB and CPU cache may
also have to be managed when switching between user and kernel modes, which takes additional machine
cycles.
5.3.2 Device Drivers
Earlier in this chapter we looked at what device controllers do. We saw that each controller has some
device registers used to give it commands or some device registers used to read out its status or both. The
number of device registers and the nature of the commands vary radically from device to device. For
example, a mouse driver has to accept information from the mouse telling it how far it has moved and
which buttons are currently depressed. In contrast, a disk driver may
358 INPUT/OUTPUT CHAP. 5
have to know all about sectors, tracks, cylinders, heads, arm motion, motor drives, head settling times,
and all the other mechanics of making the disk work properly. Obviously, these drivers will be very
different.
Consequently, each I/O device attached to a computer needs some device-spe- cific code for controlling
it. This code, called the device driver, is generally writ- ten by the device’s manufacturer and delivered
along with the device. Since each operating system needs its own drivers, device manufacturers
commonly supply drivers for several popular operating systems.
Each device driver normally handles one device type, or at most, one class of closely related devices. For
example, a SCSI disk driver can usually handle multi- ple SCSI disks of different sizes and different
speeds, and perhaps a SCSI Blu-ray disk as well. On the other hand, a mouse and joystick are so different
that different drivers are usually required. However, there is no technical restriction on having one device
driver control multiple unrelated devices. It is just not a good idea in most cases.
Sometimes though, wildly different devices are based on the same underlying technology. The bestknown example is probably USB, a serial bus technology that is not called ‘‘universal’’ for nothing. USB
devices include disks, memory sticks, cameras, mice, keyboards, mini-fans, wireless network cards,
robots, credit card readers, rechargeable shavers, paper shredders, bar code scanners, disco balls, and
portable thermometers. They all use USB and yet they all do very different things. The trick is that USB
drivers are typically stacked, like a TCP/IP stack in networks. At the bottom, typically in hardware, we
find the USB link layer (serial I/O) that handles hardware stuff like signaling and decoding a stream of
signals to USB packets. It is used by higher layers that deal with the data packets and the common
functionality for USB that is shared by most devices. On top of that, finally, we find the higher-layer APIs
such as the interfaces for mass storage, cameras, etc. Thus, we still have separate device drivers, even
though they share part of the pro- tocol stack.
In order to access the device’s hardware, actually, meaning the controller’s reg- isters, the device driver
normally has to be part of the operating system kernel, at least with current architectures. Actually, it is
possible to construct drivers that run in user space, with system calls for reading and writing the device
registers. This design isolates the kernel from the drivers and the drivers from each other, elimi- nating a
major source of system crashes—buggy drivers that interfere with the ker- nel in one way or another. For
building highly reliable systems, this is definitely the way to go. An example of a system in which the
device drivers run as user processes is MINIX 3 (www.minix3.org). However, since most other desktop
oper- ating systems expect drivers to run in the kernel, that is the model we will consider here.
Since the designers of every operating system know that pieces of code (driv- ers) written by outsiders
will be installed in it, it needs to have an architecture that allows such installation. This means having a
well-defined model of what a driver
SEC. 5.3 I/O SOFTWARE LAYERS 359
does and how it interacts with the rest of the operating system. Device drivers are normally positioned
below the rest of the operating system, as is illustrated in Fig. 5-12.
User process
User program
Printer driver
Rest of the operating system
Camcorder driver
CD-ROM driver
Printer controller
Camcorder controller
CD-ROM controller
User space
Kernel space
Hardware
Devices
Figure 5-12. Logical positioning of device drivers. In reality all communication between drivers and device controllers goes over
the bus.
Operating systems usually classify drivers into one of a small number of cate- gories. The most common
categories are the block devices, such as disks, which contain multiple data blocks that can be addressed
independently, and the charac- ter devices, such as keyboards and printers, which generate or accept a
stream of characters.
Most operating systems define a standard interface that all block drivers must support and a second
standard interface that all character drivers must support. These interfaces consist of a number of
procedures that the rest of the operating system can call to get the driver to do work for it. Typical
procedures are those to read a block (block device) or write a character string (character device).
In some systems, the operating system is a single binary program that contains all of the drivers it will
need compiled into it. This scheme was the norm for years
360 INPUT/OUTPUT CHAP. 5
with UNIX systems because they were run by computer centers and I/O devices rarely changed. If a new
device was added, the system administrator simply re- compiled the kernel with the new driver to build a
new binary.
With the advent of personal computers, with their myriad I/O devices, this model no longer worked. Few
users are capable of recompiling or relinking the kernel, even if they have the source code or object
modules, which is not always the case. Instead, operating systems, starting with MS-DOS, went over to a
model in which drivers were dynamically loaded into the system during execution. Dif- ferent systems
handle loading drivers in different ways.
A device driver has several functions. The most obvious one is to accept abstract read and write requests
from the device-independent software above it and see that they are carried out. But there are also a few
other functions they must per- form. For example, the driver must initialize the device, if needed. It may
also need to manage its power requirements and log events.
Many device drivers have a similar general structure. A typical driver starts out by checking the input
parameters to see if they are valid. If not, an error is re- turned. If they are valid, a translation from
abstract to concrete terms may be need- ed. For a disk driver, this may mean converting a linear block
number into the head, track, sector, and cylinder numbers for the disk’s geometry.
Next the driver may check if the device is currently in use. If it is, the request will be queued for later
processing. If the device is idle, the hardware status will be examined to see if the request can be handled
now. It may be necessary to switch the device on or start a motor before transfers can be begun. Once the
de- vice is on and ready to go, the actual control can begin.
Controlling the device means issuing a sequence of commands to it. The driver is the place where the
command sequence is determined, depending on what has to be done. After the driver knows which
commands it is going to issue, it starts writ- ing them into the controller’s device registers. After each
command is written to the controller, it may be necessary to check to see if the controller accepted the
command and is prepared to accept the next one. This sequence continues until all the commands have
been issued. Some controllers can be given a linked list of commands (in memory) and told to read and
process them all by itself without fur- ther help from the operating system.
After the commands have been issued, one of two situations will apply. In many cases the device driver
must wait until the controller does some work for it, so it blocks itself until the interrupt comes in to
unblock it. In other cases, howev- er, the operation finishes without delay, so the driver need not block.
As an ex- ample of the latter situation, scrolling the screen requires just writing a few bytes into the
controller’s registers. No mechanical motion is needed, so the entire oper- ation can be completed in
nanoseconds.
In the former case, the blocked driver will be awakened by the interrupt. In the latter case, it will never go
to sleep. Either way, after the operation has been com- pleted, the driver must check for errors. If
everything is all right, the driver may
SEC. 5.3 I/O SOFTWARE LAYERS 361
have some data to pass to the device-independent software (e.g., a block just read). Finally, it returns
some status information for error reporting back to its caller. If any other requests are queued, one of them
can now be selected and started. If nothing is queued, the driver blocks waiting for the next request.
This simple model is only a rough approximation to reality. Many factors make the code much more
complicated. For one thing, an I/O device may complete while a driver is running, interrupting the driver.
The interrupt may cause a device driver to run. In fact, it may cause the current driver to run. For
example, while the network driver is processing an incoming packet, another packet may arrive. Consequently, drivers have to be reentrant, meaning that a running driver has to expect that it will be called a
second time before the first call has completed.
In a hot-pluggable system, devices can be added or removed while the com- puter is running. As a result,
while a driver is busy reading from some device, the system may inform it that the user has suddenly
removed that device from the sys- tem. Not only must the current I/O transfer be aborted without
damaging any ker- nel data structures, but any pending requests for the now-vanished device must also be
gracefully removed from the system and their callers given the bad news. Fur- thermore, the unexpected
addition of new devices may cause the kernel to juggle resources (e.g., interrupt request lines), taking old
ones away from the driver and giving it new ones in their place.
Drivers are not allowed to make system calls, but they often need to interact with the rest of the kernel.
Usually, calls to certain kernel procedures are permitted. For example, there are usually calls to allocate
and deallocate hardwired pages of memory for use as buffers. Other useful calls are needed to manage the
MMU, timers, the DMA controller, the interrupt controller, and so on.
5.3.3 Device-Independent I/O Software
Although some of the I/O software is device specific, other parts of it are de- vice independent. The exact
boundary between the drivers and the device-indepen- dent software is system (and device) dependent,
because some functions that could be done in a device-independent way may actually be done in the
drivers, for ef- ficiency or other reasons. The functions shown in Fig. 5-13 are typically done in the
device-independent software.
Uniform interfacing for device drivers
Buffering
Error reporting
Allocating and releasing dedicated devices
Providing a device-independent block size
Figure 5-13. Functions of the device-independent I/O software.
362 INPUT/OUTPUT CHAP. 5
The basic function of the device-independent software is to perform the I/O functions that are common to
all devices and to provide a uniform interface to the user-level software. We will now look at the above
issues in more detail.
Uniform Interfacing for Device Drivers
A major issue in an operating system is how to make all I/O devices and driv- ers look more or less the
same. If disks, printers, keyboards, and so on, are all in- terfaced in different ways, every time a new
device comes along, the operating sys- tem must be modified for the new device. Having to hack on the
operating system for each new device is not a good idea.
One aspect of this issue is the interface between the device drivers and the rest of the operating system. In
Fig. 5-14(a) we illustrate a situation in which each de- vice driver has a different interface to the operating
system. What this means is that the driver functions available for the system to call differ from driver to
driver. It might also mean that the kernel functions that the driver needs also differ from driver to driver.
Taken together, it means that interfacing each new driver requires a lot of new programming effort.
Operating system
SATA disk driver USB disk driver SCSI disk driver SATA disk driver USB disk driver SCSI disk driver (a) (b)
Figure 5-14. (a) Without a standard driver interface. (b) With a standard driver interface.
In contrast, in Fig. 5-14(b), we show a different design in which all drivers have the same interface. Now
it becomes much easier to plug in a new driver, pro- viding it conforms to the driver interface. It also
means that driver writers know what is expected of them. In practice, not all devices are absolutely
identical, but usually there are only a small number of device types and even these are generally almost
the same.
The way this works is as follows. For each class of devices, such as disks or printers, the operating system
defines a set of functions that the driver must supply. For a disk these would naturally include read and
write, but also turning the power
Operating system
SEC. 5.3 I/O SOFTWARE LAYERS 363
on and off, formatting, and other disky things. Often the driver holds a table with pointers into itself for
these functions. When the driver is loaded, the operating system records the address of this table of
function pointers, so when it needs to call one of the functions, it can make an indirect call via this table.
This table of function pointers defines the interface between the driver and the rest of the operat- ing
system. All devices of a given class (disks, printers, etc.) must obey it.
Another aspect of having a uniform interface is how I/O devices are named. The device-independent
software takes care of mapping symbolic device names onto the proper driver. For example, in UNIX a
device name, such as /dev/disk0, uniquely specifies the i-node for a special file, and this i-node contains
the major device number, which is used to locate the appropriate driver. The i-node also contains the
minor device number, which is passed as a parameter to the driver in order to specify the unit to be read
or written. All devices have major and minor numbers, and all drivers are accessed by using the major
device number to select the driver.
Closely related to naming is protection. How does the system prevent users from accessing devices that
they are not entitled to access? In both UNIX and Windows, devices appear in the file system as named
objects, which means that the usual protection rules for files also apply to I/O devices. The system
administrator can then set the proper permissions for each device.
Buffering
Buffering is also an issue, both for block and character devices, for a variety of reasons. To see one of
them, consider a process that wants to read data from an (ADSL—Asymmetric Digital Subscriber Line)
modem, something many people use at home to connect to the Internet. One possible strategy for dealing
with the incoming characters is to have the user process do a read system call and block waiting for one
character. Each arriving character causes an interrupt. The inter- rupt-service procedure hands the
character to the user process and unblocks it. After putting the character somewhere, the process reads
another character and blocks again. This model is indicated in Fig. 5-15(a).
The trouble with this way of doing business is that the user process has to be started up for every
incoming character. Allowing a process to run many times for short runs is inefficient, so this design is
not a good one.
An improvement is shown in Fig. 5-15(b). Here the user process provides an n-character buffer in user
space and does a read of n characters. The interrupt-ser- vice procedure puts incoming characters in this
buffer until it is completely full. Only then does it wakes up the user process. This scheme is far more
efficient than the previous one, but it has a drawback: what happens if the buffer is paged out when a
character arrives? The buffer could be locked in memory, but if many processes start locking pages in
memory willy nilly, the pool of available pages will shrink and performance will degrade.
364
INPUT/OUTPUT
CHAP. 5
User process
2
2
User space
Kernel space
Modem (a)
Modem (b)
113
Modem Modem (c) (d)
Figure 5-15. (a) Unbuffered input. (b) Buffering in user space. (c) Buffering in the kernel followed by copying to user space. (d)
Double buffering in the kernel.
Yet another approach is to create a buffer inside the kernel and have the inter- rupt handler put the
characters there, as shown in Fig. 5-15(c). When this buffer is full, the page with the user buffer is
brought in, if needed, and the buffer copied there in one operation. This scheme is far more efficient.
However, even this improved scheme suffers from a problem: What happens to characters that arrive
while the page with the user buffer is being brought in from the disk? Since the buffer is full, there is no
place to put them. A way out is to have a second kernel buffer. After the first buffer fills up, but before it
has been emptied, the second one is used, as shown in Fig. 5-15(d). When the second buffer fills up, it is
available to be copied to the user (assuming the user has asked for it). While the second buffer is being
copied to user space, the first one can be used for new characters. In this way, the two buffers take turns:
while one is being copied to user space, the other is accumulating new input. A buffering scheme like this
is called double buffering.
Another common form of buffering is the circular buffer. It consists of a re- gion of memory and two
pointers. One pointer points to the next free word, where new data can be placed. The other pointer points
to the first word of data in the buffer that has not been removed yet. In many situations, the hardware
advances the first pointer as it adds new data (e.g., just arriving from the network) and the operating
system advances the second pointer as it removes and processes data. Both pointers wrap around, going
back to the bottom when they hit the top.
Buffering is also important on output. Consider, for example, how output is done to the modem without
buffering using the model of Fig. 5-15(b). The user process executes a write system call to output n
characters. The system has two choices at this point. It can block the user until all the characters have
been writ- ten, but this could take a very long time over a slow telephone line. It could also release the
user immediately and do the I/O while the user computes some more,
SEC. 5.3 I/O SOFTWARE LAYERS 365
but this leads to an even worse problem: how does the user process know that the output has been
completed and it can reuse the buffer? The system could generate a signal or software interrupt, but that
style of programming is difficult and prone to race conditions. A much better solution is for the kernel to
copy the data to a kernel buffer, analogous to Fig. 5-15(c) (but the other way), and unblock the caller
immediately. Now it does not matter when the actual I/O has been completed. The user is free to reuse the
buffer the instant it is unblocked.
Buffering is a widely used technique, but it has a downside as well. If data get buffered too many times,
performance suffers. Consider, for example, the network of Fig. 5-16. Here a user does a system call to
write to the network. The kernel copies the packet to a kernel buffer to allow the user to proceed
immediately (step 1). At this point the user program can reuse the buffer.
1
User space
Kernel space
User process
2
3
Network controller
5
4
Network
Figure 5-16. Networking may involve many copies of a packet.
When the driver is called, it copies the packet to the controller for output (step 2). The reason it does not
output to the wire directly from kernel memory is that once a packet transmission has been started, it must
continue at a uniform speed. The driver cannot guarantee that it can get to memory at a uniform speed
because DMA channels and other I/O devices may be stealing many cycles. Failing to get a word on time
would ruin the packet. By buffering the packet inside the controller, this problem is avoided.
After the packet has been copied to the controller’s internal buffer, it is copied out onto the network (step
3). Bits arrive at the receiver shortly after being sent, so just after the last bit has been sent, that bit arrives
at the receiver, where the packet has been buffered in the controller. Next the packet is copied to the
receiver’s ker- nel buffer (step 4). Finally, it is copied to the receiving process’ buffer (step 5). Usually,
the receiver then sends back an acknowledgement. When the sender gets the acknowledgement, it is free
to send the next packet. However, it should be clear that all this copying is going to slow down the
transmission rate considerably because all the steps must happen sequentially.
366 INPUT/OUTPUT CHAP. 5 Error Reporting
Errors are far more common in the context of I/O than in other contexts. When they occur, the operating
system must handle them as best it can. Many errors are device specific and must be handled by the
appropriate driver, but the framework for error handling is device independent.
One class of I/O errors is programming errors. These occur when a process asks for something
impossible, such as writing to an input device (keyboard, scan- ner, mouse, etc.) or reading from an
output device (printer, plotter, etc.). Other er- rors are providing an invalid buffer address or other
parameter, and specifying an invalid device (e.g., disk 3 when the system has only two disks), and so on.
The action to take on these errors is straightforward: just report back an error code to the caller.
Another class of errors is the class of actual I/O errors, for example, trying to write a disk block that has
been damaged or trying to read from a camcorder that has been switched off. In these circumstances, it is
up to the driver to determine what to do. If the driver does not know what to do, it may pass the problem
back up to device-independent software.
What this software does depends on the environment and the nature of the error. If it is a simple read error
and there is an interactive user available, it may display a dialog box asking the user what to do. The
options may include retrying a certain number of times, ignoring the error, or killing the calling process.
If there is no user available, probably the only real option is to have the system call fail with an error
code.
However, some errors cannot be handled this way. For example, a critical data structure, such as the root
directory or free block list, may have been destroyed. In this case, the system may have to display an error
message and terminate. There is not much else it can do.
Allocating and Releasing Dedicated Devices
Some devices, such as printers, can be used only by a single process at any given moment. It is up to the
operating system to examine requests for device usage and accept or reject them, depending on whether
the requested device is available or not. A simple way to handle these requests is to require processes to
perform opens on the special files for devices directly. If the device is unavailable, the open fails. Closing
such a dedicated device then releases it.
An alternative approach is to have special mechanisms for requesting and releasing dedicated devices. An
attempt to acquire a device that is not available blocks the caller instead of failing. Blocked processes are
put on a queue. Sooner or later, the requested device becomes available and the first process on the queue
is allowed to acquire it and continue execution.
SEC. 5.3 I/O SOFTWARE LAYERS 367 Device-Independent Block Size
Different disks may have different sector sizes. It is up to the device-indepen- dent software to hide this
fact and provide a uniform block size to higher layers, for example, by treating several sectors as a single
logical block. In this way, the higher layers deal only with abstract devices that all use the same logical
block size, independent of the physical sector size. Similarly, some character devices de- liver their data
one byte at a time (e.g., mice), while others deliver theirs in larger units (e.g., Ethernet interfaces). These
differences may also be hidden.
5.3.4 User-Space I/O Software
Although most of the I/O software is within the operating system, a small por- tion of it consists of
libraries linked together with user programs, and even whole programs running outside the kernel. System
calls, including the I/O system calls, are normally made by library procedures. When a C program
contains the call
count = write(fd, buffer, nbytes);
the library procedure write might be linked with the program and contained in the binary program present
in memory at run time. In other systems, libraries can be loaded during program execution. Either way,
the collection of all these library procedures is clearly part of the I/O system.
While these procedures do little more than put their parameters in the ap- propriate place for the system
call, other I/O procedures actually do real work. In particular, formatting of input and output is done by
library procedures. One ex- ample from C is printf, which takes a format string and possibly some
variables as input, builds an ASCII string, and then calls write to output the string. As an ex- ample of
printf, consider the statement
*
printf("The square of %3d is %6d\n", i, i i);
It formats a string consisting of the 14-character string ‘‘The square of ’’ followed by the value i as a 32
character string, then the 4-character string ‘‘ is ’’, then i as 6 characters, and finally a line feed.
An example of a similar procedure for input is scanf, which reads input and stores it into variables
described in a format string using the same syntax as printf. The standard I/O library contains a number of
procedures that involve I/O and all run as part of user programs.
Not all user-level I/O software consists of library procedures. Another impor- tant category is the spooling
system. Spooling is a way of dealing with dedicated I/O devices in a multiprogramming system. Consider
a typical spooled device: a printer. Although it would be technically easy to let any user process open the
character special file for the printer, suppose a process opened it and then did noth- ing for hours. No
other process could print anything.
368 INPUT/OUTPUT CHAP. 5
Instead what is done is to create a special process, called a daemon, and a spe- cial directory, called a
spooling directory. To print a file, a process first generates the entire file to be printed and puts it in the
spooling directory. It is up to the dae- mon, which is the only process having permission to use the
printer’s special file, to print the files in the directory. By protecting the special file against direct use by
users, the problem of having someone keeping it open unnecessarily long is elimi- nated.
Spooling is used not only for printers. It is also used in other I/O situations. For example, file transfer
over a network often uses a network daemon. To send a file somewhere, a user puts it in a network
spooling directory. Later on, the net- work daemon takes it out and transmits it. One particular use of
spooled file trans- mission is the USENET News system (now part of Google Groups). This network
consists of millions of machines around the world communicating using the Inter- net. Thousands of news
groups exist on many topics. To post a news message, the user invokes a news program, which accepts
the message to be posted and then deposits it in a spooling directory for transmission to other machines
later. The en- tire news system runs outside the operating system.
Figure 5-17 summarizes the I/O system, showing all the layers and the princi- pal functions of each layer.
Starting at the bottom, the layers are the hardware, in- terrupt handlers, device drivers, deviceindependent software, and finally the user processes.
Layer
I/O
reply
I/O functions
Make I/O call; format I/O; spooling
Naming, protection, blocking, buffering, allocation Set up device registers; check status
Wake up driver when I/O completed
Perform I/O operation
User processes
Device-independent software
Device drivers
Interrupt handlers
Hardware
I/O request
Figure 5-17. Layers of the I/O system and the main functions of each layer.
The arrows in Fig. 5-17 show the flow of control. When a user program tries to read a block from a file,
for example, the operating system is invoked to carry out the call. The device-independent software looks
for it, say, in the buffer cache. If the needed block is not there, it calls the device driver to issue the
request to the hardware to go get it from the disk. The process is then blocked until the disk oper- ation
has been completed and the data are safely available in the caller’s buffer.
SEC. 5.3 I/O SOFTWARE LAYERS 369
When the disk is finished, the hardware generates an interrupt. The interrupt handler is run to discover
what has happened, that is, which device wants attention right now. It then extracts the status from the
device and wakes up the sleeping process to finish off the I/O request and let the user process continue.
5.4 DISKS
Now we will begin studying some real I/O devices. We will begin with disks, which are conceptually
simple, yet very important. After that we will examine clocks, keyboards, and displays.
5.4.1 Disk Hardware
Disks come in a variety of types. The most common ones are the magnetic hard disks. They are
characterized by the fact that reads and writes are equally fast, which makes them suitable as secondary
memory (paging, file systems, etc.). Arrays of these disks are sometimes used to provide highly reliable
storage. For distribution of programs, data, and movies, optical disks (DVDs and Blu-ray) are also
important. Finally, solid-state disks are increasingly popular as they are fast and do not contain moving
parts. In the following sections we will discuss mag- netic disks as an example of the hardware and then
describe the software for disk devices in general.
Magnetic Disks
Magnetic disks are organized into cylinders, each one containing as many tracks as there are heads
stacked vertically. The tracks are divided into sectors, with the number of sectors around the
circumference typically being 8 to 32 on floppy disks, and up to several hundred on hard disks. The
number of heads varies from 1 to about 16.
Older disks have little electronics and just deliver a simple serial bit stream. On these disks, the controller
does most of the work. On other disks, in particular, IDE (Integrated Drive Electronics) and SATA
(Serial ATA) disks, the disk drive itself contains a microcontroller that does considerable work and
allows the real controller to issue a set of higher-level commands. The controller often does track caching,
bad-block remapping, and much more.
A device feature that has important implications for the disk driver is the possi- bility of a controller doing
seeks on two or more drives at the same time. These are known as overlapped seeks. While the controller
and software are waiting for a seek to complete on one drive, the controller can initiate a seek on another
drive. Many controllers can also read or write on one drive while seeking on one or more other drives, but
a floppy disk controller cannot read or write on two drives at the
370 INPUT/OUTPUT CHAP. 5
same time. (Reading or writing requires the controller to move bits on a microsec- ond time scale, so one
transfer uses up most of its computing power.) The situa- tion is different for hard disks with integrated
controllers, and in a system with more than one of these hard drives they can operate simultaneously, at
least to the extent of transferring between the disk and the controller’s buffer memory. Only one transfer
between the controller and the main memory is possible at once, how- ever. The ability to perform two or
more operations at the same time can reduce the average access time considerably.
Figure 5-18 compares parameters of the standard storage medium for the origi- nal IBM PC with
parameters of a disk made three decades later to show how much disks changed in that time. It is
interesting to note that not all parameters have im- proved as much. Average seek time is almost 9 times
better than it was, transfer rate is 16,000 times better, while capacity is up by a factor of 800,000. This
pattern has to do with relatively gradual improvements in the moving parts, but much higher bit densities
on the recording surfaces.
Parameter
Number of cylinders
IBM 360-KB floppy disk WD 3000 HLFS hard disk
40
36,481
Tracks per cylinder
2
Sectors per track
9
Sectors per disk
720
Bytes per sector
512
Disk capacity
360 KB
Seek time (adjacent cylinders) 6 msec
Seek time (average case)
77 msec
Rotation time
200 msec
Time to transfer 1 sector
22 msec
255
63 (avg)
586,072,368
512
300 GB
0.7 msec
4.2 msec
6 msec
1.4 sec
Figure 5-18. Disk parameters for the original IBM PC 360-KB floppy disk and a Western Digital WD 3000 HLFS
(‘‘Velociraptor’’) hard disk.
One thing to be aware of in looking at the specifications of modern hard disks is that the geometry
specified, and used by the driver software, is almost always different from the physical format. On old
disks, the number of sectors per track was the same for all cylinders. Modern disks are divided into zones
with more sec- tors on the outer zones than the inner ones. Fig. 5-19(a) illustrates a tiny disk with two
zones. The outer zone has 32 sectors per track; the inner one has 16 sectors per track. A real disk, such as
the WD 3000 HLFS, typically has 16 or more zones, with the number of sectors increasing by about 4%
per zone as one goes out from the innermost to the outermost zone.
To hide the details of how many sectors each track has, most modern disks have a virtual geometry that is
presented to the operating system. The software is instructed to act as though there are x cylinders, y
heads, and z sectors per track.
SEC. 5.4 DISKS 371
Figure 5-19. (a) Physical geometry of a disk with two zones. (b) A possible vir- tual geometry for this disk.
The controller then remaps a request for (x, y, z) onto the real cylinder, head, and sector. A possible
virtual geometry for the physical disk of Fig. 5-19(a) is shown in Fig. 5-19(b). In both cases the disk has
192 sectors, only the published arrange- ment is different than the real one.
For PCs, the maximum values for these three parameters are often (65535, 16, and 63), due to the need to
be backward compatible with the limitations of the original IBM PC. On this machine, 16-, 4-, and 6-bit
fields were used to specify these numbers, with cylinders and sectors numbered starting at 1 and heads
num- bered starting at 0. With these parameters and 512 bytes per sector, the largest pos- sible disk is 31.5
GB. To get around this limit, all modern disks now support a sys- tem called logical block addressing, in
which disk sectors are just numbered con- secutively starting at 0, without regard to the disk geometry.
RAID
CPU performance has been increasing exponentially over the past decade, roughly doubling every 18
months. Not so with disk performance. In the 1970s, average seek times on minicomputer disks were 50
to 100 msec. Now seek times are still a few msec. In most technical industries (say, automobiles or
aviation), a factor of 5 to 10 performance improvement in two decades would be major news (imagine
300-MPG cars), but in the computer industry it is an embarrassment. Thus the gap between CPU
performance and (hard) disk performance has become much larger over time. Can anything be done to
help?
1
4
3
2
0
3
3
0
0
2
9
1
1
2
2
2
8
2
2
2
3
7
2
1
4
3
2
6
2
5
0
2
4
5
6
2
5
1
4
0
1
1
9
5
4
3
1
2
7
2
1
3
2
1
3
1
4
1
8
2
5
0
1
8
1
6
6
7
9
2
8
2
9
7
1
7
1
1
2
0
6
8
1
0
1
9
2
1
1
9
1
2
5
0
1
1
1
8
3
1
1
1
1
4
4
1
1
1
5
2
7
1
6
3
1
1
372 INPUT/OUTPUT CHAP. 5
Yes! As we have seen, parallel processing is increasingly being used to speed up CPU performance. It has
occurred to various people over the years that parallel I/O might be a good idea, too. In their 1988 paper,
Patterson et al. suggested six specific disk organizations that could be used to improve disk performance,
re- liability, or both (Patterson et al., 1988). These ideas were quickly adopted by in- dustry and have led
to a new class of I/O device called a RAID. Patterson et al. defined RAID as Redundant Array of
Inexpensive Disks, but industry redefined the I to be ‘‘Independent’’ rather than ‘‘Inexpensive’’ (maybe
so they could charge more?). Since a villain was also needed (as in RISC vs. CISC, also due to Patterson), the bad guy here was the SLED (Single Large Expensive Disk).
The fundamental idea behind a RAID is to install a box full of disks next to the computer, typically a
large server, replace the disk controller card with a RAID controller, copy the data over to the RAID, and
then continue normal operation. In other words, a RAID should look like a SLED to the operating system
but have better performance and better reliability. In the past, RAIDs consisted almost ex- clusively of a
RAID SCSI controller plus a box of SCSI disks, because the per- formance was good and modern SCSI
supports up to 15 disks on a single con- troller. Nowadays, many manufacturers also offer (less
expensive) RAIDs based on SATA. In this way, no software changes are required to use the RAID, a big
sell- ing point for many system administrators.
In addition to appearing like a single disk to the software, all RAIDs have the property that the data are
distributed over the drives, to allow parallel operation. Several different schemes for doing this were
defined by Patterson et al. Now- adays, most manufacturers refer to the seven standard configurations as
RAID level 0 through RAID level 6. In addition, there are a few other minor levels that we will not
discuss. The term ‘‘level’’ is something of a misnomer since no hier- archy is involved; there are simply
seven different organizations possible.
RAID level 0 is illustrated in Fig. 5-20(a). It consists of viewing the virtual single disk simulated by the
RAID as being divided up into strips of k sectors each, with sectors 0 to k 1 being strip 0, sectors k to 2k
1 strip 1, and so on. For k 1, each strip is a sector; for k 2 a strip is two sectors, etc. The RAID level
0 organization writes consecutive strips over the drives in round-robin fashion, as depicted in Fig. 5-20(a)
for a RAID with four disk drives.
Distributing data over multiple drives like this is called striping. For example, if the software issues a
command to read a data block consisting of four consecu- tive strips starting at a strip boundary, the
RAID controller will break this com- mand up into four separate commands, one for each of the four
disks, and have them operate in parallel. Thus we have parallel I/O without the software knowing about it.
RAID level 0 works best with large requests, the bigger the better. If a request is larger than the number
of drives times the strip size, some drives will get multi- ple requests, so that when they finish the first
request they start the second one. It is up to the controller to split the request up and feed the proper
commands to the
SEC. 5.4 DISKS 373
proper disks in the right sequence and then assemble the results in memory cor- rectly. Performance is
excellent and the implementation is straightforward.
RAID level 0 works worst with operating systems that habitually ask for data one sector at a time. The
results will be correct, but there is no parallelism and hence no performance gain. Another disadvantage
of this organization is that the reliability is potentially worse than having a SLED. If a RAID consists of
four disks, each with a mean time to failure of 20,000 hours, about once every 5000 hours a drive will fail
and all the data will be completely lost. A SLED with a mean time to failure of 20,000 hours would be
four times more reliable. Because no redundancy is present in this design, it is not really a true RAID.
The next option, RAID level 1, shown in Fig. 5-20(b), is a true RAID. It dupli- cates all the disks, so there
are four primary disks and four backup disks. On a write, every strip is written twice. On a read, either
copy can be used, distributing the load over more drives. Consequently, write performance is no better
than for a single drive, but read performance can be up to twice as good. Fault tolerance is excellent: if a
drive crashes, the copy is simply used instead. Recovery consists of simply installing a new drive and
copying the entire backup drive to it.
Unlike levels 0 and 1, which work with strips of sectors, RAID level 2 works on a word basis, possibly
even a byte basis. Imagine splitting each byte of the sin- gle virtual disk into a pair of 4-bit nibbles, then
adding a Hamming code to each one to form a 7-bit word, of which bits 1, 2, and 4 were parity bits.
Further imagine that the seven drives of Fig. 5-20(c) were synchronized in terms of arm position and
rotational position. Then it would be possible to write the 7-bit Hamming coded word over the seven
drives, one bit per drive.
The Thinking Machines CM-2 computer used this scheme, taking 32-bit data words and adding 6 parity
bits to form a 38-bit Hamming word, plus an extra bit for word parity, and spread each word over 39 disk
drives. The total throughput was immense, because in one sector time it could write 32 sectors worth of
data. Also, losing one drive did not cause problems, because loss of a drive amounted to losing 1 bit in
each 39-bit word read, something the Hamming code could handle on the fly.
On the down side, this scheme requires all the drives to be rotationally syn- chronized, and it only makes
sense with a substantial number of drives (even with 32 data drives and 6 parity drives, the overhead is
19%). It also asks a lot of the controller, since it must do a Hamming checksum every bit time.
RAID level 3 is a simplified version of RAID level 2. It is illustrated in Fig. 5-20(d). Here a single parity
bit is computed for each data word and written to a parity drive. As in RAID level 2, the drives must be
exactly synchronized, since individual data words are spread over multiple drives.
At first thought, it might appear that a single parity bit gives only error detec- tion, not error correction.
For the case of random undetected errors, this observa- tion is true. However, for the case of a drive
crashing, it provides full 1-bit error correction since the position of the bad bit is known. In the event that
a drive
374 INPUT/OUTPUT CHAP. 5
Figure 5-20. RAID levels 0 through 6. Backup and parity drives are shown shaded.
SEC. 5.4 DISKS 375
crashes, the controller just pretends that all its bits are 0s. If a word has a parity error, the bit from the
dead drive must have been a 1, so it is corrected. Although both RAID levels 2 and 3 offer very high data
rates, the number of separate I/O re- quests per second they can handle is no better than for a single drive.
RAID levels 4 and 5 work with strips again, not individual words with parity, and do not require
synchronized drives. RAID level 4 [see Fig. 5-20(e)] is like RAID level 0, with a strip-for-strip parity
written onto an extra drive. For example, if each strip is k bytes long, all the strips are EXCLUSIVE
ORed together, re- sulting in a parity strip k bytes long. If a drive crashes, the lost bytes can be
recomputed from the parity drive by reading the entire set of drives.
This design protects against the loss of a drive but performs poorly for small updates. If one sector is
changed, it is necessary to read all the drives in order to recalculate the parity, which must then be
rewritten. Alternatively, it can read the old user data and the old parity data and recompute the new parity
from them. Even with this optimization, a small update requires two reads and two writes.
As a consequence of the heavy load on the parity drive, it may become a bot- tleneck. This bottleneck is
eliminated in RAID level 5 by distributing the parity bits uniformly over all the drives, round-robin
fashion, as shown in Fig. 5-20(f). However, in the event of a drive crash, reconstructing the contents of
the failed drive is a complex process.
Raid level 6 is similar to RAID level 5, except that an additional parity block is used. In other words, the
data is striped across the disks with two parity blocks in- stead of one. As a result, writes are bit more
expensive because of the parity calcu- lations, but reads incur no performance penalty. It does offer more
reliability (im- agine what happens if RAID level 5 encounters a bad block just when it is rebuild- ing its
array).
5.4.2 Disk Formatting
A hard disk consists of a stack of aluminum, alloy, or glass platters typically 3.5 inch in diameter (or 2.5
inch on notebook computers). On each platter is deposited a thin magnetizable metal oxide. After
manufacturing, there is no infor- mation whatsoever on the disk.
Before the disk can be used, each platter must receive a low-level format done by software. The format
consists of a series of concentric tracks, each containing some number of sectors, with short gaps between
the sectors. The format of a sec- tor is shown in Fig. 5-21.
Preamble Data ECC
Figure 5-21. A disk sector.
376 INPUT/OUTPUT CHAP. 5
The preamble starts with a certain bit pattern that allows the hardware to rec- ognize the start of the
sector. It also contains the...
Purchase answer to see full
attachment