Download Inside the machine: an illustrated introduction to microprocessors and computer architecture PDF

TitleInside the machine: an illustrated introduction to microprocessors and computer architecture
Author
File Size9.8 MB
Total Pages320
Table of Contents
                            Preface
Acknowledgments
Introduction
Basic Computing Concepts
	The Calculator Model of Computing
	The File-Clerk Model of Computing
		The Stored-Program Computer
		Refining the File-Clerk Model
	The Register File
	RAM: When Registers Alone Won’t Cut It
		The File-Clerk Model Revisited and Expanded
		An Example: Adding Two Numbers
	A Closer Look at the Code Stream: The Program
		General Instruction Types
		The DLW-1’s Basic Architecture and Arithmetic Instruction Format
			The DLW-1’s Arithmetic Instruction Format
			The DLW-1’s Memory Instruction Format
			An Example DLW-1 Program
	A Closer Look at Memory Accesses: Register vs. Immediate
		Immediate Values
		Register-Relative Addressing
The Mechanics of Program Execution
	Opcodes and Machine Language
		Machine Language on the DLW-1
		Binary Encoding of Arithmetic Instructions
		Binary Encoding of Memory Access Instructions
			The load Instruction
			The store Instruction
		Translating an Example Program into Machine Language
	The Programming Model and the ISA
		The Programming Model
		The Instruction Register and Program Counter
		The Instruction Fetch: Loading the Instruction Register
		Running a Simple Program: The Fetch-Execute Loop
	The Clock
	Branch Instructions
		Unconditional Branch
		Conditional Branch
			Branch Instructions and the Fetch-Execute Loop
			The Branch Instruction as a Special Type of Load
			Branch Instructions and Labels
	Excursus: Booting Up
Pipelined Execution
	The Lifecycle of an Instruction
	Basic Instruction Flow
	Pipelining Explained
	Applying the Analogy
		A Non-Pipelined Processor
		A Pipelined Processor
			Shrinking the Clock
			Shrinking Program Execution Time
		The Speedup from Pipelining
		Program Execution Time and Completion Rate
		The Relationship Between Completion Rate and Program Execution Time
		Instruction Throughput and Pipeline Stalls
			Instruction Throughput
			Pipeline Stalls
		Instruction Latency and Pipeline Stalls
		Limits to Pipelining
			Clock Period and Completion Rate
			The Cost of Pipelining
Superscalar Execution
	Superscalar Computing and IPC
	Expanding Superscalar Processing with Execution Units
		Basic Number Formats and Computer Arithmetic
		Arithmetic Logic Units
		Memory-Access Units
	Microarchitecture and the ISA
		A Brief History of the ISA
		Moving Complexity from Hardware to Software
	Challenges to Pipelining and Superscalar Design
		Data Hazards
		Structural Hazards
		The Register File
		Control Hazards
The Intel Pentium and Pentium Pro
	The Original Pentium
		Caches
		The Pentium’s Pipeline
		The Branch Unit and Branch Prediction
		The Pentium’s Back End
			The Integer ALUs
			The Floating-Point ALU
		x86 Overhead on the Pentium
		Summary: The Pentium in Historical Context
	The Intel P6 Microarchitecture: The Pentium Pro
		Decoupling the Front End from the Back End
			The Issue Phase
			The Completion Phase
			The P6’s Issue Phase: The Reservation Station
			The P6’s Completion Phase: The Reorder Buffer
			The Instruction Window
		The P6 Pipeline
		Branch Prediction on the P6
		The P6 Back End
		CISC, RISC, and Instruction Set Translation
		The P6 Microarchitecture’s Instruction Decoding Unit
		The Cost of x86 Legacy Support on the P6
		Summary: The P6 Microarchitecture in Historical Context
			The Pentium Pro
			The Pentium II
			The Pentium III
	Conclusion
PowerPC Processors: 600 Series, 700 Series, and 7400
	A Brief History of PowerPC
	The PowerPC 601
		The 601’s Pipeline and Front End
			The PowerPC Instruction Queue
			Instruction Scheduling on the 601
		The 601’s Back End
			The Integer Unit
			The Floating-Point Unit
			The Branch Execution Unit
			The Sequencer Unit
		Latency and Throughput Revisited
		Summary: The 601 in Historical Context
	The PowerPC 603 and 603e
		The 603e’s Back End
		The 603e’s Front End, Instruction Window, and Branch Prediction
		Summary: The 603 and 603e in Historical Context
	The PowerPC 604
		The 604’s Pipeline and Back End
		The 604’s Front End and Instruction Window
			The Issue Phase: The 604’s Reservation Stations
			The Four Rules of Instruction Dispatch
			The Completion Phase: The 604’s Reorder Buffer
		Summary: The 604 in Historical Context
	The PowerPC 604e
	The PowerPC 750 (aka the G3)
		The 750’s Front End, Instruction Window, and Branch Instruction
		Summary: The PowerPC 750 in Historical Context
	The PowerPC 7400 (aka the G4)
		The G4’s Vector Unit
		Summary: The PowerPC G4 in Historical Context
	Conclusion
Intel’s Pentium 4 vs. Motorola’s G4e: Approaches and Design Philosophies
	The Pentium 4’s Speed Addiction
	The General Approaches and Design Philosophies of the Pentium 4 and G4e
	An Overview of the G4e’s Architecture and Pipeline
		Stages 1 and 2: Instruction Fetch
		Stage 3: Decode/Dispatch
		Stage 4: Issue
		Stage 5: Execute
		Stages 6 and 7: Complete and Write-Back
	Branch Prediction on the G4e and Pentium 4
	An Overview of the Pentium 4’s Architecture
		Expanding the Instruction Window
		The Trace Cache
			Shortening Instruction Execution Time
			The Trace Cache’s Operation
	An Overview of the Pentium 4’s Pipeline
		Stages 1 and 2: Trace Cache Next Instruction Pointer
		Stages 3 and 4: Trace Cache Fetch
		Stage 5: Drive
		Stages 6 Through 8: Allocate and Rename (ROB)
		Stage 9: Queue
		Stages 10 Through 12: Schedule
		Stages 13 and 14: Issue
		Stages 15 and 16: Register Files
		Stage 17: Execute
		Stage 18: Flags
		Stage 19: Branch Check
		Stage 20: Drive
		Stages 21 and Onward: Complete and Commit
	The Pentium 4’s Instruction Window
Intel’s Pentium 4 vs. Motorola’s G4e: The Back End
	Some Remarks About Operand Formats
	The Integer Execution Units
		The G4e’s IUs: Making the Common Case Fast
		The Pentium 4’s IUs: Make the Common Case Twice as Fast
	The Floating-Point Units (FPUs)
		The G4e’s FPU
		The Pentium 4’s FPU
		Concluding Remarks on the G4e’s and Pentium 4’s FPUs
	The Vector Execution Units
		A Brief Overview of Vector Computing
		Vectors Revisited: The AltiVec Instruction Set
		AltiVec Vector Operations
			Intra-Element Arithmetic and Non-Arithmetic Instructions
			Inter-Element Arithmetic and Non-Arithmetic Instructions
		The G4e’s VU: SIMD Done Right
		Intel’s MMX
		SSE and SSE2
		The Pentium 4’s Vector Unit: Alphabet Soup Done Quickly
		Increasing Floating-Point Performance with SSE2
	Conclusions
64-Bit Computing and x86-64
	Intel’s IA-64 and AMD’s x86-64
	Why 64 Bits?
	What Is 64-Bit Computing?
	Current 64-Bit Applications
		Dynamic Range
		The Benefits of Increased Dynamic Range, or, How the Existing 64-Bit Computing Market Uses 64-Bit Integers
		Virtual Address Space vs. Physical Address Space
		The Benefits of a 64-Bit Address
	The 64-Bit Alternative: x86-64
		Extended Registers
		More Registers
		Switching Modes
		Out with the Old
	Conclusion
The G5: IBM’s PowerPC 970
	Overview: Design Philosophy
	Caches and Front End
	Branch Prediction
	The Trade-Off: Decode, Cracking, and Group Formation
		Dispatching and Issuing Instructions on the PowerPC 970
		The 970’s Dispatch Rules
		Predecoding and Group Dispatch
		Some Preliminary Conclusions on the 970’s Group Dispatch Scheme
	The PowerPC 970’s Back End
		Integer Unit, Condition Register Unit, and Branch Unit
		The Integer Units Are Not Fully Symmetric
		Integer Unit Latencies and Throughput
		The CRU
			The PowerPC Condition Register
		Preliminary Conclusions About the 970’s Integer Performance
	Load-Store Units
	Front-Side Bus
	The Floating-Point Units
	Vector Computing on the PowerPC 970
	Floating-Point Issue Queues
		Integer and Load-Store Issue Queues
		BU and CRU Issue Queues
		Vector Issue Queues
	The Performance Implications of the 970’s Group Dispatch Scheme
	Conclusions
Understanding Caching and Performance
	Caching Basics
		The Level 1 Cache
		The Level 2 Cache
		Example: A Byte’s Brief Journey Through the Memory Hierarchy
		Cache Misses
	Locality of Reference
		Spatial Locality of Data
		Spatial Locality of Code
		Temporal Locality of Code and Data
		Locality: Conclusions
	Cache Organization: Blocks and Block Frames
	Tag RAM
	Fully Associative Mapping
	Direct Mapping
	N-Way Set Associative Mapping
		Four-Way Set Associative Mapping
		Two-Way Set Associative Mapping
		Two-Way vs. Direct-Mapped
		Two-Way vs. Four-Way
		Associativity: Conclusions
	Temporal and Spatial Locality Revisited: Replacement/Eviction Policies and Block Sizes
		Types of Replacement/Eviction Policies
		Block Sizes
	Write Policies: Write-Through vs. Write-Back
	Conclusions
Intel’s Pentium M, Core Duo, and Core 2 Duo
	Code Names and Brand Names
	The Rise of Power-Efficient Computing
	Power Density
		Dynamic Power Density
		Static Power Density
	The Pentium M
		The Fetch Phase
			The Hardware Loop Buffer
		The Decode Phase: Micro-ops Fusion
			Fused Stores
			Fused Loads
			The Impact of Micro-ops Fusion
		Branch Prediction
			The Loop Detector
			The Indirect Predictor
		The Stack Execution Unit
		Pipeline and Back End
		Summary: The Pentium M in Historical Context
	Core Duo/Solo
		Intel’s Line Goes Multi-Core
			Processor Organization and Core Microarchitecture
			Multiprocessing and Chip Multiprocessing
		Core Duo’s Improvements
			Micro-ops Fusion of SSE and SSE2 store and load-op Instructions
			Micro-ops Fusion and Lamination of SSE and SSE2 Arithmetic Instructions
			Micro-ops Fusion of Miscellaneous Non-SSE Instructions
			Improved Loop Detector
			SSE3
			Floating-Point Improvement
			Integer Divide Improvement
			Virtualization Technology
		Summary: Core Duo in Historical Context
	Core 2 Duo
		The Fetch Phase
			Macro-Fusion
		The Decode Phase
		Core’s Pipeline
	Core’s Back End
		Integer Units
		Floating-Point Units
		Vector Processing Improvements
			128-bit Vector Execution on the P6 Through Core Duo
			128-bit Vector Execution on Core
		Memory Disambiguation: The Results Stream Version of Speculative Execution
			The Lifecycle of a Memory Access Instruction
			The Memory Reorder Buffer
			Memory Aliasing
			Memory Reordering Rules
			False Aliasing
			Memory Disambiguation
		Summary: Core 2 Duo in Historical Context
Bibliography and Suggested Reading
	General
	PowerPC ISA and Extensions
	PowerPC 600 Series Processors
	PowerPC G3 and G4 Series Processors
	IBM PowerPC 970 and POWER
	x86 ISA and Extensions
	Pentium and P6 Family
	Pentium 4
	Pentium M, Core, and Core 2
	Online Resources
Index
Updates
                        
Document Text Contents
Page 1

Inside the M
achine

Computers perform countless tasks ranging
from the business critical to the recreational,
but regardless of how differently they may look
and behave, they’re all amazingly similar in
basic function. Once you understand how the
microprocessor—or central processing unit (CPU)—
works, you’ll have a firm grasp of the fundamental
concepts at the heart of all modern computing.

Inside the Machine, from the co-founder of the highly
respected Ars Technica website, explains how
microprocessors operate—what they do and how
they do it. The book uses analogies, full-color
diagrams, and clear language to convey the ideas
that form the basis of modern computing. After
discussing computers in the abstract, the book
examines specific microprocessors from Intel,
IBM, and Motorola, from the original models up
through today’s leading processors. It contains the
most comprehensive and up-to-date information
available (online or in print) on Intel’s latest
processors: the Pentium M, Core, and Core 2 Duo.
Inside the Machine also explains technology terms
and concepts that readers often hear but may not
fully understand, such as “pipelining,” “L1 cache,”
“main memory,” “superscalar processing,” and
“out-of-order execution.”

Stokes

Jon “Hannibal” Stokes is co-founder and Senior CPU Editor of Ars Technica. He has written for a variety
of publications on microprocessor architecture and the technical aspects of personal computing. Stokes
holds a degree in computer engineering from Louisiana State University and two advanced degrees in the
humanities from Harvard University. He is currently pursuing a Ph.D. at the University of Chicago.

Includes discussion of:

• Parts of the computer and microprocessor
• Programming fundamentals (arithmetic

instructions, memory accesses, control
flow instructions, and data types)

• Intermediate and advanced microprocessor
concepts (branch prediction and speculative
execution)

• Intermediate and advanced computing
concepts (instruction set architectures,
RISC and CISC, the memory hierarchy, and
encoding and decoding machine language
instructions)

• 64-bit computing vs. 32-bit computing
• Caching and performance

Inside the Machine is perfect for students of
science and engineering, IT and business
professionals, and the growing community
of hardware tinkerers who like to dig into the
guts of their machines.

A
n Illustrated Introduction to M

icroprocessors and C
om

puter A
rch

itecture

6 89 1 45 7 10 42 7

5 4 9 9 5

9 7 81 5 93 2 71 04 6

ISBN: 978-1-59327-104-6

$49.95 ($61.95 cdn) shelve in: Computer Hardware

A Look Inside the Silicon Heart of Modern Computing

TH E F I N EST I N G E E K E NTE RTA I N M E NT™
www.nostarch.com

An Illustrated Introduction to
Microprocessors and Computer Architecture

Jon Stokes

“This is, by far, the most well written text that I have seen on the subject
of computer architecture.”

—John Stroman, Technical Account Manager, Intel

Page 160

itmfig08_04.eps


the very last detail of the processor’s design. As this chapter will show, the
successor to the most successful x86 microarchitecture of all time was a
machine built from the ground up for stratospheric clock speed.

NOTE Willamette was Intel’s code name for the Pentium 4 while the project was in develop-
ment. Intel’s projects are usually code-named after rivers in Oregon. Many companies
use code names that follow a certain convention, like Apple’s use of the names of large
cats for versions of OS X.

Motorola introduced MPC7450 in January 2001, and Apple quickly
adopted it under the G4 moniker. Because the 7450 represented a significant
departure from the 7400, the 7450 was often referred to as the G4e or the
G4+, so throughout this chapter we’ll call it the G4e. The new processor had
a slightly deeper pipeline, which allowed it to scale to higher clock speeds, and
both its front end and back ends boasted a whole host of improvements that
set it apart from the original G4. It also continued the excellent performance/
power consumption ratio of its predecessors. These features combined to
make it an excellent chip for portables, and Apple has exploited derivatives
of this basic architecture under the G4 name in a series of innovative desktop
enclosure designs and portables. The G4e also brought enhanced vector
computing performance to the table, which made it a great platform for
DSP and media applications.

This chapter will examine the trade-offs and design decisions that the
Pentium 4’s architects made in their effort to build a MHz monster, paying
special attention to the innovative features that the Pentium 4 sported and
the ways that those features fit with the processor’s overall design philosophy
and target application domain. We’ll cover the Pentium 4’s ultradeep pipe-
line, its trace cache, its double-pumped ALUs, and a host of other aspects
of its design, all with an eye to their impact on performance. As a point of
comparison, we’ll also look at the microarchitecture of Motorola’s G4e. By
examining two microprocessor designs side by side, you’ll gain a deeper
understanding of how the concepts outlined in the previous chapters play
out in a pair of popular, real-world designs.

The Pentium 4’s Speed Addiction

Table 7-1 lists the features of the Pentium 4.

Table 7-1: Features of the Pentium 4

Introduction Date April 23, 2001

Process 0.18 micron

Transistor Count 42 million

Clock Speed at Introduction 1.7 GHz

Cache Sizes L1: Approximately 16KB instruction, 16KB data

Features Simultaneous Multithreading (SMT, aka “hyperthreading”)
added in 2003. 64-bit support (EM64T) and SSE3 added in
2004. Virtualization Technology (VT) added in 2005.
138 Chapter 7

Page 161

itmfig08_04.eps


While some processors still have the classic, four-stage pipeline, described
in Chapter 1, most modern CPUs are more complicated. You’ve already
seen how the original Pentium had a second decode stage, and the P6 core
tripled the standard four-stage pipeline to 12 stages. The Pentium 4, with a
whopping 20 stages in its basic pipeline, takes this tactic to the extreme. Take
a look at Figure 7-1. The chart shows the relative clock frequencies of Intel’s
last six x86 designs. (This picture assumes the same manufacturing process
for all six cores.) The vertical axis shows the relative clock frequency, and the
horizontal axis shows the various processors relative to each other.

Figure 7-1: The relative frequencies of Intel’s processors

Intel’s explanation of this diagram and the history it illustrates is
enlightening, as it shows where their design priorities were:

Figure [3.2] shows that the 286, Intel386™, Intel486™, and Pentium®
(P5) processors had similar pipeline depths—they would run at
similar clock rates if they were all implemented on the same silicon
process technology. They all have a similar number of gates of
logic per clock cycle. The P6 microarchitecture lengthened the
processor pipelines, allowing fewer gates of logic per pipeline
stage, which delivered significantly higher frequency and perfor-
mance. The P6 microarchitecture approximately doubled the
number of pipeline stages compared to the earlier processors and
was able to achieve about a 1.5 times higher frequency on the same
process technology. The NetBurst microarchitecture was designed
to have an even deeper pipeline (about two times the P6 micro-
architecture) with even fewer gates of logic per clock cycle to allow
an industry-leading clock rate.

—The Microarchitecture of the Pentium 4 Processor, p. 3.

As you learned in Chapter 2, there are limits to how deeply you can
pipeline an architecture before you begin to reach a point of diminishing
returns. Deeper pipelining results in an increase in instruction execution
time; this increase can be quite damaging to instruction completion rates if
the pipeline has to be flushed and refilled often. Furthermore, in order to
realize the throughput gains that deep pipelining promises, the processor’s
clock speed must increase in proportion to its pipeline depth. But in the real

Re
la

tiv
e

Fr
eq

ue
nc

y

3

2.5

2

1.5

1

0.5

0

1 1 1 1

1.5

2.5

286 386 486 P5 P6 P4P
In te l ’ s Pent ium 4 vs. Motoro la’s G4e: Approaches and Design Phi losophies 139

Page 319

itmfig09_03.eps


U P D A T E S

Visit www.nostarch.com/insidemachine.htm for updates, errata, and other
information.

Page 320

Inside the M
achine

Computers perform countless tasks ranging
from the business critical to the recreational,
but regardless of how differently they may look
and behave, they’re all amazingly similar in
basic function. Once you understand how the
microprocessor—or central processing unit (CPU)—
works, you’ll have a firm grasp of the fundamental
concepts at the heart of all modern computing.

Inside the Machine, from the co-founder of the highly
respected Ars Technica website, explains how
microprocessors operate—what they do and how
they do it. The book uses analogies, full-color
diagrams, and clear language to convey the ideas
that form the basis of modern computing. After
discussing computers in the abstract, the book
examines specific microprocessors from Intel,
IBM, and Motorola, from the original models up
through today’s leading processors. It contains the
most comprehensive and up-to-date information
available (online or in print) on Intel’s latest
processors: the Pentium M, Core, and Core 2 Duo.
Inside the Machine also explains technology terms
and concepts that readers often hear but may not
fully understand, such as “pipelining,” “L1 cache,”
“main memory,” “superscalar processing,” and
“out-of-order execution.”

Stokes

Jon “Hannibal” Stokes is co-founder and Senior CPU Editor of Ars Technica. He has written for a variety
of publications on microprocessor architecture and the technical aspects of personal computing. Stokes
holds a degree in computer engineering from Louisiana State University and two advanced degrees in the
humanities from Harvard University. He is currently pursuing a Ph.D. at the University of Chicago.

Includes discussion of:

• Parts of the computer and microprocessor
• Programming fundamentals (arithmetic

instructions, memory accesses, control
flow instructions, and data types)

• Intermediate and advanced microprocessor
concepts (branch prediction and speculative
execution)

• Intermediate and advanced computing
concepts (instruction set architectures,
RISC and CISC, the memory hierarchy, and
encoding and decoding machine language
instructions)

• 64-bit computing vs. 32-bit computing
• Caching and performance

Inside the Machine is perfect for students of
science and engineering, IT and business
professionals, and the growing community
of hardware tinkerers who like to dig into the
guts of their machines.

A
n Illustrated Introduction to M

icroprocessors and C
om

puter A
rch

itecture

6 89 1 45 7 10 42 7

5 4 9 9 5

9 7 81 5 93 2 71 04 6

ISBN: 978-1-59327-104-6

$49.95 ($61.95 cdn) shelve in: Computer Hardware

A Look Inside the Silicon Heart of Modern Computing

TH E F I N EST I N G E E K E NTE RTA I N M E NT™
www.nostarch.com

An Illustrated Introduction to
Microprocessors and Computer Architecture

Jon Stokes

“This is, by far, the most well written text that I have seen on the subject
of computer architecture.”

—John Stroman, Technical Account Manager, Intel

Similer Documents