About cpu - mediapro.lt

Introduction

I’ll never forget when Intel first announced that the name for the successor to the 486 would be „Pentium.“ I and most of my fellow computer nerds thought the name was silly and not suitably geeky. Everyone knew that computer components were supposed to have names with numbers in them; after all, Star Wars droids, Star Trek ships, software versions, Compuserve e-mail addresses, and every other kind of computer-related thing you could think of had a moniker consisting of some mix of numbers and letters. So what’s with a name that vaguely suggests the concept of „fiveness,“ but would be more appropriate for an element or a compound?

To this day, I still have no idea who or what was responsible for the name „Pentium,“ but I suppose it no longer matters. A question that’s still worth asking, though, is why the Pentium name has stuck around as the brand name for Intel’s main processor product line through no less than four major architectural changes. In a nutshell, the answer is that the Pentium brand name, having somehow made the transition from the original Pentium architecture to the radically different Pentium Pro (or P6) architecture, became synonymous with the most successful desktop microprocessor architecture of all time — in fact, in its heyday „Pentium“ became virtually synonymous with „PC.“

This series of articles takes a look at the consumer desktop processors that have borne the Pentium name, beginning with the original Pentium up through today’s Pentium 4 (Prescott) and Pentium M divisions. The overview is general enough that for the most part it should be accessible to the nonspecialist, and it should give you a sense of the major differences between each generation of Pentiums. In keeping with the Ars tag line, the article does not attempt to tell you everything about every iteration of the Pentium; instead, it covers only what you need to know.

The original Pentium

Pentium Vitals Summary Table

Introduction date: March 22, 1993

Process: 0.8 micron

Transistor Count: 3.1 million

Clock speed at introduction: 60 and 66 MHz

Cache sizes: L1: 8K instruction, 8K data

Features: MMX added in 1997

The original Pentium is an extremely modest design by today’s standards, and when it was introduced in 1993 it wasn’t exactly a blockbuster by the standards of its RISC contemporaries, either. While its superscalar design (Intel’s first) certainly improved on the performance of its predecessor, the 486, the main thing that the Pentium had going for it was x86 compatibility. In fact, Intel’s decision to make enormous sacrifices of performance, power consumption, and cost for the sake of maintaining the Pentium’s backwards compatibility with legacy x86 code was probably the most strategically-important decision that the company has ever made.

The choice to continue along the x86 path inflicted some serious short- and medium-term pain on Intel, and a certain amount of long-term pain on the industry as a whole (how much pain depends on who you talk to), but as we’ll see the negative impact of this critical move has gradually lessened over time.

The Pentium’s two-issue superscalar architecture was fairly straightforward. It had two five-stage integer pipelines, which Intel designated U and V, and one six-stage floating-point pipeline. The chip’s front-end could do dynamic branch prediction, but as we’ll learn in a moment most of its front-end resources were spent on maintaining backwards compatibility with the x86 architecture. Figure 1: Pentium architecture

The Pentium’s U and V integer pipes were not fully symmetric. U, as the default pipe, was slightly more capable and contained a shifter, which V lacked. The two pipelines weren’t fully independent, either; there was a set of restrictions, which I won’t waste anyone’s time outlining, that placed limits on which combinations of integer instructions could be issued in parallel. All told, though, the Pentium’s two integer pipes provided solid enough integer performance to be competitive, especially for integer-intensive office apps.

Floating-point, however, simply went from awful on the 486 to just mediocre with the Pentium — an improvement, to be sure, but not enough to make it even remotely competitive with comparable RISC chips on the market at that time. First off, you could only issue both a floating-point and an integer operation simultaneously under extremely restrictive circumstances. This wasn’t too bad, because floating-point and integer code are rarely mixed. The killer, though, was the unfortunate design of the x87 stack-based floating-point architecture.

I’ve covered in detail the problems related to x87 before, so I won’t repeat that here. Modern x86 architectures have workarounds, like rename registers and a „free“ FXCH instruction, for alleviating — but not eliminating — the performance disadvantages of x87’s register-starved (only eight architectural registers) and stack-based architecture. The Pentium, however, had none of these, so it suffered mightily compared to its RISC competitors. In the days before rise of PC gaming, though, when most Pentium purchasers just wanted to run DOS spreadsheet and word-processing applications, this didn’t really matter too much. It simply kept the Pentium out of the scientific/workstation market, and relegated it to the growing home and business markets.

The Pentium’s pipeline

The Pentium’s basic integer pipeline is five stages long, with the stages
broken down as follows:

1. Prefetch/Fetch: Instructions are fetched from the instruction cache and aligned in prefetch buffers for decoding.

2. Decode1: Instructions are decoded into the Pentium’s internal instruction format. Branch prediction also takes place at this stage.

3. Decode2: Same as above, and microcode ROM kicks in here, if necessary. Also, address computations take place at this stage.

4. Execute: The integer hardware executes the instruction.

5. Write-back: The results of the computation are written back to the register file.

The main difference between the Pentium’s five-stage pipeline and the four-stage pipelines prevalent at the time lies in the second decode stage. RISC ISAs support only simple addressing modes, but x86’s multiple complex addressing modes, which were originally designed to make assembly language programmers’ lives easier but ended up making everyone’s lives more difficult — require extra address computations. These computations are relegated to the second decode stage, where dedicated address computation hardware handles them before dispatching the instruction to the execution units. The take-home message here is that if it weren’t for the vagaries of x86, the second decode stage would not be necessary and the pipeline length would be reduced by a fifth.

x86 overhead on the Pentium

The second decode stage isn’t the only place where legacy x86 support added significant overhead to the Pentium design. According to an MPR article published at the time (see bibliography), Intel estimated that a whopping 30% of the Pentium’s transistors were dedicated solely to providing x86 legacy support. When you consider the fact that the Pentium’s RISC competitors with comparable transistor counts could spend those transistors on performance-enhancing hardware like execution units and cache, it’s no wonder that the Pentium lagged behind.

A large chunk of the Pentium’s legacy-supporting transistors were eaten up by the Pentium’s microcode ROM. If you read my old RISC vs. CISC article, then you know that one of the big benefits of RISC processors was that they didn’t need the microcode ROMs that CISC designs required for decoding large, complex instructions.

The front-end of the Pentium also suffered from x86-related bloat in that its prefetch logic had to take account of the fact that x86 instructions are not a uniform size and hence could straddle cache lines. The Pentium’s decode logic also had to support x86’s segmented memory model, which meant checking for and enforcing code segment limits; such checking required its own dedicated address calculation hardware, in addition to the Pentium’s other dedicated address hardware. Furthermore, all of the Pentium’s dedicated address hardware needed four input ports, which again spelled more transistors spent.

So to summarize, the Pentium’s entire front-end was bloated and distended with hardware that was there solely to support x86 (mis)features which were rapidly falling out of use. With transistor budgets as tight as they were, each of those extra address adders and prefetch buffers — not to mention the microcode ROM — represented a painful expenditure of scarce resources that did nothing to enhance performance.

Fortunately for Intel, this wasn’t the end of the story. There were a few facts and trends working in the favor of Intel and the x86 ISA. If we momentarily forget about ISA extensions like MMX, SSE, etc. and the odd handful of special-purpose instructions like CPUID that get added to the x86 ISA every so often, the core legacy x86 ISA is fixed in size and has not grown over the years; similarly, with one exception (the P6, covered below), the amount of hardware that it takes to support such instructions has not tended to grow either.

Transistors, on the other hand, have shrunk rapidly since the Pentium was introduced. When you put these two facts together, this means that the relative cost (in transistors) of x86 support, a cost that is mostly concentrated in an x86 CPU’s front-end, has dropped as CPU transistor counts have increased.

Today, x86 support accounts for well under 10% of the transistors on the Pentium 4 — a drastic improvement over the original Pentium, and one that has contributed significantly to the ability of x86 hardware to catch up to and even surpass their RISC competitors in both integer and floating-point performance. In other words, Moore’s Curves have been extremely kind to the x86 ISA.

The P6 architecture

Pentium Pro vitals Pentium II vitals Pentium III vitals

Introduction date November 1, 1995 May 7, 1997 February 26, 1999

Process 0.60/0.35 micron 0.35 micron 0.25 micron

Transistor count 5.5 million 7.5 million 9.5 million

Clock speed at introduction 150, 166, 180, and 200MHz 233, 266, and 300 MHz 450 and 500MHz

L1 cache size 8K instruction, 8K data 16K instruction, 16K data 16K instruction, 16K data

L2 cache size 256K or 512K (on-die) 512K (off-die) 512K (on-die)

Features No MMX MMX MMX, SSE, processor serial number

Intel’s P6 architecture, first instantiated in the Pentium Pro, was by any reasonable metric a resounding success. Its performance was significantly better than that of the Pentium, and the market rewarded Intel handsomely for it. The architecture also proved extremely scalable, furnishing Intel with a good half-decade of desktop dominance and paving the way for x86 systems to compete with RISC in the workstation and server markets.

Figure 2: Pentium Pro architecture

What was the P6’s secret, and how did it offer such a quantum leap in performance? The answer is complex and involves the contribution of numerous technologies and techniques, the most important of which had already been introduced into the x86 world by Intel’s smaller x86 competitors (most notably, AMD’s K5): the decoupling of the front-end’s fetching and decoding functions from the back-end’s execution function, by means of an instruction window.

Decoupling the front end from the back end

In the Pentium and its predecessors, instructions traveled directly from the decoding hardware to the execution hardware. As noted above, the Pentium had some hardwired rules for dictating which instructions could go to which execution units and in what combinations, so once the instructions were decoded then the rules took over and the dispatch logic shuffled them off to the proper execution unit. In fact, you probably noticed the box marked „Control Unit“ in my original Pentium diagram. The control unit is responsible for implementing and executing the rules that decide which instructions go where, and in what combinations.

This static, rules-based approach is rigid and simplistic, and it has two major drawbacks, both stemming from the fact that though the code stream is inherently sequential, a superscalar processor attempts to execute parts of it in parallel:

1. It adapts poorly to the dynamic and ever-changing code stream, and

2. It would make poor use of wider superscalar hardware.

See, since the Pentium is a two-issue machine (i.e., it can issue at most two operations simultaneously from its decode hardware to its execution hardware on each clock cycle), then its dispatch rules look at only two instructions at a time to see if they can or cannot be dispatched simultaneously. If more execution hardware were added, and the issue width were increased to three instructions per cycle (as it is in the P6), then the rules determining which instructions go where would need to be able to account for various possible combinations of two and three instructions at a time, in order to get those instructions to the right execution unit at the right time. Furthermore, such rules would inevitably be difficult for coders to optimize for, and if they weren’t to be overly complex then there would necessarily exist many common instruction sequences that would perform suboptimally under the default rule set. Or, in plain English, the makeup of the code stream would change from application to application and from moment to moment, but the rules responsible for scheduling the code stream’s execution would be forever fixed.

Šiuo metu Jūs matote 32% šio straipsnio.

Matomi 2168 žodžiai iš 6809 žodžių.

Panašūs įrašai