RMG and Associates

Insightful, timely, and accurate

Semiconductor Technology Consulting

Semiconductor & Patent Expert Consulting

Ron@Maltiel-consulting.com

(408) 446-3040

_____________________________________________________________________________________________________________________________________________

ISSCC 2011 Highlights

 

1. ISSCC panel sees challenges at 20-nm / EETimes

2. ISSCC: China eyes petaflops, IBM hits 5 GHz / EETimes

3. Intel details Sandy Bridge at ISSCC / EETimes

4. AMD Bulldozer's at the ISSCC 2011 /PC Prespective

5. Samsung takes DRAM 512-pins wide / / EETimes

6.

 

_____________________________________________________________________________________________________________________________________________

1. ISSCC panel sees challenges at 20-nm

Mark LaPedus 2/23/2011 9:02 PM EST

SAN FRANCISCO – After some debate, there is finally some consensus at the 22-/20-nm logic node-at least among leading-edge foundries.
 
During a panel session at the 2011 International Solid-State Circuits Conference (ISSCC) here, IBM, Globalfoundries and TSMC all agreed that they would extend planar bulk CMOS to the 22-/20-nm node. In other words, don’t expect foundries to embrace FinFETs, fully depleted SOI, multi-gate transistors or other newfangled structures at 22-/20-nm.

Intel Corp., another member of the panel, is still keeping its cards close to the vest and did not reveal its transistor plans at 22-/20-nm.

Still, there were few surprises during the panel. All vendors agreed that the 22-/20-nm node would also make use of copper interconnects, high-k/metal gates and ultra low k. Leading-edge chip makers are stuck using today’s 193-nm immersion lithography with double patterning.

This is because extreme ultraviolet (EUV) lithography won’t be ready in time for this logic node. ASML Holding NV has recently shipped one pre-production EUV tool, reportedly to Samsung Electronics Co. Ltd. Throughput remains low and tool cost is astronomical.

''EUV is making progress,’’ said Mark Bohr, an Intel senior fellow and director of process architecture and integration, in an interview. ''It’s not ready for prime time.’’  

Needless to say, the 22-/20-nm logic node will be challenging. Besides lithography, high-k integration, power consumption, variability and cost remain challenging at the node.

Bohr believes the separate design and manufacturing teams must work more closely in a concept he called co-optimization. ''Co-optimization needs to start earlier at the research phase,’’ he said.        

In any case, Globalfoundries, IBM, Intel, TSMC and Samsung have announced some details about their upcoming 22-/20-nm processes. At present, leading-edge chip makers are using conventional bulk CMOS and planar transistor structures for the 32-/28-nm nodes and above. For years, chip makers have used bulk CMOS. It’s well understood, cheap and a safe technology.

But going forward at 22- and 16-nm, there are some who believe that bulk CMOS will run out of gas. At those nodes, there are a number of transistor candidates on the table: III-V, bulk CMOS, FinFET, FD-SOI, multi-gate, among others.

Some are pushing hard for one technology. Jockeying for position in the next-generation transistor race, the SOI Industry Consortium claims to have made more progress in bringing fully-depleted silicon-on-insulator (FD-SOI) technology for next-generation mobile products.

The consortium members-ARM, Globalfoundries, IBM, STMicroelectronics, Soitec, and CEA-Leti—have announced results of an assessment and characterization of FD-SOI, saying that the technology is viable for mobile and consumer devices at the 20-nm node and perhaps beyond.     

But even IBM Corp. admits that bulk CMOS will be the key technology for 22-/20-nm. FD-SOI ''will not be ready in time’’ for that node, said Ghavam Shahidi, IBM Fellow and director of silicon technology for the T.J. Watson Research Center, during the panel.

Samsung Electronics Co. Ltd., one of the members of IBM’s ''fab club’’ or technology alliance, recently rolled out its 20-nm process, which is based on bulk CMOS.  

At the ISSCC panel, Bill Liu, vice president of technology solutions at Globalfoundries Inc., also listed the various technologies that are expected to be adopted at 22-/20-nm: high-k/metal-gate; 193-nm immersion with double patterning, strain engineering, among others.  

After pushing the gate-first high-k approach for the 32- and 28-nm nodes, IBM's technology partners will move to rival gate-last technology at the 20-nm node. IBM's partners include AMD, Globalfoundries, Samsung and others, all of which insisted that the gate-first approach was better-until now.

Intel, TSMC and others have embraced the gate-last approach.

Meanwhile, Liu said the key challenges for the 22-/20-nm node will be power, lithography, among others. ''At 22-nm, lithography will become more of a gating item,’’ Liu said. Using 193-nm with double patterning will force chip makers to use ''more restrictive design rules,’’ he said.     

Min Cao, director of 20-nm development at Taiwan Semiconductor Manufacturing Co. Ltd. (TSMC), believes power and variability will remain challenging. ''Power constrained designs drive Vcc scaling but (the problem is) that variability goes up,’’ he said.    

Top

____________________________________________________________________________________________________________________________________________

2. ISSCC: China eyes petaflops, IBM hits 5 GHz

Rick Merritt

2/21/2011 7:30 PM EST

ISSCC papers described a China-designed processor that aims to power a petaflops supercomputer and an IBM mainframe chip that hits 5.2 GHz.

SAN FRANCISCO – China gave a look at a national microprocessor design that aims to power a petaflops supercomputer in a paper at the International Solid State Circuits Conference here. At the same session, IBM described a 5.2 GHz mainframe processor, continuing the push the limits in CPU frequency.

Separately, Advanced Micro Devices gave more details of its new Bulldozer core and Intel described two server chips.

Weiwu Hu, lead designer of China's Godson processor family, described the Godson-3B which will emerge in a high-performance system from Dawning (Shenzhen) this summer. The Dawning system will use 3,000 Godson-3B chips to deliver about 300 Teraflops, Hu said. A future system aims to crack the petaflops barrier using the chip.

The eight-core 65nm processor delivers 128 GFlops at 1.05 GHz and is already in production at STMicroelectronics. The next-generation Godson-3C will be a 16-core version targeting 512 GFlops at 2 GHz operation in 28nm technology and is still two years away, said Hu.

The Godson-3B is based on a 64-bit MIPS core with 200 instructions added for x86 compatibility and integrates vector processing units. The chip was previously described at Hot Chips in August.

The chip and a related operating system make up one of 16 projects funded by China's national science and technology initiative. As such it will receive an estimated $5-10 billion over the 2006-2020 period, Hu said. Other projects include next-generation VLSI process technology, 4G networking, high res satellite system and China's space exploration program, he said.

 

1. Godson eyes petaflops

2. IBM hits 5.2 GHz

3. AMD Bulldozer, Intel server CPUs


IBM hits 5.2 GHz

Separately, IBM pushed its z196 CPU to 5.2 GHz, about 18 percent above the previous 4.4 GHz z10 chip while maintaining similar thermal envelope. Frequency is key for getting top single-thread performance the mainframe class system requires.

"We think there is still room for future improvements, but frequency increases won't go on forever," said Jim Warnock, a senior IBM engineer.

The multichip modules used in IBM's z-Series servers include six CPUs and two L4 cache chips and consume a whopping 1,800 Watts. The processors themselves have a 260 Watt power budget, but could get more headroom if new MCM materials can be found.

"We are the last of the high-end processors still pushing higher frequency," said Warnock.

To get the speed boost IBM moved the design from 65 to 45nm SOI process technology. It also made extensive use of embedded DRAM for memory and capacitors. IBM also added out-of-order execution to the chip, gaining a 40 percent total performance improvement from all the new techniques.

IBM conducted extensive power analysis of the design, calculating dynamic and leakage power with various workloads. Resulting power and thermal budgets were applied to work from the planning to the physical implementation design of the part.

Designers used an extensive set of tools to optimize frequency tuning. They included delaying the master clock and local clock pulse width and timing controls.

In a separate paper, IBM described 14-bit cache hit logic tied to SRAM cache blocks used to enable the high frequencies on eight critical paths.

1. Godson eyes petaflops

2. IBM hits 5.2 GHz

3. AMD Bulldozer, Intel server CPUs


AMD Bulldozer, Intel server CPUs

Advanced Micro Devices provided more details on its new Bulldozer core first described at Hot Chips in August. AMD senior engineer Hugh MacIntyre said the core enables 3.5 GHz performance is same power and thermal envelope as AMD's prior core design.

The core delivers linear performance across a range of frequencies and 0.8-1.3V voltages it will need to operate. It uses 213 million transistors in a 30.9mm2 block with 11 metal gates in a 32nm SOI process, he said.

A separate paper described Bulldozer's 40-entry instruction out-of-order scheduler and execution unit that can issue up to four instructions per cycle. The unit helps the core meet its target of delivering 90 percent of performance of past AMD cores with a significant reduction in area and power, said Michael Golden, another AMD engineer.

In another paper, Intel described some of the techniques used to reign in power on its 10-core Westmere-EX server processor. The 32nm chip includes two memory controllers and four Intel Quick Path Interconnect (QPI) CPU interfaces handling up to 6.4 GTransfers/second.

Intel shaved half a Watt from the chip's power consumption by using a 32-byte wide ring interconnect to link the cores. It was designed using latch- and flop-based sequentials, said Shankar Sawant, senior engineer at Intel Bangalore.

He described a handful of power management features including use of multi-voltage domains and new lower power states and sub-states in the cores. In addition, Intel applied temperature compensation and receive equalization techniques on the QPI interconnect to enable its 6.4 GTransfers/second maximum throughout.

Intel also made the first technical disclosures of Poulson, the next member of its Itanium family and the first to use eight cores and a 12-instruction wide data path.

1. Godson eyes petaflops

2. IBM hits 5.2 GHz

3. AMD Bulldozer, Intel server CPUs

Top

_____________________________________________________________________________________________________________________________________________

3. Intel details Sandy Bridge at ISSCC

Dylan McGrath 2/23/2011 1:04 AM EST

SAN FRANCISCO—Intel Corp. disclosed more technical details of its 32-nm Sandy Bridge processor at the International Solid-State Circuits Conference here Tuesday (Feb. 22), including further description of its modular ring interconnect, design techniques used to minimize the cache's operational voltage and the inclusion of debug bus for monitoring traffic on the interconnect.

The 32-nm Sandy Bridge processor integrates up to four x86 cores, a power/performance optimized graphic processing unit (GPU) and DDR3 memory and PCI Express controllers on the same die, according to the paper presented at ISSCC Tuesday by Ernest Knoll, a designer at Intel's design center in Haifa, Israel. Sandy Bridge features 1.16 billion transistors and a die size of 216 square millimeters, Knoll said.

The Sandy Bridge IA core implements several improvements that boost performance without increasing power consumption, including an improved branch prediction algorithm, a micro-operation cache and a floating point advanced vector extension, according to the paper. Also, the devices' CPUs and GPU share the same 8MB level-3 cache memory, according to the paper.

Although the L3 cache units are organized in four slices along with the x86 cores, 2MB per core, they are fully shared with the GPU, Knoll said.



Sandy Bridge's ring interconnect fabric connects all the elements of the chip, including the CPUs, the GPU, the L3 cache and the system agent. Because the ring interconnect is modular, the four-core die can easily be converted into a two-core die by "chopping" out two cores and two L3 cache modules, according to Knoll's presentation. The initial version of Sandy Bridge are available in two- or four-core variations. 

"By simply 'chopping' two slices, we get to another level of die," Knoll said.

Intel provided the first details about the Sandy Bridge family of heterogeneous processors at the Intel Developer Forum here last September. Intel introduced the first Sandy Bridge products, the second generation of the company's Core processor family, at the Consumer Electronics Show in January. Some of the devices have been shipping since early January and Intel expects them to be incorporated into more than 500 laptop and desktop PC designs this year.

Minimize power consumption
Because Sandy Bridges' x86 cores and L3 cache share the same power plane, Intel faced the challenge that the minimum voltage needed to keep the L3 cache data may have limited the minimum operating voltage of the cores, increasing the power consumption of the system, according to the paper. Intel got around this issue by developing several circuit and logic design techniques to minimize the minimum operational voltage of the L3 cache and the register files of the chip to bring it to a lower level than the core logic, according to the paper.

"One of the design targets was to minimize as much as possible power consumption," Knoll said.

One of the techniques used to skirt the issue was a shared p-channel MOSFET technique that weakens the memory cell pull up device effective strength that solves the problem of RF write-ability degradation at low voltages that can be created by manufacturing process variations, Knoll said.

"With such techniques we are able to improve the minimum operating voltage for a vast majority" of the chip, Knoll said.

Thanks to the use of these design techniques, Sandy Bridge's power dissipation ranges from 95W for a four-core device operating in a high-end desktop to17W for a two-core Sandy Bridge running an optimized mobile product, according to the paper.

Sandy Bridge also introduces a debug bus that allows monitoring the traffic between the x86 cores, GPU, caches and system agent on the processor internal ring, according to the paper. The bus, dubbed the Generic Debug eXternal Connection (GDXC), allows chip, system or software debuggers to sample ring data traffic as well as ring protocol control signals and drive it to an external logic analyzer, where it can be recovered and analyzed, according to the paper.  

"The GDXC is a valuable tool for system and software debuggers," Knoll said.

Sandy Bridge also includes two different types of thermal sensors to monitor the temperature of the die, according to the paper. One is a diode-based thermal sensor on each core that compares the diode voltage to output the temperature, providing information for throttling, catastrophic function and fan regulation. The second is a much smaller CMOS-based thermal sensor with a more limited temperature range that can be placed at several locations inside the core to provide an accurate picture of core hot spots, according to the paper.

Earlier this year, Intel discovered a design flaw in one of the support chips for the first quad-core version of Sandy Bridge that began shipping Jan. 9. The company came up with a quick fix for the issue and temporarily halted shipments of the support chip. Intel later resumed shipments of the flawed chip to PC suppliers that were implementing it in systems were the flaw would not be an issue.


Top

____________________________________________________________________________________________________________________________________________

4. AMD Bulldozer's at the ISSCC 2011

 

The more I read about Bulldozer, the more impressed I am becoming.  AMD has barely been able to keep up with Intel for the past 6 years, which is disappointing considering the success they found with the original Athlon and then Athlon 64.  Since Intel introduced the Core 2 series, AMD has only been able to design and sell processors that would sometimes achieve around 90% of the overall performance of Intel’s top end processors.  AMD also suffered from die size and performance per watt disparities as compared to Intel’s very successful Core 2 and Core i7/i5/i3 processors.  The latest generation of Sandy Bridge based units again exposed how lacking in overall design and performance AMD’s Phenom families of chips were.

Before we all go off the deep end and claim that AMD will surpass Intel in overall performance, we need to calm down.  We are getting perilously close to the very limits of IPC (instructions per clock) with current technology and designs.  It appears to me that with the Bulldozer architecture, AMD should reach parity with Intel and their latest generation of CPUs when it comes to IPC per core.  Where AMD could have an advantage over Intel are in several specific categories.



Fetch, decode, L2 cache, and a beefy floating point/SIMD unit are all shared.  Since the majority of CPU work is integer based, AMD has implemented two complete integer units per CPU module.

At this year’s ISSC, AMD presented several papers on the Bulldozer architecture.  These cover in depth about a handful of features that the new core will bring to AMD’s lineup.  The first portion is essentially a refresh of what information we have been given last fall about the overall architecture of the Bulldozer core.  The second looks at the changes in the schedulers which feed the integer execution unit.  The final paper covers the power saving techniques.

A Clean Sheet Design

Bulldozer brings very little from the previous generation of CPUs, except perhaps the experience of the engineers working on these designs.  Since the original Athlon, the basic floor plan of the CPU architecture AMD has used is relatively unchanged.  Certainly there were significant changes throughout the years to keep up in performance, but the 10,000 foot view of the actual decode, integer, and floating point units were very similar throughout the years.  TLB’s increasing in size, more instructions in flight, etc. were all tweaked and improved upon.  Aspects such as larger L2 caches, integrated memory controllers, and the addition of a shared L3 cache have all brought improvements to the architecture.  But the overall data flow is very similar to that of the original Athlon introduced 14 years ago.

As covered in our previous article about Bulldozer, it is a modular design which will come in several flavors depending on the market it is addressing.  The basic building block of the Bulldozer core is a 213 million transistor unit which features 2 MB of L2 cache.  This block contains the fetch and decode unit, two integer execution units, a shared 2 x 128 bit floating point/SIMD unit, L1 data and instruction caches, and a large shared L2 unit.  All of this is manufactured on GLOBALFOUNDRIES’ 32nm, 11 metal layer SOI process.  This entire unit, plus 2 MB of L2 cache, is contained in approximately 30.9 mm squared of die space.

It is well known that Bulldozer embraces the idea of “CMT”, or chip multi-threading.  While Intel supports SMT on their processors, it is not the most efficient way of doing things.  SMT sends two threads to the same execution unit, in an attempt to maximize the work being done by that unit.  Essentially fewer cycles are wasted waiting for new instructions or resultant data.  AMD instead chose to implement multi-threading in a different way.  For example, a Bulldozer core comprised of four modules will have eight integer execution units, and four shared 2 x 128 bit floating point/SIMD units.  This allows the OS to see the chip as an eight core unit.

CMT maximizes die space and threading performance seemingly much better than SMT (it scales around 1.8x that of a single core, as compared to 1.3x that using SMT), and CMP (chip multi-processor- each core may not be entirely utilized, and the die cost of replicating entire cores is much higher than in CMP).  This balance of performance and die savings is the hallmark of the Bulldozer architecture.  AMD has gone through and determined what structures can be shared, and what structures need to be replicated in each module.  CMT apparently only increases overall die space by around 5% in a four module unit.



A closer look at the units reveals some nice details.  Note the dual MMX (SIMD-Integer) units in the FP/SIMD block.  A lot of work has been done on the front end to adequately feed the three execution units.


Gone is the three pipeline integer unit of the Athlon.  Bulldozer uses a new four pipeline design which further divides the workloads being asked of it.  These include multiply, divide, and two address generation units.  Each integer unit is fed by its own integer scheduler.  The decode unit which feeds the integer units and the float unit has also been significantly beefed up.  And it had to be.  It is now feeding a lot more data to more execution units than ever before.  The original Athlon had a decode unit comprised of 3 complex decoders.  The new design now features a 4 decode unit, but we are unsure so far how the workload is managed.  For example, the Core 2 had a 4 decode unit, three of which were simple decode, and the fourth was a complex.  My gut feeling here is that we are probably looking at three decoders which can handle 80 to 90% of the standard instructions, while the fourth will handle the more complex instructions which would need to be converted to more than one macro-op.  While this sounds familiar to the Core 2 architecture, it does not necessarily mean the same thing.  It all depends on the complexity of the macro-ops being sent to the execution units, and how those are handled.

The floating point unit is also much more robust than it used to be.  The Phenom had a single 128 bit unit per core, and Bulldozer now has it as 2 x 128 bit units.  It can combine those units when running AVX and act as a single 256 bit unit.  There are some performance limitations there as compared to the Intel CPUs which support AVX, and in those cases Intel should be faster.  However, AVX is still very new, and very unsupported.  AMD will have an advantage here over Intel when running SSE based code.  It can perform 2 x 128 bit operations, or up to 4 x 64 bit operations.  Intel on the other hand looks to only support 1 x 128 bit operation and 2 x 64 bit operations.  The unit officially supports SSE3, SSE 4.1, SSE 4.2, AVX, and AES.  It also supports advanced multiply-add/accumulate operations, something that has not been present in previous generations of CPUs.

In terms of overall performance, a Bulldozer based core should be able to outperform a similarly clocked Intel processor featuring the same number of threads when being fully utilized.  Unfortunately for AMD, very few workloads will max out a modern multi-core processor.  Intel should have a slight advantage in single threaded/lightly threaded applications.  AMD does look to offset that advantage by offering higher clocked processors positioned against the slower clocked Intel units.  This could mean that a quad core i7 running at 3.2 GHz would be the price basis for a 4 module Bulldozer running at 3.5 GHz.

Exact specifications have not been released for the individual parts, but we can infer a few things here.  First off is the fact that it appears as though each core will utilize 2 MB of L2 cache.  This is quite a bit of cache, especially considering that the current Phenom II processors feature 512 KB of L2 cache per core.  Something that has allowed this to happen is buried in GLOBALFOUNDRIES 32 nm SOI process.  They were apparently able to get the SRAM cell size down significantly from that of the previous 45 nm process, and allow it to also clock quite a bit higher.  This should allow more headroom for the individual cores.  With the shrink, we should also expect to see at least 8 MB of shared L3 cache, with the ability to potentially clock higher than the 2 GHz we see the current L3 caches running at.

Integer Scheduler and Execution Unit

The second topic covered at ISSCC was that of “40-Entry unified Out-of-Order Scheduler and Integer Execution Unit for the AMD Bulldozer x86-64 Core”.  Single thread performance is still of great importance for modern processors, and this has been an area where AMD has lacked as compared to the competition.  The first work to help achieve better single thread performance was that of the fetch/prefetch, branch prediction, and decode.  AMD has still not covered those portions in depth, other than we know that a lot of work has been done to each individual unit.

Each integer unit has its own scheduler.  Each integer unit is comprised of two execution units, and then two address generation units.  The execution units are further divided so that one handles multiply and the other divide.   These are again newly designed units which have very little in common with previous processor architectures.

The schedulers have some very interesting wrinkles to them.  First off is the support for 40 entries, out of order scheduling.  It also supports up to 4 x 64 bit instructions in flight.  Michael Golden presented the paper, and his quote about the clock characteristics of these tightly knit units is as follows:

The out-of-order scheduler must efficiently pick up to four ready instructions for execution and wake up dependent instructions so that they may be picked in the next cycle. The execution units must compute results in a single cycle and forward them to dependent operations in the following cycle. All of this is required so that the module gives high architectural performance, measured in the number of instructions completed per cycle (IPC).

What is perhaps the most interesting aspect of these new designs is the use of standard cells vs. fully custom cells.  Place and route of standard cells can be automated, and it is relatively easy to create complex designs fairly quickly.  Custom cell layout is very complex and time consuming, but it has the advantage of being very efficient in terms of power consumption, and has a higher switching speed than standard cell designs.  Somehow AMD has taken a standard cell design utilized on GLOBALFOUNDRIES 32 nm SOI process, and made it perform at custom cell levels.  The integer execution units and the scheduler can run at the same 3.5 GHz+ speed as the rest of the chip, even though it has portions of the design made with standard cells.

This apparently has allowed AMD to quickly and rapidly prototype these designs.  This has the advantage of being able to deliver to market faster than going with a fully custom part, and it also allows AMD to further test the performance and attributes of the standard cell design and possibly change it without the time and manpower constraints of custom cell.  How AMD has achieved this is beyond me.  Being able to implement standard cell design rules and achieve custom cell performance has been the holy grail of CPU/GPU design.  Obviously this has limitations, as the entire processor is not comprised of all standard cells.  I believe that Intel also utilizes some standard cell features in their latest series of processors, so AMD is not exactly alone here.

Power

Previous AMD processors were not designed from the ground up to implement complex and efficient power saving schemes.  Since Bulldozer is a new design altogether, the engineers are able to more effectively implement power saving into the processor.  Throughout the years we have seen small jumps forward from AMD with power saving techniques, but Bulldozer will be the first desktop/server product that will have a fully comprehensive suite of power saving technologies.



The CPU, in typical workloads (obviously does not include "Furmark" in SLI/Crossfire situations), takes up the majority of power in a system.  By being able to reduce a significant percentage of power draw at that one component will decrease the overall system draw to a great degree.

AMD now has fully gated power to the individual cores, which allows them to be completely turned off when not in use.  The replication of functional units (such as fetch and decode) for the individual cores also cuts down on the complexity, and thereby power draw, of the overall processor as compared to how many logical cores it has.  The clock grid (which provides the clock signals throughout the processor) also has been radically redesigned so as to be less of a power sink, and still be efficient in keeping the processor clicking along.

Clock gating, which turns off individual components such as execution units, has been much more thoroughly implemented.  There is something like 30,000 clock enables throughout the design, and it should allow an unprecedented amount of power savings (and heat reduction) even when the CPU is at high usage rates.  Even though a processor might be at 100% utilization, not all functional units are being used or need to be clocked.  By having a highly granular control over which units can be gated, overall TDP and heat production can be reduced dramatically even at high utilization rates.

AMD Turbo Core will also receive a great amount of attention.  The current Turbo Core we see in the X6 processors is somewhat underwhelming when we look at the overall complexity of AMD’s implementation.  For example, when three cores or less are being utilized on the X6 1090T, those cores will clock up to 3.6 GHz, while the other three go down to 800 MHz.  There is no real fine tuning of performance or TDP here, just an “on/off” switch for clocking half of the cores 400 MHz higher while downclocking the rest.  This is fairly basic as compared to Intel’s system.  Now it seems that AMD is implementing a system much like Intel’s.  We should see Turbo frequencies with differing numbers of cores which will be much more similar to what Intel offers with Sandy Bridge.



Due to the ground up design of Bulldozer, and the focus on decreasing power draw and heat production, we will see a nice reduction in power being utilized across the entire processor.

In Closing

Bulldozer is a comprehensive blank sheet design which is very similar to the jump the company took going from the K5/K6 to the original Athlon.  AMD certainly hopes that it will be able to more adequately compete with Intel in terms of overall performance per watt, as well as die size and transistor count.  When the Phenom was originally detailed, many thought that it would prove to be the counter to the Core 2 that AMD needed, but unfortunately that design was not forward thinking enough in terms of design to adequately compete.  Up through the current generation of parts, Intel was able to use fewer transistors and a smaller die size to create products that were significantly faster than what AMD was able to provide.

All indications so far point to Bulldozer being at the very least a competitive design.  I believe that Intel will have an advantage in instructions per clock when handling single threaded and lightly threaded workloads.  But AMD certainly looks to counter that by providing processors which will clock higher than the Intel counterparts, yet still remain in the same thermal envelope as the competition.  AMD has also made a big push to cut down the transistor count yet still retain the necessary performance to compete.  We should see leaner, meaner die sizes from AMD when compared to Intel products in the same performance range.  Consider that the Phenom II X4 processors had almost the same die size as the Core i7 9x0 series from Intel, but simply could not compete at the same level.  Bulldozer looks to change that.

If AMD has designed the front end of each core as we hope they have, then in heavily threaded applications Bulldozer should have a distinct performance advantage as compared to the SMT based Intel parts.  This is not a given though.  Processor design is hard.  This much is obvious, as there are few CPU companies out there.  AMD has an aggressive approach with the Bulldozer design, and I can foresee much more work being done with the fetch and decode units in further generations of products to more adequately feed the integer and floating point execution units.  That being said, it still looks to be a very fast part across a variety of workloads.

Until AVX hits primetime, AMD should again have a performance advantage with their FPU/SIMD design.  Being able to do 4 x 64 bit or 2 x 128 bit FP/SSE instructions per clock will give much higher throughput than the competing Intel unit.  Only when AVX instructions are run will we see Intel take the lead with their current designs.

Bulldozer has a lot of heavy expectations being laid upon it.  And so far, the people at AMD have seemed very excited about it.  Probably much more excited about the potential of this part as compared to the first Bobcat based Fusion processors (which have already proven to be a hit with OEMs and consumers alike).  We again must temper our expectations though, as we have been let down multiple times in the past by AMD and their new wonderchips.  Then again, the original Athlon and the follow up Athlon 64 proved to be quite successful and gave Intel a serious run for its money.  For the sake of competition, I hope Bulldozer can deliver.

Top

____________________________________________________________________________________________________________________________________________

5. Samsung takes DRAM 512-pins wide

Peter Clarke

2/21/2011 7:24 AM EST


LONDON – Samsung Electronics Co. Ltd. has announced the development of a 1-Gbit DRAM with a 512-pin wide I/O interface intended for mobile applications such as smartphones and tablet computers.

The chip is implemented in a manufacturing process technology somewhere between 50- and 59-nm and Samsung is also due to present a paper related to wide I/O DRAM technology at the 2011 International Solid-State Circuits Conference being held from February 20 to 24 in San Francisco.

To boost data transmission, the wide-I/O DRAM uses 512 pins for data input and output compared to the previous generation of mobile DRAMs, which used a maximum of 32 pins. Including pins for commands and power supply and its regulation the WIO DRAM is designed to have ip to 1,200 pins.

Samsung did not indicate whether it intends to offer the 1-Gbit WIO DRAM as a packaged part or to use it as a bare die in multi-chip packages. Nor did Samsung state when engineering samples of the 1-Gbit WIO DRAM would be available or it would be in volume production.

Nonetheless, as a result of the extreme I/O the 1-Gbit WIO DRAM can transmit data at 12.8-Gbytes per second, increasing the bandwidth of mobile DDR DRAM eightfold, while reducing power consumption by approximately 87 percent. The bandwidth is also four times that of LPDDR2 DRAM, which is approximately 3.2-Gigabytes per second, Samsung said.

To follow on from the WIO DRAM launch Samsung is planning for a 20-nm class 4-Gbit WIO mobile DRAM to become available in 2013,

"Following the development of 4-Gbit LPDDR2 DRAM last year, our new mobile DRAM solution with a wide I/O interface represents a significant contribution to the advancement of high-performance mobile products," said Byungse So, senior vice president, memory product planning and application engineering at Samsung Electronics, in a statement.

Top

____________________________________________________________________________________________________________________________________________