Usenet.com

www.Usenet.com

Group Index

Comp Thread Archive from Usenet.com

<-- __Chronological__ --> <-- __Thread__ -->

Re: 1teraflops cell processor possible?



"Robert Myers" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> On 26 Nov 2003 22:46:33 -0800, [EMAIL PROTECTED] (George William
> Herbert) wrote:

...

> >Bandwidth x distance is expensive.  Lots of bandwidth over
> >short distance is not very expensive.  Lots of bandwidth across
> >small distances on a modern chip is very very very cheap.
> >
> That's why streaming architectures are attractive, and the point that
> everyone seems to be missing, or at least that no one bothers to
> acknowledge, is that it is movement of data *on the chip* that is
> expensive--in terms of power consumption.  I summarized the argument
> in another post here, and I'm not going to summarize it again.  It's
> in the sc2003 merrimac paper
>
> http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf
>
> >For highly partitionable problems, the cost-effective
> >optimum partition size can be analyzed and modeled by
> >looking at the costs of transmitting partition cell
> >edge state to neighbors versus storing/calculating it
> >locally.  For given problems and chip technologies there
> >are different optimizations.
> >
> >By putting a large number of tiny but moderately powerful
> >CPU/Custom FP units per ASIC chip, they are getting very
> >good average neighbor to neighbor bandwidth.  Or at least
> >can do so and presumably did.  Going off-chip to the
> >neighbors on a circuit board hurts, but again is subject
> >to cost / technology optimization, along with the
> >CPU capacity per unit and bandwidth internally...
> >
> You really do need to read the paper.

All right, I took you at your word, and have the following observations:

1.  Existing local per-core (L1, and sometimes L2) caches certainly go a
long way toward addressing the paper's observation that communication
(between storage and computational unit) can take far more energy than
computation.  For their example where sending 3 64-bit data elements about
15 mm. (from a large chip-wide cache like Itanic's L3 or POWER4+'s L2) took
20x as much power as performing a single computation on them, a local L1
cache with a mere 95% hit rate (hardly unrealistic for the kind of tuned
code being considered in the paper) would reduce that ratio to much more
like 1:1 on average (or perhaps 2:1 if the L1 communication distance
contributes a non-negligible amount; a local L2 such as Itanic's should help
ensure decent local hit rates and energy consumption in architectures that
employ very small L1s) - and such local caches scale down with smaller
feature sizes to preserve this relationship.

2.  The paper makes the far-too-common error of comparing current
general-purpose technology with future special-purpose vaporware.
Multi-core designs like Sun's 8-core Niagara (expected in 2005-6) and
Intel's reportedly 8-core Tanglewood (expected in 2006-7) are a major step
toward large per-chip numbers of computational units - especially to the
degree that each core has multiple such units and some form of
multi-threading to exploit them when significant workload parallelism is
present.  These chips will almost certainly retain the local per-core caches
described above and hence offer potentially good
computation-vs.-communication power (and performance) characteristics; if
continued decreases in feature sizes start to make the local cache hit-rate
insufficient to mask chip-wide-cache latency (and communication power
consumption), the obvious solution is to break up the chip-wide cache into
separate, closer shared caches for subsets of the cores, and communicate
among these subsets just as multi-core chips communicate off-chip.  The Sun
and Intel designs will also presumably employ sequential prefetch
optimizations at least between chip-wide cache and main memory, so the kind
of streaming access to main memory described in the paper should be possible
for the kind of specialized code it describes (and at least for
architectures that support prefetch hints this presumably would, or at least
potentially could, carry the data right through to the core-local caches).
I'm not going to spend enough time studying this to understand the detailed
nature of the specialized support in the proposed chip for pipe-lining
computations from one core to the next (rather than within a single core, as
is currently common) without the latency inherent in having the data pass
through a chip-wide cache:  I do see some hand-waving about
better-than-chip-wide SRF locality, but exactly how this would be exploited
in such a computation (which at a minimum would appear to require knowledge
of which cores were near neighbors and which were not) is not clear at first
(quick) glance.

3.  And of course the most important point is that the Sun and Intel designs
address not only specialized HPTC needs but significant commercial needs as
well - so they will be both relatively inexpensive due to highish volume and
well-maintained over time, neither of which is likely with a far more
specialized chip such as the one the paper proposes.  The paper's estimate
of a $200 cost is laughable to the point of being absurd:  just blindly
migrating the design to each new process generation as it came along, with
no enhancements to take advantage of the additional opportunties that opened
up, would likely make it far more expensive than that for the volumes that
it would likely generate, leaving aside both initial development costs and
marginal production costs.  No company in their right mind would be likely
to touch this without the expectation of a market price approaching 5
figures to justify the effort and development risk.

IOW, I'd suggest that for the problem-space described people might well be
better off figuring out how best to use the commercial processors already in
the development pipeline to best effect - and that these designs might well
at least approach the capabilities of the special-purpose hardware that was
described (don't forget, they'll be clocked many times as fast as the target
described for that hardware).  Unless I greatly over-estimate the expense of
creating large, complex, high-performance chips (and the recent experiences
with Itanic would seem to suggest otherwise), doing so for a small,
special-purpose market such as the one described seems to make no sense at
all.

- bill






<-- __Chronological__ --> <-- __Thread__ -->


Usenet.com



Please check out one of the premium Usenet Newsgroup Service Providers below for access to Usenet.