
www.Usenet.com
| <-- __Chronological__ --> | <-- __Thread__ --> |
"Robert Myers" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > On 26 Nov 2003 22:46:33 -0800, [EMAIL PROTECTED] (George William > Herbert) wrote: ... > >Bandwidth x distance is expensive. Lots of bandwidth over > >short distance is not very expensive. Lots of bandwidth across > >small distances on a modern chip is very very very cheap. > > > That's why streaming architectures are attractive, and the point that > everyone seems to be missing, or at least that no one bothers to > acknowledge, is that it is movement of data *on the chip* that is > expensive--in terms of power consumption. I summarized the argument > in another post here, and I'm not going to summarize it again. It's > in the sc2003 merrimac paper > > http://www.sc-conference.org/sc2003/paperpdfs/pap246.pdf > > >For highly partitionable problems, the cost-effective > >optimum partition size can be analyzed and modeled by > >looking at the costs of transmitting partition cell > >edge state to neighbors versus storing/calculating it > >locally. For given problems and chip technologies there > >are different optimizations. > > > >By putting a large number of tiny but moderately powerful > >CPU/Custom FP units per ASIC chip, they are getting very > >good average neighbor to neighbor bandwidth. Or at least > >can do so and presumably did. Going off-chip to the > >neighbors on a circuit board hurts, but again is subject > >to cost / technology optimization, along with the > >CPU capacity per unit and bandwidth internally... > > > You really do need to read the paper. All right, I took you at your word, and have the following observations: 1. Existing local per-core (L1, and sometimes L2) caches certainly go a long way toward addressing the paper's observation that communication (between storage and computational unit) can take far more energy than computation. For their example where sending 3 64-bit data elements about 15 mm. (from a large chip-wide cache like Itanic's L3 or POWER4+'s L2) took 20x as much power as performing a single computation on them, a local L1 cache with a mere 95% hit rate (hardly unrealistic for the kind of tuned code being considered in the paper) would reduce that ratio to much more like 1:1 on average (or perhaps 2:1 if the L1 communication distance contributes a non-negligible amount; a local L2 such as Itanic's should help ensure decent local hit rates and energy consumption in architectures that employ very small L1s) - and such local caches scale down with smaller feature sizes to preserve this relationship. 2. The paper makes the far-too-common error of comparing current general-purpose technology with future special-purpose vaporware. Multi-core designs like Sun's 8-core Niagara (expected in 2005-6) and Intel's reportedly 8-core Tanglewood (expected in 2006-7) are a major step toward large per-chip numbers of computational units - especially to the degree that each core has multiple such units and some form of multi-threading to exploit them when significant workload parallelism is present. These chips will almost certainly retain the local per-core caches described above and hence offer potentially good computation-vs.-communication power (and performance) characteristics; if continued decreases in feature sizes start to make the local cache hit-rate insufficient to mask chip-wide-cache latency (and communication power consumption), the obvious solution is to break up the chip-wide cache into separate, closer shared caches for subsets of the cores, and communicate among these subsets just as multi-core chips communicate off-chip. The Sun and Intel designs will also presumably employ sequential prefetch optimizations at least between chip-wide cache and main memory, so the kind of streaming access to main memory described in the paper should be possible for the kind of specialized code it describes (and at least for architectures that support prefetch hints this presumably would, or at least potentially could, carry the data right through to the core-local caches). I'm not going to spend enough time studying this to understand the detailed nature of the specialized support in the proposed chip for pipe-lining computations from one core to the next (rather than within a single core, as is currently common) without the latency inherent in having the data pass through a chip-wide cache: I do see some hand-waving about better-than-chip-wide SRF locality, but exactly how this would be exploited in such a computation (which at a minimum would appear to require knowledge of which cores were near neighbors and which were not) is not clear at first (quick) glance. 3. And of course the most important point is that the Sun and Intel designs address not only specialized HPTC needs but significant commercial needs as well - so they will be both relatively inexpensive due to highish volume and well-maintained over time, neither of which is likely with a far more specialized chip such as the one the paper proposes. The paper's estimate of a $200 cost is laughable to the point of being absurd: just blindly migrating the design to each new process generation as it came along, with no enhancements to take advantage of the additional opportunties that opened up, would likely make it far more expensive than that for the volumes that it would likely generate, leaving aside both initial development costs and marginal production costs. No company in their right mind would be likely to touch this without the expectation of a market price approaching 5 figures to justify the effort and development risk. IOW, I'd suggest that for the problem-space described people might well be better off figuring out how best to use the commercial processors already in the development pipeline to best effect - and that these designs might well at least approach the capabilities of the special-purpose hardware that was described (don't forget, they'll be clocked many times as fast as the target described for that hardware). Unless I greatly over-estimate the expense of creating large, complex, high-performance chips (and the recent experiences with Itanic would seem to suggest otherwise), doing so for a small, special-purpose market such as the one described seems to make no sense at all. - bill
| <-- __Chronological__ --> | <-- __Thread__ --> |