
www.Usenet.com
| <-- __Chronological__ --> | <-- __Thread__ --> |
On Wed, 03 Dec 2003 00:20:53 GMT, "News sbc" <[EMAIL PROTECTED]> wrote: <snip> > >Most computation involves more than one data item. >I think this implies that the optimum place to perform >a computation will be at something resembling the >centroid of the data items involved - weighted by >access frequency. With some allowance for treating >instructions as data. During the life of a comnputation >this "computational centroid" would shift over time. > >Parallel computations would be performed at something >resembling a Viterbi constallation diagram. > The ADAM architecture described in Andrew "Bunnie" Huang's MIT Ph.D. thesis attempt to do something like that. I've heard him give a talk on it and read it more than once, but I have a ways to go before I understand it well enough to give a summary better than what I think is in the title; viz, you try to keep data and the threads using them close to each other by migrating both. >--- > >Re streams: would you build a stream instrction set, or a >vector instruction set? > >Obviously, vectors can be built on top of streams. >Vice versa is a bit harder. > I'm not entirely certain I know the difference, in part because I'm no longer certain I know what a vector processor is. On the old Cray's, a "vector processor" was really just a pipeline processor. On Itanium, at least, you can explicitly define a software pipeline, and I'm not certain what to call the resulting process; is it a vector process, a streaming process, or neither? SIMD is more like what alot of people who didn't really understand the old Cray's thought they were doing. You can stream SIMD instructions, too. Is that a streaming process built on top of a vector architecture? >From my point of view, the real issue got exposed in my exchange with Bill Todd: single CPU's already understand the wisdom of the Merrimac paper and can already do tricks that look an awful lot like streaming or vector processing or both. It's when you put multiple processors on a single die that the difference becomes apparent. The key step seems to be to have the kind of bypass path connecting CPU's that now apparently connects functional units on a single CPU. On a single CPU, the bypass path allows you to bypass the register file. With multiple CPU's, it would allow you to bypass shared cache. How to make such a path available and how to tell the processors to use it is well beyond my capacity even to hand-wave. And, of course, if you have a large die, possibly so large that you no longer expect to be able to reach all of it in a single clock, you have to start worrying about the global movement of data among processors. If you can arrange the calculation as a streaming process, then organizing the movement of data is a problem that solves itself: data might eventually migrate across the entire die in many clock ticks, after having gone though multiple neighboring CPU's to do so, possibly without ever having languished in cache, and certainly without ever having required access to global bandwidth or global cache. RM RM
| <-- __Chronological__ --> | <-- __Thread__ --> |