Usenet.com

www.Usenet.com

Group Index

Comp Thread Archive from Usenet.com

<-- __Chronological__ --> <-- __Thread__ -->

Re: Processors in memory... NOT



On Wed, 03 Dec 2003 00:20:53 GMT, "News sbc"
<[EMAIL PROTECTED]> wrote:

<snip>
>
>Most computation involves more than one data item.
>I think this implies that the optimum place to perform
>a computation will be at something resembling the
>centroid of the data items involved - weighted by
>access frequency.  With some allowance for treating
>instructions as data. During the life of a comnputation
>this "computational centroid" would shift over time.
>
>Parallel computations would be performed at something
>resembling a Viterbi constallation diagram.
>

The ADAM architecture described in Andrew "Bunnie" Huang's MIT Ph.D.
thesis attempt to do something like that.  I've heard him give a talk
on it and read it more than once, but I have a ways to go before I
understand it well enough to give a summary better than what I think
is in the title; viz, you try to keep data and the threads using them
close to each other by migrating both.

>---
>
>Re streams:  would you build a stream instrction set, or a
>vector instruction set?
>
>Obviously, vectors can be built on top of streams.
>Vice versa is a bit harder.
>
I'm not entirely certain I know the difference, in part because I'm no
longer certain I know what a vector processor is.  On the old Cray's,
a "vector processor" was really just a pipeline processor.  On
Itanium, at least, you can explicitly define a software pipeline, and
I'm not certain what to call the resulting process; is it a vector
process, a streaming process, or neither?

SIMD is more like what alot of people who didn't really understand the
old Cray's thought they were doing.  You can stream SIMD instructions,
too.  Is that a streaming process built on top of a vector
architecture?

>From my point of view, the real issue got exposed in my exchange with
Bill Todd: single CPU's already understand the wisdom of the Merrimac
paper and can already do tricks that look an awful lot like streaming
or vector processing or both.  It's when you put multiple processors
on a single die that the difference becomes apparent.  The key step
seems to be to have the kind of bypass path connecting CPU's that now
apparently connects functional units on a single CPU.  On a single
CPU, the bypass path allows you to bypass the register file.  With
multiple CPU's, it would allow you to bypass shared cache.  How to
make such a path available and how to tell the processors to use it is
well beyond my capacity even to hand-wave.

And, of course, if you have a large die, possibly so large that you no
longer expect to be able to reach all of it in a single clock, you
have to start worrying about the global movement of data among
processors.  If you can arrange the calculation as a streaming
process, then organizing the movement of data is a problem that solves
itself: data might eventually migrate across the entire die in many
clock ticks, after having gone though multiple neighboring CPU's to do
so, possibly without ever having languished in cache, and certainly
without ever having required access to global bandwidth or global
cache.

RM

RM



<-- __Chronological__ --> <-- __Thread__ -->


Usenet.com



Please check out one of the premium Usenet Newsgroup Service Providers below for access to Usenet.