Usenet.com

www.Usenet.com

Group Index

Comp Thread Archive from Usenet.com

<-- __Chronological__ --> <-- __Thread__ -->

Re: Spatially distributed computation



On Wed, 03 Dec 2003 00:20:53 GMT, "News sbc"
<[EMAIL PROTECTED]> wrote:

>> The Merrimac authors were not referring to bandwidth to memory or even
>> to cache.  They were talking about bandwidth within the CPU itself.
>> In a streaming architecture there is no getting and putting; the
>> output of one functional unit feeds right into the input of another.
>
>Unfortunately, this is not true for streaming architectures,
>and it is not true for reconfigurable logic.
>
>Rather, the output of one functional unit feeds into a switching network
>that eventually feeds into the input of the other.  Where said switching
>network, if you are going to have more than 4 integer ALUs on a chip[*]
>probably involves at least 1, more likely 2, right hand turns - i.e. where
>the interconnect length is probably 8-16 times longer than that involved
>in a single integer ALU.  Which, if you are being aggressive in clocking,
>and which if you are limited by wire speed, could very easily translate
>to 1 clock in the ALU, 4 cycles to any other ALU that is not directly
>stacked on top of the first.
>
>If talking about FMACs rather than integer ALUs, the ratio of compute
>to communicate even in such a "FU to FU" microarchiture is closer to
>1 to 1.  Although my understanding is that some of the bio-informatics
>and genomics codes are integer; although protein folding should be FP,
>if based on real physical modelling.
>
I'm going to play along gamely and expose my ignorance in the hope of
jollying my own understanding along, and perhaps that of a few others.
It seems to me that you are making a distinction without a difference.
If you are streaming, you stream to whatever you can reach in the next
clock.  If it's another functional unit, great.  If it's an element of
a switching network, so what?  From my (naive) POV, the switching
network is just as transparent as the repeaters that are necessary
actually to get data to move any real distance.

The point is, you never put anything anywhere--cache or register--just
to sit and wait.  It's always on its way somewhere, and you can
pipeline movement through the switching network just like you can
pipeline movement through functional units.  What you are talking
about affects latency: how long it takes to get the first result to
pop out of the end of the stream, but it does not necessarily affect
throughput.  Once you get the first result out, there is no reason
whatsoever that a properly designed streaming architecture cannot get
out a new result with every clock, no matter how many functional units
or network switches or repeaters you've had to go through to get
there.

As to communication cost, you can't do any better than the cost of
moving through the hardware you *have* to move through to get from one
end of the pipe to the other, and you don't *have* to sit in register
files or cache.

RM




<-- __Chronological__ --> <-- __Thread__ -->


Usenet.com



Please check out one of the premium Usenet Newsgroup Service Providers below for access to Usenet.