
www.Usenet.com
| <-- __Chronological__ --> | <-- __Thread__ --> |
On Wed, 03 Dec 2003 00:20:53 GMT, "News sbc" <[EMAIL PROTECTED]> wrote: >> The Merrimac authors were not referring to bandwidth to memory or even >> to cache. They were talking about bandwidth within the CPU itself. >> In a streaming architecture there is no getting and putting; the >> output of one functional unit feeds right into the input of another. > >Unfortunately, this is not true for streaming architectures, >and it is not true for reconfigurable logic. > >Rather, the output of one functional unit feeds into a switching network >that eventually feeds into the input of the other. Where said switching >network, if you are going to have more than 4 integer ALUs on a chip[*] >probably involves at least 1, more likely 2, right hand turns - i.e. where >the interconnect length is probably 8-16 times longer than that involved >in a single integer ALU. Which, if you are being aggressive in clocking, >and which if you are limited by wire speed, could very easily translate >to 1 clock in the ALU, 4 cycles to any other ALU that is not directly >stacked on top of the first. > >If talking about FMACs rather than integer ALUs, the ratio of compute >to communicate even in such a "FU to FU" microarchiture is closer to >1 to 1. Although my understanding is that some of the bio-informatics >and genomics codes are integer; although protein folding should be FP, >if based on real physical modelling. > I'm going to play along gamely and expose my ignorance in the hope of jollying my own understanding along, and perhaps that of a few others. It seems to me that you are making a distinction without a difference. If you are streaming, you stream to whatever you can reach in the next clock. If it's another functional unit, great. If it's an element of a switching network, so what? From my (naive) POV, the switching network is just as transparent as the repeaters that are necessary actually to get data to move any real distance. The point is, you never put anything anywhere--cache or register--just to sit and wait. It's always on its way somewhere, and you can pipeline movement through the switching network just like you can pipeline movement through functional units. What you are talking about affects latency: how long it takes to get the first result to pop out of the end of the stream, but it does not necessarily affect throughput. Once you get the first result out, there is no reason whatsoever that a properly designed streaming architecture cannot get out a new result with every clock, no matter how many functional units or network switches or repeaters you've had to go through to get there. As to communication cost, you can't do any better than the cost of moving through the hardware you *have* to move through to get from one end of the pipe to the other, and you don't *have* to sit in register files or cache. RM
| <-- __Chronological__ --> | <-- __Thread__ --> |