
www.Usenet.com
| <-- __Chronological__ --> | <-- __Thread__ --> |
"Wilco" <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>...
> "Vijay" <[EMAIL PROTECTED]> wrote in message
> news:[EMAIL PROTECTED]
> > Hi,
> > This is regards the branch prediction in ARMv5 integer
> > cores (ARM10TDMI and ARM10E). I was wondering if these
>
> Don't call it ARM v5 branch prediction - there are ARMv5
> cores (ARM9E, 926) that don't have branch prediction, and
> there are v5 cores like XScale that uses a very different
> scheme or ARM1026 that uses a related but an improved
> prediction scheme including a return stack.
>
> > integer cores support more than one branch prediction at a time.
> > More specifically, can there be more than one predicted/folded
> > branch in the pipeline or prefetch buffer at any time?
>
> Not in the prefetch buffer, like the TRM says. There can be
> several branches in the pipeline at any one stage, whether
> predicted/predictable or not (but this is true for any ARM).
> Branches only stay a small amount of time in the 3-entry
> prefetch buffer, so this limitation isn't very important.
>
> > The only thing about this that the TRMs say is "The branch
> > prediction logic optimizes one branch at a time". If indeed,
> > the prediction logic allows only one predicted branch in the
> > pipeline at any instance, there is an increased danger of
> > branches going into the pipeline unpredicted which would cause
> > flushes, if taken.
>
> In general on non-superscalar CPUs you don't see that many
> branches per cycle, so letting a few through into the core is not
> a major problem. The ARM10 uses a simple prediction scheme:
> it predicts backwards conditional branches as taken (ie. loop
> branches) - however this is where most of the benefit comes
> from. In general it can only predict one branch in a block
> of 2 ARM instructions, if there are 2 the other is dealt with
> by the core.
>
> To get fast branching on ARM10 you need to:
>
> 1) avoid branching by using conditional execution instead
> 2) avoid placing many branches/returns/calls close together
> 3) avoid placing 2 predictable branches in the same 64-bit
> fetch slot
> 4) swap direction of conditional branches so that most likely
> path is either fall-through or branching backwards
> 5) avoid loops consisting of fewer than 3-4 instructions
Surely, a simple static branch predictor can make a big difference
for a non-superscalar CPU. But the fact that performance can be
impacted in the case of small loops (3-4 instructions) is just a
bit bothering, if not annoying, especially when an application can
have a number of them. For example, consider this snippet from the
start-up code
00008088 <_zero_loop>:
8088: e2555004 subs r5, r5, #4 ; 0x4
808c: 24847004 strcs r7, [r4], #4
8090: 8afffffc bhi 8088 <_zero_loop>
8094: eafffff2 b 8064 <_zero_region>
Assuming that the branch predictor allows only one branch prediction
in the pipeline at any time, a flush will result every second iteration
of the <_zero_loop> when instruction at address 0x8090 is executed.
This penalty reduces the performance improvement that can be gained
if multiple predictions were allowed in the pipeline.
I know the ARM11 has a deeper pipeline, both static as well as dynamic
branch prediction, but does it allow multiple predictions in the pipeline?
Vijay
--
>
>
> On the other hand, if the branch predictor
> > can allow multiple predictions, it needs to keep around state
> > information (e.g., address in case of misprediction) for each
> > predicted branch, which means extra silicon cost.
>
> Yes, this is why embedded CPUs use simple prediction
> schemes that take little area/power. Although prediction
> isn't nearly as good as on desktop CPUs, even a simple
> scheme makes a huge difference. The ARM11 uses a more
> elaborate dynamic branch predictor that improves accuracy
> over ARM10, and there is no doubt that future cores will
> use even better predictors.
>
> Wilco
| <-- __Chronological__ --> | <-- __Thread__ --> |