
www.Usenet.com
| <-- __Chronological__ --> | <-- __Thread__ --> |
"Vijay" <[EMAIL PROTECTED]> wrote in message
news:[EMAIL PROTECTED]
> Hi,
> This is regards the branch prediction in ARMv5 integer
> cores (ARM10TDMI and ARM10E). I was wondering if these
Don't call it ARM v5 branch prediction - there are ARMv5
cores (ARM9E, 926) that don't have branch prediction, and
there are v5 cores like XScale that uses a very different
scheme or ARM1026 that uses a related but an improved
prediction scheme including a return stack.
> integer cores support more than one branch prediction at a time.
> More specifically, can there be more than one predicted/folded
> branch in the pipeline or prefetch buffer at any time?
Not in the prefetch buffer, like the TRM says. There can be
several branches in the pipeline at any one stage, whether
predicted/predictable or not (but this is true for any ARM).
Branches only stay a small amount of time in the 3-entry
prefetch buffer, so this limitation isn't very important.
> The only thing about this that the TRMs say is "The branch
> prediction logic optimizes one branch at a time". If indeed,
> the prediction logic allows only one predicted branch in the
> pipeline at any instance, there is an increased danger of
> branches going into the pipeline unpredicted which would cause
> flushes, if taken.
In general on non-superscalar CPUs you don't see that many
branches per cycle, so letting a few through into the core is not
a major problem. The ARM10 uses a simple prediction scheme:
it predicts backwards conditional branches as taken (ie. loop
branches) - however this is where most of the benefit comes
from. In general it can only predict one branch in a block
of 2 ARM instructions, if there are 2 the other is dealt with
by the core.
To get fast branching on ARM10 you need to:
1) avoid branching by using conditional execution instead
2) avoid placing many branches/returns/calls close together
3) avoid placing 2 predictable branches in the same 64-bit
fetch slot
4) swap direction of conditional branches so that most likely
path is either fall-through or branching backwards
5) avoid loops consisting of fewer than 3-4 instructions
On the other hand, if the branch predictor
> can allow multiple predictions, it needs to keep around state
> information (e.g., address in case of misprediction) for each
> predicted branch, which means extra silicon cost.
Yes, this is why embedded CPUs use simple prediction
schemes that take little area/power. Although prediction
isn't nearly as good as on desktop CPUs, even a simple
scheme makes a huge difference. The ARM11 uses a more
elaborate dynamic branch predictor that improves accuracy
over ARM10, and there is no doubt that future cores will
use even better predictors.
Wilco
| <-- __Chronological__ --> | <-- __Thread__ --> |