Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster pcurves reductions for P-256 and P-384 #4147

Merged
merged 2 commits into from
Jul 10, 2024
Merged

Conversation

randombit
Copy link
Owner

No description provided.

@coveralls
Copy link

Coverage Status

coverage: 91.743% (+0.005%) from 91.738%
when pulling dd0b2b1 on jack/faster-nist-redc
into d24c2c3 on master.

Copy link
Collaborator

@reneme reneme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that some standard approach that warrants a reference, perhaps?

With some templates the code could be quite a bit more compact, and IMHO also easier to understand; i.e. self-contained functions vs. classes; arrays vs. long lists of variables. See here: 18dc508

And here's some Godbolt to play with this implementation: https://godbolt.org/z/4WGGzzYn8

sum.accum(S7);
const auto S = sum.final_carry(S8);

CT::unpoison(S);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a stray. Didn't see any corresponding CT::poison(). Same for secp384r1. Or am I missing something?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input is (when running under valgrind) already poisoned. As a result S is also poisoned, being itself derived from a value that (valgrind is treating as) undefined. This makes the assert trigger an error since it directly jumps if the value is something vs another thing. However the code here is quite resilient should the offset by wrong, so I replaced this with a debug assert.

Comment on lines +24 to +25
static constexpr auto P256_4 =
hex_to_words<uint32_t>("0x3fffffffc00000004000000000000000000000003fffffffffffffffffffffffc");
Copy link
Collaborator

@reneme reneme Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you decide to adopt my suggestion, it would be worthwhile to replace this by a constexpr multiplication (bigmul(4, Params::P)) to save a magic string/number, if we have that somewhere already.

For 32-bit x86, this reduction results in point arithmetic operations
that are 25-35% faster than when using Montgomery.

Sadly for 64-bit x86 it is at best about even with using Montgomery,
and for Clang 64-bit it's even somewhat slower.
This is about 20-30% faster on both 32 and 64 bit systems
@randombit
Copy link
Owner Author

Is that some standard approach that warrants a reference, perhaps?

Yes this is the standard Solinas reduction (SP 800-186) just done in columns instead of forming each integer directly. I added some references.

With some templates the code could be quite a bit more compact, and IMHO also easier to understand; i.e. self-contained functions vs. classes; arrays vs. long lists of variables. See here: 18dc508

I'll admit this is cleaner but it's also slower than the original Montgomery code, at least on my machine. :)

@coveralls
Copy link

Coverage Status

coverage: 91.745% (+0.007%) from 91.738%
when pulling fa71e70 on jack/faster-nist-redc
into d24c2c3 on master.

@reneme
Copy link
Collaborator

reneme commented Jun 25, 2024

... but it's also slower than the original Montgomery code, at least on my machine. :)

Interesting! On my machine (M2 MacBook Air, Xcode 15.3) there's no difference to your implementation.

Setup

./configure.py \
   --build-tool=ninja \
   --build-targets=static,cli \
   --compiler-cache=ccache \
   --enable-experimental-features \
   --disable-modules=pcurves_secp256r1,pcurves_secp521r1,pcurves_brainpool256r1,pcurves_brainpool384r1,pcurves_brainpool512r1
ninja cli
./botan speed --msec=1000 pcurves

Results

Exercise René Jack master
secp384r1 base mul 5950/sec 5997/sec 4828 /sec
secp384r1 var mul 1517/sec 1514/sec 1227 /sec
secp384r1 mul2 setup 2034/sec 2029/sec 1594 /sec
secp384r1 mul2 2217/sec 2250/sec 1815 /sec
secp384r1 proj->affine 26603/sec 25764/sec 20005 /sec
secp384r1 scalar invert 19650/sec 19973/sec 20544 /sec

@randombit
Copy link
Owner Author

GCC 14.1.1, Linux, i5-2520M

Exercise René Jack master
secp384r1 base mul 3350/sec 5149/sec 4159/sec
secp384r1 var mul 879/sec 1366/sec 1117 /sec
secp384r1 mul2 setup 1136/sec 1880/sec 1441 /sec
secp384r1 mul2 1307/sec 2053/sec 1636/sec
secp384r1 proj->affine 15319/sec 26978/sec 17389 /sec
secp384r1 scalar invert 21183/sec 21183/sec 21078 /sec

@reneme
Copy link
Collaborator

reneme commented Jun 25, 2024

Its what it is then. The world doesn't run on clang alone. :(

@randombit
Copy link
Owner Author

I just checked with LLVM Clang on my machine and there I see similar results between the two approaches. Kind of depressing that GCC regresses so badly.

@reneme
Copy link
Collaborator

reneme commented Jun 25, 2024

I just checked with LLVM Clang on my machine and there I see similar results between the two approaches. Kind of depressing that GCC regresses so badly.

Indeed. I think I should have another look at the performance of #4024 w/ gcc. 😨

@reneme
Copy link
Collaborator

reneme commented Jun 25, 2024

I think I should have another look at the performance of #4024 w/ gcc

FTR: Seems to be fine.

@randombit randombit mentioned this pull request Jul 9, 2024
@randombit randombit merged commit 7fad1d2 into master Jul 10, 2024
42 checks passed
@randombit randombit deleted the jack/faster-nist-redc branch July 10, 2024 07:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants