Stream: compiler development

Topic: dec trig accuracy


view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 16:52):

So a normal double has about 16 decimal places of accuracy when ignoring the exponent. glibc does calculations in a way such that is has approximately 16.5 decimal places of accuracy. Thus, it should always be as correct as possible when using nearest rounding mode.

Dec has 18 decimal places. Do we want the same accuracy as floats, or should we attempt to implement trig functions such that they are accurate to all 18 decimal places (probably at the cost of some speed, but I am not sure how much at the moment)

view this post on Zulip Richard Feldman (Sep 15 2023 at 18:31):

whoa, fascinating!

view this post on Zulip Richard Feldman (Sep 15 2023 at 18:31):

my intuition is to go for speed, because it's going to be an approximation anyway

view this post on Zulip Richard Feldman (Sep 15 2023 at 18:32):

like it's going to lose some precision, so might as well make use of floating-point ops to make it go faster

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 18:37):

Oh, I wasn't thinking of using floating point ops. I was still thinking of implementing it natively with dec.

That said, if we are fine with inaccuracies, it may be fine to do dec -> f64 then run the trig function finally f64 -> dec.

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 18:38):

That may actually be faster due to multiplication being faster with floats than with dec.

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 18:38):

Though, we may be able to do some tricks here given trig functions only have to take in values 0 to 2pi.

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 18:40):

And 2pi is guaranteed to fit into the lower half of a dec. So we may be able to use a fast path were all the math is 64bit instead of 128bit. That could actually be a huge speed benefit....hmmm.

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 18:42):

caveat. Still would be 64 bits promoting to 128 and maybe some 128 to 128 bit operations, but no need for 256 bit operations.

Sin calculations for f64 actually use 2 f64s as well to get more precision.

view this post on Zulip Richard Feldman (Sep 15 2023 at 18:44):

gotcha - yeah I don't have strong feelings about either way :big_smile:

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 18:44):

If my theory is correct, and we can stay in the land of 2x u64, there is a chance that dec based trig functions would actually be faster than the floating point ones......Totally just wild theory at this point though.

view this post on Zulip Richard Feldman (Sep 15 2023 at 18:44):

mainly I think it's reasonable to choose either approach!

view this post on Zulip Richard Feldman (Sep 15 2023 at 18:44):

that would be really sweet if true! :smiley:

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 21:04):

Did some quick benchmarking of u64 math. u64s are a lot faster than floats at addition and subtraction, but they are about the same speed for multiplication and actually slower for division.

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 21:05):

So even if we implemented all of this with u64 math, I would expect it to be at best the same speed as the float version. Cause the core of the algorithms in terms of runtime is multiplication.

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 21:06):

So I think I may opt for converting to floats just to keep this all simple for now.

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 21:06):

Long term, it is an interesting area to investigate (especially if we later decide we want all 18 decimal places of accuracy)

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 21:07):

So these functions will really just be modulus by 2 pi. Then call the float versions of the functions. May need to double check that modulus is accurate enough and I don't need to do fancier estimation. But otherwise, that is the rough plan.

view this post on Zulip Richard Feldman (Sep 15 2023 at 21:38):

sounds great!

view this post on Zulip Brendan Hansknecht (Sep 15 2023 at 22:28):

Yeah, definitely will need to do some sort of high precision modulo for this to be accurate. The max dec modulo 2*pi as a dec is quite a bit off (totally the wrong number to be clear that was a bug. still very off but not totally wrong).

view this post on Zulip Brendan Hansknecht (Sep 16 2023 at 03:36):

Took a bit of fiddling, but figured out how to do high precision modulus on dec: https://godbolt.org/z/jx51eKbr7

view this post on Zulip Brendan Hansknecht (Sep 16 2023 at 03:41):

This uses a constant that technically is 192bit in precision instead of 128bit for doing the modulus. So an extra 64 bits below dec.
So instead of modulo 6.283185307179586477, it is modulo roughly 6.2831853071795864769252867665590057684

view this post on Zulip Brendan Hansknecht (Sep 16 2023 at 17:27):

Ok. Now I think I have a full implementation. This uses the fancy division by constant trick to make this only use multiplication. It is high accuracy and should be reasonably fast. Also got it working both with negative and positive numbers.

https://godbolt.org/z/aebeTTfaj

view this post on Zulip Brendan Hansknecht (Sep 16 2023 at 18:01):

Not terrible:

Dec/F64:
sin:            3.00
asin:           8.94

view this post on Zulip Richard Feldman (Sep 16 2023 at 18:07):

wow, amazing! I expected it to be a lot slower!

view this post on Zulip Brendan Hansknecht (Sep 16 2023 at 18:08):

I mean, this is with it converting to and then running the float impl. Just has to do some setup first to ensure modulus is accurate.

Also, I guess asin is a really fast function cause it doesn't have to do the modulus. It is just the cost of converting from float and back. Yet is has way more overhead than sin which is also doing the modulus.

view this post on Zulip Brendan Hansknecht (Sep 17 2023 at 05:47):

So one point of tradeoff.

High accuracy dec sign is 3x slower than f64. Low accuracy dec sin is only 2.3x slower than f64. Low accuracy dec sign is basically garbage at the far end of the dec range. That said, sin for floats on these giant numbers is also basically garbage in the same range. So we have information to be much more accurate than floats at the large ranges, but it costs some runtime perf. Given we have the information, I am learning torwards keeping dec more accurate at the cost of some perf, but was wondering what the general opinion is.

view this post on Zulip Luke Boswell (Sep 17 2023 at 05:50):

I guess the general design is for the default to be accurate at the expense of perf, with those users looking for perf can always choose F64 etc?

view this post on Zulip Brendan Hansknecht (Sep 17 2023 at 05:59):

Also, I think I have all of dec trig working now. To catch up with floats, dec is missing:

view this post on Zulip Richard Feldman (Sep 17 2023 at 10:15):

yeah I agree with keeping Dec accurate, especially when the performance factor is that close

view this post on Zulip Richard Feldman (Sep 17 2023 at 10:16):

maybe if there's a massive perf gap but a relatively small accuracy gap I could see an argument for lowering accuracy though :big_smile:

view this post on Zulip Brendan Hansknecht (Sep 17 2023 at 15:00):

SG

view this post on Zulip Brendan Hansknecht (Sep 18 2023 at 04:30):

Hmm. So my perf numbers may not be correct. Compilers are too smart and optimizing things in unexpected ways.

view this post on Zulip Brendan Hansknecht (Sep 18 2023 at 04:31):

Need to dig into some assembly more to figure out if this is generating what I want, but it isn't clear at the moment. I think the perf results my be different from what I currently reported.

view this post on Zulip Brendan Hansknecht (Sep 18 2023 at 04:44):

These are probably more accurate numbers, but I'll actually do some assembly diving on x86 to double check tomorrow:

Dec/F64:
addition:       5.37
subtraction:    3.37
multiplication: 14.25
division:       45.52
sin:            3.72
cos:            3.75
tan:            2.47
asin:           9.10
acos:           5.55
atan:           1.67

view this post on Zulip Richard Feldman (Sep 18 2023 at 15:33):

interesting!

view this post on Zulip Brendan Hansknecht (Sep 18 2023 at 18:00):

Ok. I definitely have the assembly I want now. That said, the branch predictor (especially on m1 for some reason) is annoying.

view this post on Zulip Brendan Hansknecht (Sep 18 2023 at 18:01):

Will have new benchmarks soon that I actually believe are accurate (equivalent to running in a long hot loops)

view this post on Zulip Brendan Hansknecht (Sep 18 2023 at 18:14):

Ok. So these are benchmark numbers that I am confident in and have double checked the assembly of (on m1):

Dec/F64:
addition:       0.55
subtraction:    0.53
multiplication: 15.09
division:       53.74
sin:            3.90
cos:            3.62
tan:            2.34
asin:           1.83
acos:           1.72
atan:           1.73

The caveats are:

  1. When looking at smaller benchmarks addition/subtraction perf can greatly be hurt. I think the issue is that addition and subtraction require a potential jump to roc_panic (could have a branch misprediction). It also is simply more instructions overall. Floats are consistent. Dec is not. So with smaller benchmarks, Dec's have a lot of variance here and can be like 2x slower than floats. But in a hot loop with warmup, Dec addition and subtraction is definitely faster.
  2. Currently the trig functions don't have any quick exits. They always convert to float and then run the function on the generated float. If the trig functions hit a quick exit condition, they are much faster for floats. That is where asin being 9-10x faster comes from. If we ensure that the function is not early existing, it is only ~2x slower as shown in the numbers above.
  3. The compiler can reason about floats better. It will do fast math optimizations on them and does a better job at dead code removal when dealing with floats.
  4. They take 2x the memory which of course will hurt cache locality if stored.
  5. In my benchmark, I tested a chain of dependent operations rather than independent ones. (This is cause the optimizeer kept optimizing away all the independent float ops unless I added a lot of unwanted datamovement).

view this post on Zulip Richard Feldman (Sep 18 2023 at 19:28):

wow, this is really interesting!

view this post on Zulip Brendan Hansknecht (Sep 18 2023 at 20:18):

Edited the above comment to capture a few more caveats.


Last updated: Jul 06 2025 at 12:14 UTC