From owner-freebsd-numerics@FreeBSD.ORG Mon Jun 10 02:32:17 2013
Return-Path:
Delivered-To: freebsd-numerics@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
by hub.freebsd.org (Postfix) with ESMTP id 66A48826
for ; Mon, 10 Jun 2013 02:32:17 +0000 (UTC)
(envelope-from brde@optusnet.com.au)
Received: from mail35.syd.optusnet.com.au (mail35.syd.optusnet.com.au
[211.29.133.51]) by mx1.freebsd.org (Postfix) with ESMTP id 92BF5166C
for ; Mon, 10 Jun 2013 02:32:16 +0000 (UTC)
Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au
(c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23])
by mail35.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r5A2W54q007158
(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
Mon, 10 Jun 2013 12:32:07 +1000
Date: Mon, 10 Jun 2013 12:32:05 +1000 (EST)
From: Bruce Evans
X-X-Sender: bde@besplex.bde.org
To: Steve Kargl
Subject: Re: Implementation for coshl.
In-Reply-To: <20130610003645.GA16444@troutmask.apl.washington.edu>
Message-ID: <20130610110740.V24058@besplex.bde.org>
References: <20130610003645.GA16444@troutmask.apl.washington.edu>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=eqSHVfVX c=1 sm=1 a=LM0AswAWfpYA:10
a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=MtpAjnDH-vAA:10
a=uVhegijvKTqswUFXLp4A:9 a=CjuIK1q_8ugA:10 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117
Cc: freebsd-numerics@FreeBSD.org
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
X-List-Received-Date: Mon, 10 Jun 2013 02:32:17 -0000
On Sun, 9 Jun 2013, Steve Kargl wrote:
> I suspect that there will be some nits with the implementation.
Quite a few nats :-).
> Anyway, testing gives
>
> Arch | Interval | #calls | Time (s) | Max ULP | Compiler | Value
> -----------+---------------------+--------+----------+---------+----------+-------
> i386 [1] | [ 0.00: 0.35] | 100M | 15.0198 | 0.58583 | gcc | 1
> i386 [1] | [ 0.35: 24.00] | 100M | 15.1858 | 1.01504 | gcc | 2
> i386 [1] | [ 24.00:11356.52] | 100M | 12.9591 | 0.51198 | gcc | 3
> i386 [1] | [11356.52:11357.22] | 100M | 13.3328 | 1.90988 | gcc | 4
> -----------+---------------------+--------+----------+---------+----------+-------
Quite large errors, unfortunately much the same as in double precision.
> amd64 [2]| [ 0.00: 0.35] | 100M | 11.7811 | 0.63075 | clang | 5
> amd64 [2]| [ 0.35: 24.00] | 100M | 11.0662 | 1.01535 | clang | 6
> amd64 [2]| [ 24.00:11356.52] | 100M | 9.97704 | 0.50852 | clang | 7
> amd64 [2]| [11356.52:11357.22] | 100M | 10.8221 | 1.90931 | clang | 8
> -----------+---------------------+--------+----------+---------+----------+-------
Likely a bug for the ULPs for the first domain to be so differnt. I suspect
even much smaller differences are due to bugs.
> ...
> 1. The max ulp for the intervals [0.35:24] and [0.35:41] of 1.xxx is
> due to the division in the expression half*exp(x) + half/exp(x).
That's the one with the large MD difference.
> Bruce and I exchanged emails a long time ago about possible ways
> to reduce the ulp in this range by either computer exp(x) with
> extra precision or using a table with cosh(x) = cosh(x_i) * cosh(d)
> + sinh(x_i) * sinh(d) with d = x - x_i. I tried the latter with
> disappointing results.
The latter may be good for trig functions, but it is bad for hyperbolic
functions. It is technically difficult to splice the functions.
> The former would require a refactoring of
> s_expl.c into a kernel __kernel_expl(x, hi, lo). I have no plans on
> pursuing this at the this time.
But you need this soon for __ldexp_exp() and __ldexp_cexp(), which are
needed for hyperbolic functions (the large args already fixed for float
and double precision) and for cexp() and trig and hyperbolic complex
functions. It is much easier to implement these using a kernel. I do
this only for float precision.
The hyperbolic functions are also much easier with a kernel. Most of
the thresholds become unnecessary. Above about |x| < 0.5, all cases
are handled uniformly using code like:
k_expl(x, &hi1, &lo1, &k1);
k_expl(-x, &hi2, &lo2, &k2); /* KISS slow initially */
/*
* Bah, that's too uniform. I don't want to deal with k. So
* use a threshold for large |x| (case handled by __ldexp_expl().
* Now for 0.5 <= |x| < thresh:
*/
k_expl(x, &hi1, &lo1);
k_expl(-x, &hi2, &lo2); /* diferent API includes 2**k */
_2sumF(hi1, lo1); /* a bit sloppy */
return (0.5 * (lo2 + lo1 + hi1 + hi2));
Error analysis: since |x| >= 0.5, the ratio exp(x)/exp(-x) is >= exp(1).
Already if we we add exp(x) + exp(-x), the error is at most ~1 ulps
(certainly less than 2). But our hi_lo approximations give 6-10 extra
bits, so the ~1 ulp error is scaled by 2**-6 or better until the final
addition.
Note that this doesn't need the complexities of expm1l(x) or a kernel
for that. Adding exp(-x) to exp(x) without increasing the error
significantly is similar to adding -1 to exp(x) (subtracting exp(-x)
for sinh() is even more similar). The addiitonal complications in
expm1l() are because:
- it wants to give 6-10 bits and not lose any relative to expl()
- |x| >= 0.5 so large cancelations cannot occur.
For |x| <= 0.5, use a poly approx. 0.5 can be reduced significantly
if necessary to get less terms in the poly. This is less needed than
for expm1l() since the power series about 0 converges much faster
for coshl().
Similarly for sinhl(). I don't know of a similar method for tanhl().
A division seems to be necessary, and hi+lo decompositions only work
well for additions.
> /*
> * ====================================================
> * Copyright (C) 1993 by Sun Microsystems, Inc. All rights reserved.
> *
> * Developed at SunSoft, a Sun Microsystems, Inc. business.
> * Permission to use, copy, modify, and distribute this
> * software is freely granted, provided that this notice
> * is preserved.
> * ====================================================
> *
> * Converted to long double by Steven G. Kargl
> */
It changes the style completely, so diffs with the double version are
unreadable.
> #if LDBL_MANT_DIG == 64
> ...
> #else
> #error "Unsupported long double format"
> #endif
The complications for ld80/128 and i386 seem reasonble.
> long double
> coshl(long double x)
> {
> long double t, w;
> uint16_t hx, ix;
>
> ENTERI();
>
> GET_LDBL_EXPSIGN(hx, x);
> ix = hx & 0x7fff;
> SET_LDBL_EXPSIGN(x, ix);
This sign frobbing is pessimal, and is not done by the double version.
Signs are better cleared using fabs*() unless you are sure that clearing
them in bits is more optimal. But don't optimize before getting it right.
Any optimizations should be made to the double version first.
> /* x is +-Inf or NaN. */
> if (ix == BIAS + LDBL_MAX_EXP)
> RETURNI(x * x);
The sign frobbing also clobbers the result here.
>
> if (x < log2o2) {
The threshold comparisons are painful and probably inefficient to do in
bits.
Hoever you have most of the pain of using bits by using long doubles
and LD80C() to declare precise thresholds. Most or all of the thresholds
are fuzzy and don't need more than float precision (or maybe just a the
exponent). The double version uses a fuzzy threshold here. It only
tests the upper 21 bits of the mantissa, so it uses less than float
precision.
> if (ix < BIAS + EXP_TINY) { /* |x| < 0x1pEXP_TINY */
> /* cosh(x) = 1 exactly iff x = +-0. */
> if ((int)x == 0)
> RETURNI(1.0L);
> }
Unnecessary algorithm change and micro-optimization. The double version
uses the general case doing 1+t to set inexact here.
> t = expm1l(x);
> w = 1 + t;
> RETURNI(1 + t * t / (w + w));
> }
> ...
> if (x < o_threshold2) {
> t = expl(half * x);
> RETURNI(half * t * t);
> }
This is missing use of __ldexp_expl().
Going back to the painful threshold declarations:
> #if LDBL_MANT_DIG == 64
> static const union IEEEl2bits
> #define EXP_TINY -32
Strange placement of macro in the middle of a declaration.
This should be simply LDBL_MANT_DIG / 2.
> #define s_threshold 24
I don't understand the magic for this now. It is not quite LDBL_MANT_DIG / 3.
> /* log(2) / 2 */
> log2o2u = LD80C(0xb17217f7d1cf79ac, -2, 0.346573590279972654714L),
> #define log2o2 (log2o2u.e)
> /* x = log(LDBL_MAX - 0.5) */
> o_threshold1u = LD80C(0xb17217f7d1cf79ac, 13, 11356.5234062941439497L),
> #define o_threshold1 (o_threshold1u.e)
> /* log(LDBL_MAX - 0.5) + log(2) */
> o_threshold2u = LD80C(0xb174ddc031aec0ea, 13, 11357.2165534747038951L);
> #define o_threshold2 (o_threshold2u.e)
> #elif LDBL_MANT_DIG == 113
> #define EXP_TINY -56
> #define s_threshold 41
> static long double
> log2o2 = 0.346573590279972654708616060729088288L,
> o_threshold1 = 11356.5234062941439494919310779707650L,
> o_threshold2 = 11357.2165534747038948013483100922230L;
> #else
> #error "Unsupported long double format"
> #endif
No need for any long doubles or LD80C()'s. The double version uses only
21 bits for all thresholds. We had to be more careful about the thresholds
for technical reasons in expl(). IIRC, we needed precise thresholds to
do a final filtering step after a fuzzy initial classification. This
isn't needed here, since we use calculations that can't give spurious
overflow.
Going back a bit more:
- 'huge' is missing a volatile qualifier to work around clang bugs.
expl() is already out of date relative to exp2l() since it is missing
the recent fix to add this qualifier. This fix is missing in the
double and float versions of all the hyperbolic functions too.
- 'half' doesn't need to be long double
- the double version has a variable 'one'. You changed that to '1', but
didn't change 'half' to 0.5.
Bruce