Floating in C++
One part of C++ that I almost never use is the whole floating point environment.
During my studies at University I’ve done some numerical approximation in C, but mostly in MATLAB (Octave at home), which offers a lot of tools ready to use.
Some professors did prefer to write the code in C, and for some machines, we did not have the appropriate license for running MATLAB. Therefore for some courses, we simply could not use a more advanced environment. After 5 years, I’ve taken a look again at those programs. I am ashamed to say that the code is completely unreadable, most variable and functions have meaningless names. If you are lucky enough to understand what it was supposed to do, you would notice that it is impossible to reuse it, unless you consider copy and paste as a form of reuse. Those points are actually also true for my MATLAB functions, but at least the code is much shorter since I did not need to reimplement most functionality myself. Probably it just would be better to rewrite most algorithms from scratch since the original code is not very mature, it leaks memory, it overwrites it where it shouldn’t, and surely invokes UB in a lot of other places.
Apparently, most Math professors aren’t very good at programming or at least, at explaining to their student how to write clean code.
Since I never used C++ for mathematical computations, I decided to take some time to reimplement some algorithms and see what the standard library already offers us.
Of course, there are a lot of C++ libraries that can be used, like Armadillo, Blitz++ or boost uBLAS. There are also a lot of C libraries that can be used directly in C++ and a more comprehensive list of numerical libraries can be found on Wikipedia and cppreference.
Since I wanted to reimplement some algorithms myself, just for fun, I didn’t use any of those libraries, just the C++ standard library. If you want to solve some numerical problems, you should investigate which library might suit your needs.
As everyone that has taken any numerical approximation course, it should be clear that the floating point model is only an approximation of the domain of real numbers. There are, for example, numbers that do not have any inverse, and of course, I’m meaning other numbers apart from 0
. Most numbers are also approximated, for example, the sum of 0.1
and 0.2
should be exactly 0.3
. Unfortunately it isn’t so. Most operations on the floating point environment are not even commutative. a+(b+c)
and (a+b)+c
may yield different results.
And there are a lot of other subtleties about identities, precision, performance, and error handling.
Other and more detailed information can be found at "What Every Computer Programmer Should Know About Floating Point", renamed to "Floating Point Demystified" since its title was too similar to "What Every Computer Scientist Should Know About FloatingPoint Arithmetic".
Even with all those defects, it is still possible to solve real problems with floating point numbers. The result will not be exact, but it will be good enough if we are able to control the error between the approximated and real solution. Since the data that is normally gathered through experiments is also only an approximation, we are normally not even interested in an exact result. Unfortunately writing generic and stable algorithms is a lot more difficult and unintuitive as someone might think.
Some basic functions
As mentioned before, floating point numbers are not very precise. Therefore most times it makes little sense to see if two values have exactly the same number. Most of the time we will need to verify that two values are approximately the same
/// Compares if two values a and b are equals with an error of err
inline bool approx_equal(double a, double b, double err){
return aerr <= b && a+err >= b;
}
Another thing that has happened more than once, is the need to convert from some integral type to floating point type. Unfortunately, some operations give some completely different results when using integral types, like int
or short
, or floating point types, like float
or double
. For example double z = x/y
with x=1
and y=2
yields z=0
as result if x
and y
are of integral type, and z=0.5
as result if x
or y
is of floating point type. Writing static_cast<double>
or using the Cstyle cast (double)
is a lot of noise in a complex formula. A simple conversion formula, named d
, would be a much more readable solution:
template<class T>
double d(T v){
return static_cast<double>(v);
}
So, what function am I going to implement?
I’ve taken out some notes of my old courses and made following roadmap

solving the root problems for nonlinear equations (bisection, Newton, and some of its variations)

algorithms and structures for managing polynomials and approximate functions through polynomials (interpolation)

managing matrices and solving linear equations

some basic algorithms for solving Cauchy problems, like Euler or RungeKutta methods

other algorithms for solving ODE problems, like Adams method
Please be aware that my goal is not making production code or some sort of library. There are already a lot of libraries for solving all problems that appear on my roadmap.
The first reason for reimplementing most algorithms is just for having some fun. I had a great time implementing most algorithms my first time, verifying the approximation error tending to zero, verifying that predicted properties hold, and so on. I want to see if I’ll have the same pleasure again, even if I’ll skip most of the theoretical part.
It would, of course, be awesome if my code could also help someone during their studies. I can remember having a lot of difficulties when writing my first algorithm in C. The worst part was not implementing the algorithm itself, but managing memory and debugging. The part of managing memory is easily solved. I’ll not manage it. Since I am going to write in C++, there is no reason to allocate and deallocate memory manually. Classes like std::vector or std::unique_ptr cover most use cases, and it would be out of scope trying to reimplement something similar.
As for debugging, the problem was that the only tool I knew during my courses was the compiler from the command line, a little bit of Make, and a text editor (with syntax highlighting) for writing code. Structuring the project with CMake, enabling most compiler warnings, and using an IDE, like Qt Creator, helps a lot, since it provides an easyto use GUI for debugging. Inspecting the variables during runtime, eases the process of debugging a lot. We can still printf
everything, but it is not always the most efficient method.
Also putting all code under revision control generally helps to organize it better. It happened more than once that a function got fixed, it was then compared with another project, and no one knew what was the most recent version any longer. Git is surely a handy tool since it allows you to put your code under version control without needing a server to store it.
I plan to save all my code under GitHub, and release it under a Boost License.