The C++ logo, by Jeremy Kratz, licensed under CC0 1.0 Universal

Compile-time string operations


14 - 18 minutes read, 3515 words
Categories: c++
Keywords: c++ compile-time string

While in C++20 std::string can be used in a constexpr context (but the std::string created at compile-time cannot be use at runtime, maybe in a future revision of C++), it is still possible to write a relative simple string class that is constexpr, even in C++11.

This should not be a surprise as at CppCon17 Jason Turner and Ben Deane gave a talk titled "Constexpr ALL the things" and presented a compile-time JSON parser 🤯.

Even without making everything constexpr, being able to manipulate strings (and some other types) at compile time is very useful.

For example, it makes it easier to avoid magic constants in your code or helps to avoid using the preprocessor for generating them.

Overview of a c_string class

The code presented is C++20 (at least because of std::make_array), but with minor changes can be backported to C++11

First of all, we need something for manipulating an array at compile-time, as a string is, after all, represented by a sequence of characters.

The standard library does not provide anything, but implementing it is not that difficult (except for the non-intuitive part on how to initialize an array). Note that this implementation, for brevity, will always copy the content. Overloads for temporaries can be provided, but as the functions are mainly there for initializing compile-time data, it’s probably not that important.

#include <array>

template<class T, std::size_t N1, std::size_t N2, std::size_t... PACK1, std::size_t... PACK2>
constexpr std::array<T,N1+N2> concat_impl(const std::array<T,N1>& arr1, const std::array<T,N2>& arr2, std::index_sequence<PACK1...>, std::index_sequence<PACK2...>){
	return std::array<T,N1+N2>({arr1[PACK1]..., arr2[PACK2]...});
}
template<class T, std::size_t N1, std::size_t N2>
constexpr std::array<T,N1+N2> concat(const std::array<T,N1>& arr1, const std::array<T,N2>& arr2) {
	return concat_impl(arr1, arr2, std::make_index_sequence<N1>{}, std::make_index_sequence<N2>{});
}

static_assert(concat(std::array{1,2}, std::array{3,4}).size()==4,"");
static_assert(concat(concat(std::array{1,2}, std::array{1}), std::array{3,4}).size()==5,"");

The string class has the following layout:

template <std::size_t N>
class c_string
{
    char mdata[N + 1]; // or std::array<char, N+1> mdata;

    template <std::size_t... I>
    constexpr c_string(const char (&arr)[N+1], std::index_sequence<I...>) noexcept
        : mdata{ arr[I]..., '\0' }{}

public:
    constexpr c_string(const char (&arr)[N+1]) noexcept : c_string(arr, std::make_index_sequence<N>{}) {
    }

    constexpr std::size_t size() const noexcept{ return N; }
    constexpr char operator[](int i ) const { return this->mdata[i]; }
    // other functions providing a string like interface (begin, end, size, c_str, op==, conversion to string_view, ...)
};

// factory function
template <int N_PLUS_1>
constexpr auto literal(const char (&lit)[N_PLUS_1]) noexcept -> c_string<N_PLUS_1 - 1> {
    return c_string<N_PLUS_1 - 1>(lit);
}


static_assert(literal("1234").size() == 4, "");

The array has a place for one character more at the end because I want it to be \0-terminated like std::string (more on that later). As a side-effect, concatenating two c_string is not equivalent to concatenating two arrays, because the terminating \0 is not part of the content of the string!

To have such functionality, we need another couple of helper functions

template<std::size_t O, class T, std::size_t N, std::size_t... PACK>
constexpr std::array<T,N-O> trim_right_impl(const std::array<T,N>& arr, std::index_sequence<PACK...>){
    return std::array<T,N-O>({arr[PACK]...});
}
template<std::size_t O, class T, std::size_t N>
constexpr std::array<T,N-O> trim_right(const std::array<T,N>& arr) {
    return trim_right_impl<O>(arr, std::make_index_sequence<N-O>{});
}

static_assert(trim_right<1>(std::to_array({1,2,3})).size() == 2);
static_assert(trim_right<1>(std::to_array({1,2,3}))[0] == 1);
static_assert(trim_right<1>(std::to_array({1,2,3}))[1] == 2);

With these functions, it is possible to write an operator+ for c_string:

template <std::size_t N>
class c_string
{
    template<std::size_t N2>
    friend class c_string;

    char mdata[N + 1]; // or std::array<char, N+1> mdata;

    // using std::array because of operator+
    template <std::size_t... I>
    constexpr c_string(const std::array<char, N+1>& arr, std::index_sequence<I...>) noexcept
        : mdata{ arr[I]..., '\0' }{}
    constexpr c_string(const std::array<char, N+1>& arr) noexcept : c_string(arr, std::make_index_sequence<N>{}) {
    }

public:
    constexpr c_string(const char (&arr)[N+1]) noexcept : c_string(std::to_array(arr), std::make_index_sequence<N>{}) {
    }

    template <std::size_t N2>
    constexpr auto operator+(const c_string<N2>& s2) const noexcept -> c_string<N + N2>{
        return c_string<N + N2>(concat(trim_right<1>(std::to_array(this->mdata)), s2.mdata));
    }
    template < std::size_t N2>
    constexpr friend auto operator+(const c_string<N>& s1, const char (&s2)[N2]) noexcept -> c_string<N + N2-1>{
        return s1 + c_string<N2-1>(s2);
    }
    template <std::size_t N2>
    constexpr friend auto operator+(const char (&s1)[N2], const c_string<N>& s2) noexcept -> c_string<N2-1 + N>{
        return c_string<N2-1>(s1) + s2;
    }
    constexpr std::size_t size() const noexcept{ return N; }
    constexpr char operator[](int i ) const { return this->mdata[i]; }

    // other functions providing a string-like interface (begin, end, size, c_str, op==, conversion to string_view, ...)
};

// factory function
template <int N_PLUS_1>
constexpr auto literal(const char (&lit)[N_PLUS_1]) noexcept -> c_string<N_PLUS_1 - 1> {
    return c_string<N_PLUS_1 - 1>(lit);
}

static_assert((literal("1")+literal("23")).size() == 3, "");
static_assert((literal("")+literal("23")).size() == 2, "");
static_assert((literal("")+literal("")).size() == 0, "");
static_assert((literal("1")+literal("23"))[0] == '1', "");
static_assert((literal("1")+literal("23"))[1] == '2', "");
static_assert((literal("1")+literal("23"))[2] == '3', "");
static_assert((literal("1")+"23").size() == 3, "");
static_assert((literal("1")+"").size() == 1, "");
static_assert(("1" + literal("23")).size() == 3, "");
static_assert(("" + literal("23")).size() == 2, "");

This string class, even if minimal, already provides some useful functionality. In particular, it has easy-to-use interface for concatenating strings.

Why '\0'-termination?

Contrary to std::string_view, the c_string class is owning.

As some APIs require a '\0'-terminated string, it is a useful property, even they won’t work correctly in the case of an embedded '\0'. Requiring every instance to have one byte more does not seem a big price to pay. In contrast, it can avoid an unnecessary conversion to std::string if a leading '\0' is required by some underlying API.

If there is no .c_str() method, then the leading '\0' can be removed without issues, and operator+ is just concatenating two strings.

hex encoding/decoding

Contrary to a string generated by macros, c_string is a real class, and writing constexpr functions is much easier.

Before doing that, supposing we are decoding from the left to the right, we need for a recursive implementation (required for C++11, discouraged otherwise) a helper function for splitting c_string.

As we already have trim_right for std::array, lets add trim_left for both std::array and c_string

template <std::size_t O, std::size_t ... Is>
constexpr auto add_offset(std::index_sequence<Is...>) noexcept -> std::index_sequence<(O + Is)...> {
    return {};
}
template <std::size_t O, std::size_t N>
constexpr auto make_index_sequence_with_offset() noexcept -> decltype(add_offset<O>(std::make_index_sequence<N>{})) {
    return add_offset<O>(std::make_index_sequence<N>{});
}

template<std::size_t O, class T, std::size_t N, std::size_t... PACK>
constexpr std::array<T,N-O> trim_left_impl(const std::array<T,N>& arr, std::index_sequence<PACK...>){
    return std::array<T,N-O>({arr[PACK]...});
}
template<std::size_t O, class T, std::size_t N>
constexpr std::array<T,N-O> trim_left(const std::array<T,N>& arr) {
    return trim_left_impl<O>(arr, make_index_sequence_with_offset<O, N-O>());
}

static_assert(trim_left<1>(std::array{1,2,3}).size() == 2);
static_assert(trim_left<1>(std::array{1,2,3})[0] == 2);
static_assert(trim_left<1>(std::array{1,2,3})[1] == 3);


template<std::size_t O, std::size_t N, std::size_t... PACK>
constexpr c_string<N-O> trim_left_impl(const c_string<N>& arr, std::index_sequence<PACK...>){
    return c_string<N-O>({arr[PACK]..., '\0'});
}
template<std::size_t O, std::size_t N>
constexpr c_string<N-O> trim_left(const c_string<N>& arr) {
    return trim_left_impl<O>(arr, make_index_sequence_with_offset<O, N-O>());
}

static_assert(trim_left<1>(literal("123")).size() == 2);

Given those functions, a routine for converting a string containing hexadecimal characters to its binary representation could look like

constexpr char validate_hexchar(char c){
    return (c >= '0' && c <= '9') or (c >= 'A' && c <= 'F') or (c >= 'a' && c <= 'f') ? c : stop_comp();
}

constexpr std::array<unsigned char, 1> hex2bin(const c_string<2>& in){
    return {static_cast<unsigned char>(
        (((validate_hexchar(in[0]) & '@') ? in[0] + 9 : in[0]) << 4) |
        (((validate_hexchar(in[1]) & '@') ? in[1] + 9 : in[1]) & 0xF)
    )};
}
template <std::size_t N>
constexpr std::array<unsigned char, N/2> hex2bin(const c_string<N>& in){
    return concat<unsigned char>(
        hex2bin(literal({in[0], in[1], '\0'})),
        hex2bin(literal(trim_left<2>(in))));
}

If you have a long c_string, with MSVC you will exhaust your memory. In general, such recursive implementation increases the compile-times by a significant factor. For a string with N characters, it will instantiate N/2 functions. From C++14 onward writing a recursive hex2bin does not make much sense. By using a loop, the code is generally more efficient, requires less time to compile, and the function is also usable at runtime without causing stack overflows. But for those sticking with C++11 it’s possible to reduce compile times (and stack usage but you should still not use it at runtime), for example by converting from two to eight hexadecimal values at once, which means reducing the number of instantiations to one eighth.

constexpr char validate_hexchar(char c){
    return (c >= '0' && c <= '9') or (c >= 'A' && c <= 'F') or (c >= 'a' && c <= 'f') ? c : stop_comp();
}

constexpr unsigned char hex2bin_i(char in0, char in1){
    return static_cast<unsigned char>(
        (((validate_hexchar(in0) & '@') ? in0 + 9 : in0) << 4) |
        (((validate_hexchar(in1) & '@') ? in1 + 9 : in1) & 0xF)
    );
}
constexpr std::array<unsigned char, 1> hex2bin(const c_string<2>& in){
    return std::array<unsigned char, 1>({hex2bin_i(in[0], in[1])});
}
constexpr std::array<unsigned char, 2> hex2bin(const c_string<4>& in){
    return std::array<unsigned char, 2>({
      hex2bin_i(in[0], in[1]),
      hex2bin_i(in[2], in[3])
    });
}
constexpr std::array<unsigned char, 3> hex2bin(const c_string<6>& in){
    return std::array<unsigned char, 3>({
      hex2bin_i(in[0], in[1]),
      hex2bin_i(in[2], in[3]),
      hex2bin_i(in[4], in[5])
    });
}
constexpr std::array<unsigned char, 4> hex2bin(const c_string<8>& in){
    return std::array<unsigned char, 4>({
      hex2bin_i(in[0], in[1]),
      hex2bin_i(in[2], in[3]),
      hex2bin_i(in[4], in[5]),
      hex2bin_i(in[6], in[7])
    });
}
// eventually add other overloads to reduce compile-times further
template <std::size_t N>
constexpr std::array<unsigned char, N/2> hex2bin(const c_string<N>& in){
    return concat<unsigned char>(
      hex2bin(literal({in[0], in[1], in[2], in[3], in[4], in[5], in[6], in[7], '\0'})),
      hex2bin(literal(trim_left<8>(in)))
    );
}

The opposite conversion (binary to hex) is left as an exercise to the reader.

Converting a number at compile-time

The main difference between c_string and std::string, when doing this type of conversion, is that the size is part of the type.

Thus different numbers will create different c_string types. For example, the number 42 will create a c_string with length two, which is different from the type generated from the number 3, a c_string with length one.

This means that:

  • it is not possible to make a function usable both at runtime and compile-time, like hex2bin, because functions have a fixed return type

  • implementing such function requires one pass for determining the length of the string, and another for doing the conversion with the appropriate types

Calculating the required length is simple (ignoring std::numeric_limits<int>::min() for simplicity):

//! calculate number of digits needed, including minus sign
constexpr int num_digits(int x) {
    return x < 0  ? 1 + num_digits(-x) :
           x > 10 ? 1 + num_digits(x/10) :
                    1;
}

static_assert(num_digits(-1) == 2, "");
static_assert(num_digits(-0) == 1, "");
static_assert(num_digits( 0) == 1, "");
static_assert(num_digits(10) == 0, "");

Once we know how much space we need, and thus the return type of our function, we can write the implementation.

First, we need to determine the size of the final string, then we can do the conversion.

constexpr int abs_val(int x) {
     return x < 0 ? -x : x;
}
constexpr char digit_to_char(int c){
    return c + '0';
}

// called if single base b digit >=0
template <int value>
constexpr auto to_string_1(std::true_type) -> c_string<1> {
    return c_string<1>({digit_to_char(value), '\0'});
}
// called if single base b digit <0
template <int value>
constexpr auto to_string_1(std::false_type) -> c_string<2> {
    return c_string<2>({'-', digit_to_char(0-value), '\0'});
}

// called if int is single digit
template <int value>
constexpr auto to_string_i(std::true_type) -> c_string<num_digits(value)> {
    return to_string_1<value, b>(std::integral_constant<bool, value >= 0>());
}

template <int value>
constexpr auto to_string() -> c_string<num_digits(value)>;

// called if not single-digit
template <int value>
constexpr auto to_string_i(std::false_type) -> c_string<num_digits(value)> {
    return to_string<value/10>() +  to_string_1<abs_val(value)%10>(std::true_type{});
}

template <int value>
constexpr auto to_string() -> c_string<num_digits(value)> {
    return to_string_i<value>(std::integral_constant<bool, num_digits(value)==1 || (num_digits(value)==2 and not (value > 0) )>());
}

// negative value
static_assert(to_string<-9999999>() == "-9999999"sv, "");
// with enum
enum val { a = 12 };
static_assert(to_string<a>() == "12"sv, "");

One might argue that using a macro, in this case, is much simpler.

In fact

#define STR_HELPER(x) #x
#define STR(x) STR_HELPER(x)
static_assert(STR(12) == "12"sv, "");

is much shorter to write, but it will not work, for example, for enum, as it will stringify the enum name, and not its value:

#define STR_HELPER(x) #x
#define STR(x) STR_HELPER(x)
enum val { a = 12 };
static_assert(STR(a) == "s"sv, "");

The only advantage of the macro, is that it creates a string literal, which sometimes cannot be replaced by something else.

A function also makes it easier, for example, to convert a number in base 2 or 16, and not only in base 10.

The presented code can be adapted by adding a second parameter and replacing 10 with the new parameter. Just digit_to_char needs some additional logic.

constexpr int abs_val (int x) {
    return x < 0 ? -x : x;
}

enum class base : int {bin = 2, dec = 10, hex = 16};

constexpr char digit_to_char(int c, base b){
    return
         b == base::hex ? (c > 9 ? (c - 10 + 'A') : (c + '0') ) :
         b == base::bin ? (c == 0 ? '0' : '1') :
         /* base::dec */  (c + '0');
}

//! calculate number of digits needed, including minus sign
constexpr int num_digits(int x) {
    return x < 0      ? 1 + num_digits(-x, b) :
           x > int(b) ? 1 + num_digits(x/int(b), b) :
                        1;
}

static_assert(num_digits(0xA, base::dec) == 2, "");
static_assert(num_digits(0xA, base::hex) == 1, "");
static_assert(num_digits(0xA, base::bin) == 4, "");

// called if single base b digit >=0
template <int value, base b>
constexpr auto to_string_1(std::true_type) -> c_string<1> {
    return c_string<1>({digit_to_char(value, b), '\0'});
}
// called if single base b digit <0
template <int value, base b>
constexpr auto to_string_1(std::false_type) -> c_string<2> {
    return c_string<2>({'-', digit_to_char(0-value, b), '\0'});
}

// called if int is single digit
template <int value, base b>
constexpr auto to_string_i(std::true_type) -> c_string<num_digits(value, b)> {
    return to_string_1<value, b>(std::integral_constant<bool, value >= 0>());
}

template <int value, base b>
constexpr auto to_string() -> c_string<num_digits(value, b)>;

// called if not single-digit
template <int value, base b>
constexpr auto to_string_i(std::false_type) -> c_string<num_digits(value, b)> {
    return to_string<value/int(b), b>() +  to_string_1<abs_val(value)%int(b), b>(std::true_type{});
}

template <int value, base b>
constexpr auto to_string() -> c_string<num_digits(value, b)> {
    return to_string_i<value, b>(std::integral_constant<bool, num_digits(value, b)==1 or (num_digits(value, b)==2 and not (value >= 0)>()));
}



// negative value
static_assert(to_string<-9999999, base::dec>() == "-9999999"sv, "");


// with enum
enum val { a = 12 };
static_assert(to_string<a, base::dec>() == "12"sv, "");

// base 16
static_assert(to_string<0xF, base::hex>() == "F"sv, "");
static_assert(to_string<0xF, base::dec>() == "15"sv, "");

// base 2
static_assert(to_string<0, base::bin>() == "0"sv, "");
static_assert(to_string<1, base::bin>() == "1"sv, "");
static_assert(to_string<2, base::bin>() == "10"sv, "");
static_assert(to_string<3, base::bin>() == "11"sv, "");

Compiler limits

Sometimes, the algorithm that does the computation at runtime is much more efficient than its corresponding constexpr version. Especially in C++11, where loops are not possible in constexpr functions, and thus the implementation needs to be recursive.

Depending on the compiler, and the size of the string, some functions might also fail, for example, because there might be not enough memory while compiling. Also, compilers do set some hard limits on how many functions can be evaluated at compile-time.

In some cases, it is possible to "take some shortcuts", just like shown with hex2bin and make the code compile with fewer resources, those solutions are generally compiler-specific. If using at least C++14, avoiding recursive functions and preferring loops (or algorithms), as a regular function would do at runtime, generally reduces the build time and the need for working around compiler limitations.

Conclusion

Even if the places where such classes are needed are rare, they are still a useful tool for

Thanks to std::string_view, std::span and eventually other similar self-written non-owning types, classes like c_string can be used without glue code or code duplication in regular functions, not designed with the presented highly specialized string class in mind.


Do you want to share your opinion? Or is there an error, same parts that are not clear enough?

You can contact me here.