The C++ logo, by Jeremy Kratz, licensed under CC0 1.0 Universal Public Domain Dedication

Self-hashing code in C++

Notes published the
5 - 6 minutes to read, 1295 words

Recently I had to upgrade an application that executed a setup procedure during its first run.

In its current implementation, the procedure verified that it was already executed by saving the information on a file.

Since I had to update the procedure, the setup algorithm had to be executed again to ensure a consistent state.

The most straightforward way to accomplish such a task is to add additional information, like a version number.

With the updated procedure, the executable queries a file from the disk. If it exists and if the version number does not match, then the setup procedure is executed, otherwise it is skipped. If the setup is executed successfully, the version number is saved to the disk.

This approach works relatively well but has one drawback: every time the setup algorithm changes, the programmer needs to remember to manually update the hardcoded version number.

One could tie this version number to the version of the executable, but it would mean that the setup algorithm is executed every time a new program is deployed.

While it is not an error, the setup process might be computationally expensive or might be interactive, thus it would be good to minimize how often it is executed. Thus tying it to a version number that does not indicate if the setup procedure changed or not would not be the best approach.

On the other hand, we should not rely on the programmer to update manually the version number.

Hash instead of version number

A hash instead of a monotonically increasing number would be good enough, as we only need to compare the number for equality. Hashes might introduce collisions, but the chances of forgetting to increase a number are much higher.

How to hash some C++ code directly in C++

With the preprocessor, it is possible to get a textual representation of a piece of code

#include <utility>

#define HASH_FUNC(...) \
	std::pair(#__VA_ARGS__, __VA_ARGS__)

void foo(int alg_id, std::string_view data){
	const auto [str, fun] = HASH_FUNC([&](){
		// expensive algorithm
		return data[0];
	});
	// str contains a textual representation of fun
	// fun is a lambda with the algorithm we want to conditionally execute
}

A possible enhancement for the macro HASH_FUNC is to verify that VA_ARGS is something that can be used as a function

#define HASH_FUNC(...) \
	std::pair( \
		([&]{static_assert(std::is_invocable_v<decltype(__VA_ARGS__)>);}(),#__VA_ARGS__), \
		__VA_ARGS__\
	)

Once we have a textual representation, it is possible to convert it to a number, for example with crc32

#include <string_view>
#include <utility>
#include <cstdint>

// crc32, crc64, md5, or something else
struct hash{
	constexpr explicit hash(int v) noexcept : value(v){}
	int value;
	constexpr bool operator==(hash o) noexcept { return value == o.value;}
};

consteval hash hash_data(std::string_view s){
	// begin crc32
	uint32_t crc32 = 0xFFFFFFFF;
	for(const auto& v : s) {
		auto ch = static_cast<unsigned char>(v);
		for(size_t j = 0; j != 8; ++j) {
			uint32_t b = (ch ^ crc32) & 1;
			crc32 >>= 1;
			if(b) {
				crc32 = crc32 ^ 0xEDB88320;
			}
			ch >>= 1;
		}
	}
	crc32 = ~crc32;
	// end crc32
	return hash(crc32);
};
static_assert(hash_data("123456789")    == hash(0xCBF43926));
static_assert(hash_data("hello world!") == hash(0x03B4C26D));

#define HASH_FUNC(...) \
	std::pair(([&]{static_assert(std::is_invocable_v<decltype(__VA_ARGS__)>);}(),hash_data(#__VA_ARGS__)), __VA_ARGS__)

int foo(int alg_id, std::string_view data){
	const auto [h,fun] = HASH_FUNC([&](){
		// expensive algorithm
		return data[0];
	});
	//assert(h == hash(0));
	if(hash(alg_id) != h){
		fun();
	}
	return h.value;
}
Note 📝
comments are stripped before the preprocessor is executed, thus the textual representation will not change if only the comments are changed.

The function hash_data is marked as consteval to ensure that the algorithm is executed at compile time and that we do not blow the binary size with literals.

Note that by using this pattern, the algorithm itself cannot verify it it should execute or not, this check needs to be done by the caller.

The main reason is that inside the lambda the type of h has not been deduced yet.

An alternative implementation, which does not use a lambda, could look like

// multi-statement macro
#define HASH_EXEC(...) \
	hash_data(#__VA_ARGS__); {__VA_ARGS__}

int foo(int alg_id, std::string_view data){
	const auto h = HASH_EXEC({
		if(hash(alg_id)==h){
			return alg_id;
		}
		// expensive algorithm
		// ...
		return h.value;
	});
}

In this case, h can be used safely from the code that has been versioned.

HASH_EXEC has the main issue to be a multi-statement macro

There are multiple ways to avoid having a multiple-statement macro, all of them have a different set of tradeoffs.

For example, wrapping it in a lambda would change the meaning of return, while wrapping it in a do-while loop would break break and continue.

Possible misuses

The presented implementation is not free from misuse.

The main disadvantage is that the code inside HASH_EXEC or HASH_FUNC calls another function, and that function is changed, for example

void bar(std::string_view data);

int foo(int alg_id, std::string_view data){
	const auto h = HASH_EXEC({
		if(hash(alg_id)==h){
			return alg_id;
		}
		// expensive algorithm
		bar(data);
		return h.value;
	});
}

or

void bar(std::string_view data);

int foo(int alg_id, std::string_view data){
	const auto [h,fun] = HASH_FUNC([&]{bar(data);});
	//assert(h == hash(0));
	if(hash(alg_id) != h){
		fun();
	}
	return h.value;
}

In both cases, if the behavior of bar changes, the calculated hash will still be unchanged. And there is no way to prevent such error, except by using an external linter.

But even if there were a linter, it would mean that the whole implementation needs to be inline.

This is doable for code that is simple enough but after a certain point, it might be easier to split the implementation, ensure that all functions are implemented in the same file, hash the whole source file with the build system (or maybe use embed 🗄️ if/when it gets standardized), and generate a source file containing the computed constant.

The other issue is that HASH_FUNC and HASH_EXEC might not receive the whole algorithm as parameters. Using the build system makes this error less obvious, but it is still possible that some implementation details are moved from one file to another.

In both cases, one relevant piece of the algorithm might change, the hash value does not change, and the code is not executed when deployed.

Conclusion

There is unfortunately no silver bullet.

Tracking by hand changes is error-prone, but the macros HASH_EXEC and HASH_FUNC have their set of issues too.

It might be easier to spot the errors when using the macros as if someone does not know what they are doing, it might ask some more questions.

Without any type of annotation, one would not think that it might need to change a constant in another place to ensure that some code is executed correctly when upgrading from an older version of the software.

Since it does not seem to be possible to do something better in the language or with the build system, an automated regression test can be used to ensure the correct functionality.

It could detect if the hash value and output of the setup procedure changed or not, and fail appropriately.

But depending on the functionality (what if the setup procedure changes a setting of the operating system?), it might not be an approach that scales well, or that is easy to implement.


Do you want to share your opinion? Or is there an error, some parts that are not clear enough?

You can contact me anytime.