Design a short string class
In the last weeks I thought about a "short string" class, if it made sense to try to standardize it, and how to take advantage of other invariants.
My notes are mostly for immutable strings; strings that are constructed and never changed, eventually reassigned.
Class representing short strings with maximum capacity
A minimal example of such a class would be the following:
template <int capacity>
class short_string{
static_assert(capacity < 32, "too big");
char data[capacity+1]{};
constexpr short_string() noexcept = default;
template<int N>
constexpr std::size_t size( const char (&)[N] ) noexcept; // static_assert(N<=capacity)
constexpr static short_string rcrop(std::string_view); // store up to capacity characters, drop the rest
constexpr bool empty() noexcept;
constexpr bool operator==(std::string_view) noexcept;
// operator[], operator string_view, ...
};
It has the following features:
-
no allocation
-
data is inline
-
supports up to a maximum length
There are two possible implementation strategies.
1) `\0`-terminated, no need to store the size separately 2) store the size separately
Since the first approach does not work well for strings with an embedded \0
, either declare that this class does not support such strings, or store the size separately.
Note 📝 | short_string isn’t a novel class; on the contrary. Different dialects of the programming language Pascal also have a ShortString type 🗄️, which has a capacity of 255 characters, and is implemented similarly since … 1982 I guess? Although at the time the type was not named ShortString . |
Class representing short strings with fixed length
There is a second form of short_string
class, which is very similar:
template <int size>
class short_string{
static_assert(size < 32, "too big");
char data[size]{};
constexpr short_string( const char (&)[size] ) noexcept;
constexpr static short_string rcrop(std::string_view);
constexpr bool empty() noexcept;
constexpr bool operator==(std::string_view) noexcept;
// operator[], operator string_view, ...
};
It is very similar to the first one, but all strings of a given type have the same size. In this case, there is no need to store the information on how long a string is in a member variable, thus the class requires one byte less.
Why?
Why did I want to use such classes?
Mainly to reduce the amount of used memory, without heavily changing existing code.
To give some perspective, the value of sizeof(char*)
is 8 bytes on most 64-bit systems, and 4 bytes on most 32-bit systems. Thus, when allocating space for a string, aside from the bytes required for storing the data itself, you also need to count sizeof(char*)
and sizeof(size_t)
(I’m choosing size_t
because this is what the most used standard libraries are using).
This means that if you want to store strings up to 7 characters, you need 7+sizeof(char*)+sizeof(size_t)
bytes, which are
-
15 on most 32-bit systems
-
23 on most 64-bit systems
Thus you are triplicating the amount of required space on many 32-bit system, and quadruplicating it on 64-bit systems.
The difference gets even bigger when compared to std::string
, which also needs to store the capacity, but more on std::string
later.
Note 📝 | With \0 -terminated strings you can avoid storing the length in a variable. Still, on a 32-bit system, you are doubling the amount of required memory, and triplicating it on a 64-bit system. |
In most cases, the required space on the stack for an instance is not that important, but what if you want to store such data in memory, for example inside a vector?
If you have thousands of strings containing 7 characters, the difference between std::vector<std::string>
and std::vector<short_string<7>>
might make a difference.
Also note that std::vector<short_string<7>>
has another nice property; it has nearly the same layout of a double-null terminated strings 🗄️, the strings are there near each other, but with a much nicer interface: the same of std::vector
and a string class, instead of the one of a pointer. And the elements do not depend on each other. This makes using it much simpler; try to sort the strings in a double-null terminated string!
std::vector<std::string>
, or even std::vector<const char*>
has additional indirections and allocations, and will thus have some more work to do when iterating and managing memory, you need, for example, to allocate element-wise.
Even if it sounds counterintuitive: short_string<7>
is even more lightweight than std::string_view
.
Granted, those benefits are hard to appreciate on a machine with dozens of gigabytes of RAM.
Note 📝 | a 64-bit architecture has "big" pointers, which makes short strings even more compacter than on 32-bit architecutres. Considering that most programs do not need to handle so much memory, I find the X32 ABI (it can be summarized as a 64-bit architecture with 32-bit pointers) interesting, and even if it is supported on systems like Debian, it does not seem to be used that much, and it was already discussed twice if Linux should continue to support it 🗄️. |
std::string
already supports SSO
True, most used std::string
implementations support SSO (small string optimization).
This means that if a string is short enough, then std::string
might not allocate memory.
There are two issues.
The first is that the user of std::string
does not control if and when std::string
uses SSO. The second issue is that sizeof(std::string)
might be even bigger than anticipated for supporting longer strings without allocating memory, and then still fail to avoid the allocation.
With short_string
the end-user is in control of this particular feature.
If you want to read more about SSO, I would recommend following resources:
Why the static_assert
?
For a long string, the saved space is proportionally small. If the string is 100 bytes long, does saving 2 bytes, thus 2% of the size, really matter? On the other hand, 2 bytes for a 4 byte long string is already 50% of the size! For this reason, short_string
for "big" sizes and capacities is not that useful.
But it gets worse; short_string
might contain strings shorter than the capacity. If a short_string<100>
contains a string with 5 bytes of content it will still consume the whole capacity. Thus instead of using only 5 bytes, it will still require 100 bytes. Quite the opposite of what we want to achieve. Thus in general a short_string
with a big size or capacity increases the memory usage.
std::string
works very well as a general string container, it is easier to use than pointers and thanks to SSO it is even more performant.
short_string
shines when the known maximum capacity is smaller than sizeof(std::string)
, but is can be also useful for slightly bigger values.
Personally, I’ve used this technique for reducing memory usage for string with a maximum capacity of to 22 bytes.
Does it make sense to standardize short_string
?
I believe it does not.
It is not a good public API; you generally do not want to convert between std::string
, std::string_view
, literals, char*
, and the whole short_string<N>
classes (for which I have presented two different implementations).
For my use cases, the string class had no mutable operation except for assignment.
The class performed better than std::string
because it took advantage of some invariant of the content: The maximum length is smaller than a relatively small value.
What we normally store in strings has other invariants too, thus if it makes sense to standardize short_string
, then it might make sense to standardize other implementations that take advantage of other invariants, leading to a combinatorial explosion of interactions between the classes.
We do not want that.
Example of strings with invariants
As an example, a string with multiple invariants are IBANs.
It has a maximum of 34 characters, which is slightly larger than some sizeof(std::string)
.
But, there are other invariants one can take advantage of.
For example, it is not necessary to store the check digits, if you only want to store valid IBANs. It is possible to validate the check digits on construction, drop them, and eventually calculate them again on the fly if they are needed afterwards. This already reduces the required characters to 32.
The country codes can be represented with a single byte (assuming one byte is 8 bit) instead of two, as there are less than 256 country codes.
But there are other invariants, alhtough harder to exploit. All characters are
-
ASCII
-
alphanumeric
-
case-insensitive
Thus it is possible to compress them to binary blobs.
If you do not even need to be able to reverse the compression you could use an hashing algorithm. If the search space is small enough, one could hash all possible inputs and verify that there are no collisions.
Thus, there are following options:
-
use
std::string
, which requires up to 32+34 bytes for every single element (some IBAN might be shorter), and an indirection (assuming no SSO) -
use a generic
short_string
, which requires 34 bytes for every single element -
use an
int32_t
(assumingcrc32
is enough), which is 4 bytes
enum
/index
/pointer - the flyweight pattern
Strings, even long strings, can be obviously mapped to an integral type.
The most straightworward approach would be to enlist them and assign to all strings a unique number, this could even be an enum
value.
This only makes sense if the set of valid values is small enough. It would make little sense to enlist all possible IBAN values, but it is a viable approach for country codes.
If you want to minimize space usage, you also want your index type to be smaller than sizeof(void*)
, otherwise you might just use a pointer to the data if it requires the same amount of space.
In the case of countries, since they are less than 256, a char
as index is sufficient. Considering that sizeof(char)
is 1
, and sizeof(char*)
is 4, if you have a lot of strings in memory you have to reduce the usage to one-fourth.
Granted, you need a function call to access the value instead of accessing the value behind a pointer, but a pointer is, after all, a number that acts as an index in memory.
Also, for the country codes, since those are so short, the difference in memory usage might be hard to measure. You need to really have a lot of them before using one technique makes a relevant difference. If you have longer strings, that appear in multiple places, and they will appear in memory more than once, using this pattern can help to reduce the amount of used memory.
A more general approach to the described techniques is the Flyweight pattern.
If it is not possible to store all strings in the binary, storing them in a global buffer/pool is a relatively simple alternate approach. The main disadvantage is how to avoid this buffer from growing indefinitely, and decide a strategy to keep it constrained.
Those approaches do not work very well with mutable objects, while the short_string
classes can work with mutable objects too.
zlib and other generic algorithms
Generic compression algorithms cannot guarantee that the output is smaller than the input.
Some algorithms work better with some type of data, for example, text, while others with other types of data, like music or video files. It should not be a surprise that there are many different codecs, and codecs with parameters. The more you know about your data, the more you can compress it more efficiently.
Most algorithms rely on repeating patterns to represent them in a more space-efficient way.
For short "random" strings it is more unlikely to have a repeating pattern, and the header alone of the compressed input might be bigger than the data you wanted to compress.
Having written that using an existing compression algorithm is generally a good idea.
It might be possible to merge multiple strings and compress them all together, instead of having a lot of compressed strings.
This increases the chances of having some repeating patterns, thus the generic algorithm should be able to produce something that requires less space. Also, the overhead of a header or signature will be negligible compared to the saved space, as there will be only one, not one per string. The fact that with std::span<short_string<N>> the data is stored contiguously in memory, makes such an operation even simpler, just access the as raw bytes, and compress everything in one go. The operation for decompressing the data and getting the desired `short_string<N>
is slightly more complicated, but still relatively simple.
The main disadvantage of such an approach is that in general, you will not be able to access/decompress the elements singularly, but will have to decompress them all.
Unused bits
If the strings contain only ASCII characters and your system has CHAR_BIT == 8
, then there is one wasted bit per character.
If you could reuse this bit, you would be able to store 8 characters in an array of 7 elements, which means that it is possible to reduce the required space by 12.5%!
But we can do even better.
If your strings use only a subset of ASCII, for example only alphanumeric characters and digits, for a total of 62 symbols, then 6 bits are sufficient to represent them all. This means that 8 characters can be stored in an array of 6 elements, with 25% of space saved!
While it sounds like a winning, there are other things to consider.
How do those strings interact with each other? It might be problematic if you need to convert from the "compressed" string to a normal one, instead of being able to simply read, or change, the content.
How to remove unused bits
Warning ⚠️ | I did not use this method in production, use with caution. |
All of what I’ve written assumes that
-
CHAR_BIT == 8
-
little-endian
The general idea can be applied to other sytems too with some changes.
The following example shows how to compress the string literal Hello!!!
by hand:
unsigned char input[] { 'H', 'e', 'l', 'l', 'o', '!', '!', '!' };
data in memory (little endian):
H | e | l | l | o | ! | ! | !
01001000 | 01100101 | 01101100 | 01101100 | 01101111 | 00100001 | 00100001 | 00100001
The first bit is always 0; "remove" it
_1001000 _1100101 _1101100 _1101100 _1101111 _0100001 _0100001 _0110001
resulting data in memory
10010001 | 10010111 | 01100110 | 11001101 | 11101000 | 01010000 | 10100001 |
| | f | | | P | |
unsigned char output[] { 145, 151, 'f', 205, 232, 'P', 161 };
For converting the data back, insert a 0 at the start and after every 7th bit
0100100001 100101011 011000110 110001101 111001000 010010000 100100001
^ ^ ^ ^ ^ ^ ^ ^
resulting data in memory
01001000 | 01100101 | 01101100 | 01101100 | 01101111 | 00100001 | 00100001 | 00100001
H | e | l | l | o | ! | ! | !
From a "high level" perspective, the algorithm looks pretty obvious, how can it be implemented in C or C++?
Unfortunately in both languages, bits are second-class citizens. It is not possible to pass bits as parameters to functions, the standard library does not provide any facility, and the only way to modify them is through symbols (<<
, ~
, >>
, &
) that do not always make the code easy to read (or write).
The main operation needed for implementing the algorithm is shift (<<
). After removing the first 0
at the left, every other bit is moved one position to the left.
Unfortunately, both languages define shifting for single characters (or for single numerical values), not arrays.
Thus we need some functions for shifting the values in an array.
#include <cstring>
#include <span>
void left_shift_with_carry(std::span<unsigned char> arr, int bits) {
assert(bits>=0);
// first, move complete objects
int chunks = bits / 8;
if (chunks >= arr.size()) {
std::memset(arr.data(), 0, arr.size());
return;
}
if (chunks > 0) {
std::memmove(arr.data(), arr.data() + chunks, arr.size() - chunks);
std::memset(arr.data() + arr.size() - chunks, 0, chunks);
}
// then move single bits (if any remaining)
int left = bits % 8; // would reassign to bits
if (left==0) return;
int right = 8 - left;
size_t l = arr.size() -1 - chunks;
for (size_t i = 0; i != l; ++i) {
unsigned char carry = arr[i+1];
arr[i] = ((arr[i] << left)) | (carry >> right);
}
arr[l] = (arr[l] << left);
}
void left_shift_with_carry_and_drop(std::span<unsigned char> arr, int bits){
for(int i = arr.size() - 1 ; i != -1; --i){
left_shift_with_carry(arr.subspan(i), bits);
}
}
With this function, we can shift subsets of arrays and drop the elements we are not interested in
int main(){
unsigned char buffer[] { 'H', 'e', 'l', 'l', 'o', '!', '!', '!' };
left_shift_with_carry_and_drop(buffer, 1);
const unsigned char expected_output[] { 145, 151, 'f', 205, 232, 'P', 161 };
assert(std::equals(buffer, buffer+7, expected_output));
}
Similarly, we need to shift the element to the right to restore the original content:
void right_shift_with_carry(std::span<unsigned char> arr, int bits) {
assert(bits>=0);
// first, move complete objects
int chunks = bits / 8;
if (chunks >= arr.size()) {
std::memset(arr.data(), 0, arr.size());
return;
}
if (chunks > 0) {
std::memmove(arr.data() + chunks , arr.data(), arr.size() - chunks);
std::memset(arr.data(), 0, chunks);
}
// then move single bits (if any remaining)
int left = bits % 8;
if (left==0) return;
int right = 8 - left;
unsigned char carry = 0;
for (int i = 0; i != arr.size(); ++i) {
unsigned char carry_for_next_iter = arr[i] & ((1<<left)-1);
arr[i] = (arr[i] >> left) | (carry << right);
carry = carry_for_next_iter;
}
}
void right_shift_with_carry_and_insert(std::span<unsigned char> arr, int bits){
for(int i = 0 ; i != arr.size(); ++i){
right_shift_with_carry(arr.subspan(i),bits);
}
}
int main(){
unsigned char buffer[] { 'H', 'e', 'l', 'l', 'o', '!', '!', '!' };
left_shift_with_carry_and_drop(buffer, 1);
const unsigned char expected_output[] { 145, 151, 'f', 205, 232, 'P', 161 };
assert(std::equals(buffer, buffer+7, expected_output));
right_shift_with_carry_and_insert(buffer, 1);
const unsigned char expinput[] { 'H', 'e', 'l', 'l', 'o', '!', '!', '!' };
assert(std::equal(buffer, buffer+8, expinput));
}
Note that the current implementation has quadratic complexity, thus it is not particularly performant. But since the main argument of this note is short strings, and this code only is a proof of concept, the current implementation is good enough.
We can also calculate immediately how big the resulting array should be; we just need to count how many bits have been dropped in total, convert it to bytes (rounding down), and subtract that value from the original size: data.size() - data.size()*(8-requiredbits)/8
.
Define supported characters
Currently the algorithm simply removed the unused bits. But how do we know how many bits we can drop?
By defining an alphabet of supported characters, and by mapping those characters to the values in the interval [0, alphabet.size()-1]
of unsigned char
.
Note 📝 | I’m writing for simplicity only about ASCII characters, but the same mapping can be done for any Unicode codepoint. On the other hand, chances are that if you have Unicode-encoded strings, then the alphabet will have more than 127 elements, thus there will be no unused bits to drop. |
The size of the alphabet determines how many bits can be dropped, in the given example, the alphabet consists of 26 characters, which can be represented with bits, thus we only need 5 bits of 8; 3 can be dropped.
If you have case-insensitive ASCII non-locale dependant strings, my recommendation would be to convert the string to lowercase (or uppercase) and work with the reduced alphabet.
Note that since all uppercase ascii letters plus digits are 36 symbols, and , and if you take also lowercase symbols in account you have 62 symbols, and you can even avoid to convert the string beforehand to uppercase or lowercase and compress it as-is, as you’ll need in both cases at least 6 bits.
The functions for mapping the supported symbols to the first values of unsigned char
and back are the following:
void reduce(std::span<unsigned char> data, std::string_view dictionary){
for(auto& v : data){
auto pos = dictionary.find(v);
assert(pos != std::string_view::npos);
v = pos;
}
}
void expand(std::span<unsigned char> data, std::string_view dictionary){
for(auto& v : data){
assert(v<dictionary.size());
v = dictionary[v];
}
}
At this point, we have defined all functions for creating a compressed string:
// define the supported alphabet
const std::string_view alphabet = "abcdefghijklmnopqrstuvwxyz";
std::vector<unsigned char> data = { /* data coming from somewhere else */ };
// map all characters of the alphabet to [0, alphabet.size()-1]
reduce(data, alphabet);
// calculate how many bits can be dropped
const int requiredbits = std::bit_width(alphabet.size());
// drop unused bits
left_shift_with_carry_and_drop(data, 8-requiredbits);
// copy the compressed string in a shorter appropriate buffer
// of length
// data.size() - (data.size()*(8-requiredbits))/8
// -- now restore the original content
// add padding bits
right_shift_with_carry_and_insert(data, 8-requiredbits);
// map [0, alphabet.size()-1] to original symbols
expand(data, alphabet);
A class compressed_string
would
-
use
reduce
andleft_shift_with_carry_and_drop
in the constructor -
use
right_shift_with_carry_and_insert
andexpand
in a conversion operator (if such operator is necessary) -
store the alphabet as a template non-type parameter or as an index to a constant (see the flyweight pattern; storing the alphabet as a member variable would defeat the purpose of
compressed_string
)
Conclusion
Up until now short_string
was good enough for my use cases.
With "good enough" I mean that I was able to replace in some places std::string
with short_string
, minimally change some places where those are passed around (similarly to replacing const std::string&
parameters with std::string_view
), and significantly reduce the amount of used memory.
The original code had feature-wise nothing wrong, it was just not prepared to deal with a lot of strings in a constrained environment.
Replacing std::string
with short_string
was enough to deal with the new workload, but if this increases again again, then more drastic measures will be necessary, as in major changes in the algorithms dealing with the data structures.
Using something like compressed_string
instead of short_string
would not have been as straightforward.
The main reason: you need to convert the string forth and back, and this takes some resources.
While it is true that the instance occupies even less memory, converting forth and back is additional work. It might still mean that the application uses less memory, but it is not clear how performance is affected without further analysis.
Performance considerations are not really an issue for short_string
; there is generally no additional work, and the data is stored as-is as before, but in a less dynamic data structure.
Just like short_string
I think that something like compressed_string
(or the functions used for implementing it) is not a good candidate for the standard library.
On the other hand, it might be nice to have functions for shifting data in an array. Most of the time a cast to a corresponding unsigned integral type is the most simple and efficient way to deal with shifts, but since an array could be longer than all supported integral types, it is not always a viable solution.
Do you want to share your opinion? Or is there an error, some parts that are not clear enough?
You can contact me anytime.