A minimal DOM parser
Normally I am the user of a parser API.
I tend to use an existing library/API to query the content of a JSON message, XML file, Windows registry, and so on.
If I parse something "by hand", then it is normally something very simple, like a configuration file in .ini format, or just searching a sequence of symbols in a bigger buffer.
Probably because of my lack of experience in parsing data, I found many APIs difficult to use or not really efficient. I see unnecessary copies, the feeling that I still need to parse the output of the parser and that I cannot express efficiently what I want to do with some API or file format. It feels like I’m doing something wrong.
For this and other reasons, I decided to write some parsers from scratch, with different techniques.
To keep things simple, I wrote down how to parse a subset of the JSON format. The main motivation for this choice is that I wanted to be able to show the code in a 5-minute talk with other people and discuss parsing techniques on a concrete example.
The JSON format has the advantage that it is simple enough, yet defines different data types, and has a recursive definition, which makes it more interesting.
The grammar of a JSON file can be summarized 🗄️ as
wsc = ' ' | '\t' | '\r' | '\n'
ws0 = wsc(ws0)?
ws = (ws0)?
value = 'null' | 'true' | 'false' | number | string | array | object
string = '"' characters-or-escape '"'
number = '-'? digits ('.' digits)? (('e'|'E') ('+'|'-')? digits)?
array = ws '[' ws element-list? ws ']' ws
element-list = value ( ws ',' ws element-list)?
object = ws '{' ws member-list? ws '}' ws
member-list = member (',' member-list)?
member = string ws ':' ws value To simplify the implementation even further, here is what I’m going to parse
wsc = ' ' | '\t' | '\r' | '\n'
ws0 = wsc(ws0)?
ws = (ws0)?
value = 'null' | 'true' | 'false' | number | string | array | object
string = '"' characters '"'
number = '-'? digits
array = ws '[' ws element-list? ws ']' ws
element-list = value ( ws ',' ws element-list)?
object = ws '{' ws member-list? ws '}' ws
member-list = member (',' member-list)?
member = string ws ':' ws value On top of that, I’m assuming that a number is small enough to be represented by an int, and that characters does not include the quotation mark character (").
| Warning ⚠️ | Do not use the presented parser in production. It does not support valid all JSON constructs and is buggy. There are better libraries in the wild. |
Write a DOM parser
DOM stands for Document Object Model. It is normally used for XML and HTML documents, but the same concept can be also used for other document types.
The main idea is simple; while parsing the content, a structure with the content of the file is built and returned to the caller.
The first thing to do is map the JSON data types we want to parse to some C++ types; a possible mapping would look like
-
'null'can be mapped tostd::nullptr_t; -
'true'and'false'can be mapped tobool; -
numbercan be mapped toint -
stringcan be mapped tostd::string -
arraycan be mapped tostd::vector -
objectcan be mapped tomap<std::string, value> -
valuewould be a variant of the already listed types
The implementation is also not that complicated; for every data type, it is possible to define a corresponding routine that is responsible for the parsing:
struct value_t{
std::variant<bool, int, std::string, std::nullptr_t, std::vector<value_t>, std::map<std::string, value_t>> data;
};
auto parse_null() -> std::nullptr_t;
auto parse_bool() -> bool;
auto parse_string() -> std::string;
auto parse_number() -> int;
auto parse_value() -> value_t;
auto parse_array() -> std::vector<value_t>;
auto parse_object() -> std::map<std::string, value_t>;
// entry point
auto parse_json() -> std::map<std::string, value_t>; For simplicity, I’m going to assume that the whole content to parse is stored in something convertible to a std::string_view; this leads to the following minimal implementation:
#include <map>
#include <string>
#include <string_view>
#include <stdexcept>
#include <cassert>
#include <charconv>
#include <variant>
#include <vector>
struct value_t{
std::variant<bool, int, std::string, std::nullptr_t, std::vector<value_t>, std::map<std::string, value_t>> data;
};
[[noreturn]] void parse_error(){throw std::runtime_error("Error while parsing");}
// verifies if the next characters of sv match with v
// if it does, advance and return true, otherwise return false
bool advance_lit_name(std::string_view& sv, std::string_view v){
if(sv.starts_with(v)) {
sv = sv.substr(v.size());
return true;
}
return false;
}
// verifies if the next character (ignoring leading whitespaces) is c
// if it does, advance (ignoring trailing whitespaces) and return true, otherwise return false
bool advance_struct_char(std::string_view& sv, char c){
auto pos_leading_space = sv.find_first_not_of(" \t\r\n");
pos_leading_space = pos_leading_space==std::string_view::npos ? 0 : pos_leading_space;
if(sv.empty() or sv[pos_leading_space] != c) {
return false;
}
sv = sv.substr(pos_leading_space+1);
pos_leading_space = sv.find_first_not_of(" \t\r\n");
if(pos_leading_space!=std::string_view::npos) sv = sv.substr(pos_leading_space);
return true;
}
bool isdigit(char c){ return c >= '0' && c <= '9'; }
std::string parse_string(std::string_view& sv){
assert(sv[0] == '"');
auto pos = sv.find('"', 1); // NOTE: real JSON might contain escaped double quotes
if(pos == std::string_view::npos)parse_error();
auto v = std::string(sv.substr(1, pos-1));
sv = sv.substr(pos+1);
return v;
}
int parse_number(std::string_view& sv){
assert(sv[0] == '-' || isdigit(sv[0]));
int number;
auto [ptr, ec] = std::from_chars(sv.data(), sv.data()+sv.size(), number);
if(ec != std::errc{}) parse_error();
sv = sv.substr(ptr - sv.data());
return number;
}
value_t parse_value(std::string_view& sv);
std::vector<value_t> parse_array(std::string_view& sv){
auto check = advance_struct_char(sv, '[');
assert(check);
std::vector<value_t> arr;
if(advance_struct_char(sv, ']')) return arr;
do{
arr.push_back(parse_value(sv));
}while(advance_struct_char(sv, ','));
if(not advance_struct_char(sv, ']'))parse_error();
return arr;
}
std::map<std::string, value_t> parse_object(std::string_view& sv);
value_t parse_value(std::string_view& sv) {
if(sv.empty()) parse_error();
/**/ if(advance_lit_name(sv, "null") ) return {nullptr};
else if(advance_lit_name(sv, "true") ) return {true};
else if(advance_lit_name(sv, "false") ) return {false};
else if(sv[0] == '-' || isdigit(sv[0])) return {parse_number(sv)};
else if(sv[0] == '"') return {parse_string(sv)};
else if(sv[0] == '[') return {parse_array(sv)};
else if(sv[0] == '{') return {parse_object(sv)};
else parse_error();
}
std::map<std::string, value_t> parse_object(std::string_view& sv) {
if(not advance_struct_char(sv, '{')) parse_error();
std::map<std::string, value_t> obj;
if(advance_struct_char(sv, '}')) {
if(sv.size() > 1) parse_error();
return obj;
}
do{
if(not sv.starts_with('"')) parse_error();
auto key = parse_string(sv);
if(not advance_struct_char(sv, ':')) parse_error();
auto value = parse_value(sv);
obj.emplace(std::move(key), std::move(value));
}while(advance_struct_char(sv, ','));
if(not advance_struct_char(sv, '}'))parse_error();
if(sv.size() > 1) parse_error();
return obj;
}
std::map<std::string, value_t> parse_json(std::string_view sv) {
return parse_object(sv);
} Note that I did not implement parse_null and parse_bool as separate functions; the implementation is "inline" in parse_value.
The usage is relatively trivial, although for accessing the elements one needs to use std::visit:
int main(){
const auto json = R"({"name":{"first":"John", "last":"Smith"}, "age": 30, "locations" : [1,2,3, "4"]})";
const auto parsed = parse_json(json);
assert(parsed.size() == 3);
std::visit([](auto&& arg) {
using T = std::decay_t<decltype(arg)>;
if constexpr (std::is_same_v<T, std::map<std::string, value_t>>) {
assert(arg.size() == 2);
} else {
assert(false);
}
}, parsed.at("name").data);
} A common property of all parse_ functions is that they do not change any state (the position in string_view) if they cannot parse the data, this should make it easier to generate more useful error messages; for example by pointing at the current position that could not be parsed.
parse_json is the only function that should interest the end-user, all other functions are to be considered as implementation details.
Helper function for printing
For testing purposes, it can be useful to have a helper function for showing the parsed content.:
void mprint(const std::string_view s){
std::cout << "\"" << s << "\"";
}
void mprint(bool b){
std::cout << (b ? "true" : "false");
}
void mprint(int i){
std::cout << i;
}
void mprint(std::nullptr_t){
std::cout << "null";
}
void mprint(const value_t&);
void mprint(const std::map<std::string, value_t>& m){
std::cout << "{";
bool begin = true;
for( const auto& [key, v] : m){
if(not begin){std::cout << ", ";}
begin = false;
mprint(key);
std::cout << ":";
mprint(v);
}
std::cout << "}";
}
void mprint(const std::vector<value_t>& v){
std::cout << "[";
bool begin = true;
for(const auto& vv : v){
if(not begin){std::cout << ", ";}
begin = false;
mprint(vv);
}
std::cout << "]";
}
void mprint(const value_t& v){
std::visit([&](auto&& arg) {
mprint(arg);
}, v.data);
} How to handle errors
The current implementation is "optimistic"; it assumes that std::string_view represents a valid JSON object, and throws an error if it cannot parse the data correctly.
If exceptions are problematic, or if invalid data is expected, or at least not uncommon, then it makes more sense to return an error object; something like
struct parse_error{ /* ... */ };
std::expected<std::map<std::string, value_t>, parse_error> parse_json(std::string_view str); If you do not have std::expected in your toolbox, std::variant or even std::optional can act as replacements.
But there is also something else to consider; even if there was a parsing error, maybe the data the end-user is interested in has been already correctly parsed.
Consider for example the following string: {"a":"b"} hello. There is a valid JSON structure, and then something else.
The presented parser triggers an error because the string as a whole is not a valid JSON structure. For this particular case scenario, something similar to std::from_chars, which returns the parsed object and the last position parsed successfully would be more useful. The user can decide what to do with the parsed data and leftovers.
It does not help that much with something like {"a":{"c":"d"}, hello}, {"a":{"c":"d"}, or {"a":{"c":"d"}, "hello. This is not a valid JSON structure, but one might still want to access the object {"c":"d"} pointed by a.
With the current approach, the valid parsed data is thrown away.
Thus it might make sense to return a more complex object, that can represent the parsed data and a parsing error.
I did not explore how to change the API, but the caller needs somehow to be informed what is the last element that was parsed successfully. For example, if the parsing error happens inside an array, the array in the DOM might have the wrong size, since probably at least one element could not be parsed successfully.
Invalid, but partially parsed strings are probably best discarded, although even in that case, it is possible to get some useful information from incomplete data; for example a minimum length.
Supporting invalid objects might be a niche use case, and for most use cases, it does probably not make sense to make the DOM parser more complex than it needs to be. HTML is probably the biggest exception, where malformed documents are created and consumed daily.
For most applications, reporting the offending position might be enough.
Further parsing
For most of my use cases, the DOM structure is an intermediate state.
Normally I already have some data structures in my code used in the business logic, and I need to convert the DOM to those structures; this has two implications.
The first is that parse_json, might be parsing a gigantic JSON files, but I only need a small subset of it.
With the current API, there is no way around it.
You must create all those objects (and allocate memory for them), only to discard them shortly afterward.
One exterme example would be parsing the document for validating and pretty-printing it. In those case, all elements are copied in the DOM structure and discarded afterwards.
But even if you need to copy most objects over, there might be unnecessary conversions.
For example, the parser might use a std::map, but the user might want to use std::unordered_map, or a std::set instead of std::vector, something that is not std::string, a different allocator, and so on. This can be partially fixed by adding template parameters to parse_json, but that approach has some drawbacks too. For example, the implementation needs to be inline, and one might need to write adapters for custom containers that do not have the same API (like emplace for maps and push_back for vectors).
std::string_view versus std::string
parse_string returns a std::string, but an implementation could also return std::string_view.
While this would avoid some allocations, it works only as long as the original string with the whole JSON structure is alive.
This means that the validity of a parsed object is tied to the lifetime of another object, and still needs to allocate memory for std::vector and std::map.
It is a valid optimization, but I doubt that performance-wise it makes a relevant difference, and introduces new potential issues. It could also increase the amount of required memory as one needs to keep both the original document and the parsed document in memory at the same time.
Tagged union versus class hierarchies
In the implementation, I used std::variant to express that a value can be of different types, which I find easier to use than a class hierarchy for getting the actual data, and requires less code.
Other languages might not have the appropriate constructs, in particular, some languages do not make it possible to create such types by hand.
In Java, the preferred approach would probably be to use a base class and one subclass for every data type.
Using class hierarchies works, but adds a further indirection, and requires an additional allocation. Granted in a language that creates every object on the heap this will not make a big difference.
One could use sealed classes 🗄️ to limit the amount of subclasses, but even in that case, it might be less clear what data types one is going to handle.
Last but not least, in Java int and String do not derive from a common base, thus one might need to introduce wrapper classes for most or all types, while with variant types, no further wrappers are required.
Considering that there is only a closed subset of value types described in the grammar, the disadvantages of class hierarchies outweigh the advantages.
Streaming API
The current API requires that all data can be accessed through std::string_view.
If the data comes from a file, it means you need to read the whole file in memory, and then parse it. The drawback is that this approach requires more memory than necessary, as the parsing routine does not need to access the whole content at once.
A platform-specific solution might be to use mmap 🗄️ on POSIX systems and CreateFileMapping 🗄️ and MapViewOfFile 🗄️ on Windows systems. This might leave some systems out for parsing efficiently files, but covers many situations.
What is also missing, is support for data coming from somewhere else, or data that is not available as plain text on the disk. The data might be coming from stdin, is being downloaded from the internet, or is encoded in some format like Base64.
Currently, the user needs to download (or convert) the whole data, and then parse it. It would be better if the user could parse the data while it is arriving or while it is being converted.
This does not change how the parsing algorithms work but requires to use of something different than a string-base API as input. The parsing logic needs to pull the data, and eventually store it in an internal buffer until there is enough data to parse the next element.
Conclusion
As already mentioned, parsing some data to a DOM structure has multiple disadvantages:
-
all objects will be parsed and created; even if you do not need them
-
it is not clear how to handle partially invalid data, even without considering that recoveries might be possible
-
unnecessary conversions between types
-
since it builds a tree representation of the entire document, the usage of memory increases with the size of the document
-
you need to navigate the DOM structure to copy the data you are interested in
But it also does have multiple advantages; in case of success:
-
it is possible to easily navigate the DOM; iterate over elements and access sub-objects
-
the parsed object is easy to modify and eventually convert back to JSON
-
elements can be accessed in any order
-
elements can be accessed multiple times
If you have questions, comments, or found typos, the notes are not clear, or there are some errors; then just contact me.