A minimal pull parser
After implementing a DOM and a SAX parser, its time to look at a third one: a pull parser.
The biggest disadvantage of the SAX / event-based parser is it’s usability; the fact that one needs to implement a state machine to know what is currently being parsed.
The main idea of a pull parser is to give the control back to the user. I already tried out this approach for numerical algorithms, and it helped to reduce the overall complexity and code duplication.
Similar to numerical algorithms, there is an increased complexity for the caller (compared to the DOM parser).
Note 📝 | StAX (Streaming API for XML) is specific pull parser for XML. |
While the DOM and event based parser (or push parser) have a similar structure, the pull parser has many more differences:
#include <string_view>
#include <stdexcept>
#include <cassert>
#include <charconv>
#include <optional>
void parse_error(){throw std::runtime_error("Error while parsing");}
bool advance_lit_name(std::string_view& sv, std::string_view v){
if(sv.starts_with(v)) {
sv = sv.substr(v.size());
return true;
}
return false;
}
bool advance_struct_char(std::string_view& sv, char c){
auto pos_leading_space = sv.find_first_not_of(" \t\r\n");
pos_leading_space = pos_leading_space==std::string_view::npos ? 0 : pos_leading_space;
if (sv.empty() or sv[pos_leading_space] != c) {
return false;
}
sv = sv.substr(pos_leading_space+1);
pos_leading_space = sv.find_first_not_of(" \t\r\n");
if(pos_leading_space!=std::string_view::npos) sv = sv.substr(pos_leading_space);
return true;
}
std::string_view parse_string(std::string_view& sv){
assert(sv[0] == '"');
auto pos = sv.find('"', 1); // NOTE: real json might contain escaped double quotes
if(pos == std::string_view::npos) return std::nullopt;
auto v = sv.substr(1, pos-1);
sv = sv.substr(pos+1);
return v;
}
int parse_number(std::string_view& sv){
assert(sv[0] == '-' || isdigit(sv[0]));
int number;
auto [ptr, ec] = std::from_chars(sv.data(), sv.data()+sv.size(), number);
if(ec != std::errc{}) parse_error();
sv = sv.substr(ptr - sv.data());
return number;
}
struct parser{
std::string_view sv;
bool null(){ // std::optional<void> <--> bool
return advance_lit_name(sv, "null");
}
std::optional<bool> boolean(){
/**/ if(advance_lit_name(sv, "true" )) return true;
else if(advance_lit_name(sv, "false")) return false;
return std::nullopt;
}
std::optional<std::string_view> string(){
if(sv.starts_with('"')) return parse_string(sv);
return std::nullopt;
}
std::optional<int> number(){
if(not sv.empty() and (sv[0] == '-' || isdigit(sv[0]))) return parse_number(sv);
return std::nullopt;
}
bool array(){ // skip [
return advance_struct_char(sv, '[');
}
bool element(){ // skip, and ]
if(advance_struct_char(sv, ']')) return false;
advance_struct_char(sv, ','); // FIXME: sees wrong, could be an error
return true;
}
bool object(){ // skip {
return advance_struct_char(sv, '{');
}
std::optional<std::string_view> key(){ // , } : and key
advance_struct_char(sv, ',');
if(sv.starts_with('"')) {
auto res = parse_string(sv);
advance_struct_char(sv, ':');
return res;
}
return std::nullopt;
}
};
The API is much bigger, which makes it more complex to use compared to a dom parser. On the other hand, the values are not evaluated out of context for the caller, thus it makes traversing the JSON document easier than using the SAX parser.
void check(bool b){if(not b) throw std::runtime_error("Error while parsing");}
int main(){
std::string name;
int age = -1;
std::vector<std::string> documents;
auto str = R"( {"name":"John", "age": 30, "documents":["1","2","3"] })";
std::string_view sv = str;
auto p = parser(sv);
check(p.object());
while(auto key = p.key()) {
if(*key == "name"){
auto value = p.string();
check(value);
name = *value;
}else if(*key == "age"){
auto value = p.number();
check(value);
age = *value;
}else if(*key == "documents"){
check(p.array());
while(p.element()){
auto value = p.string();
check(value);
documents.emplace_back(*value);
}
}
}
check(name == "John");
check(age == 30);
check(documents.size() == 3);
check(documents[0] == "1");
}
The suggested parser is too broad
If you try the proposed parser
class out, you’ll note a couple of deficiencies. The pull parser ignores some errors; for example objects and arrays that are not closed, correctly, missing commas, and so on.
Thus instead of having array
, one needs an array_start
and array_finish
(in a previous revision i named the array_begin
and array_end
, but then it might look like they are used for iterating over elements…)
With an additional member function, it is possible to verify if an array or object have been parsed successfully.
Plus, there needs to be a leftovers
member function for verifying if there are any unparsed elements, even if everything has been parsed successfully.
Last but not least, for detecting some missing or additional commas an additional state variable, expect_comma
, has been added. This is only used when parsing keys and element in an array, so I am not particularly happy with this variable that is needed only in a small subset of the API, but it gets the job done.
Last but not least, I’ve removed parse_error
completely. The previous implementation was inconsistent as it mostly did not throw any exceptions.
The final implementation should looks like the following:
#include <string_view>
#include <stdexcept>
#include <cassert>
#include <charconv>
#include <optional>
bool advance_lit_name(std::string_view& sv, std::string_view v){
if(sv.starts_with(v)) {
sv = sv.substr(v.size());
return true;
}
return false;
}
bool advance_struct_char(std::string_view& sv, char c){
auto pos_leading_space = sv.find_first_not_of(" \t\r\n");
pos_leading_space = pos_leading_space==std::string_view::npos ? 0 : pos_leading_space;
if(sv.empty() or sv[pos_leading_space] != c) {
return false;
}
sv = sv.substr(pos_leading_space+1);
pos_leading_space = sv.find_first_not_of(" \t\r\n");
if(pos_leading_space!=std::string_view::npos) sv = sv.substr(pos_leading_space);
return true;
}
std::optional<std::string_view> parse_string(std::string_view& sv){
assert(sv[0] == '"');
auto pos = sv.find('"', 1); // NOTE: real json might contain escaped double quotes
if(pos == std::string_view::npos) return std::nullopt;
auto v = sv.substr(1, pos-1);
sv = sv.substr(pos+1);
return v;
}
std::optional<int> parse_number(std::string_view& sv){
assert(sv[0] == '-' || isdigit(sv[0]));
int number;
auto [ptr, ec] = std::from_chars(sv.data(), sv.data()+sv.size(), number);
if(ec != std::errc{}) return std::nullopt;
sv = sv.substr(ptr - sv.data());
return number;
}
struct parser{
std::string_view sv;
bool expect_comma = false;
bool null(){ // std::optional<void>
return advance_lit_name(sv, "null");
}
std::optional<bool> boolean(){
/**/ if(advance_lit_name(sv, "true" )) return true;
else if(advance_lit_name(sv, "false")) return false;
return std::nullopt;
}
std::optional<std::string_view> string(){
if(sv.starts_with('"')) return parse_string(sv);
return std::nullopt;
}
std::optional<int> number(){
if (not sv.empty() and (sv[0] == '-' || isdigit(sv[0]))) return parse_number(sv);
return std::nullopt;
}
bool array_start(){ // skip [
auto tmp = sv;
bool b = advance_struct_char(tmp, '[');
if(advance_struct_char(tmp, ',')) return false; // broken array, cannot begin with ,
sv = tmp;
expect_comma = false;
return b;
}
bool element(){ // skip , but there should be no , for the first element
auto tmp = sv;
if(advance_struct_char(tmp, ']')) return false;
if(expect_comma and not advance_struct_char(sv, ',')){
return false;
}
expect_comma = true;
return true;
}
bool array_finish(){ // skip ]
return advance_struct_char(sv, ']');
}
bool object_start(){ // skip {
auto tmp = sv;
bool b = advance_struct_char(tmp, '{');
if(advance_struct_char(tmp, ',')) return false; // broken object, cannot begin with ,
expect_comma = false;
sv = tmp;
return b;
}
std::optional<std::string_view> key(){ // , } : and key
auto tmp = sv;
if(advance_struct_char(tmp, '}')) return std::nullopt;
if(expect_comma and not advance_struct_char(tmp, ',')){
return std::nullopt;
}
if(not tmp.starts_with('"')) {
return std::nullopt;
}
auto res = parse_string(tmp);
if(not advance_struct_char(tmp, ':')) return std::nullopt;
// if followed by :, verify it does not end with }, otherwise not valid key-value pair
sv = tmp;
expect_comma = true;
return res;
}
bool object_finish(){ // skip {
return advance_struct_char(sv, '}');
}
bool leftovers(){
return sv.size() >1;
}
};
The usage is mostly unchanged:
void check(bool b){if(not b) throw std::runtime_error("Error while parsing");}
int main(){
std::string name;
int age = -1;
std::vector<std::string> documents;
auto str = R"( {"name":"John", "age": 30, "documents":["1","2","3"] })";
std::string_view sv = str;
auto p = parser(sv);
check(p.object_start());
while(auto key = p.key()) {
if(*key == "name"){
auto value = p.string();
check(value);
name = *value;
}else if(*key == "age"){
auto value = p.number();
check(value);
age = *value;
}else if(*key == "documents"){
check(p.array_start());
while(p.element()){
auto value = p.string();
check(value);
documents.emplace_back(*value);
}
check(p.array_finish());
}
}
check(p.object_finish());
check(not p.leftovers());
check(name == "John");
check(age == 30);
check(documents.size() == 3);
check(documents[0] == "1");
}
A SAX parser on top of a pull parser
Just as one can implement a DOM parser through a SAX parser, one can also implement a SAX parser with a pull parser:
#include <string_view>
#include <stdexcept>
struct json_callbacks{
virtual void on_null(){}
virtual void on_bool(bool){}
virtual void on_string(std::string_view){}
virtual void on_number(int){}
virtual void on_key(std::string_view){}
virtual void on_array_begin(){}
virtual void on_array_end(){}
virtual void on_object_begin(){}
virtual void on_object_end(){}
[[noreturn]] virtual void on_error(){throw std::runtime_error("Error while parsing");}
};
void parse_value(parser& p, json_callbacks& cb);
void parse_array(parser& p, json_callbacks& cb){
cb.on_array_begin();
while(p.element()){
parse_value(p, cb);
}
if(not p.array_finish()) cb.on_error();
cb.on_array_end();
}
void parse_object(parser& p, json_callbacks& cb) {
cb.on_object_begin();
while(auto key = p.key()) {
cb.on_key(*key);
parse_value(p, cb);
}
if(not p.object_finish()) cb.on_error();
cb.on_object_end();
}
void parse_value(parser& p, json_callbacks& cb){
/**/ if(/* */ p.null() ) cb.on_null();
else if(auto value = p.boolean()) cb.on_bool(*value);
else if(auto value = p.string() ) cb.on_string(*value);
else if(auto value = p.number() ) cb.on_number(*value);
else if(p.array_start()) parse_array(p, cb);
else if(p.object_start()) parse_object(p, cb);
else cb.on_error();
}
void parse_json(std::string_view sv, json_callbacks& cb) {
auto p = parser(sv);
parse_value(p, cb);
}
The implementation is comparable to the one presented in those notes; instead of operating on a string_view
takes a parser&
.
As long as parser
is implemented decently, this eliminates from the SAX parser all out-of-bounds errors.
A DOM parser on top of a pull parser
While the DOM parser on top of a SAX parser had some additional overhead, it is not the case when writing a DOM parser on top of a pull parser:
#include <map>
#include <string>
#include <string_view>
#include <stdexcept>
#include <cassert>
#include <charconv>
#include <variant>
#include <vector>
struct value_t{
std::variant<bool, int, std::string, std::nullptr_t, std::vector<value_t>, std::map<std::string, value_t>> data;
};
void parse_error(){throw std::runtime_error("Error while parsing");}
value_t parse_value(parser& p);
std::vector<value_t> parse_array(parser& p){
std::vector<value_t> arr;
while(p.element()){
arr.push_back(parse_value(p));
}
if(not p.array_finish()) parse_error();
return arr;
}
std::map<std::string, value_t> parse_object(parser& p);
value_t parse_value(parser& p) {
/**/ if(/* */ p.null() ) return {nullptr};
else if(auto value = p.boolean(); value) return {*value};
else if(auto value = p.string(); value) return {std::string(*value)};
else if(auto value = p.number(); value) return {*value};
else if(/* */ p.array_start()) return parse_array(p);
else if(/* */ p.object_start()) return parse_object(p);
else parse_error();
}
std::map<std::string, value_t> parse_object(parser& p) {
std::map<std::string, value_t> obj;
while(auto key = p.key(); key) {
auto value = parse_value(sv);
obj.emplace(std::move(*key), std::move(value));
}
if(not p.object_finish()) parse_error();
return obj;
}
std::map<std::string, value_t> parse_json(std::string_view sv) {
auto p = parser(sv);
return parse_value(p);
}
Similarly to the SAX parser on top of a pull parser, the implementation of the DOM parser is comparable to the implementation of the DOM parsing operating directly on a string_view
, and without the need of additional resources.
API is cumbersome compared to a DOM parser
There are some pain point with the pull parser API, for example
struct parser{
// ...
bool object_start();
std::optional<std::string_view> key();
};
std::string foo(parser& p){
if(auto key = p.key(); key) return *key;
return "default";
}
someone forgot to call p.object_start()
, which not only tells the caller if there is an object, but also changes the internal status of the parser for parsing a key.
A more type-safe API could split the parser
and look similar to the following
struct parser_object{
// ...
std::optional<std::string_view> key();
};
struct parser{
// ...
std::optional<parser_object> object_start();
};
std::string foo(parser& p){
if(auto p_o = p.object_start(); p_o) {
if(auto key = p_o.key(); key) return *key;
}
return "default";
}
Which means that for calling the member function key
, you must call object_start
, and keep a reference to it.
The disadvantage is that the API is broken in sub-objects, which might make it harder to get an overview how it works. Also parser_object
would keep a reference to the internal status of parser
, which might create lifetime issues, as both need to be are alive when parsing a key.
Conclusion
The pull parser provides lower-level building blocks compared to a DOM and event-base parser, and building those other two parser on top of a pull parser does not require additional resources.
Alone for this reason, it would make sense to always begin with a pull parser, and eventually build the others on top of it.
The function leftovers
can be improved by a lot. Instead of returning a bool
, it could return the last successful parsed position, or std::string_view::npos
if everything was parsed til the end.
With that information, the user of the parsing class can create more useful error messages when the parser is unable to parse the next element, by printing for example the offending position, or even a subset of the string.
The pull parser is also easier to use on malformed data. It poses little restriction on the content, as the user is responsible to react accordingly, thus it is easier to be used for parsing a malformed JSON structure, or even begin in the middle of a JSON document; for example 1,2,3,4] }, "a":"b", "c":"d"
. Apparently 1,2,3,4
are part of an array in a subobject. if the user knows it is in the middle of an array in a subobject, it just needs to call the appropriate function of the parser. The presented DOM parser cannot handle this situation, as it would error out when the array or object finishes, as there was no begin of the array and object. Also the presented event-based parser cannot handle this situation, as it would call on_error
that stops the parsing routine. I can see how it could be possible to make it less restrictive about the input, but it needs to be instructed that the first element is part of an array. I suppose that supporting such use-case might introduce a lot of special cases in the handling logic.
Since it is possible to use the push parser from a middle of a document, it is possible to start parsing from the beginning until the parser cannot parse other elements. At that point, the user of the parser can decide to "ignore" the malformed data, skip some characters, or something else, and the apply a new push parser from the middle of the document.
Another advantage of the pull parser, is that it is also trivial to pause and resume the parsing routine. Just do not call any member function. parser_object
is even trivial to copy, and can be stored anywhere for later use, as long as the data it references remains valid.
Granted, for the JSON format it might not be that interesting as the order of elements is not defined, thus it is hard to say if one needs to store or parse an element depending on other properties of the document.
If you have questions, comments, or found typos, the notes are not clear, or there are some errors; then just contact me.