Low-level XML parsers
Orcus provides three low-level SAX-style XML parsers that form a layered hierarchy. Each is a class template parameterized on a user-supplied handler type, and each fires event callbacks into that handler as it walks the document:
sax_parseris the foundation. It reports elements, attributes, text content, declarations and other low-level events without any namespace awareness.sax_ns_parserbuilds on top of it, adding namespace resolution so that each element and attribute carries a resolved namespace identifier in addition to its prefix.sax_token_parserbuilds on the namespace parser, further translating element and attribute names into integer tokens against a predefined vocabulary so that downstream code can dispatch on integers rather than strings.
In all three cases the handler does not need to be derived from any particular
base class; the parser simply calls the expected member functions on whatever
type it is given. The library does, however, provide base handler classes
(sax_handler, sax_ns_handler and
sax_token_handler) that supply empty implementations for
every callback, so that by deriving from one of them you only need to implement
the callbacks you actually care about.
Note
Several callbacks receive a transient flag. When it is set, the string
value was decoded into a temporary buffer because it contained one or more
encoded characters, and is only valid for the duration of the callback. In
that case the handler must copy or intern the value before returning rather
than holding on to the std::string_view.
Basic parsing with sax_parser
sax_parser is the most basic of the three. It does not
track namespaces and does not verify that opening and closing tags match; it
simply reports each event as it is encountered.
Start by including the parser header:
#include <orcus/sax_parser.hpp>
#include <iostream>
Define a handler that derives from sax_handler and
overrides the callbacks of interest. This one prints each element, attribute
and text segment:
/**
* the handler only needs to define the callbacks it cares about; inheriting
* from orcus::sax_handler supplies empty defaults for the rest
*/
class sax_parser_handler : public orcus::sax_handler
{
public:
void start_element(const orcus::sax::parser_element& elem)
{
std::cout << "start element: " << elem.name << std::endl;
}
void end_element(const orcus::sax::parser_element& elem)
{
std::cout << "end element: " << elem.name << std::endl;
}
void attribute(const orcus::sax::parser_attribute& attr)
{
std::cout << " attribute: " << attr.name << "='" << attr.value << "'" << std::endl;
}
void characters(std::string_view val, bool /*transient*/)
{
// skip whitespace-only segments between elements
if (val.find_first_not_of(" \t\r\n") == std::string_view::npos)
return;
std::cout << " characters: " << val << std::endl;
}
};
The characters() callback skips whitespace-only
segments here, since the indentation between elements is itself reported as text
content. Refer to the sax_handler class definition for the
full set of available callbacks.
Next, prepare the XML content to parse:
std::string_view content =
"<?xml version=\"1.0\"?>"
"<catalog>"
"<book id=\"b1\">Go</book>"
"<book id=\"b2\">C++</book>"
"</catalog>";
Finally, construct the parser with the content and the handler, and parse:
// instantiate the parser with the content and an own handler
sax_parser_handler hdl;
orcus::sax_parser<sax_parser_handler> parser(content, hdl);
// parse the content
parser.parse();
Note that the attributes of an element are reported through the
attribute() callback before the element’s
start_element() callback fires. Executing this
code generates the following output:
attribute: version='1.0'
start element: catalog
attribute: id='b1'
start element: book
characters: Go
end element: book
attribute: id='b2'
start element: book
characters: C++
end element: book
end element: catalog
Namespace-aware parsing with sax_ns_parser
sax_ns_parser adds namespace handling on top of the basic
parser. It uses an xmlns_context to resolve namespace
prefixes into stable xmlns_id_t identifiers, and tracks
element scopes so that non-matching closing tags are detected.
Include the parser and namespace headers:
#include <orcus/sax_ns_parser.hpp>
#include <orcus/xml_namespace.hpp>
#include <iostream>
Define the handler. The element and attribute structs passed to it carry both
the source prefix and the resolved namespace identifier, and namespace
declarations are reported through the
namespace_declaration() callback:
/**
* like sax_handler, sax_ns_handler provides empty defaults so we override only
* the callbacks of interest; the element and attribute structs carry a resolved
* namespace identifier in addition to the alias used in the source
*/
class sax_ns_parser_handler : public sax_ns_handler
{
public:
void start_element(const sax_ns_parser_element& elem)
{
std::cout << "start element: " << elem.name;
if (elem.ns)
// a non-null identifier points to the resolved namespace URI
std::cout << " (ns: " << elem.ns << ")";
std::cout << std::endl;
}
void end_element(const sax_ns_parser_element& elem)
{
std::cout << "end element: " << elem.name << std::endl;
}
// keep the base overload used for declaration/PI attributes visible
using sax_ns_handler::attribute;
void attribute(const sax_ns_parser_attribute& attr)
{
std::cout << " attribute: " << attr.name << "='" << attr.value << "'";
if (attr.ns)
std::cout << " (ns: " << attr.ns << ")";
std::cout << std::endl;
}
void namespace_declaration(std::string_view alias, xmlns_id_t ns_id)
{
std::cout << "namespace declaration: alias='" << alias << "' uri='" << ns_id << "'" << std::endl;
}
};
Because the derived handler declares its own
attribute() overload, the using declaration
keeps the base overload that handles declaration and processing-instruction
attributes visible.
Prepare some namespaced XML content:
std::string_view content =
"<?xml version=\"1.0\"?>"
"<list xmlns=\"http://example.com/default\" xmlns:x=\"http://example.com/extra\">"
"<item x:rank=\"1\"/>"
"<item x:rank=\"2\"/>"
"</list>";
Create an xmlns_repository and a fresh
xmlns_context from it. A new context should be created per
stream:
// a context tracks the prefix-to-URI bindings as parsing descends through
// element scopes; create a fresh one from the repository per stream
xmlns_repository repo;
xmlns_context cxt = repo.create_context();
Then construct the parser - passing the context in addition to the content and handler - and parse:
sax_ns_parser_handler hdl;
sax_ns_parser<sax_ns_parser_handler> parser(content, cxt, hdl);
parser.parse();
For more on namespace management see the
xmlns_repository and xmlns_context
class definitions. Executing this code generates the following output:
namespace declaration: alias='' uri='http://example.com/default'
namespace declaration: alias='x' uri='http://example.com/extra'
start element: list (ns: http://example.com/default)
attribute: rank='1' (ns: http://example.com/extra)
start element: item (ns: http://example.com/default)
end element: item
attribute: rank='2' (ns: http://example.com/extra)
start element: item (ns: http://example.com/default)
end element: item
end element: list
Tokenized parsing with sax_token_parser
sax_token_parser further translates element and attribute
names into integer tokens while parsing. The caller supplies a predefined set
of names via a tokens instance; any name found in that
vocabulary is reported as its integer token, while names not in the vocabulary
are reported as XML_UNKNOWN_TOKEN.
Include the parser, token store and namespace headers:
#include <orcus/sax_token_parser.hpp>
#include <orcus/tokens.hpp>
#include <orcus/xml_namespace.hpp>
#include <iostream>
Define the token vocabulary. Each name maps to a token by its position in the
array, with index 0 reserved for XML_UNKNOWN_TOKEN:
/**
* the token vocabulary maps names to integer tokens by position; index 0 is
* reserved for XML_UNKNOWN_TOKEN, so the first real name starts at index 1
*/
const char* token_names[] = {
"??", // 0 - reserved for unknown names
"catalog", // 1
"book", // 2
"id", // 3
};
const xml_token_t XML_catalog = 1;
const xml_token_t XML_book = 2;
const xml_token_t XML_id = 3;
Define the handler. Because known names arrive as integer tokens, it can
dispatch on them - for example with a switch statement - instead of
comparing strings. The xml_token_element_t passed to the
handler carries the element’s token, its raw name, and the list of tokenized
attributes:
class sax_token_parser_handler : public sax_token_handler
{
public:
/**
* names known to the vocabulary arrive as integer tokens, which can be
* dispatched on directly instead of comparing strings
*/
void start_element(const xml_token_element_t& elem)
{
std::cout << "start element: " << elem.raw_name << " (token=" << elem.name << ")";
switch (elem.name)
{
case XML_catalog:
std::cout << " -> recognized as catalog";
break;
case XML_book:
std::cout << " -> recognized as book";
break;
default:
std::cout << " -> unknown";
}
std::cout << std::endl;
for (const xml_token_attr_t& attr : elem.attrs)
{
std::cout << " attribute: " << attr.raw_name << " (token=" << attr.name
<< ") = '" << attr.value << "'";
if (attr.name == XML_id)
std::cout << " -> recognized as id";
std::cout << std::endl;
}
}
};
Prepare the XML content. The <magazine> element is intentionally absent
from the vocabulary so that its token resolves to
XML_UNKNOWN_TOKEN:
std::string_view content =
"<?xml version=\"1.0\"?>"
"<catalog>"
"<book id=\"b1\"/>"
"<book id=\"b2\"/>"
"<magazine id=\"m1\"/>"
"</catalog>";
Construct the tokens store and an
xmlns_context. The token store is not copyable and is
typically created once as a global constant:
// the token store is typically a global constant shared across parses
tokens token_map(token_names, std::size(token_names));
xmlns_repository repo;
xmlns_context cxt = repo.create_context();
Finally, construct the parser with the content, token store, context and handler, and parse:
sax_token_parser_handler hdl;
sax_token_parser<sax_token_parser_handler> parser(content, token_map, cxt, hdl);
parser.parse();
Executing this code generates the following output:
start element: catalog (token=1) -> recognized as catalog
start element: book (token=2) -> recognized as book
attribute: id (token=3) = 'b1' -> recognized as id
start element: book (token=2) -> recognized as book
attribute: id (token=3) = 'b2' -> recognized as id
start element: magazine (token=0) -> unknown
attribute: id (token=3) = 'm1' -> recognized as id