Low-level XML parsers

Orcus provides three low-level SAX-style XML parsers that form a layered hierarchy. Each is a class template parameterized on a user-supplied handler type, and each fires event callbacks into that handler as it walks the document:

sax_parser is the foundation. It reports elements, attributes, text content, declarations and other low-level events without any namespace awareness.
sax_ns_parser builds on top of it, adding namespace resolution so that each element and attribute carries a resolved namespace identifier in addition to its prefix.
sax_token_parser builds on the namespace parser, further translating element and attribute names into integer tokens against a predefined vocabulary so that downstream code can dispatch on integers rather than strings.

In all three cases the handler does not need to be derived from any particular base class; the parser simply calls the expected member functions on whatever type it is given. The library does, however, provide base handler classes (sax_handler, sax_ns_handler and sax_token_handler) that supply empty implementations for every callback, so that by deriving from one of them you only need to implement the callbacks you actually care about.

Note

Several callbacks receive a transient flag. When it is set, the string value was decoded into a temporary buffer because it contained one or more encoded characters, and is only valid for the duration of the callback. In that case the handler must copy or intern the value before returning rather than holding on to the std::string_view.

Basic parsing with sax_parser

sax_parser is the most basic of the three. It does not track namespaces and does not verify that opening and closing tags match; it simply reports each event as it is encountered.

Start by including the parser header:

#include <orcus/sax_parser.hpp>

#include <iostream>

Define a handler that derives from sax_handler and overrides the callbacks of interest. This one prints each element, attribute and text segment:

/**
 * the handler only needs to define the callbacks it cares about; inheriting
 * from orcus::sax_handler supplies empty defaults for the rest
 */
class sax_parser_handler : public orcus::sax_handler
{
public:
    void start_element(const orcus::sax::parser_element& elem)
    {
        std::cout << "start element: " << elem.name << std::endl;
    }

    void end_element(const orcus::sax::parser_element& elem)
    {
        std::cout << "end element: " << elem.name << std::endl;
    }

    void attribute(const orcus::sax::parser_attribute& attr)
    {
        std::cout << "  attribute: " << attr.name << "='" << attr.value << "'" << std::endl;
    }

    void characters(std::string_view val, bool /*transient*/)
    {
        // skip whitespace-only segments between elements
        if (val.find_first_not_of(" \t\r\n") == std::string_view::npos)
            return;

        std::cout << "  characters: " << val << std::endl;
    }
};

The characters() callback skips whitespace-only segments here, since the indentation between elements is itself reported as text content. Refer to the sax_handler class definition for the full set of available callbacks.

Next, prepare the XML content to parse:

std::string_view content =
    "<?xml version=\"1.0\"?>"
    "<catalog>"
    "<book id=\"b1\">Go</book>"
    "<book id=\"b2\">C++</book>"
    "</catalog>";

Finally, construct the parser with the content and the handler, and parse:

// instantiate the parser with the content and an own handler
sax_parser_handler hdl;
orcus::sax_parser<sax_parser_handler> parser(content, hdl);

// parse the content
parser.parse();

Note that the attributes of an element are reported through the attribute() callback before the element’s start_element() callback fires. Executing this code generates the following output:

  attribute: version='1.0'
start element: catalog
  attribute: id='b1'
start element: book
  characters: Go
end element: book
  attribute: id='b2'
start element: book
  characters: C++
end element: book
end element: catalog

Namespace-aware parsing with sax_ns_parser

sax_ns_parser adds namespace handling on top of the basic parser. It uses an xmlns_context to resolve namespace prefixes into stable xmlns_id_t identifiers, and tracks element scopes so that non-matching closing tags are detected.

Include the parser and namespace headers:

#include <orcus/sax_ns_parser.hpp>
#include <orcus/xml_namespace.hpp>

#include <iostream>

Define the handler. The element and attribute structs passed to it carry both the source prefix and the resolved namespace identifier, and namespace declarations are reported through the namespace_declaration() callback:

/**
 * like sax_handler, sax_ns_handler provides empty defaults so we override only
 * the callbacks of interest; the element and attribute structs carry a resolved
 * namespace identifier in addition to the alias used in the source
 */
class sax_ns_parser_handler : public sax_ns_handler
{
public:
    void start_element(const sax_ns_parser_element& elem)
    {
        std::cout << "start element: " << elem.name;
        if (elem.ns)
            // a non-null identifier points to the resolved namespace URI
            std::cout << " (ns: " << elem.ns << ")";
        std::cout << std::endl;
    }

    void end_element(const sax_ns_parser_element& elem)
    {
        std::cout << "end element: " << elem.name << std::endl;
    }

    // keep the base overload used for declaration/PI attributes visible
    using sax_ns_handler::attribute;

    void attribute(const sax_ns_parser_attribute& attr)
    {
        std::cout << "  attribute: " << attr.name << "='" << attr.value << "'";
        if (attr.ns)
            std::cout << " (ns: " << attr.ns << ")";
        std::cout << std::endl;
    }

    void namespace_declaration(std::string_view alias, xmlns_id_t ns_id)
    {
        std::cout << "namespace declaration: alias='" << alias << "' uri='" << ns_id << "'" << std::endl;
    }
};

Because the derived handler declares its own attribute() overload, the using declaration keeps the base overload that handles declaration and processing-instruction attributes visible.

Prepare some namespaced XML content:

std::string_view content =
    "<?xml version=\"1.0\"?>"
    "<list xmlns=\"http://example.com/default\" xmlns:x=\"http://example.com/extra\">"
    "<item x:rank=\"1\"/>"
    "<item x:rank=\"2\"/>"
    "</list>";

Create an xmlns_repository and a fresh xmlns_context from it. A new context should be created per stream:

// a context tracks the prefix-to-URI bindings as parsing descends through
// element scopes; create a fresh one from the repository per stream
xmlns_repository repo;
xmlns_context cxt = repo.create_context();

Then construct the parser - passing the context in addition to the content and handler - and parse:

sax_ns_parser_handler hdl;
sax_ns_parser<sax_ns_parser_handler> parser(content, cxt, hdl);
parser.parse();

For more on namespace management see the xmlns_repository and xmlns_context class definitions. Executing this code generates the following output:

namespace declaration: alias='' uri='http://example.com/default'
namespace declaration: alias='x' uri='http://example.com/extra'
start element: list (ns: http://example.com/default)
  attribute: rank='1' (ns: http://example.com/extra)
start element: item (ns: http://example.com/default)
end element: item
  attribute: rank='2' (ns: http://example.com/extra)
start element: item (ns: http://example.com/default)
end element: item
end element: list

Tokenized parsing with sax_token_parser

sax_token_parser further translates element and attribute names into integer tokens while parsing. The caller supplies a predefined set of names via a tokens instance; any name found in that vocabulary is reported as its integer token, while names not in the vocabulary are reported as XML_UNKNOWN_TOKEN.

Include the parser, token store and namespace headers:

#include <orcus/sax_token_parser.hpp>
#include <orcus/tokens.hpp>
#include <orcus/xml_namespace.hpp>

#include <iostream>

Define the token vocabulary. Each name maps to a token by its position in the array, with index 0 reserved for XML_UNKNOWN_TOKEN:

/**
 * the token vocabulary maps names to integer tokens by position; index 0 is
 * reserved for XML_UNKNOWN_TOKEN, so the first real name starts at index 1
 */
const char* token_names[] = {
    "??",       // 0 - reserved for unknown names
    "catalog",  // 1
    "book",     // 2
    "id",       // 3
};

const xml_token_t XML_catalog = 1;
const xml_token_t XML_book = 2;
const xml_token_t XML_id = 3;

Define the handler. Because known names arrive as integer tokens, it can dispatch on them - for example with a switch statement - instead of comparing strings. The xml_token_element_t passed to the handler carries the element’s token, its raw name, and the list of tokenized attributes:

class sax_token_parser_handler : public sax_token_handler
{
public:
    /**
     * names known to the vocabulary arrive as integer tokens, which can be
     * dispatched on directly instead of comparing strings
     */
    void start_element(const xml_token_element_t& elem)
    {
        std::cout << "start element: " << elem.raw_name << " (token=" << elem.name << ")";

        switch (elem.name)
        {
            case XML_catalog:
                std::cout << " -> recognized as catalog";
                break;
            case XML_book:
                std::cout << " -> recognized as book";
                break;
            default:
                std::cout << " -> unknown";
        }
        std::cout << std::endl;

        for (const xml_token_attr_t& attr : elem.attrs)
        {
            std::cout << "  attribute: " << attr.raw_name << " (token=" << attr.name
                << ") = '" << attr.value << "'";

            if (attr.name == XML_id)
                std::cout << " -> recognized as id";

            std::cout << std::endl;
        }
    }
};

Prepare the XML content. The <magazine> element is intentionally absent from the vocabulary so that its token resolves to XML_UNKNOWN_TOKEN:

std::string_view content =
    "<?xml version=\"1.0\"?>"
    "<catalog>"
    "<book id=\"b1\"/>"
    "<book id=\"b2\"/>"
    "<magazine id=\"m1\"/>"
    "</catalog>";

Construct the tokens store and an xmlns_context. The token store is not copyable and is typically created once as a global constant:

// the token store is typically a global constant shared across parses
tokens token_map(token_names, std::size(token_names));

xmlns_repository repo;
xmlns_context cxt = repo.create_context();

Finally, construct the parser with the content, token store, context and handler, and parse:

sax_token_parser_handler hdl;
sax_token_parser<sax_token_parser_handler> parser(content, token_map, cxt, hdl);
parser.parse();

Executing this code generates the following output:

start element: catalog (token=1) -> recognized as catalog
start element: book (token=2) -> recognized as book
  attribute: id (token=3) = 'b1' -> recognized as id
start element: book (token=2) -> recognized as book
  attribute: id (token=3) = 'b2' -> recognized as id
start element: magazine (token=0) -> unknown
  attribute: id (token=3) = 'm1' -> recognized as id