Mapping XML with automatic structure detection

This section extends the previous example to show how orcus_xml can detect the mapping structure of an XML document automatically, without any hand-written XPath rules.

Simple auto-detection

The setup is identical to the previous example. The only difference is that instead of registering namespace aliases and mapping rules manually, a single call to detect_map_definition() handles everything:

filter.detect_map_definition(input.str());
filter.read_stream(input.str());

detect_map_definition() scans the document, identifies every repeating element group, registers namespace aliases automatically, and populates the internal mapping rules. Each detected range is assigned to a new sheet named range-0, range-1, and so on. read_stream() then imports the data using those rules.

The output is:

rows: 7  cols: 5
+--------+----------------------+-----------+----------------+--------------------------------------------------------------------+
| ns0:id | ns0:timestamp        | ns0:level | ns0:service    | ns0:message                                                        |
+--------+----------------------+-----------+----------------+--------------------------------------------------------------------+
| 1 [v]  | 2026-03-23T08:02:11Z | INFO      | AuthService    | User alice@example.com authenticated successfully.                 |
+--------+----------------------+-----------+----------------+--------------------------------------------------------------------+
| 2 [v]  | 2026-03-23T08:14:37Z | WARN      | AuthService    | Failed login attempt for user bob@example.com. Attempt 3 of 5.     |
+--------+----------------------+-----------+----------------+--------------------------------------------------------------------+
| 3 [v]  | 2026-03-23T08:31:05Z | ERROR     | SessionManager | Cache connection timed out after 30s. Session store unreachable.   |
+--------+----------------------+-----------+----------------+--------------------------------------------------------------------+
| 4 [v]  | 2026-03-23T08:31:09Z | INFO      | SessionManager | Cache connection restored. Resuming normal operations.             |
+--------+----------------------+-----------+----------------+--------------------------------------------------------------------+
| 5 [v]  | 2026-03-23T09:45:22Z | ERROR     | ApiGateway     | Request to /api/orders returned 503. Upstream service unavailable. |
+--------+----------------------+-----------+----------------+--------------------------------------------------------------------+
| 6 [v]  | 2026-03-23T10:00:00Z | INFO      | Scheduler      | Daily report job completed. 1,402 records processed in 4.2s.       |
+--------+----------------------+-----------+----------------+--------------------------------------------------------------------+

The column headers carry ns0: prefixes because the auto-detection assigns short aliases to each namespace it encounters, and those aliases become part of the field names used as column headers. ns0 corresponds to the primary namespace http://example.com/server-logs in this document. The aliases are assigned in the order the namespaces are first seen during parsing, so ns0 is always the first-encountered namespace and ns1 the second, and so on.

Note

detect_map_definition() also appends the necessary sheets to the document automatically — one per detected range — so there is no need to call append_sheet() manually.

Inspecting and editing the map definition

detect_map_definition() commits the mapping immediately. When you need to inspect or adjust the detected rules before importing, use write_map_definition() instead. It performs the same structure analysis but serializes the result to a map definition XML string rather than applying it:

std::ostringstream os;
filter.write_map_definition(input.str(), os);

// print it to stdout for inspection
auto map_def = os.str();
std::cout << map_def << std::endl;

This code will print the map definition XML to stdout, but do note that it is written as a single unbroken line with no indentation. The following is the pretty-printed form of the map definition XML:

<?xml version="1.0"?>
<map xmlns="https://gitlab.com/orcus/orcus/xml-map-definition">
  <ns alias="ns0" uri="http://example.com/server-logs"/>
  <ns alias="ns1" uri="http://example.com/server-logs/meta"/>
  <sheet name="range-0"/>
  <range sheet="range-0" row="0" column="0">
    <field path="/ns0:serverLogs/ns0:entry/@ns0:id"/>
    <field path="/ns0:serverLogs/ns0:entry/ns0:level"/>
    <field path="/ns0:serverLogs/ns0:entry/ns0:message"/>
    <field path="/ns0:serverLogs/ns0:entry/ns0:service"/>
    <field path="/ns0:serverLogs/ns0:entry/ns0:timestamp"/>
    <row-group path="/ns0:serverLogs/ns0:entry"/>
  </range>
</map>

Because the document uses XML namespaces, the auto-detection assigns short aliases automatically: ns0 for the log namespace and ns1 for the meta namespace, indexed in the order they are first encountered during parsing. The string can be edited freely at this point — fields can be removed, reordered, or relabelled, sheet names can be changed, and unwanted ranges can be dropped entirely.

To relabel a field, add a label attribute to the <field> element. For example, to give the ns0:id field a friendlier column header, change:

<field path="/ns0:serverLogs/ns0:entry/@ns0:id"/>

to:

<field path="/ns0:serverLogs/ns0:entry/@ns0:id" label="id"/>

The label value is used as the column header in the imported sheet instead of the auto-generated path-derived name.

Once any edits are complete, load the definition and import the data:

filter.read_map_definition(map_def);
filter.read_stream(input.str());

read_map_definition() parses the map definition XML, appends the necessary sheets to the document, and populates the internal mapping rules, exactly as if the rules had been set up manually. read_stream() then applies those rules to import the source document.