Detecting format of data stream

Orcus provides an API to detect the format of a data stream via the orcus::detect() function. This function parses the content of a stream given as std::string_view and reports the detected format type. In this section we will walk through how to use this function with some test input files from the project repository.

First, let’s include the necessary headers:

#include <orcus/format_detection.hpp>
#include <orcus/stream.hpp>

#include <iostream>
#include <filesystem>

namespace fs = std::filesystem;

We will also create a namespace alias fs to reference the std::filesystem namespace for brevity.

Let’s assume that there is an environment variable named TESTDIR that points to the top-level test directory of the project repository:

const char* testdir = std::getenv("TESTDIR");

We will use this path throughout the examples. Now, let’s try to detect a test input file that we know is of Open Document Spreadsheet (ODS) format.

auto filepath = fs::path{testdir} / "ods" / "raw-values-1" / "input.ods";
orcus::file_content fc{filepath};

auto format = orcus::detect(fc.str());
std::cout << "format: " << format << std::endl;

The first two lines load the input file and references expose its content through the orcus::file_content class. This class internally uses mmap to map the content of the loaded file into virtual memory, and its str() method returns a view of its content as std::string_view. This view can then be passed to the detect() function. The returned value, which is of enum type orcus::format_t, can be printed directly to stdout via std::cout. Running this code should produce the following output:

format: ods

Let’s try another input file. This time it is an Excel 2007 file:

auto filepath = fs::path{testdir} / "xlsx" / "raw-values-1" / "input.xlsx";
orcus::file_content fc{filepath};

auto format = orcus::detect(fc.str());
std::cout << "format: " << format << std::endl;

The file is loaded and examined the same way before. Running this code should produce the following output:

format: xlsx

You can also detect a generic XML file, as in the following example:

auto filepath = fs::path{testdir} / "xml" / "simple" / "input.xml";
orcus::file_content fc{filepath};

auto format = orcus::detect(fc.str());
std::cout << "format: " << format << std::endl;

The input file used above is an XML file but does not correspond to any specific XML-based file format. It is simply a generic XML document. Running this code should produce the following output:

format: xml

Similarly, using a generic JSON file as the input to detect:

auto filepath = fs::path{testdir} / "json" / "basic1" / "input.json";
orcus::file_content fc{filepath};

auto format = orcus::detect(fc.str());
std::cout << "format: " << format << std::endl;

should produce the following output:

format: json

You can also use a variant of detect() that checks whether an input stream is of a specified format. Let’s take a look at the following example:

auto filepath = fs::path{testdir} / "ods" / "raw-values-1" / "input.ods";

orcus::file_content fc{filepath};
std::cout << "ods? " << orcus::detect(fc.str(), orcus::format_t::ods) << std::endl;
std::cout << "xlsx? " << orcus::detect(fc.str(), orcus::format_t::xlsx) << std::endl;

Here, we are passing the content of what we know to be an ODS file to the detect() function and asking it to report 1) whether it is an ODS file then 2) whether it is an XLSX file.

The expected output is:

ods? 1
xlsx? 0

Next, we are going to use an Excel 2003 XML file as the input and ask the detect() function whether it is:

  • an Excel 2003 XML file,

  • an XML file, and

  • a JSON file.

auto filepath = fs::path{testdir} / "xls-xml" / "raw-values-1" / "input.xml";

orcus::file_content fc{filepath};
std::cout << "xls-xml? " << orcus::detect(fc.str(), orcus::format_t::xls_xml) << std::endl;
std::cout << "xml? " << orcus::detect(fc.str(), orcus::format_t::xml) << std::endl;
std::cout << "json? " << orcus::detect(fc.str(), orcus::format_t::json) << std::endl;

Here is the output from this code:

xls-xml? 1
xml? 1
json? 0

The xls-xml alias is what orcus uses to reference the Excel 2003 XML format (often referred to as the SpreadsheetML format). Since this is an XML-based format, asking whether it is an XML format should also yield true. But since it is clearly not a JSON format, the last inquiry should rightly yield false.