Detecting format of data stream
Orcus provides an API to detect the format of a data stream via the orcus::detect()
function. This function parses the content of a stream given as std::string_view
and reports the detected format type. In this section we will walk through
how to use this function with some test input files from the project
repository.
First, let’s include the necessary headers:
#include <orcus/format_detection.hpp>
#include <orcus/stream.hpp>
#include <iostream>
#include <filesystem>
namespace fs = std::filesystem;
We will also create a namespace alias fs to reference the std::filesystem
namespace for brevity.
Let’s assume that there is an environment variable named TESTDIR that
points to the top-level test directory of the project repository:
const char* testdir = std::getenv("TESTDIR");
We will use this path throughout the examples. Now, let’s try to detect a test input file that we know is of Open Document Spreadsheet (ODS) format.
auto filepath = fs::path{testdir} / "ods" / "raw-values-1" / "input.ods";
orcus::file_content fc{filepath};
auto format = orcus::detect(fc.str());
std::cout << "format: " << format << std::endl;
The first two lines load the input file and references expose its
content through the orcus::file_content class. This class
internally uses mmap to map the content of the loaded file into virtual
memory, and its str() method returns a
view of its content as std::string_view. This view can then be
passed to the detect() function. The returned
value, which is of enum type orcus::format_t, can be printed
directly to stdout via std::cout. Running this code should
produce the following output:
format: ods
Let’s try another input file. This time it is an Excel 2007 file:
auto filepath = fs::path{testdir} / "xlsx" / "raw-values-1" / "input.xlsx";
orcus::file_content fc{filepath};
auto format = orcus::detect(fc.str());
std::cout << "format: " << format << std::endl;
The file is loaded and examined the same way before. Running this code should produce the following output:
format: xlsx
You can also detect a generic XML file, as in the following example:
auto filepath = fs::path{testdir} / "xml" / "simple" / "input.xml";
orcus::file_content fc{filepath};
auto format = orcus::detect(fc.str());
std::cout << "format: " << format << std::endl;
The input file used above is an XML file but does not correspond to any specific XML-based file format. It is simply a generic XML document. Running this code should produce the following output:
format: xml
Similarly, using a generic JSON file as the input to detect:
auto filepath = fs::path{testdir} / "json" / "basic1" / "input.json";
orcus::file_content fc{filepath};
auto format = orcus::detect(fc.str());
std::cout << "format: " << format << std::endl;
should produce the following output:
format: json
You can also use a variant of detect() that checks
whether an input stream is of a specified format. Let’s take a look at
the following example:
auto filepath = fs::path{testdir} / "ods" / "raw-values-1" / "input.ods";
orcus::file_content fc{filepath};
std::cout << "ods? " << orcus::detect(fc.str(), orcus::format_t::ods) << std::endl;
std::cout << "xlsx? " << orcus::detect(fc.str(), orcus::format_t::xlsx) << std::endl;
Here, we are passing the content of what we know to be an ODS file to
the detect() function and asking it to report 1) whether
it is an ODS file then 2) whether it is an XLSX file.
The expected output is:
ods? 1
xlsx? 0
Next, we are going to use an Excel 2003 XML file as the input and ask the
detect() function whether it is:
an Excel 2003 XML file,
an XML file, and
a JSON file.
auto filepath = fs::path{testdir} / "xls-xml" / "raw-values-1" / "input.xml";
orcus::file_content fc{filepath};
std::cout << "xls-xml? " << orcus::detect(fc.str(), orcus::format_t::xls_xml) << std::endl;
std::cout << "xml? " << orcus::detect(fc.str(), orcus::format_t::xml) << std::endl;
std::cout << "json? " << orcus::detect(fc.str(), orcus::format_t::json) << std::endl;
Here is the output from this code:
xls-xml? 1
xml? 1
json? 0
The xls-xml alias is what orcus uses to reference the Excel 2003 XML
format (often referred to as the SpreadsheetML
format). Since this is an XML-based format, asking whether it is an XML
format should also yield true. But since it is clearly not a JSON format,
the last inquiry should rightly yield false.