Anatomy of Persistence

#include <some_header_files>
    int main(int argc, char *argv[]) {
       sn::some_type_t object;
       write( somewhere, object, ... );
       ...
       for( size_t i=0; i<huge_number; i+=batch_size)
          read( somewhere, object, ...);
    }
  • take a program
  • with an object, having some memory layout
  • and intention to save its state to some device
  • or retrieve the data

Properties of the Object

can be categorized by

  • Content: homogeneous vs heterogeneous
  • Placement in memory: Contiguous vs Non contiguous
  • How much space is used in total
  • Number of dimensions or rank

Type trait utility?

  • std::is_compound != heterogeneous
  • std::vector<int>().data() points to a contiguous memory, but when T = std::string it doesn't

purely trait based approach requires the type available upfront, making it less powerful then if we could detect the presence of certain methods

heterogeneous types   such as plain old struct, class, union...

require access to field names and may reside in non-contiguous memory.    possible layouts:

Table approach, where each row is a heterogeneous datatype leads to fast indexing by rows.
OTOH accessing a single field will lead to I/O bandwidth loss.
Contiguous memory: 1 block heterogeneous
Column layout provides efficient access by fields, with the added implementation complexity of each dataset per columns
non-contiguous memory: 3 separate blocks, each homogeneous
 //a vector of pod struct
    struct coo_t {
       size_t row;
       size_t column;
       double value;
    };
    std::vector<coo_t> sparse_matrix;
    
 // each field of the struct is a vector
    struct csc_t {
       std::vector<size_t> rowind; // row indices
       std::vector<size_t> colptr; // start of new columns
       std::vector<double> values; // nonzero values
    };
    csc_t sparse_matrix;
    
NOTE: coordinate of points   is efficient for incremental construction. Whereas Armadillo C++ arma::SpMat<T> uses Compressed Sparse Column representation

We need Introspection / Reflection method to retrieve field names of C++ class types

C++ mailing list and papers categorized as related to reflection
  • 2019 8 entries
    • P1390R0 Suggested Reflection TS NB Resolutions Matúš Chochlík, Axel Naumann, and David Sankel
    • P1390R1 Reflection TS NB comment resolutions: summary and rationale Matúš Chochlík, Axel Naumann, and David Sankel
    • N4818 Working Draft, C++ Extensions for Reflection David Sankel
    • P1733R0 User-friendly and Evolution-friendly Reflection: A Compromise David Sankel, Daveed Vandevoorde
    • P1749R0 Access control for reflection Yehezkel Bernat
    • P1240R1 Scalable Reflection in C++ Daveed Vandevoorde, Wyatt Childers, Andrew Sutton, Faisal Vali, Daveed Vandevoorde
    • P1887R0 Typesafe Reflection on attributes Corentin Jabot
  • 2018 13 entries
    • P0194R5 Static reflection by Matúš Chochlík, Axel Naumann, David Sankel
    • P0670R2 Static reflection of functions by Matúš Chochlík, Axel Naumann, David Sankel
    • P0954R0 What do we want to do with reflection? Bjarne Stroustrup
    • P0993R0 Value-based Reflection Andrew Sutton, Herb Sutter
    • P0572R2 Static reflection of bit fields Alex Christensen
    • P0670R3 Function reflection Matúš Chochlík, Axel Naumann, David Sankel
    • P1240R0 Scalable Reflection in C++ Andrew Sutton, Faisal Vali, Daveed Vandevoorde
  • 2017 14 entries
    • P0194R3 Static reflection by Matúš Chochlík, Axel Naumann, David Sankel
    • P0385R2 Static reflection: Rationale, design and evolution by Matúš Chochlík, Axel Naumann, David Sankel
    • P0578R0 Static Reflection in a Nutshell by Matúš Chochlík, Axel Naumann, David Sankel
    • P0590R0 A design static reflection: Andrew Sutton, Herb Sutter
    • P0598R0 Reflect Through Values Instead of Types by Daveed Vandevoorde
  • 2016 19 entries
    • P0194R0 Static reflection (revision 4) Matus Chochlik, Axel Naumann
    • P0255R0 C++ Static Reflection via template pack expansion Cleiton Santoia Silva, Daniel Auresco
    • P0256R0 C++ Reflection Light Cleiton Santoia Silva
    • P0327R0 Product types access Vicente J. Botet Escriba
    • P0341R0 parameter packs outside of templates Mike Spertus
    • Static reflection: Rationale, design and evolution Matúš Chochlík, Alex Naumann
introspection/reflection is non-trivial, not yet available as a language feature

LLVM/CLANG lib tooling based static reflection

...
    StatementMatcher h5templateMatcher = callExpr( allOf(
       hasDescendant( declRefExpr( to( varDecl().bind("variableDecl")  ) ) ),
       hasDescendant( declRefExpr( to(
          functionDecl( allOf(
            eachOf(
    		hasName("h5::write"), hasName("h5::create"), hasName("h5::read"),
                hasName("h5::append"),
    		hasName("h5::awrite"), hasName("h5::acreate"), hasName("h5::aread")
    	),
    ... ));
  • identify the relevant nodes
  • marked by I/O operators
  • visit the structure in reverse topological order
  • emit the templates describing the class with fields and types
P0993r0 Value-based Reflection, Andrew Sutton, Herb Sutter:
static reflection is a programming facility that exposes read-only data about entities in a translation unit compile-time values. Static reflection does not require support for runtime compilation since reflection values can be used with existing generative facilities (i.e., templates) or additional generative facilities (i.e., metaprogramming) to produce new code.
dynamic reflection:provides information for navigating source-code data structures at runtime. Language supporting dynamic reflection also tend to make additional facilities available for generating and JIT-compiling new code. Dynamic reflection and code generation are not in the scope of this work.

How about   Containers?   Let's take a look at N4436 C++ Detection Idiom

It is possible to identify if a container is STL like, provides direct access to its contiguous storage -- as std::vector<T> does, or alternatively iterators for scatter/gather operations

  • identify if there is direct access to contiguous memory
  • or iterator for non-contiguous layouts

    template <typename T> using value_type_f = typename T::value_type;
    template <typename T> using data_f = decltype(std::declval <T>().data());
    template <typename T> using size_f = decltype(std::declval <T>().size());
    template <typename T> using begin_f = decltype(std::declval <T>().begin());
    template <typename T> using end_f = decltype(std::declval <T>().end());
    template <typename T> using cbegin_f = decltype(std::declval <T>().cbegin());
    template <typename T> using cend_f = decltype(std::declval <T>().cend());

    template <typename T> using value = compat::detected_or <T, value_type_f, T>;
    template <typename T> using has_value_type = compat::is_detected <value_type_f, T>;
    template <typename T> using has_data = compat::is_detected <data_f, T>;
    template <typename T> using has_direct_access = compat::is_detected <data_f, T>;
    template <typename T> using has_size = compat::is_detected <size_f, T>;
    template <typename T> using has_begin = compat::is_detected <begin_f, T>;
    template <typename T> using has_end = compat::is_detected <end_f, T>;
    template <typename T> using has_cbegin = compat::is_detected <cbegin_f, T>;
    template <typename T> using has_cend = compat::is_detected <cend_f, T>;

    template <typename T> using has_iterator = std::integral_constant <bool, has_begin <T>::value && has_end <T>::value >;
    template <typename T> using has_const_iterator = std::integral_constant <bool, has_cbegin <T>::value && has_cend <T>::value >;
    
credit: WG21 N4436 C++ Detection Idiom by Walter Brown

C++ Linear Algebra Systems calling BLAS/LAPACK    specialized containers

are dedicated category, as they all must provide mechanism to pass/receive data to/from some BLAS system call, however the naming varies from system to system.

The differences can be mitigated with a combination of

  • type traits
  • feature detection idiom

librarydirect accessvector size
armamemptr()n_elem
eigendata()size()
blazedata()n/a
blitzdata()size()
itpp_data()length()
ublasdata().begin()n/a
dlib(0,0)size()

H5CPP

  • LLVM based static reflection tool
  • C++ templates with CRUD like operators

take a header file with POD struct


typedef unsigned long long int MyUInt;
namespace sn {
	namespace example {
		struct Record {
			MyUInt               field_01;
			char                 field_02;
			double            field_03[3];
			other::Record field_04[4];
		};
	}
}
  • typedefs are fine
  • nested namespace are OK
  • mapped to : H5T_NATIVE_CHAR
  • H5Tarray_create(H5T_NATIVE_DOUBLE,1, ... )
  • first `other::Record` is parsed: type_hid_t = ...
  • then the generated type is used: H5Tarray_create(type_hid_t, ...)

write your program

write your cpp program as if `generated.h` were already written 
#include "some_header_file.h"
#include <h5cpp/core>
	#include "generated.h"
#include <h5cpp/io>
int main(){
	std::vector<sn::example::Record> stream =
		...
	h5::fd_t fd = h5::create("example.h5",H5F_ACC_TRUNC);
	h5::pt_t pt = h5::create<sn::example::Record>(
		fd, "stream of struct",
		h5::max_dims{H5S_UNLIMITED,7}, h5::chunk{4,7} | h5::gzip{9} );
	...
}
  • sandwich the not-yet existing `generated.h`
  • write the TU translation unit as usual
  • using the POD type with one of the H5CPP CRUD like operators:
      h5::create | h5::write | h5::read | h5::append | h5::acreate | h5::awrite | h5::aread  
    will trigger the `h5cpp` compiler to generate code

A header file with HDF5 Compound Type descriptors:

#ifndef H5CPP_GUARD_ErRrk
#define H5CPP_GUARD_ErRrk
namespace h5{
    template<> hid_t inline register_struct(){
        hsize_t at_00_[] ={7};            hid_t at_00 = H5Tarray_create(H5T_NATIVE_FLOAT,1,at_00_);
        hsize_t at_01_[] ={3};            hid_t at_01 = H5Tarray_create(H5T_NATIVE_DOUBLE,1,at_01_);
        hid_t ct_00 = H5Tcreate(H5T_COMPOUND, sizeof (sn::typecheck::Record));
        H5Tinsert(ct_00, "_char",	HOFFSET(sn::typecheck::Record,_char),H5T_NATIVE_CHAR);
		...
		H5Tclose(at_03); H5Tclose(at_04); H5Tclose(at_05); 
        return ct_02;
    };
}
H5CPP_REGISTER_STRUCT(sn::example::Record);
#endif
  • random include guards
  • within namespace
  • template specialization for h5::operators
  • compound types are recursively created
  • calls the template specialization when h5::operator needs it
rich set of
HDF5
property lists

Comma Separated Values to HDF5

#include "csv.h"
#include "struct.h"
#include <h5cpp/core>      // has handle + type descriptors
	#include "generated.h" // uses type descriptors
#include <h5cpp/io>        // uses generated.h + core 

int main(){
	h5::fd_t fd = h5::create("output.h5",H5F_ACC_TRUNC);
	h5::ds_t ds = h5::create<input_t>(fd,  "simple approach/dataset.csv",
				 h5::max_dims{H5S_UNLIMITED}, h5::chunk{10} | h5::gzip{9} );
	h5::pt_t pt = ds;
	ds["data set"] = "monroe-county-crash-data2003-to-2015.csv";
	ds["cvs parser"] = "https://github.com/ben-strasser/fast-cpp-csv-parser";

	constexpr unsigned N_COLS = 5;
	io::CSVReader<N_COLS> in("input.csv"); // number of cols may be less, than total columns in a row, we're to read only 5
	in.read_header(io::ignore_extra_column, "Master Record Number", "Hour", "Reported_Location","Latitude","Longitude");
	input_t row;                           // buffer to read line by line
	char* ptr;      // indirection, as `read_row` doesn't take array directly
	while(in.read_row(row.MasterRecordNumber, row.Hour, ptr, row.Latitude, row.Longitude)){
		strncpy(row.ReportedLocation, ptr, STR_ARRAY_SIZE); // defined in struct.h
		h5::append(pt, row);
	}
}
  • CSV header only library by Ben Strasser, and a type definition for the record
  • h5cpp includes
  • translation unit, the program
  • create HDF5 container, and dataset
  • decorate it with attributes
  • do I/O operations within a loop

Attributes:

do the right thing. Here are some examples, and come with an easy to use operator:

h5::ds_t ds = h5::write(fd,"some dataset with attributes", ... );
ds["att_01"] = 42 ;
ds["att_02"] = {1.,3.,4.,5.};
ds["att_03"] = {'1','3','4','5'};
ds["att_04"] = {"alpha", "beta","gamma","..."};
ds["att_05"] = "const char[N]";
ds["att_06"] = u8"const char[N]áééé";
ds["att_07"] = std::string( "std::string");
ds["att_08"] = record; // pod/compound datatype
ds["att_09"] = vector; // vector of pod/compound type
ds["att_10"] = matrix; // linear algebra object
  • obtain a handle by h5::create | h5::open | h5::write
  • rank N objects, even compound types when h5cpp compiler used
  • arrays of various element types
  • mapped to rank 0 variable length character types