MRS  1.0
A C++ Class Library for Statistical Set Processing
Subpavings for processing statistical sample data

Introduction

Subpavings for processing statistical sample data are a refinement of regular subpavings where each box in a subpaving can be associated with data points in a sample.


Organising data

The subpaving becomes a way to organise the sample data. The boxes of the subpaving can be thought of a 'buckets' for the data points, each box containing some of the data points. A data point x belongs in a box or interval vector [x] if [x] contains x.

The binary tree representation of the subpaving shows how the subpaving has been formed.

nonminimalsp1.png
A 2-dimensional subpaving and its representation as a binary tree

Note that although we are still using regular subpavings (ie, we are bisecting boxes along their longest dimension), our subpavings are no longer minimal - that is, if a box is split, both sub-boxes are always retained in the paving.

Using the binary tree representation of a subpaving we can think of the process of organising data as adding data starting at the root and having the data make its way through the tree to the leaf node whose box contains the data, going through the following steps at each node it gets to from the root down:

  1. If the box of the node contains the data, continue [if the box does not contain the data, return something indicating "not this node"]
  2. If the box has been subdivided (node has children), see which subbox the data is in, ie recurse the addition process from step 1. on the children
  3. If the box has not been subdivided then the data can be associated with this node

Thus the data moves through the tree until it finds a leaf node whose box contains it.

Because a subpaving comprises non-overlapping boxes, a data point should belong to at most only one box. If the data point is outside the box of the root node of a subpaving, it will clearly not be in any of the boxes of the leaves.


A class for statistical subpavings

We extend the SPnode base class to a class specifically designed for processing statistical sample data. This class is called SPSnode and a pointer to an SPSnode is aliased as StatsSubPaving.

SPSnodes summarise as well as organise data: each node can maintain summaries of the data it is associated with, such as a count (number of data points in the box), sum, etc. Because of the tree structure we can maintain a summary of the data at each level of the tree, not just in relation to the final subpaving (the leaf nodes).

As the data moves down the tree, each node it passes through updates its summary statistics, such as a count and sum of data points covered by the box of that node.

statsspaddingdata.png
Summarising data with a binary tree

SPSnodes form non-minimal trees: an SPSnode will either exactly two child nodes or no child nodes. If a tree of SPSnodes is representing a statistical summary of data and the data is imagined plotted as coordinate points on the root box of the paving, then an SPSnode representing a part of the subpaving where no data has fallen will have a SPSnode::counter of zero.

By default, all recursively computable statistics provided are maintained in each SPSnode. However, since this uses memory and is not always needed, an SPSnode can be constructed to only maintain count statistics. See SPSnode::countsOnly.


Using the statistical subpavings class

The statistical subpavings class SPSnode is designed to be a flexible tool.

An SPSnode object represents only part of a subpaving. The entire statistical subpaving is represented by a tree of SPSnodes. The behaviour of the tree of nodes will usually be controlled by a wrapper or manager object. In particular, there are no rules specified within the SPSnode class itself to determine when a box in the subpaving represented by a tree of nodes is split. The wrapper or manager class will specify the methods which determine the development of the subpaving/its tree representation.

An example of such a wrapper or manager object is the AdaptiveHistogram class, see Adaptive histograms and Examples using adaptive histograms

These manager classes will usually determine the rules for constructing the subpaving and simply pass on 'instructions' to the node in the tree representing a particular box in the subpaving. This allows decisions to be made on the basis of the characteristics of the statistical subpaving as a whole and not just an individual node.


Implementing the statistical subpavings class

For a full description of this class, see the SPSnode class documentation. Here we discuss only some of the important points of this class.

Bisecting a box of a statistical subpaving

When we bisect a box of a statistical subpaving to form two subboxes, we have to make a decision about which of the new boxes is considered to include the boundary hyperplane. This is so that, if we have a data point which falls exactly on that boundary, it is clear which of the subboxes it is associated with.

We consider that the boundary to belong to the right child. That is, the interval element of the right child's box on the splitting dimension is a left-closed interval whereas the interval element of the left child's box on the splitting dimension is a right-open interval. For example, if the parent node box interval on the splitting dimension is [1 5] then the split takes place at the midpoint (3) and the right child is considered to have a box whose interval, on that dimesion, is [3 5] whereas the left child has [1 3). [1 5] is split into [1 3)[3 5]. A datapoint which is contained by the parent node's box and which has value 3 on the splitting dimension will be considered to belong to the right child's box and not to the left child's box.

Associating data with a node

A node of a tree representation of a subpaving is associated with a particular box of the subpaving. The box may have been subsequently subdivided, in which case the node will have children. A leaf node represents an undivided box in the subpaving. As well as keeping summaries of the data associated with a node (the data points covered by its box) we want to keep the full dataset itself as well, but without duplication. Thus, we only associate data with leaf nodes. When a box in the paving is subdivided the data associated with the corresponding node is passed down to the new children.

In order to keep our implementation as flexible as possible, we maintain a container for the sample data outside the SPSnode class. A data point in this 'big collection' of data is associated with a node through the iterator to that data point in the big collection. The node does not 'have' the data, it simply knows where the data it is associated with is stored. In the implementation of SPSnode, a node has a data member which is a container for iterators into the big data collection to store references to the data that it is associated with.

C-XSC and processing statistical sample data

The SPSnode class is implemented using the C-XSC library. A data point is a cxsc::rvector and a box or interval vector is a cxsc::ivector. For example, a point in 3-d which would be represented in cartesian coordinates as (1.0, 2.0, 3.0) is a 3-dimensional rvector whose elements are, in order, the reals 1.0, 2.0, and 3.0. This point would be contained in (in cxsc, the operator <= meaning "is an element of") a 3 dimensional ivector. The ivector whose elements are [0.0 2.0], [1.0 3.0], [2.0, 3.0] defines a box which contains this point.

Summaries of data values are stored in csxc::dotprecision accumulators to mitigate the risk of inaccuracy in summing floating point numbers (this risk is higher when the numbers being summed have large differences in order of magnitude, as will be the case when a relatively small new data point value is added to a much larger accumulated sum of data points so far).

 All Classes Namespaces Functions Variables Typedefs Enumerations Friends