A derived class based on SPnode for processing sample data. More...
Public Member Functions | |||
SPSnode () | |||
Default constructor. | |||
SPSnode (const ivector &v, bool cntOnly) | |||
Initialised constructor. | |||
SPSnode (const ivector &v) | |||
Initialised constructor. | |||
SPSnode (const ivector &v, size_t max, bool cntOnly) | |||
Initialised constructor. | |||
SPSnode (const ivector &v, size_t max) | |||
Initialised constructor. | |||
SPSnode (const LabBox &lb, bool cntOnly=false) | |||
Initialised constructor. | |||
SPSnode (const LabBox &lb, size_t max, bool cntOnly=false) | |||
Initialised constructor. | |||
SPSnode (const SPSnode &other) | |||
Copy constructor. | |||
SPSnode & | operator= (SPSnode rhs) | ||
Copy assignment operator. | |||
size_t | getCounter () const | ||
Accessor for the counter. | |||
virtual int | getSplitDim () const | ||
Get the split dimension. | |||
virtual real | getSplitValue () const | ||
Get the split value. | |||
real | getCountsOnly () const | ||
Accessor for the countsOnly value. | |||
NodeData | getData () const | ||
Accessor for the node's data collection. | |||
bool | checkTreeStateLegal () const | ||
Check tree rooted at this is legal with respect to isSplittableNode(). | |||
bool | checkTreeStateLegal (size_t minChildPoints, double minVol) const | ||
Check tree rooted at this is legal with respect to isSplittableNode(size_t minChildPoints, double minVol). | |||
bool | checkTreeStateLegal (size_t minChildPoints) | ||
Check tree rooted at this is legal with respect to isSplittableNode(size_t minChildPoints). | |||
virtual bool | isSplittableNode () const | ||
bool | isSplittableNode (size_t minChildPoints, double minVol) const | ||
Method to check whether a node is splittable. | |||
bool | isSplittableNode (size_t minChildPoints) const | ||
Method to check whether a node is splittable. | |||
size_t | getLeftCountIfSplit () const | ||
The count the left child would have if this node was split. | |||
size_t | getRightCountIfSplit () const | ||
The count the right child would have if this node was split. | |||
size_t | getMinChildCountIfSplit () const | ||
Smallest number of points in either child if this was split. | |||
size_t | getMinChildCountIfSplitNEW () const | ||
Size_tVec & | getChildrensLeftAndRightCountsIfSplit (Size_tVec &grandchildCounts) const | ||
return a container of counts for prospective grandchildren. | |||
Size_tVec & | getChildrensLeftAndRightCountsIfSplitNEW (Size_tVec &grandchildCounts) const | ||
Size_tVec & | getLeafNodeCounts (Size_tVec &counts) const | ||
SPSnodePtrs & | getLeavesInIntersection (const SPnode *const spn, SPSnodePtrs &leaves) | ||
SPSnodeConstPtrs & | getConstLeavesInIntersection (const SPnode *const spn, SPSnodeConstPtrs &leaves) const | ||
SPSnodePtrs & | getSubLeavesInIntersection (const SPnode *const spn, SPSnodePtrs &subleaves) | ||
SPSnodeConstPtrs & | getConstSubLeavesInIntersection (const SPnode *const spn, SPSnodeConstPtrs &subleaves) const | ||
size_t | getRootCounter () const | ||
The count in the node's ultimate ancestor root. | |||
double | getCountOverVolume () const | ||
The count divided by the node volume. | |||
rvector | getMean () const | ||
Get the sample mean. | |||
std::pair< size_t, cxsc::real > | getNonEmptyBoxSummary () const | ||
Get summary information on non-empty leaf box numbers and volumes. | |||
real | getSumLeafCountOverVol () const | ||
Get the sum of the count over volume in the leaf nodes. | |||
size_t | getSmallestLeafCount () const | ||
Get the count of the leaf with the smallest count. | |||
size_t | getLargestLeafCount () const | ||
Get the count in the leaf with the smallest count. | |||
virtual real | getLogLik (const size_t n) const | ||
Get this node's contribution to loglikelihood. | |||
virtual dotprecision | getSplitChangeLogLik () const | ||
Get change in log likelihood on split of this node. | |||
virtual cxsc::real | getUnscaledTreeLogLik () const | ||
Get the unscaled log likihood for the tree rooted at this. | |||
virtual dotprecision | getMergeChangeLogLik () const | ||
Get change in log likelihood on merge of this' leaf chidren. | |||
dotprecision | getBestSplitChangeEMPCOPERR (const size_t n) const | ||
Get best change in EMP under COPERR from splitting any leaf. | |||
dotprecision | getBestSplitChangeEMPAIC () const | ||
Get best change in EMP under AIC from splitting any leaf. | |||
dotprecision | getBestMergeChangeEMPCOPERR (const size_t n) const | ||
Get best change in EMP under COPERR from merging any subleaf. | |||
dotprecision | getBestMergeChangeEMPAIC () const | ||
Get best change in EMP under AIC from merging any subleaf. | |||
real | getEMPContributionCOPERR (const size_t n) const | ||
Get this node's scaled contribution to EMP under COPERR. | |||
real | getEMPContributionAIC (const size_t n) const | ||
Get this node's scaled contribution to EMP under AIC. | |||
dotprecision | getSplitChangeEMPCOPERR (const size_t n) const | ||
Get scaled change in sum term in EMP under COPERR on split. | |||
dotprecision | getSplitChangeEMPAIC () const | ||
Get change in sum term in EMP under AIC on split. | |||
dotprecision | getMergeChangeEMPCOPERR (const size_t n) const | ||
Get scaled change in sum term in EMP under COPERR on merge. | |||
dotprecision | getMergeChangeEMPAIC () const | ||
Get change in sum term in EMP under AIC on merge. | |||
void | reshapeToUnion (const SPnode &other) | ||
Reshape so that the tree rooted at this has shape that is the union of this shape and the shape of another tree. | |||
void | reshapeToUnion (const SPnode &other, size_t minChildPoints) | ||
Reshape so that the tree rooted at this has shape that is as close as possible to the union of this shape and the shape of another tree. | |||
virtual std::ostream & | nodePrint (std::ostream &os) const | ||
Output details of a specific node. | |||
virtual std::ostream & | leavesOutputTabsWithEMPs (const size_t bigN, std::ostream &os, int prec=5) const | ||
Output for for all leaves of a binary tree. | |||
virtual std::ostream & | leavesOutputTabsWithHistHeight (std::ostream &os, int prec=5) const | ||
Output for for all leaves of a binary tree. | |||
virtual std::ostream & | leavesOutputTabsWithHistHeight (const size_t bigN, std::ostream &os, int prec=5) const | ||
Output for for all leaves of a binary tree. | |||
virtual std::ostream & | leavesOutputTabsWithHistHeightAndEMPs (const size_t bigN, std::ostream &os, int prec=5) const | ||
Output for for all leaves of a binary tree. | |||
dotprecision | getEMPSumCOPERR (const size_t n) const | ||
Get scaled EMP sum under COPERR for tree rooted at this. | |||
dotprecision | getEMPSumAIC (const size_t n) const | ||
Get the unscaled EMP sum under AIC for tree rooted at this. | |||
virtual void | nodeExpand (int comp) | ||
Expand a leaf node. | |||
virtual void | nodeExpand (const SplitDecisionObj &boolTest, int comp) | ||
Expand a leaf node. | |||
virtual void | nodeExpand () | ||
Expand a leaf node. | |||
virtual void | nodeExpand (const SplitDecisionObj &boolTest) | ||
Expand a leaf node. | |||
virtual void | nodeReabsorbChildren () | ||
Reabsorbs both children of the node. | |||
bool | splitRootAtLeastToShapeSPS (std::vector< size_t > reqDepths, size_t minPoints=0) | ||
Try to get tree rooted at this into shape at least as deep as described by instruction reqDepths. | |||
bool | randomMCMCSplitRootAtLeastSPS (unsigned long int numLeaves, const MCMCPartitionGenerator &partitioner, RealMappedSPnode *rmsp, dotprecision &nlogn, int &ndepth, bool saveInstructions=false) | ||
bool | randomMCMCSplitRootAtLeastSPS (unsigned long int numLeaves, const MCMCPartitionGenerator &partitioner, RealMappedSPnode *rmsp, dotprecision &nlogn, int &ndepth, size_t minPoints, bool saveInstructions=false) | ||
bool | randomKnuthMCMCSplitRootAtLeastSPS (unsigned long int numLeaves, const MCMCPartitionGenerator &partitioner, RealMappedSPnode *rmsp, dotprecision &nlogn, int &ndepth, bool saveInstructions, const std::string &failureLogFilename="") | ||
bool | randomKnuthMCMCSplitRootAtLeastSPS (unsigned long int numLeaves, const MCMCPartitionGenerator &partitioner, RealMappedSPnode *rmsp, dotprecision &nlogn, int &ndepth, size_t minPoints, bool saveInstructions, const std::string &failureLogFilename="") | ||
SPSnode * | insertOneFind (BigDataItr newItr, OPERATIONS_ON childInd, const SplitDecisionObj &boolTest) | ||
Tries to insert data into the leaves of the tree rooted at this. | |||
const SPSnode * | findContainingNode (const rvector &pt, OPERATIONS_ON childInd=ON_PARENT) const | ||
Find the leaf node which would contain a data point. | |||
void | unionTreeStructure (const SPSnode *const rhs) | ||
cxsc::real | getL1Distance (const SPSnode *const other) const | ||
virtual void | clearAllDataHeld () const | ||
Clear all the data associated with the tree rooted at this. | |||
virtual void | setCountsOnly (bool setTo) | ||
Set an indicator controlling whether this maintains all available statistics or not. | |||
void | swapSPS (SPSnode &spn) | ||
Swap this and another node. | |||
virtual std::string | nodeStringSummary () const | ||
Get a string summary of this node's properties. | |||
virtual std::string | doubleCheckStringSummary () const | ||
Get a string summary of the properties of nodes in the tree rooted at this which can be used for checking and debugging. | |||
Accessors for links between the nodes. | |||
Note that pointers for parent, leftChild, and rightChild are not reference counted so there could potentially be problems with the use of returned pointers (for instance, being used to delete nodes). These pointers might be better implemented with boost::shared_ptr . | |||
SPSnode * | getParent () const | ||
Accessor for the parent of a node. | |||
SPSnode * | getLeftChild () const | ||
Accessor for the left child of a node. | |||
SPSnode * | getRightChild () const | ||
Accessor for the right child of a node. | |||
Get a container of all descendent leaf nodes. | |||
Will contain just this if this is a leaf.
| |||
SPSnodePtrs & | getLeaves (SPSnodePtrs &leaves) | ||
Returns pointers to non-const nodes. | |||
SPSnodeConstPtrs & | getConstLeaves (SPSnodeConstPtrs &leaves) const | ||
Returns pointers to const nodes. | |||
Get a container of all sub-leaf nodes. | |||
Sub-leaf nodes (aka 'cherry nodes) have at least one child but any child must be a leaf, ie sub-leaves are the parents only of leaf nodes. Will just contain this if this is a sub-leaf.
| |||
SPSnodePtrs & | getSubLeaves (SPSnodePtrs &subleaves) | ||
Returns pointers to non-const nodes. | |||
SPSnodeConstPtrs & | getConstSubLeaves (SPSnodeConstPtrs &subleaves) const | ||
Returns pointers to const nodes. | |||
Get the sample variance-covariance matrix. | |||
This calculates the sample variance-covariance matrix for the sample and returns it as a d*d-dimensional vector of reals, where d is the dimension of the data. cov(i,j) = [sumproduct(i,j)-sum(i)xsum(j)/counter]/(counter-1) cov(i,j) is at index i*d+j in the returned RealVec (indices from 0 to d*d-1, where d is the dimension of the data).
| |||
RealVec | getVarCovar () const | ||
RealVec & | getVarCovar (RealVec &varCovar) const | ||
Protected Member Functions | |||
virtual void | stripData () const | ||
virtual void | stripOptionalStatsOnly () | ||
virtual void | addOptionalStatsOnly () | ||
virtual void | splitData (const SplitDecisionObj &boolTest) | ||
Send the data associated with this down to children. | |||
virtual void | nodeExpansionOnly (int comp) | ||
Expand the node with no reallocation of data. | |||
bool | _splitAtLeastToShapeSPS (std::vector< size_t > &reqDepths, size_t myDepth, size_t minPoints) | ||
Try to get tree rooted at this into shape at least as deep as described by instruction reqDepths. | |||
bool | _randomMCMCSplitAtLeastSPS (unsigned long int numLeaves, const MCMCPartitionGenerator &partitioner, RealMappedSPnode *rmsp, size_t myDepth, dotprecision &nlogn, int &ndepth, size_t minPoints) | ||
int | levelsUpToNextRight () const | ||
bool | _randomKnuthMCMCSplitAtLeastSPS (unsigned long int p, unsigned long int q, std::stack< size_t > &lastPair, const MCMCPartitionGenerator &partitioner, RealMappedSPnode *rmsp, size_t myDepth, dotprecision &nlogn, int &ndepth, size_t minPoints, const std::string &failureLogFilename, bool across=false) | ||
void | _reshapeToUnion (const SPnode *const other) | ||
Internal method to reshape this to a union. | |||
bool | _reshapeToUnion (const SPnode *const other, size_t minChildPoints, const std::string &errorFilename) | ||
Internal method to reshape this to a union with a restriction of minChildPoints. | |||
virtual void | _getUnscaledTreeLogLik (dotprecision &nlogn, int &ndepth, int depth) const | ||
Internal method to accumulate the components of the unscaled log likelihood for a tree rooted at this. | |||
dotprecision & | accumulateLeafCountOverVol (dotprecision &sum) const | ||
Accumulate a sum of node counts divided by node volumes. | |||
void | accumulateNonEmptyBoxSummary (size_t &nNonEmptyBoxes, real &vNonEmptyBoxVolumes) const | ||
Accumulate summary information on non-empty box numbers and volumes. | |||
virtual std::ostream & | nodeDataPrint (std::ostream &os) const | ||
Print the data in a specified format. | |||
virtual std::ostream & | nodeMeanPrint (std::ostream &os) const | ||
Print the mean in a specified format. | |||
virtual std::ostream & | nodeVarCovarPrint (std::ostream &os) const | ||
Print the variance-covariance in a specified format. | |||
virtual std::ostream & | leafOutputTabs (std::ostream &os) const | ||
Output for a node in a binary tree, tab-delimited. | |||
virtual std::ostream & | leafOutputTabsWithEMPs (const size_t bigN, std::ostream &os, int prec=5) const | ||
Output for a node in a binary tree, tab-delimited. | |||
virtual std::ostream & | leafOutputTabsWithHistHeight (const size_t bigN, std::ostream &os, int prec=5) const | ||
Output for a node in a binary tree, tab-delimited. | |||
std::ostream & | leafOutputTabsWithHistHeightAndEMPs (const size_t bigN, std::ostream &os, int prec=5) const | ||
Output for a node in a binary tree, tab-delimited. | |||
virtual void | clearNodeData () const | ||
Clears the node's data collection. | |||
virtual void | recalculateStats (const rvector &newdata) const | ||
Recalculate summary statistics associated with node. | |||
virtual void | recalculateSums (const rvector &newdata) const | ||
Recalculate summary statistics associated with node. | |||
virtual void | recalculateSumProducts (const rvector &newdata) const | ||
Recalculate summary statistics associated with node. | |||
virtual void | recalculateOptionalStatsAndGatherData (NodeData &container) | ||
Recalculates optional summary statistics associated with node from scratch, using all the data currently held. | |||
virtual void | recalculateOptionalStatsForData (const NodeData &nodedata) const | ||
Protected Attributes | |||
size_t | spaceIndication | ||
An indication of the maximum number of data points a node needs to carry. | |||
int | splitDim | ||
Dimension the node's box has been split along. | |||
real | splitValue | ||
The value, on split dimension, where node's box was split. | |||
bool | countsOnly | ||
Determines the amount of statistical summary data in node. | |||
mutable data members. | |||
These data members are mutable to allow them to be modified by const functions as data passed to or through the node. Only leaf nodes have data associated with them but the recursively computable statistics, such as counter and sum, are maintained for all nodes. Thus when a data point is sent to the root node and progresses down the tree to find which leaf node it should be associated with, the counter is incremented and the data sum increased for each non-leaf node it passes through (ie, where it is contained in the box of that node but that node is not a leaf node so the box has been sub- divided and the datapoint continues on to one of the children). | |||
size_t | counter | ||
A counter for how many data points are covered by theBox. | |||
VecDotPrec | dpSums | ||
A container representing the sum of the data points covered by theBox. | |||
VecDotPrec | dpSumProducts | ||
A container representing the sumproduct matrix of the data points covered by theBox. | |||
NodeData | dataItrs | ||
A container for the association of data with a node. |
A derived class based on SPnode for processing sample data.
The base class SPnode is a node in the representation of a regular subpaving as a binary tree. A node represents a box (interval vector). SPnodes are linked together to form the tree. The initial box of the subpaving is the box represented by the root node of the tree. A box which has been split will be represented as node with 1 or 2 children.
A subpaving of [x] (union of non-overlapping subboxes of [x]) is represented by the leaves (degenerate/child-less) nodes in the tree.
The SPSnode class has additional data members for statistical data analysis. The SPSnode class is used to form a regular subpaving representing containers of sample data where some data-related criteria is used to determine when the subpaving should be bisected. For example, keeping the number of data points associated with a node below a specified maximum -- "a maximally statistically equivalent blocks" criterion.
Data points are of type csxc::rvector.
Leaves of the SPSnode class have data associated with them in the form of pointers to some big collection of sample data. If an SPSnode is bisected the data associated with it descends to its children, so that only leaf SPSnodes have data associated with them. However, "recursively-computable statistical summaries", such as, count, sum, etc, of the data which would be contained in the box an SPSnode represents are kept for all SPSnodes and continue to be updated when the SPSnode has children and data reaching the node is passed on to be finally associated with a leaf.
For an algebraic statistical formalisation of Fisher's ideas on recursively computable statistics see "Notions of Sufficiency" by S.L. Lauritzen, Contributed Paper, 44th Session of the International Statistical Institute, Madrid, Spain, September 12th--22nd, 1983.
By default, all recursively computable statistics provided are maintained in each SPSnode. However, since this uses memory and is not always needed, an SPSnode can be constructed to only maintain count statistics.
When a box is split along some dimension d, only the right (upper) subbox is closed on the actual split value (the midpoint of the interval which is the dth element in the interval vector/box to be bisected) whereas the left (lower) box has an open interval on this value. Thus [1 5] is subdivided to [1 3) and [3 5]. The parent SPnode class can ignore this point but it must be addressed if we are to be able to decide which subbox (left or right) a datapoint sitting exactly on the split value in the split dimension should descend to. This is an extension of the empirical distribution to a partition of the root box with the leaves of the subpaving.
The class also needs to know which dimension the box represented by a node was split on and what the split value was, so that data reaching the node can be correctly (ie, in cognisance of the open and closed intervals described above) associated with either the right or left child. These values therefore become data members of the class, with default values for leaf nodes.
SPSnode::SPSnode | ( | const ivector & | v, |
bool | cntOnly | ||
) |
Initialised constructor.
Throws a MalconstructedBox_Error if v has no practical (0 or negative) dimensions.
v | interval vector for the box this represents. |
cntOnly | an indicator for whether all available stats are maintained (cntOnly = false) or just counts (cntOnly = true). |
: SPnode(v), counter(0), splitDim(-1), splitValue(0.0), countsOnly(cntOnly) { //invokes the base class constructor with ivector // and then initialises additional data members //dpSums, a vector of dotprecision terms, is not initialised //dpSumProducts, similarly not initialised spaceIndication = static_cast<size_t>(defaultMaxPts); //reserve space - not sure if important - leave for moment dataItrs.reserve(spaceIndication); }
SPSnode::SPSnode | ( | const ivector & | v | ) | [explicit] |
Initialised constructor.
Throws a MalconstructedBox_Error if v has no practical (0 or negative) dimensions.
v | interval vector for the box this represents. |
: SPnode(v), counter(0), splitDim(-1), splitValue(0.0), countsOnly(false) { //invokes the base class constructor with ivector // and then initialises additional data members //dpSums, a vector of dotprecision terms, is not initialised //dpSumProducts, similarly not initialised spaceIndication = static_cast<size_t>(defaultMaxPts); //reserve space - not sure if important - leave for moment dataItrs.reserve(spaceIndication); }
SPSnode::SPSnode | ( | const ivector & | v, |
size_t | max, | ||
bool | cntOnly | ||
) |
Initialised constructor.
Throws a MalconstructedBox_Error if v has no practical (0 or negative) dimensions.
v | interval vector for the box this represents. |
max | an indication of the maximum number of datapoints that may eventually be associated with a leaf. This is used for efficiency only and does not constitute a maximum count for a node. |
cntOnly | an indicator for whether all available stats are maintained (cntOnly = false) or just counts (cntOnly = true). |
: SPnode(v), spaceIndication(max), counter(0), splitDim(-1), splitValue(0.0), countsOnly(cntOnly) { //invokes the base class constructor with ivector argument // and then initialises additional data members //dpSums, a vector of dotprecision terms, is not initialised //dpSumProducts, similarly not initialised //reserve space - not sure if important - leave for moment dataItrs.reserve(spaceIndication+1); }
SPSnode::SPSnode | ( | const ivector & | v, |
size_t | max | ||
) |
Initialised constructor.
Throws a MalconstructedBox_Error if v has no practical (0 or negative) dimensions.
v | interval vector for the box this represents. |
max | an indication of the maximum number of datapoints that may eventually be associated with a leaf. This is used for efficiency only and does not constitute a maximum count for a node. |
: SPnode(v), spaceIndication(max), counter(0), splitDim(-1), splitValue(0.0), countsOnly(false) { //invokes the base class constructor with ivector argument // and then initialises additional data members //dpSums, a vector of dotprecision terms, is not initialised //dpSumProducts, similarly not initialised //reserve space - not sure if important - leave for moment dataItrs.reserve(spaceIndication+1); }
SPSnode::SPSnode | ( | const LabBox & | lb, |
bool | cntOnly = false |
||
) | [explicit] |
Initialised constructor.
Throws a MalconstructedBox_Error if lb has a box with no practical dimensions.
lb | a labeled box which provides the box to be used for this. |
cntOnly | an indicator for whether all available stats are maintained (cntOnly = false) or just counts (cntOnly = true). |
: SPnode(lb), counter(0), splitDim(-1), splitValue(0.0), countsOnly(cntOnly) { //invokes the base class constructor with LabBox argument // and then initialises additional data members //dpSums, a vector of dotprecision terms, is not initialised //dpSumProducts, similarly not initialised spaceIndication = static_cast<size_t>(defaultMaxPts); //reserve space - not sure if important - leave for moment dataItrs.reserve(spaceIndication); }
SPSnode::SPSnode | ( | const LabBox & | lb, |
size_t | max, | ||
bool | cntOnly = false |
||
) |
Initialised constructor.
Throws a MalconstructedBox_Error if v has no practical (0 or negative) dimensions.
lb | a labeled box which provides the box to be used for this. |
max | an indication of the maximum number of datapoints that may eventually be associated with a leaf. This is used for efficiency only and does not constitute a maximum count for a node. |
cntOnly | an indicator for whether all available stats are maintained (cntOnly = false) or just counts (cntOnly = true). |
: SPnode(lb), spaceIndication(max), counter(0), splitDim(-1), splitValue(0.0), countsOnly(cntOnly) { //invokes the base class constructor with LabBox argument //and then initialises additional data members //dpSums, a vector of dotprecision terms, is not initialised //dpSumProducts, similarly not initialised //reserve space - not sure if important - leave for moment dataItrs.reserve(spaceIndication+1); }
SPSnode::SPSnode | ( | const SPSnode & | other | ) |
Copy constructor.
Note that if the node to be copied is a leaf, and therefore has a container of iterators to some data collection, then the node created by a copy constructor will simply copy that container of iterators - ie, both the original and copy are associated with the same data collection. If the node is contained within by some other object that also manages the data collection, it is advisable to ensure that copying that other object creates a different data collection and that the leaf nodes of the copied subpaving tree reference that new data collection.
other | the node to be copied to construct this. |
: spaceIndication(other.spaceIndication), counter(other.counter), dpSums(other.dpSums), dpSumProducts(other.dpSumProducts), splitDim(other.splitDim), splitValue(other.splitValue), countsOnly(other.countsOnly) { if (other.theBox != NULL) { theBox = new ivector( other.getBox() ); } nodeName = other.nodeName; //reserve space dataItrs.reserve((other.dataItrs).size()); //copy dataItrs from other to this dataItrs = other.dataItrs; //recursion on the children if (other.leftChild) { nodeAddLeft(new SPSnode(*(other.getLeftChild()))); } else leftChild=NULL; if (other.rightChild) { nodeAddRight(new SPSnode(*(other.getRightChild()))); } else rightChild=NULL; }
bool SPSnode::_reshapeToUnion | ( | const SPnode *const | other, |
size_t | minChildPoints, | ||
const std::string & | errorFilename | ||
) | [protected] |
Internal method to reshape this to a union with a restriction of minChildPoints.
minChildPoints | is the minimum child points to allow for a split (as usual, if one child gets all the points and the number of points in this is >= minChildPoints, the split is allowed). |
errorFileName | is a file to which to log messages about nodes which could not be split because of minChildPoints. |
{ // indictator for being able to do union exactly bool success = true; if ( other != NULL && !(other->isEmpty()) ) { // this is not a leaf, other is a leaf if (!isLeaf() && other->isLeaf()) { // no need to do anything } // this is a leaf, other is not a leaf if (isLeaf() && !other->isLeaf()) { //we need to expand this if (isSplittableNode(minChildPoints)) nodeExpand(); else { success = false; // log file std::string line = "Could not split " + getNodeName() + " because of minChildPoints"; outputFile(errorFilename, line); } } // now recurse on the children if both have children // note - it won't go here is !success because still isLeaf() if (!isLeaf() && !other->isLeaf()) { success = getLeftChild()->_reshapeToUnion( other->getLeftChild(), minChildPoints, errorFilename); success = getRightChild()->_reshapeToUnion( other->getRightChild(), minChildPoints, errorFilename) && success; } return success; } }
bool SPSnode::_splitAtLeastToShapeSPS | ( | std::vector< size_t > & | reqDepths, |
size_t | myDepth, | ||
size_t | minPoints | ||
) | [protected] |
Try to get tree rooted at this into shape at least as deep as described by instruction reqDepths.
Tries to ensures that this has at least shape of reqDepths, but allows the tree rooted at this to also be more split at some or all nodes.
This can only follow the instruction if the nodes it is required to split are splittable nodes, with respect to both width and minPoints (to be split a node must return isSplittableNode(minPoints) = true).
reqDepths | a collection of leaf levels to try to split to. |
myDepth | the depth this node is in the tree. |
minPoints | the minimum child points used to determine whether this is splittable. |
{ if(reqDepths.empty()) throw std::invalid_argument( "SPSnode::_splitAtLeastToShapeSPS(...): : reqDepths.empty()"); bool success = true; int depth = reqDepths.back(); if (myDepth < depth) { // need to try to go down more if ( isLeaf() ) { if (isSplittableNode(minPoints)) { // split, using the nodeExpand for this subtype if not base nodeExpand(); } else success = false; } // if we are okay, send instruction down RIGHT SIDE FIRST if (success) success = getRightChild()->_splitAtLeastToShapeSPS(reqDepths, myDepth+1, minPoints); if (success) success = getLeftChild()->_splitAtLeastToShapeSPS(reqDepths, myDepth+1, minPoints); } else if (myDepth == depth) { // split enough /* I am not necessarily a leaf though - allow for mcmc situation * where I want to be split to at least this much */ // knock the last element out reqDepths.erase(reqDepths.begin()+reqDepths.size()-1); } else { // myDepth is > depth throw std::invalid_argument( "SPSnode::_splitAtLeastToShapeSPS(...): node depth > instruction depth"); } return success; }
dotprecision & SPSnode::accumulateLeafCountOverVol | ( | dotprecision & | sum | ) | const [protected] |
Accumulate a sum of node counts divided by node volumes.
sum | a reference to an accumulation of the sum so far. |
{ if (isLeaf()) { // this is a leaf accumulate(sum, 1.0*counter, (1.0/nodeRealVolume())); } else { // this is not a leaf sum = getLeftChild()->accumulateLeafCountOverVol(sum); sum = getRightChild()->accumulateLeafCountOverVol(sum); } return sum; }
void SPSnode::accumulateNonEmptyBoxSummary | ( | size_t & | nNonEmptyBoxes, |
real & | vNonEmptyBoxVolumes | ||
) | const [protected] |
Accumulate summary information on non-empty box numbers and volumes.
nNonEmptyBoxes | a reference to an accumulation the number of non-empty boxes so far. |
vNonEmptyBoxVolumes | a reference to an accumulation the volume of non-empty boxes so far. |
{ if (isLeaf()) { if (counter > 0) { ++nNonEmptyBoxes; vNonEmptyBoxVolumes+=nodeRealVolume(); } } else { // recurse getLeftChild()->accumulateNonEmptyBoxSummary(nNonEmptyBoxes, vNonEmptyBoxVolumes); getRightChild()->accumulateNonEmptyBoxSummary(nNonEmptyBoxes, vNonEmptyBoxVolumes); } // we want to count all empty leaves even if parent is empty too }
void SPSnode::addOptionalStatsOnly | ( | ) | [protected, virtual] |
Recalculate the values held to enable the calculation of optional statistics, and set countsOnly to false, in this and the children of this.
{ if (countsOnly) { NodeData container; recalculateOptionalStatsAndGatherData(container); } }
bool SPSnode::checkTreeStateLegal | ( | ) | const [virtual] |
Check tree rooted at this is legal with respect to isSplittableNode().
'Legal' means that all non-leaf nodes in the tree are splittable, ie return isSplittableNode() = true;
Reimplemented from subpavings::SPnode.
{ return SPnode::checkTreeStateLegal(); }
bool SPSnode::checkTreeStateLegal | ( | size_t | minChildPoints, |
double | minVol | ||
) | const |
Check tree rooted at this is legal with respect to isSplittableNode(size_t minChildPoints, double minVol).
'Legal' means that all non-leaf nodes in the tree are splittable, ie return isSplittableNode(size_t minChildPoints, double minVol) = true;
minChildPoints | is the minimum number of points that there would be in the children if the node were to be split. |
minVol | is the minimum node volume to be tested for. |
{ // check current state is legal by looking at everything not a leaf bool legal = true; if ( !isLeaf() ) { // to be splittable, need at least 2*min vol in this node, so each child has min vol legal = isSplittableNode(minChildPoints, minVol); if (legal && hasLCwithBox() ) { legal = getLeftChild()->checkTreeStateLegal(minChildPoints, minVol); } if (legal && hasRCwithBox() ) { legal = getRightChild()->checkTreeStateLegal(minChildPoints, minVol); } } return legal; }
bool SPSnode::checkTreeStateLegal | ( | size_t | minChildPoints | ) |
Check tree rooted at this is legal with respect to isSplittableNode(size_t minChildPoints).
'Legal' means that all non-leaf nodes in the tree are splittable, ie return isSplittableNode(size_T minChildPoints) = true;
minChildPoints | is the minimum number of points that there would be in the children if the node were to be split. |
{ // check current state is legal by looking at everything not a leaf bool legal = true; if ( !isLeaf() ) { legal = isSplittableNode(minChildPoints); if (legal && hasLCwithBox() ) { legal = getLeftChild()->checkTreeStateLegal(minChildPoints); } if (legal && hasRCwithBox() ) { legal = getRightChild()->checkTreeStateLegal(minChildPoints); } } return legal; }
void SPSnode::clearAllDataHeld | ( | ) | const [virtual] |
Clear all the data associated with the tree rooted at this.
Calling this method resets counters to 0, discards values held to enable the computation of optional statistics, and discards data associated with root nodes. The structure of the tree rooted at this is unchanged.
Throws a NonRootNode_Error if this is not a root node.
{ if (getParent() != NULL) { throw NonRootNode_Error("SPSnode::clearAllDataHeld()"); } stripData(); // clears recursively on this and children }
void SPSnode::clearNodeData | ( | ) | const [protected, virtual] |
std::string SPSnode::doubleCheckStringSummary | ( | ) | const [virtual] |
Get a string summary of the properties of nodes in the tree rooted at this which can be used for checking and debugging.
Includes the descendents of this node.
Shows the addresses of the values pointed to by the dataItrs associated with leaf nodes, rather than values themselves.
{ std::ostringstream oss; oss << "I am " << getNodeName() << "(address " << this << "),\n"; oss << "Dimension is " << getDimension() << ", address of box is " << theBox << "\n"; oss << "spaceIndication is " << spaceIndication << ", splitDim is " << splitDim << ", splitValue is " << splitValue << "\n"; oss << "countsOnly is " << countsOnly << ", counter is " << counter << "\n"; oss << "address of dataItrs is " << &dataItrs; if (isLeaf() ) { oss << " and dataItrs (giving addresses pointed to by iterators) is " << endl; for (NodeDataConstItr cit = dataItrs.begin(); cit < dataItrs.end(); ++cit) { oss << &(**cit) << "\t"; } } oss << "\n" << endl; if ( getLeftChild() ) oss << getLeftChild()->doubleCheckStringSummary() << endl; if ( getRightChild() ) oss << getRightChild()->doubleCheckStringSummary() << endl; return oss.str(); }
const SPSnode * SPSnode::findContainingNode | ( | const rvector & | pt, |
OPERATIONS_ON | childInd = ON_PARENT |
||
) | const |
Find the leaf node which would contain a data point.
No data is actually inserted.
Throws a NoBox_Error if this has no box.
pt | a data point. |
childInd | an indicator for whether the current node is a treated as a left or right child or a root. Defaults to ON_PARENT. |
{ // start at the top if (isEmpty()) { throw NoBox_Error( "SPSnode::findContainingNode(const rvector&, OPERATIONS_ON)"); } const SPSnode* retObj = NULL; if(nodeContains(pt, childInd)) { if(isLeaf()) { // give this node as return value retObj = this; } // end of isLeaf // if not a leaf and contains data // recurse on the children if any else { if(rightChild!=NULL && !rightChild->isEmpty()){ retObj = (getRightChild())->findContainingNode( pt, ON_RIGHT); } // only try left if we did not find on the right if(retObj == NULL && leftChild!=NULL && !leftChild->isEmpty()) { retObj = (getLeftChild())->findContainingNode( pt, ON_LEFT); } } } // end if node contains // will return null if does not contain the data return retObj; }
dotprecision SPSnode::getBestMergeChangeEMPAIC | ( | ) | const |
Get best change in EMP under AIC from merging any subleaf.
n, the value to use for scaling, cancels out of change.
Finding the best merge change only makes sense for a node that is not already a leaf: this method throws an UnfulfillableRequest_Error if called on a leaf node.
{ if (isLeaf() ) { throw subpavings::UnfulfillableRequest_Error( "SPSnode::getBestMergeChangeEMPAIC()"); } if (isSubLeaf() ) { return getMergeChangeEMPAIC(); } else { if (getLeftChild()->isLeaf()) { return getRightChild()->getBestMergeChangeEMPAIC(); } else if (getRightChild()->isLeaf()) { return getLeftChild()->getBestMergeChangeEMPAIC(); } else { dotprecision myChangeL = getLeftChild()->getBestMergeChangeEMPAIC(); dotprecision myChangeR = getRightChild()->getBestMergeChangeEMPAIC(); return (myChangeL < myChangeR ? myChangeL : myChangeR); } } }
dotprecision SPSnode::getBestMergeChangeEMPCOPERR | ( | const size_t | n | ) | const |
Get best change in EMP under COPERR from merging any subleaf.
Finding the best merge change only makes sense for a node that is not already a leaf: this method throws an UnfulfillableRequest_Error if called on a leaf node.
n | the value to use for scaling, the total number of points. |
{ if (isLeaf() ) { throw subpavings::UnfulfillableRequest_Error( "SPSnode::getBestMergeChangeEMPCOPERR(const size_t)"); } if (isSubLeaf() ) { return getMergeChangeEMPCOPERR(n); } else { if (getLeftChild()->isLeaf()) { return getRightChild()->getBestMergeChangeEMPCOPERR(n); } else if (getRightChild()->isLeaf()) { return getLeftChild()->getBestMergeChangeEMPCOPERR(n); } else { dotprecision myChangeL = getLeftChild()->getBestMergeChangeEMPCOPERR(n); dotprecision myChangeR = getRightChild()->getBestMergeChangeEMPCOPERR(n); return (myChangeL < myChangeR ? myChangeL : myChangeR); } } }
dotprecision SPSnode::getBestSplitChangeEMPAIC | ( | ) | const |
Get best change in EMP under AIC from splitting any leaf.
n, the value to use for scaling, cancels out of change.
{ if (isLeaf() ) { return getSplitChangeEMPAIC(); } else { dotprecision myChangeL = getLeftChild()->getBestSplitChangeEMPAIC(); dotprecision myChangeR = getRightChild()->getBestSplitChangeEMPAIC(); return (myChangeL < myChangeR ? myChangeL : myChangeR); } }
dotprecision SPSnode::getBestSplitChangeEMPCOPERR | ( | const size_t | n | ) | const |
Get best change in EMP under COPERR from splitting any leaf.
n | the value to use for scaling, the total number of points. |
{ if (isLeaf() ) { return getSplitChangeEMPCOPERR(n); } else { dotprecision myChangeL = getLeftChild()->getBestSplitChangeEMPCOPERR(n); dotprecision myChangeR = getRightChild()->getBestSplitChangeEMPCOPERR(n); return (myChangeL < myChangeR ? myChangeL : myChangeR); } }
Size_tVec & SPSnode::getChildrensLeftAndRightCountsIfSplit | ( | Size_tVec & | grandchildCounts | ) | const |
return a container of counts for prospective grandchildren.
Should be called only on leaf nodes.
returns an indexable container of the number of points the prospective children of each prospective child (ie all four prospective grandchildren) would be associated with, indexed like this [0] = left child's left child count, [1] = left child's rght child count, [2] = rght child's left child count, [3] = rght child's rght child count,
grandchildCounts | a reference to a container to be filled with the prospective grandchild counts |
{ // first find what the children's boxes would be would be int splitMe; // variable to hold first longest dimension ivector box = getBox(); double temp1 = MaxDiam(box, splitMe); // ivectors to be new boxes for new children ivector rCBox; ivector lCBox; // Call Upper() to get what would be the right hand child box Upper(box, rCBox, splitMe); // Call Lower() to get what would be the left hand child box Lower(box, lCBox, splitMe); // mid point of my box on first longest dimension cxsc::real midSplit = cxsc::mid(box[splitMe]); // and if those children were split // left Child int splitChildren = MaxDiamComp(lCBox); cxsc::real midSplitLC = cxsc::mid(lCBox[splitChildren]); // right child // will split on the same dimension as LC cxsc::real midSplitRC = cxsc::mid(rCBox[splitChildren]); // now find how many of this node's data points would go right // and left children of left and right children size_t rightRightCount = 0; size_t rightLeftCount = 0; size_t leftRightCount = 0; size_t leftLeftCount = 0; NodeDataItr it; for (it = dataItrs.begin(); it < dataItrs.end(); it++) { // DataItrs is a container of iterators to a BigDataCollection rvector p = **it; // increment left child? if ( p[splitMe] < midSplit ) { if ( p[splitChildren] < midSplitLC ) leftLeftCount++; else rightLeftCount++; } else { // on right of me if ( p[splitChildren] < midSplitRC) leftRightCount++; else rightRightCount++; } } grandchildCounts.push_back(leftLeftCount); grandchildCounts.push_back(rightLeftCount); grandchildCounts.push_back(leftRightCount); grandchildCounts.push_back(rightRightCount); return grandchildCounts; }
NodeData SPSnode::getData | ( | ) | const |
Accessor for the node's data collection.
Returns a copy of the node's collection of iterators to the big data set.
{ return dataItrs; }
real SPSnode::getEMPContributionAIC | ( | const size_t | n | ) | const |
Get this node's scaled contribution to EMP under AIC.
Under AIC, EMP is -1 x sum over leaves of (counts in leaf x ln(count in leaf / (n * vol of leaf))) where n is the total number of data points in the histogram. And this is -1 * the node's contribution to the loglikelihood of the data given the current state.
The EMP of an entire tree of nodes comes only from the contributions of the current leaf nodes, but this method will not throw an exception if called on a non-leaf node. This allows 'what-if' type calculations to be carried out (eg what if the node reabsorbed its children)...
n | the value to use for scaling, the total number of points. |
real SPSnode::getEMPContributionCOPERR | ( | const size_t | n | ) | const |
Get this node's scaled contribution to EMP under COPERR.
Under COPERR, EMP is -1/n^2 x sum over leaves of (counts in leaf squared / volume of leaf) where n is the total number of data points in the histogram
The EMP of an entire tree of nodes comes only from the contributions of the current leaf nodes, but this method will not throw an exception if called on a non-leaf node. This allows 'what-if' type calculations to be carried out (eg what if the node reabsorbed its children)...
n | the value to use for scaling, the total number of points. |
{ // current number of data points associated to node is counter // current node volume from nodeVolume, and each child will have half #ifdef EMPSDEBUG cout << "in getEMPContributionCOPERR, n = " << n << endl; #endif dotprecision contribution(0.0); if ((n > 0) && (counter > 0)) { accumulate(contribution, -(1.0*counter)/(1.0*n), (1.0*counter)/((n*1.0)*nodeRealVolume())); } // contribution is -counter^2/(n^2 * vol) // default cxsc rounding to nearest #ifdef EMPSDEBUG cout << "EMPContributionCOPERR = " << rnd(contribution) << endl; #endif //return contribution; return rnd(contribution); }
dotprecision SPSnode::getEMPSumAIC | ( | const size_t | n | ) | const |
Get the unscaled EMP sum under AIC for tree rooted at this.
n | the total number of points in the histogram, for scaling |
{ dotprecision retValue; retValue = 0.0; // uses member function getEMPContributionAIC for leaf result if (!(isEmpty()) && isLeaf()) { // this is a non-empty leaf retValue = getEMPContributionAIC(n); } //recurse on the children if (getLeftChild()!=NULL) { retValue+=getLeftChild()->getEMPSumAIC(n); } if (getRightChild()!=NULL) { retValue+=getRightChild()->getEMPSumAIC(n); } return retValue; }
dotprecision SPSnode::getEMPSumCOPERR | ( | const size_t | n | ) | const |
Get scaled EMP sum under COPERR for tree rooted at this.
n | the total number of points in the histogram, for scaling |
{ dotprecision retValue; retValue = 0.0; // uses member function getEMPContributionCOPERR for leaf value if (!(isEmpty()) && isLeaf()) { // this is a non-empty leaf retValue = getEMPContributionCOPERR(n); } //recurse on the children if (getLeftChild()!=NULL) { retValue = retValue + getLeftChild()->getEMPSumCOPERR(n); } if (getRightChild()!=NULL) { retValue = retValue + getRightChild()->getEMPSumCOPERR(n); } return retValue; }
cxsc::real SPSnode::getL1Distance | ( | const SPSnode *const | other | ) | const |
Gets the L1 distance between this and another subpaving.
The L1 distance is defined as the sum of the absolute values of the differences in 'area' represented by the leaf nodes of this and the other paving. The 'area' represented by a leaf node is the proportion of the total count of the tree the leaf is part of that is in that leaf node.
Throws the following exceptions:
other | a pointer to the root of SPSnode tree to calculate the L1 distance against. |
{ cxsc::real retValue(0.0); if ( other == NULL ) { throw NullSubpavingPointer_Error( "SPSnode::getL1Distance(const SPSnode * const)"); } if ( isEmpty() || other->isEmpty() ) { throw NoBox_Error( "SPSnode::getL1Distance(const SPSnode * const)"); } if ( getBox() != other->getBox() ) { throw IncompatibleDimensions_Error( "SPSnode::getL1Distance(const SPSnode * const)"); } std::size_t thisBigN = getCounter(); std::size_t otherBigN = other->getCounter(); // if both BigN's are zero, L1 distance is 0 // else if thisBigN is 0 or other bigN is 0 , L1 distance will be 1 if( (thisBigN + otherBigN)) { // at least one of them has some data if (thisBigN * otherBigN) { // both have some data cxsc::dotprecision retDP(retValue); retDP = _getL1distance(retDP, other, thisBigN, otherBigN); retValue = cxsc::rnd(retDP); } else { // only one has some data retValue += 1.0; } } return retValue; }
size_t SPSnode::getLargestLeafCount | ( | ) | const |
Get the count in the leaf with the smallest count.
Returns the count of the largest (by count) leaf node.
{ if (isLeaf() ) { return getCounter(); } else { size_t myL = getLeftChild()->getLargestLeafCount(); size_t myR = getRightChild()->getLargestLeafCount(); return (myL > myR ? myL : myR); } }
Size_tVec & SPSnode::getLeafNodeCounts | ( | Size_tVec & | counts | ) | const |
Fills in container of leaf counts, left to right.
Traverses the leaves left to right, puts the leaf counts into container.
counts | is reference to the container to fill in. |
{ if (getLeftChild() != NULL) { getLeftChild()->getLeafNodeCounts(counts); } if (getRightChild() != NULL) { getRightChild()->getLeafNodeCounts(counts); } if (isLeaf()) { counts.push_back(counter); } return counts; }
SPSnodePtrs & SPSnode::getLeaves | ( | SPSnodePtrs & | leaves | ) |
Returns pointers to non-const nodes.
{ //if children, recurse on the children if (hasLCwithBox()) { getLeftChild()->getLeaves(leaves); } if (hasRCwithBox()) { getRightChild()->getLeaves(leaves); } if ( isLeaf() ) { // this is a leaf leaves.push_back(this); } return leaves; }
SPSnode * SPSnode::getLeftChild | ( | ) | const |
Accessor for the left child of a node.
Returns a pointer to leftChild node.
Reimplemented from subpavings::SPnode.
size_t SPSnode::getLeftCountIfSplit | ( | ) | const |
The count the left child would have if this node was split.
Does not split the nodes, just calculates how many of the data points currently associated with this node would go to the left child if the node were to be split.
Note that the left child's interval on the split dimension would be an open interval.
{ size_t leftCount = 0; if (isLeaf()) { // first find what the dimension for the split would be // if the split were made // right hand child's box would be if that child // were to be created cxsc::ivector box = getBox(); int split = MaxDiamComp(box); cxsc::real midSplit = cxsc::mid(box[split]); // left child would have everything up to but not including // midSplit, on the split dimension NodeDataItr it; for (it = dataItrs.begin(); it < dataItrs.end(); it++) { // DataItrs is a container of iterators to a BigDataCollection // increment rightCount if the point is in rC if( (**it)[split] < midSplit ) leftCount++; } } else { // already split leftCount = getLeftChild()->getCounter(); } return leftCount; }
real SPSnode::getLogLik | ( | const size_t | n | ) | const [virtual] |
Get this node's contribution to loglikelihood.
A leaf node's contribution to the log likelihood of overall state given the data is (count in leaf x ln(count in leaf / (n * vol of leaf))) where n is the total number of data points in the histogram.
The loglikelihood of an entire tree of nodes comes only from the contributions of the current leaf nodes, but this method will not throw an exception if called on a non-leaf node. This allows 'what-if' type calculations to be carried out (eg what if the node reabsorbed its children)...
n | the value to use for scaling, the total number of points. |
{ // contribution to loglikelihood is counter*log(counter/(n * vol)) dotprecision contribution(0.0); if ((n > 0) && (counter > 0)) { accumulate(contribution, 1.0*counter, log(1.0*counter)); accumulate(contribution, -1.0*counter, log(1.0*n)); real nv = nodeRealVolume(); if (nv < cxsc::MinReal) { accumulate(contribution, -1.0*counter, log(nodeVolume())); } else { accumulate(contribution, -1.0*counter, ln(nv)); } } // default cxsc rounding to nearest return rnd(contribution); }
rvector SPSnode::getMean | ( | ) | const |
Get the sample mean.
{ int dimension = getDimension(); // set up an rvector retMean of the correct dimensions rvector retMean(dimension); // loop through the elements in the dpSums vector for (int i = 0; i< dimension; i++) { // if no data elements each element or if only counts are held, // that element of the mean is cxsc::SignalingNaN if (countsOnly || (counter == 0)) { // cxsc::rvector is indexed 1 to n retMean[i+1] = cxsc::SignalingNaN; } // if data elements, find the element-by-element mean else { // default cxsc rounding dotprecision rnd_next retMean[i+1] = rnd(dpSums[i])/(1.0*counter); } }// end loop through the elements in dpSums return retMean; }
dotprecision SPSnode::getMergeChangeEMPAIC | ( | ) | const |
Get change in sum term in EMP under AIC on merge.
Under AIC, EMP is -1 x sum over leaves of (counts in leaf x ln(count in leaf / (n * volume of leaf))) where n is the total number of data points in the histogram.
The merge change only makes sense for a node that is not already a leaf: this method throws an UnfulfillableRequest_Error if called on a leaf node.
{ if (isLeaf() ) { throw subpavings::UnfulfillableRequest_Error( "SPSnode::getMergeChangeEMPAIC()"); } return -getMergeChangeLogLik(); }
dotprecision SPSnode::getMergeChangeEMPCOPERR | ( | const size_t | n | ) | const |
Get scaled change in sum term in EMP under COPERR on merge.
Under COPERR, EMP is -1/n^2 x sum over leaves of (counts in leaf squared / (n * volume of leaf)) where n is the total number of data points in the histogram.
The merge change only makes sense for a node that is not already a leaf: this method throws an UnfulfillableRequest_Error if called on a leaf node.
n | the value to use for scaling, the total number of points. |
{ if (isLeaf() ) { throw subpavings::UnfulfillableRequest_Error( "SPSnode::getMergeChangeEMPCOPERR(const size_t)"); } // change is 1/(n^2 * vol) x (2(lc_count^2 + rc_count^2) - counter^2) // Change is scaled by n, total points in histogram dotprecision change; change = 0.0; if ((n > 0) && (counter > 0)) { real nv = nodeRealVolume(); if (nv < cxsc::MinReal) { change = cxsc::Infinity; } else { // first find what the left hand child's counter is size_t leftCount = getLeftChild()->getCounter(); // and right child size_t rightCount = counter - leftCount; // current number of data points associated to node is counter // current node volume from nodeVolume, and each child will have half accumulate(change, (1.0*leftCount)/(1.0*n), (2.0*leftCount)/(n*nodeVolume())); accumulate(change, (1.0*(rightCount))/(1.0*n), (2.0*(rightCount))/(n*nodeVolume())); accumulate(change, -(1.0*counter)/(n*nodeVolume()), (1.0*counter)/(1.0*n)); } } return change; }
dotprecision SPSnode::getMergeChangeLogLik | ( | ) | const [virtual] |
Get change in log likelihood on merge of this' leaf chidren.
log likelihood is sum over leaves of (counts in leaf x ln(count in leaf / (n * volume of leaf))) where n is the total number of data points in the histogram.
The merge change only makes sense for a node that is not already a leaf: this method throws an UnfulfillableRequest_Error if called on a leaf node.
{ if (isLeaf() ) { throw subpavings::UnfulfillableRequest_Error( "SPSnode::getMergeChangeLogLik()"); } dotprecision change(0.0); // if counter is 0 there can be no change on merging if (counter > 0) { // first find what the left hand child's counter is size_t leftCount = getLeftChild()->getCounter(); // and right child size_t rightCount = getRightChild()->getCounter(); // change is (count*ln(count) // - (lc_count*ln(lc_count) + rc_count*ln(rc_count) + // count*ln(2)) // note that the terms involving the total count in the histogram // and the volume of this node cancel so this change // is effectively scaled and does not need to use n dotprecision currentEMP(0.0); dotprecision childEMP(0.0); if (leftCount > 0) accumulate(childEMP, 1.0*leftCount, log(1.0*leftCount)); if (rightCount > 0) accumulate(childEMP, 1.0*rightCount, log(1.0*rightCount)); accumulate(childEMP, 1.0*counter, log(2.0)); accumulate(currentEMP, 1.0*counter, log(1.0*counter)); change = currentEMP - childEMP; } return change; }
size_t SPSnode::getMinChildCountIfSplit | ( | ) | const |
Smallest number of points in either child if this was split.
Does not split the nodes, just calculates how many of the data points currently associated with this node would go to the left and right child if the node were to be split.
{ size_t min = getLeftCountIfSplit(); if ((counter - min) < min) min = counter - min; return min; }
std::pair< size_t, cxsc::real > SPSnode::getNonEmptyBoxSummary | ( | ) | const |
Get summary information on non-empty leaf box numbers and volumes.
{ if(isEmpty()) throw NoBox_Error("SPSnode::getNonEmptyBoxSummary"); size_t nNonEmptyBoxes = 0; real vNonEmptyBoxVolumes(0.0); accumulateNonEmptyBoxSummary(nNonEmptyBoxes, vNonEmptyBoxVolumes); return std::pair <size_t, real >(nNonEmptyBoxes, vNonEmptyBoxVolumes/nodeRealVolume()); }
SPSnode * SPSnode::getParent | ( | ) | const |
Accessor for the parent of a node.
Returns a pointer to parent node.
Reimplemented from subpavings::SPnode.
SPSnode * SPSnode::getRightChild | ( | ) | const |
Accessor for the right child of a node.
Returns a pointer to rightChild node.
Reimplemented from subpavings::SPnode.
{ return (SPSnode*) rightChild; }
size_t SPSnode::getRightCountIfSplit | ( | ) | const |
The count the right child would have if this node was split.
Does not split the nodes, just calculates how many of the data points currently associated with this node would go to the right child if the node were to be split.
Note that the left child's interval on the split dimension would be a closed interval.
{ size_t rightCount = 0; if (isLeaf()) { rightCount = counter - getLeftCountIfSplit(); } else { rightCount = getRightChild()->getCounter(); } }
size_t SPSnode::getSmallestLeafCount | ( | ) | const |
Get the count of the leaf with the smallest count.
Returns the count in the smallest (by count) leaf node.
{ if (isLeaf() ) { return getCounter(); } else { size_t myL = getLeftChild()->getSmallestLeafCount(); size_t myR = getRightChild()->getSmallestLeafCount(); return (myL < myR ? myL : myR); } }
dotprecision SPSnode::getSplitChangeEMPAIC | ( | ) | const |
Get change in sum term in EMP under AIC on split.
Under AIC, EMP is -1 x sum over leaves of (counts in leaf x ln(count in leaf / (n * volume of leaf))) where n is the total number of data points in the histogram.
The split change only makes sense for a leaf node, but the calculation can be carried out for a non-leaf and so this method will not throw an exception if called on a non-leaf node.
{ return -getSplitChangeLogLik(); }
dotprecision SPSnode::getSplitChangeEMPCOPERR | ( | const size_t | n | ) | const |
Get scaled change in sum term in EMP under COPERR on split.
Under COPERR, EMP is -1/n^2 x sum over leaves of (counts in leaf squared / (n * volume of leaf)) where n is the total number of data points in the histogram.
The split change only makes sense for a leaf node, but the calculation can be carried out for a non-leaf and so this method will not throw an exception if called on a non-leaf node.
n | the value to use for scaling, the total number of points. |
{ #ifdef EMPSDEBUG cout << "in getSplitChangeEMPCOPERR, n = " << n << endl; #endif // change is 1/(n^2 * vol) x (counter^2 - 2(lc_count^2 + rc_count^2)) // if we split and lc_count, rc_count were the new counts in // left and right children respectively // Change is scaled by n, total points in histogram dotprecision change; change = 0.0; #ifdef EMPSDEBUG cout << "counter = = " << counter << endl; #endif if ((n > 0) && (counter > 0)) { // first find what the left hand child's counter would be if that child // were to be created size_t leftCount = getLeftCountIfSplit(); #ifdef EMPSDEBUG cout << "leftCount = " << leftCount << endl; #endif // current number of data points associated to node is counter // current node volume from nodeVolume, and each child will have half real nv = nodeRealVolume(); #ifdef EMPSDEBUG cout << "node volume = " << nv << endl; cout << "1.0/((n*1.0)*nv) = " << (10.0/((n*1.0)*nv)) << endl; cout << "1.0/((n*1.0)*MinReal) = " << (10.0/((n*1.0)*MinReal)) << endl; #endif if (nv < cxsc::MinReal) { change = -cxsc::Infinity; } else { accumulate(change, (1.0*counter)/((n*1.0)*nv), (1.0*counter)/(1.0*n)); #ifdef EMPSDEBUG cout << "change stage 1 rnd(change) = " << rnd(change) << endl; #endif accumulate(change, (1.0*leftCount)/(1.0*n), -(2.0*leftCount)/((n*1.0)*nv)); #ifdef EMPSDEBUG cout << "change stage 2 = " << rnd(change) << endl; #endif accumulate(change, (1.0*(counter - leftCount))/(1.0*n), -(2.0*(counter - leftCount))/((1.0*n)*nv)); #ifdef EMPSDEBUG cout << "change stage 3 = " << rnd(change) << endl; #endif } } #ifdef EMPSDEBUG cout << "returning rnd(change) = " << rnd(change) << endl; #endif return change; }
dotprecision SPSnode::getSplitChangeLogLik | ( | ) | const [virtual] |
Get change in log likelihood on split of this node.
log likelihood is sum over leaves of (counts in leaf x ln(count in leaf / (n * volume of leaf))) where n is the total number of data points in the histogram.
The split change loglikelihood only makes sense for a leaf node, but the calculation can be carried out for a non-leaf and so this method will not throw an exception if called on a non-leaf node.
{ dotprecision change(0.0); // if counter is 0 there can be no change on splitting if (counter > 0) { // first find what the left hand child's counter would be if // that child were to be created size_t leftCount = getLeftCountIfSplit(); // current number of data points associated to node is counter size_t rightCount = counter-leftCount; // current node volume from nodeVolume; each child will have half // change is // (lc_count*ln(lc_count) + rc_count*ln(rc_count) + // count*ln(2)) - (count*ln(count) // if we split and lc_count, rc_count were the new counts in // left and right children respectively // note that the terms involving the total count in the histogram // and the volume of this node cancel so this change // is effectively scaled and does not need to use n dotprecision currentEMP(0.0); dotprecision childEMP(0.0); if (leftCount > 0) accumulate(childEMP, 1.0*leftCount, log(1.0*leftCount)); if (rightCount > 0) accumulate(childEMP, 1.0*rightCount, log(1.0*rightCount)); accumulate(childEMP, 1.0*counter, log(2.0)); accumulate(currentEMP, 1.0*counter, log(1.0*counter)); change = childEMP - currentEMP; } return change; }
SPSnodePtrs & SPSnode::getSubLeaves | ( | SPSnodePtrs & | subleaves | ) |
Returns pointers to non-const nodes.
{ if (isSubLeaf()) { // this is a subleaf subleaves.push_back(this); } //else if children, recurse on the children else if (!isLeaf()) { getLeftChild()->getSubLeaves(subleaves); getRightChild()->getSubLeaves(subleaves); } return subleaves; }
cxsc::real SPSnode::getUnscaledTreeLogLik | ( | ) | const [virtual] |
Get the unscaled log likihood for the tree rooted at this.
The unscaled log likelihood is sum over leaves of ( count in leaf x ( ln(count in leaf) + depth of leaf * ln2) )
To get the scaled log likelihood we would add ln (1/(n x vol)^n ) = -n x ln (n x vol) where n is the counter for this and vol is the volume of the box associated with this. If we are taking ratios of likelihoods for trees rooted at nodes with the same vol and n (eg different states of a tree rooted at this) we can often ignore this scaling factor.
{ if(isEmpty()) throw NoBox_Error("SPSnode::getUnscaledTreeLogLik()"); dotprecision nlogn(0.0); int ndepth = 0.0; int depth = 0; _getUnscaledTreeLogLik(nlogn, ndepth, depth); // this accumulates values in nlogn, ndepth dotprecision ndlog2(0.0); accumulate (ndlog2, cxsc::Ln2_real, 1.0*ndepth); return (cxsc::rnd(nlogn + ndlog2)); }
SPSnode * SPSnode::insertOneFind | ( | BigDataItr | newItr, |
OPERATIONS_ON | childInd, | ||
const SplitDecisionObj & | boolTest | ||
) |
Tries to insert data into the leaves of the tree rooted at this.
Passes down the tree, seeking the leaf node whose box contains the data point.
If the box that this represents does not contain the data point then the return value is NULL.
If the box that this represents contains the data point, statistics held by this are updated (subject to the value of countsOnly for this).
If this also a leaf node then the data is associated with this leaf and the address of this becomes the return value. If the SplitDecisionObj boolTest determines that this should expand following the insertion of the data, this is expanded.
If the box that this represents contains the data point but this is not a leaf node, the return value will be the address of the ultimate leaf descendent of this that contains the data point.
newItr | an iterator to the data in big data collection. |
childInd | an indicator for whether the current node is a treated as a left or right child or a root. |
boolTest | is a reference to an object determining whether to split a node when a data point arrives. |
{ rvector newData = *newItr; // start at the top SPSnode* retObj = NULL; if(nodeContains(newData, childInd)) { recalculateStats(newData); bool wasLeaf = (isLeaf()); // if it is a leaf, add the data and return this object if(wasLeaf) { dataItrs.push_back(newItr); // give this node as return value retObj = this; // split if we need to if (boolTest(this)) { // expand and split data to children nodeExpand(boolTest); } // end if we need to split } // end of wasLeaf // if not a leaf before we had split, and contains data // recurse on the children if any if (!wasLeaf) { if( rightChild != NULL && !rightChild->isEmpty() ){ retObj = (getRightChild())->insertOneFind( newItr, ON_RIGHT, boolTest); } // only try left if we did not find on the right if(retObj == NULL && leftChild!=NULL && !leftChild->isEmpty()) { retObj = (getLeftChild())->insertOneFind(newItr, ON_LEFT, boolTest); } } } // end if node contains // will return null if does not contain the data return retObj; }
bool SPSnode::isSplittableNode | ( | ) | const [virtual] |
Return boolean to indicate if node is splittable.
A node is splittable if the node volume is >= 2 * cxsc::MinReal (the smallest representable real number).
Reimplemented from subpavings::SPnode.
{ return SPnode::isSplittableNode(); }
bool SPSnode::isSplittableNode | ( | size_t | minChildPoints, |
double | minVol | ||
) | const |
Method to check whether a node is splittable.
Decides whether a node is splittable based on checking volume and number of points that would result in child nodes on split.
Node must satisfy the basic isSplittableNode() test and volume must be >=2*minVol to split and if minChildPoints > 0, then
Thus in general the method will only return true if the given node satisfies both the minVol test and, if it were to be split, both children would have at least minChildPoints data points, but if all the data points would go to one child (none to the other), this will also satisfy the minChildPoints test.
If the node has already been split, the test will use the actual numbers of points in the children; if the node is a leaf (ie not split) then the test will consider the number of points that would go to the each child if it were to be split.
minChildPoints | is the minimum number of points that there would be in the children if the node were to be split. |
minVol | is the minimum node volume to be tested for. This is the minimum volume that each child is permitted to have, i.e. to be splittable the node volume itself must be >= 2*minVol. |
{ bool retValue = true; if (minVol > 0.0) retValue = (!(nodeVolume() < 2*minVol)); if (retValue) { retValue = isSplittableNode(minChildPoints); } else { #ifdef DEBUG_CHECK_NODE_COUNT cout << "isSplittableNode: node failed vol test" << endl; #endif } return retValue; }
bool SPSnode::isSplittableNode | ( | size_t | minChildPoints | ) | const |
Method to check whether a node is splittable.
Decides whether a node is splittable based number of points that would result in child nodes on split.
Node must satisfy the basic isSplittableNode() test and if minChildPoints > 0, then
Thus in general the method will only return true if, if it were to be split, both children would have at least minChildPoints data points, but if all the data points would go to one child (none to the other), this will also satisfy the minChildPoints test.
If the node has already been split, the test will use the actual numbers of points in the children; if the node is a leaf (ie not split) then the test will consider the number of points that would go to the each child if it were to be split.
minChildPoints | is the minimum number of points that there would be in the children if the node were to be split. |
{ bool retValue = isSplittableNode(); //basic check if (!retValue) { #ifdef DEBUG_CHECK_NODE_COUNT cout << "isSplittableNode: node failed basic is splittable test" << endl; #endif #ifdef DEBUG_MCMC_SPLIT_FAIL cout << "Failed isSplittableNode: I am " << nodeName << endl; { ivector box = getBox(); interval maxD = box[MaxDiamComp(box)]; cout << cxsc::SaveOpt; cout << Scientific << SetPrecision(35,30); cout << "interval to be split is " << maxD << endl; cout << cxsc::RestoreOpt; } #endif } #ifdef DEBUG_CHECK_NODE_COUNT cout << "isSplittableNode minChildPoints = " << minChildPoints << endl; #endif if (retValue && minChildPoints > 0) { retValue = false; // need to retest size_t minChildCount = getMinChildCountIfSplitNEW(); if ( (counter >= minChildPoints) && ((minChildCount == 0) || (minChildCount >= minChildPoints)) ) { retValue = true; } #ifdef DEBUG_CHECK_NODE_COUNT cout << "isSplittableNode minChildCount = " << minChildCount << endl; cout << "(minChildCount >= minChildPoints) = " << (minChildCount >= minChildPoints) << endl; cout << "isSplittable = " << retValue << endl; #endif } return retValue; }
std::ostream & SPSnode::leafOutputTabs | ( | std::ostream & | os | ) | const [protected, virtual] |
Output for a node in a binary tree, tab-delimited.
Output intended for a txt file, in numeric form only.
Replaces the format that that the cxsc::<< operator produces for interval vectors. The format used here produces alpha-numeric tab-delimited data. The format for an n-dimensional interval vector is:
nodeName [tab] volume [tabl] counter [tab] Inf(ivector[1]) [tab] Sup(ivector[1]) [tab] ... [tab] Inf(ivector[n]) [tab] Sup(ivector[n])
os | is the stream to send to. |
Reimplemented from subpavings::SPnode.
{ if(theBox != NULL) { // do nothing if there is no box ivector thisBox = *theBox; // copy of theBox // output the node name, nodeVolume, counter os << nodeName; os << "\t" << nodeRealVolume(); os << "\t" << counter; // followed by the intervals of box using Inf and Sup // ie unlike cxsc output, there is no [ ] around them for (int i= Lb(thisBox); i <= Ub(thisBox) ; i++) { os << "\t" << Inf(thisBox[i]) << "\t" << Sup(thisBox[i]); } } }
std::ostream & SPSnode::leafOutputTabsWithEMPs | ( | const size_t | bigN, |
std::ostream & | os, | ||
int | prec = 5 |
||
) | const [protected, virtual] |
Output for a node in a binary tree, tab-delimited.
Output intended for a txt file, in numeric form only.
Replaces the format that that the cxsc::<< operator produces for interval vectors. The format used here produces alpha-numeric tab-delimited data.
The format for a d-dimensional interval vector is
nodeName [tab] volume [tab] counter [tab] scaled EMP contribution COPERR [tab] change in scaled EMP contribution COPERR if split [tab] scaled EMP contribution AIC [tab] change in scaled EMP contribution AIC if split [tab] Inf(ivector[1]) [tab] Sup(ivector[1].[tab] . . [tab] Inf(ivector[d]) [tab] Sup(ivector[d]
bigN | is total datapoints, used by the emps calculation |
os | is the stream to send to |
prec | is the precision used for printing, defaulting to 5 |
{ if(theBox != NULL) { // do nothing if there is no box ivector thisBox = *theBox; // copy of theBox // output the name, nodeVolume, counter os << nodeName; os << cxsc::SaveOpt; os << cxsc::Variable << cxsc::SetPrecision(prec+2,prec); os << "\t" << nodeRealVolume(); os << "\t" << counter; // EMP contributions and changes if split os << "\t" << getEMPContributionCOPERR(bigN); os << "\t" << rnd(getSplitChangeEMPCOPERR(bigN)); os << "\t" << getEMPContributionAIC(bigN); os << "\t" << rnd(getSplitChangeEMPAIC()); // followed by the intervals of box using Inf and Sup // ie unlike cxsc output, there is no [ ] around them for (int i= Lb(thisBox); i <= Ub(thisBox) ; i++) { os << "\t" << Inf(thisBox[i]) << "\t" << Sup(thisBox[i]); } os << cxsc::RestoreOpt; } }
std::ostream & SPSnode::leafOutputTabsWithHistHeight | ( | const size_t | bigN, |
std::ostream & | os, | ||
int | prec = 5 |
||
) | const [protected, virtual] |
Output for a node in a binary tree, tab-delimited.
Output intended for a txt file, in numeric form only.
Includes the height of a bin represented by this leaf node for a normalised histogram, ie counter/(node volume * total count in tree)
Replaces the format that that the cxsc::<< operator produces for interval vectors. The format used here produces alpha-numeric tab-delimited data. The format for a d-dimensional interval vector is:
nodeName [tab] volume [tab] counter [tab] counter/(volume*total count) [tab] Inf(ivector[1]) [tab] Sup(ivector[1]) [tab] ... [tab] Inf(ivector[d]) [tab] Sup(ivector[d])
bigN | is total data points, used by the height calculation |
os | is the stream to send to |
prec | is the precision used for printing, defaulting to 5 |
{ if(theBox != NULL) { // do nothing if there is no box ivector thisBox = *theBox; // copy of theBox // output the node name, nodeVolume, counter, counter/(bigN * vol) os << cxsc::SaveOpt; os << cxsc::Variable << cxsc::SetPrecision(prec+2,prec); cxsc::real vol = nodeRealVolume(); os << setprecision(prec); os << "\t" << vol; os << "\t" << counter; os << "\t" << (1.0 * counter)/(vol * (1.0* bigN)); // followed by the intervals of box using Inf and Sup // ie unlike cxsc output, there is no [ ] around them for (int i= Lb(thisBox); i <= Ub(thisBox) ; ++i) { os << "\t" << Inf(thisBox[i]) << "\t" << Sup(thisBox[i]); } os << cxsc::RestoreOpt; } }
std::ostream & SPSnode::leafOutputTabsWithHistHeightAndEMPs | ( | const size_t | bigN, |
std::ostream & | os, | ||
int | prec = 5 |
||
) | const [protected] |
Output for a node in a binary tree, tab-delimited.
Output intended for a txt file, in numeric form only.
Replaces the format that that the cxsc::<< operator produces for interval vectors. The format used here produces alpha-numeric tab-delimited data.
Includes the height of a bin represented by this leaf node for a normalised histogram, ie counter/(node volume * total count in tree)
The format for a d-dimensional interval vector is
nodeName [tab] volume [tabl] counter [tab] counter/(volume*total count) scaled EMP contribution COPERR [tab] change in scaled EMP contribution COPERR if split [tab] scaled EMP contribution AIC [tab] change in scaled EMP contribution AIC if split [tab] Inf(ivector[1]) [tab] Sup(ivector[1].[tab] . . [tab] Inf(ivector[d]) [tab] Sup(ivector[d]
bigN | is total data points, used by emps and height calculations |
os | is the stream to send to |
prec | is the precision used for printing, defaulting to 5 |
{ if(theBox != NULL) { // do nothing if there is no box ivector thisBox = *theBox; // copy of theBox // output the node name, nodeVolume, counter, counter/(n * vol) os << nodeName; os << cxsc::SaveOpt; os << cxsc::Variable << cxsc::SetPrecision(prec+2,prec); cxsc::real vol = nodeRealVolume(); os << "\t" << vol; os << "\t" << counter; os << "\t" << (1.0 * counter)/(vol * (1.0*bigN)); // EMP contributions and changes if split os << "\t" << getEMPContributionCOPERR(bigN); os << "\t" << rnd(getSplitChangeEMPCOPERR(bigN)); os << "\t" << getEMPContributionAIC(bigN); os << "\t" << rnd(getSplitChangeEMPAIC()); // followed by the intervals of box using Inf and Sup // ie unlike cxsc output, there is no [ ] around them for (int i= Lb(thisBox); i <= Ub(thisBox) ; i++) { os << "\t" << Inf(thisBox[i]) << "\t" << Sup(thisBox[i]); } os << cxsc::RestoreOpt; } }
std::ostream & SPSnode::leavesOutputTabsWithEMPs | ( | const size_t | bigN, |
std::ostream & | os, | ||
int | prec = 5 |
||
) | const [virtual] |
Output for for all leaves of a binary tree.
Output intended for a txt file, in numeric form only.
Includes scaled COPERR and AIC contributions and changes if split.
bigN | is total data points, used by emps and height calculations |
os | is the stream to send to |
prec | is the precision used for printing, defaulting to 5 |
{ if (parent == NULL) { // root std::string headers = "node \t vol \t count \t EMP COPERR "; headers += "\t &change \t EMP AIC \t &change \t dimensions \n"; os << headers; } // uses member function leafOutputTabsWithEMPs to generate node output if (!(isEmpty()) && isLeaf()) { // this is a non-empty leaf leafOutputTabsWithEMPs(bigN, os, prec); return (os << "\n"); } //recurse on the children if (getLeftChild()!=NULL) { getLeftChild()->leavesOutputTabsWithEMPs(bigN, os, prec); } if (getRightChild()!=NULL) { getRightChild()->leavesOutputTabsWithEMPs(bigN, os, prec); } }
std::ostream & SPSnode::leavesOutputTabsWithHistHeight | ( | std::ostream & | os, |
int | prec = 5 |
||
) | const [virtual] |
Output for for all leaves of a binary tree.
Output intended for a txt file, in numeric form only.
Includes output for the height of histogram bins for a normalised histogram based on tree with total number of data points of this root.
os | is the stream to send to |
prec | is the precision used for printing, defaulting to 5 |
{ leavesOutputTabsWithHistHeight(counter, os, prec); return (os); }
std::ostream & SPSnode::leavesOutputTabsWithHistHeight | ( | const size_t | bigN, |
std::ostream & | os, | ||
int | prec = 5 |
||
) | const [virtual] |
Output for for all leaves of a binary tree.
Output intended for a txt file, in numeric form only.
Includes output for the height of histogram bins for a normalised histogram based on tree with total number of data points of this root.
bigN | is total data points, used by emps and height calculations |
os | is the stream to send to |
prec | is the precision used for printing, defaulting to 5 |
{ // uses member function leafOutputTabs to generate node output if (!(isEmpty()) && isLeaf()) { // this is a non-empty leaf leafOutputTabsWithHistHeight(bigN, os, prec); return (os << "\n"); } //recurse on the children if (getLeftChild()!=NULL) { getLeftChild()->leavesOutputTabsWithHistHeight(bigN, os, prec); } if (getRightChild()!=NULL) { getRightChild()->leavesOutputTabsWithHistHeight(bigN, os, prec); } }
std::ostream & SPSnode::leavesOutputTabsWithHistHeightAndEMPs | ( | const size_t | bigN, |
std::ostream & | os, | ||
int | prec = 5 |
||
) | const [virtual] |
Output for for all leaves of a binary tree.
Output intended for a txt file, in numeric form only.
Recursively uses leafOutputTabsWithHistHeightAndEMPs() to output information for each leaf node. Includes output for the height of histogram bins for a normalised histogram based on tree with total number of data points n.
bigN | is total data points, used by emps and height calculations |
os | is the stream to send to |
prec | is the precision used for printing, defaulting to 5 |
{ if (parent == NULL) { // root std::string headers = "node \t vol \t count \t height "; headers += "\t EMP COPERR \t &change \t EMP AIC \t &change "; headers += "\t dimensions \n"; os << headers; } // uses member function leafOutputTabsWithEMPs to generate node output if (!(isEmpty()) && isLeaf()) { // this is a non-empty leaf leafOutputTabsWithHistHeightAndEMPs(bigN, os, prec); return (os << "\n"); } //recurse on the children if (getLeftChild()!=NULL) { getLeftChild()->leavesOutputTabsWithHistHeightAndEMPs(bigN, os, prec); } if (getRightChild()!=NULL) { getRightChild()->leavesOutputTabsWithHistHeightAndEMPs(bigN, os, prec); } }
std::ostream & SPSnode::nodeDataPrint | ( | std::ostream & | os | ) | const [protected, virtual] |
Print the data in a specified format.
Replaces the format that the cxsc::<< operator produces for vectors. The format used here produces numeric tab-delimited data. The format for an n-dimensional real vector data point is:
rvector[1] [tab] . . . [tab] rvector[n]
{ if (!dataItrs.empty()) { NodeDataItr dataItr; int dimension = getDimension(); for (dataItr = dataItrs.begin(); dataItr!= dataItrs.end(); dataItr++) { BigDataItr bigIt = *dataItr; rvector theData = *bigIt; for (int i = 1; i < dimension + 1; i++) { os << theData[i] << " "; // print data } // end loop through data elements } // end loop through data container } // end if counter > 0 // if no data, ie counter = 0, then just return os unaltered return os; }
void SPSnode::nodeExpand | ( | int | comp | ) | [virtual] |
Expand a leaf node.
Expand a leaf node to have two children and pass data down to the children with no further splitting.
Equivalent to bisecting a box in a regular subpaving. Makes two new sibling child nodes of this.
comp | is the dimension on which to to bisect theBox. |
Reimplemented from subpavings::SPnode.
{ nodeExpansionOnly(comp); // expand the node SplitNever sn; // dummy split decision maker splitData(sn); // split the data with no further splitting }
void SPSnode::nodeExpand | ( | const SplitDecisionObj & | boolTest, |
int | comp | ||
) | [virtual] |
Expand a leaf node.
Expand a leaf node to have two children and pass data down to the children, allowing for further splitting.
boolTest | is a reference to an object providing a function operator determining whether to split a node when a data point arrives. |
comp | is the dimension on which to to bisect theBox. |
{ nodeExpansionOnly(comp); // expand the node // split the data, allowing for further splitting splitData(boolTest); }
void SPSnode::nodeExpand | ( | ) | [virtual] |
Expand a leaf node.
Expand a leaf node to have two children and pass data down to the children with no further splitting.
Finds the splitting dimension.
Reimplemented from subpavings::SPnode.
{ int maxdiamcomp; // variable to hold first longest dimension double temp = MaxDiam(getBox(), maxdiamcomp); nodeExpand(maxdiamcomp); // complete nodeExpand }
void SPSnode::nodeExpand | ( | const SplitDecisionObj & | boolTest | ) | [virtual] |
Expand a leaf node.
Expand the leaf node to have two children and pass data down to the children, allowing for further splitting.
Finds the dimension to split on.
boolTest | is a reference to an object providing a function operator determining whether to split a node when a data point arrives. |
{ int maxdiamcomp; // variable to hold first longest dimension double temp = MaxDiam(getBox(), maxdiamcomp); nodeExpand(boolTest, maxdiamcomp); // complete nodeExpand }
void SPSnode::nodeExpansionOnly | ( | int | comp | ) | [protected, virtual] |
Expand the node with no reallocation of data.
Bisect box, make two new nodes (one for each half box) and graft onto this node provided that this node is a leaf. Equivalent to bisecting a box in a regular subpaving.
{ if ( isEmpty() ) { throw NoBox_Error("SPSnode::nodeExpansionOnly(int )"); } // only do something if this SPSnode is a leaf if (isLeaf()) { SPSnode* newLC = NULL; SPSnode* newRC = NULL; try { // ivectors to become boxes for new children ivector lC, rC; // Call Lower() and Upper() to put the split // boxes into lC and rC respectively Lower(getBox(), lC, comp); Upper(getBox(), rC, comp); // when making new children, use constructor // that will give space indication (for data) // of the size of this node's dataItrs size_t space = dataItrs.size(); newLC = new SPSnode(lC, space, countsOnly); newRC = new SPSnode(rC, space, countsOnly); nodeAddLeft(newLC); nodeAddRight(newRC); //name the new children getLeftChild()->setNodeName(nodeName + "L"); getRightChild()->setNodeName(nodeName + "R"); // store the split dimension in this splitDim = comp; // store the split value in this // the split value is the infinum of interval // of right child box for dimension split on splitValue = _double(Inf( ((getRightChild())->getBox())[comp])); } catch(std::exception& e) { // overkill, but try to deal with all eventualities... try { if (getLeftChild() != NULL) { delete (getLeftChild()); leftChild = NULL; } if (getRightChild() != NULL) { delete (getRightChild()); rightChild = NULL; } if (newLC != NULL) { delete newLC; newLC = NULL; } if (newRC != NULL) { delete newRC; newRC = NULL; } } catch(std::exception& e) {} //catch and swallow throw; // rethrow original exception } } }
std::ostream & SPSnode::nodePrint | ( | std::ostream & | os | ) | const [virtual] |
Output details of a specific node.
This is intended for console output or output to a mixed alpha and numeric file.
os | is the stream to send to |
Reimplemented from subpavings::SPnode.
{ // output for box in form: // box, volume, counter, mean, variance covariance, and data if(theBox != NULL) { // do nothing if there is no box ivector thisBox = *theBox; // copy theBox os << "Box is :"; for (int i = Lb(thisBox); i <= Ub(thisBox) ; i++) { // c-xsc default output for intervals os << " " << thisBox[i]; } os << std::endl; os << "Box volume is " << nodeRealVolume() << std::endl; os << "Counter = " << counter << std::endl; os << "Mean is "; nodeMeanPrint(os); os << "\nVariance-covariance matrix is" << std::endl; nodeVarCovarPrint(os); os << "Data is" << std::endl; nodeDataPrint(os); os << std::endl; os << std::endl; } return os; }
void SPSnode::nodeReabsorbChildren | ( | ) | [virtual] |
Reabsorbs both children of the node.
Effectively reverses any split of the node.
Data associated with the children is pushed back up to this and the splitDim and splitValue reset to leaf defaults.
Works even if the children are not leaves.
Reimplemented from subpavings::SPnode.
{ // first recursively deal with the children of the children if (hasLCwithBox()) getLeftChild()->nodeReabsorbChildren(); if (hasRCwithBox()) getRightChild()->nodeReabsorbChildren(); if (hasLCwithBox()) { getLeftChild()->gatherData(dataItrs); delete getLeftChild(); } if (hasRCwithBox()) { getRightChild()->gatherData(dataItrs); delete getRightChild(); } // reset splitDim and splitValue to their defaults splitDim = -1; splitValue = 0.0; leftChild = NULL; rightChild = NULL; }
std::string SPSnode::nodeStringSummary | ( | ) | const [virtual] |
Get a string summary of this node's properties.
Just this node, not this and its descendents.
Reimplemented from subpavings::SPnode.
{ std::ostringstream oss; oss << "I am " << getNodeName() << "(address " << this << "),\n"; oss << "Dimension is " << getDimension(); if (isEmpty() ) { oss << ", the box is NULL\n"; } else { oss << ", address of box is " << theBox << "\n"; } oss << "spaceIndication is " << spaceIndication << ", splitDim is " << splitDim << ", splitValue is " << splitValue << "\n"; oss << "address of dataItrs is " << &dataItrs << ", countsOnly is " << countsOnly << ", counter is " << counter << "\n"; oss << "getMean is "; nodeMeanPrint(oss); oss << "\ngetVarCovar is\n"; nodeVarCovarPrint(oss); oss << "my parent is "; if (getParent() != NULL) oss << getParent()->getNodeName(); else oss << "NULL"; oss << ", my left child is "; if (getLeftChild() != NULL) oss << getLeftChild()->getNodeName(); else oss << "NULL"; oss << ", my right child is "; if (getRightChild() != NULL) oss << getRightChild()->getNodeName(); else oss << "NULL"; return oss.str(); }
void SPSnode::recalculateOptionalStatsAndGatherData | ( | NodeData & | container | ) | [protected, virtual] |
Recalculates optional summary statistics associated with node from scratch, using all the data currently held.
Recalculates sums and sumproducts.
Uses only the data currently associated with the node in the the recalculation.
Does not recalculate non-optional summaries, like counter, because these should be right already
{ countsOnly = false; // make sure the current containers are empty VecDotPrec().swap (dpSums); VecDotPrec().swap (dpSumProducts); NodeData myContainer; if (!isLeaf()) { if (hasLCwithBox()) { getLeftChild()->recalculateOptionalStatsAndGatherData(myContainer); } if (hasRCwithBox()) { getRightChild()->recalculateOptionalStatsAndGatherData(myContainer); } } else { // is a leaf // just my data myContainer = dataItrs; } // now do my stats assert(myContainer.size() == getCounter() ); recalculateOptionalStatsForData(myContainer); // add my data to the container container.insert(container.end(), myContainer.begin(), myContainer.end()); }
void SPSnode::recalculateStats | ( | const rvector & | newdata | ) | const [protected, virtual] |
Recalculate summary statistics associated with node.
Recalculates counter and sums (used for mean) and sumproducts (used for variance-covariance).
{ counter++; // update the counter if (!countsOnly) { recalculateSums(newdata); // update the sums recalculateSumProducts(newdata); // update the sumproducts } }
void SPSnode::recalculateSumProducts | ( | const rvector & | newdata | ) | const [protected, virtual] |
Recalculate summary statistics associated with node.
Recalculates sumproducts (used for variance-covariance).
{ /* the sumproducts can be thought of as an nxn matrix, which is implemented here as a nxn element vector of dotprecision variables, using row-major order. Ie the m-th element (m = 0, . . . nxn-1) is in row floor(m/n) and column m-rowxn in the matrix configuration. Or, the sumproduct of elements i and j in an rvector, i,j = 0,...,n-1, is element m=(ixn+j) of the sumproducts vector. */ int dimension = getDimension(); if (dpSumProducts.empty()) { //nothing there yet // reserve space for all elements dpSumProducts.reserve(dimension*dimension); // for each dimnsn^2 of data, initialise element for (int i = 0; i< (dimension*dimension); i++) { dotprecision dp; dp = 0.0; dpSumProducts.push_back(dp); } } // make a dot precision variable out of the ith element // and jth element of the of the rvector of new data and // store in dpSumProducts. for (int i = 1; i < dimension + 1; i++) { // only need to do columns 1 to i because of symmetry for (int j = 1; j< i + 1; j++) { int index = (i-1)*dimension + (j-1); // rvectors indexed 1 to n accumulate(dpSumProducts[index], newdata[i], newdata[j]); //if not on the diagonal of the matrix, // we can also fill in the symmetric element if (i!=j) { int sym_index = (j-1)*dimension + (i-1); dpSumProducts[sym_index] = dpSumProducts[index]; } // end if }// end j-loop }// end i-loop // sumproducts has been updated for new datapoint }
void SPSnode::recalculateSums | ( | const rvector & | newdata | ) | const [protected, virtual] |
Recalculate summary statistics associated with node.
Recalculates sums (used for mean).
{ int dimension = getDimension(); if (dpSums.empty()) { //nothing in the sums yet // reserve space in dpSums for all elements of the mean dpSums.reserve(dimension); // for each dimnsn of data, initialise element for (int i = 0; i< dimension; i++) { dotprecision dp; dp = 0.0; dpSums.push_back(dp); } } // make a dot precision variable out of the ith element // of the rvector of new data and store in dpSums for (int i = 1; i< dimension + 1; i++) { // rvectors indexed 1 to n, vectors indexed 0 to n-1 accumulate(dpSums[i-1], newdata[i], 1.0); } }
void SPSnode::reshapeToUnion | ( | const SPnode & | other | ) | [virtual] |
Reshape so that the tree rooted at this has shape that is the union of this shape and the shape of another tree.
Throws a NoBox_Error if this has no box or if other has no box.
Throws an IncompatibleDimensions_Error if boxes of this and other are not the same.
Throws an std::runtime_error if other has an illegal state (see checkTreeStateLegal()).
other | is the tree to make the union against. |
Reimplemented from subpavings::SPnode.
{ SPnode::reshapeToUnion(other); }
void SPSnode::reshapeToUnion | ( | const SPnode & | other, |
size_t | minChildPoints | ||
) |
Reshape so that the tree rooted at this has shape that is as close as possible to the union of this shape and the shape of another tree.
If other is more split than this, this will not exactly follow the shape of other if the resulting nodes would not splittable according to isSplittableNode(size_t minChildPoints). If any node cannot be split to follow the shape of other due to minChildPoints, a message will be printed to std::cerr.
Throws a NoBox_Error if this has no box or if other has no box.
Throws an IncompatibleDimensions_Error if boxes of this and other are not the same.
Throws an std::runtime_error if other has an illegal state (see checkTreeStateLegal()).
other | is the tree to make the union against. |
minChildPoints | is the minumum child points to use to check if this can be split in order to follow other. |
{ if (isEmpty() || other.isEmpty()) { throw NoBox_Error( "SPnode::reshapeToUnion(const SPnode&, size_t)"); } if ( getBox() != other.getBox() ) { throw IncompatibleDimensions_Error( "SPnode::reshapeToUnion(const SPnode&, size_t)"); } if ( !other.checkTreeStateLegal() ) throw runtime_error( "SPnode::reshapeToUnion(const SPnode&, size_t) : other has illegal tree state"); std::string baseErrorFilename("ReshapeErrors"); std::string errorFilename = getUniqueFilename(baseErrorFilename); bool success = this->_reshapeToUnion( &other, minChildPoints, errorFilename); // if we returned success there should be no file with that name if(!success) { std::cerr << "\nCould not exactly reshape this to the union:" << " check " << errorFilename << " for errors\n" << endl; } }
void SPSnode::setCountsOnly | ( | bool | setTo | ) | [virtual] |
Set an indicator controlling whether this maintains all available statistics or not.
If the node is changed from holding all available statistics to not holding all available statistics, the existing values held to enable calculation of the unwanted statistics are discarded.
If the node is changed from not holding all available statistics to holding all available statistics, values enabling the calculation of the optional statistics are recalculated.
Throws a NonRootNode_Error if this is not a root node.
setTo | the indicator for whether only counts are maintained (setTo = true) or whether all available statistics are maintained (setTo = false). |
{ if (getParent() != NULL) { throw NonRootNode_Error("SPSnode::setCountsOnly(bool)"); } if (setTo) { // change to keeping counts only stripOptionalStatsOnly(); // clears recursively on this and children } else { // change to not keeping counts only, ie to keeping all stats addOptionalStatsOnly(); // adds recursively on this and children } }
void SPSnode::splitData | ( | const SplitDecisionObj & | boolTest | ) | [protected, virtual] |
Send the data associated with this down to children.
Children may then be resplit using boolTest.
{ // check that both children exist if (!hasLCwithBox() || !hasRCwithBox()) { throw UnfulfillableRequest_Error( "SPSnode::splitData(const SplitDecisionObj&)"); } NodeDataItr dataItr; // iterator //divvie the data up amongst the children for (dataItr = dataItrs.begin(); dataItr!= dataItrs.end(); dataItr++) { BigDataItr newItr = *dataItr; //calls insertOneFind on the children of this node // so stats are not recalculated for this node itself SPSnode* reinsertedInto = NULL; if(rightChild!=NULL && !rightChild->isEmpty()) { reinsertedInto = (getRightChild())->insertOneFind( newItr, ON_RIGHT, boolTest); } // only try the left if it's not on the right if(reinsertedInto==NULL && leftChild!=NULL && !leftChild->isEmpty()) { reinsertedInto = (getLeftChild())->insertOneFind( newItr, ON_LEFT, boolTest); } } clearNodeData(); //clear the data in this node }
bool SPSnode::splitRootAtLeastToShapeSPS | ( | std::vector< size_t > | reqDepths, |
size_t | minPoints = 0 |
||
) |
Try to get tree rooted at this into shape at least as deep as described by instruction reqDepths.
Tries to ensure that this has at least shape of reqDepths, but allows the tree rooted at this to also be more split at some or all nodes.
This can only follow the instruction if the nodes it is required to split are splittable nodes, with respect to both width and minPoints (to be split a node must return isSplittableNode(minPoints) = true).
reqDepths | a collection of leaf levels to try to split to. |
minPoints | the minimum child points used to determine whether a node is splittable (defaults to 0). |
{ if ( isEmpty() ) { throw NoBox_Error("SPSnode::splitRootAtLeastToShape(...)"); } if (getParent() != NULL) { throw NonRootNode_Error("SPSnode::splitRootAtLeastToShape(...)"); } size_t myDepth = 0; bool success = _splitAtLeastToShapeSPS(reqDepths, myDepth, minPoints); if (success && !reqDepths.empty()) throw std::invalid_argument("SPSnode::splitRootAtLeastToShape(...)"); return success; }
void SPSnode::stripData | ( | ) | const [protected, virtual] |
Strip the data from this node and its children, seting counters to 0 and discarding other statistics held, and discarding data held by leaf nodes.
{ //strip me counter = 0; VecDotPrec().swap (dpSums); VecDotPrec().swap (dpSumProducts); NodeData().swap (dataItrs); // and children if (getLeftChild() != NULL) { getLeftChild()->stripData(); } if (getRightChild() != NULL) { getRightChild()->stripData(); } }
void SPSnode::stripOptionalStatsOnly | ( | ) | [protected, virtual] |
Discard the values held to enable the calculation of optional statistics, and set countsOnly to true, in this and the children of this.
{ if (!countsOnly) { //strip me of the sum collectors only VecDotPrec().swap (dpSums); VecDotPrec().swap (dpSumProducts); countsOnly = true; } // and children if (getLeftChild() != NULL) { getLeftChild()->stripOptionalStatsOnly(); } if (getRightChild() != NULL) { getRightChild()->stripOptionalStatsOnly(); } }
void SPSnode::swapSPS | ( | SPSnode & | spn | ) |
Swap this and another node.
Swaps all the data members of this with the other node.
spn | a reference to the node to swap with |
{ /* theBox, parent, leftChild, rightChild and nodeName are inherited from base class */ SPnode::swap(spn); // use the base version std::swap(spaceIndication, spn.spaceIndication); std::swap(counter, spn.counter); std::swap(dpSums, spn.dpSums); std::swap(dpSumProducts, spn.dpSumProducts); std::swap(dataItrs, spn.dataItrs); std::swap(splitDim, spn.splitDim); std::swap(splitValue, spn.splitValue); std::swap(countsOnly, spn.countsOnly); }
void SPSnode::unionTreeStructure | ( | const SPSnode *const | rhs | ) |
Makes the non-minimal union of the tree structures of this and another node.
Discards the actual data. The resulting structure returned has no data associated with it.
Throws the following exceptions:
rhs | a pointer to the root of SPSnode tree to make a union against, which can be a NULL pointer. rhs is unchanged by the operation. |
{ if ( isEmpty() ) { throw NoBox_Error( "SPSnode::unionTreeStructure(const SPSnode * const)"); } if ( rhs != NULL && !rhs->isEmpty() && (getBox() != rhs->getBox()) ) { throw IncompatibleDimensions_Error( "SPSnode::unionTreeStructure(const SPSnode * const)"); } // if rhs is null or empty we carry on unionNoData(rhs); }
bool subpavings::SPSnode::countsOnly [protected] |
Determines the amount of statistical summary data in node.
If true, only counts are maintained in the node. If false, counts and other statistics (sums, sumproducts) are also maintained.
Setting countsOnly to true can help to conserve memory when means (from sums) and covariances (from sumproducts) will not be needed.
NodeData subpavings::SPSnode::dataItrs [mutable, protected] |
A container for the association of data with a node.
Data is associated with a node via this container of iterators. The iterators can, very loosely, in the sense in which they are used here, be thought of as pointers to a big data collection of all data points. Only leaf nodes can have anything in this container. However, not all leaf nodes will necessarily have something in this container: the container will be empty if no data points are covered by the box represented by a leaf node.
VecDotPrec subpavings::SPSnode::dpSumProducts [mutable, protected] |
A container representing the sumproduct matrix of the data points covered by theBox.
The sumproducts matrix is used to obtain the sample variance-covariance matrix.
The for n-dimensional data the sample variance-covariance matrix is an nxn matrix where the element in row i, column j is the sample covariance between the ith-dimension and jth-dimension of the data, which is [sumproduct(i,j)-sum(i)sum(j)/counter]/(counter-1).
So by keeping the sum product and sums up to date, we can calculate a covariance on demand.
The sumproducts can be thought of as an nxn matrix where the element in row i, column j is the sum over all the datapoints associated with that box of the products of the ith element and jth element in the datapoints. ie for each datapoint, we take the product of the ith and jth elements and then sum the products over all the datapoints.
Data points are rvectors so each element is a real, and the the accumulation (sum) of products of reals is implemented here with a dotprecision accumulator.
The sumproduct matrix is stored here as a nxn element vector of dotprecision variables (where n is the dimensions of the rvectors or data points), using row-major order.
Ie the m-th element (m = 0, . . . nxn-1) is in row floor(m/n) and column m-rowxn in the matrix configuration.
Or, the sumproduct of elements i and j in an rvector, i,j = 0,...,n-1, is element m=(ixn+j) of the sumproducts vector.
VecDotPrec subpavings::SPSnode::dpSums [mutable, protected] |
A container representing the sum of the data points covered by theBox.
The sums are used for calculating the mean and also the sample variance-covariance matrix for the data associated with a node.
cxsc::dotprecision accumulators are used to maintain the sum of the data in each dimension of the data because floating point arithmetic can result in inaccuracies during summation, especially in large boxes.
We could use Kahan summation instead with a lot more work. Kahan summation relies on adding a number of points in a sequence and recovering data lost in one summation during the next one. When we simply add two numbers, Kahan summation has no chance to recover the lost part. We would have to implement this by keeping the lost part, say having a vector of pairs, and trying to re-add the lost part each time. See http://en.wikipedia.org/wiki/Kahan_summation_algorithm . The same may apply to arguments for using gsl_mean using a simpler reccurence relation. Speed comparisons have not been performed on the three alternative possible implementations of the recursively computable sample sum or sample mean. However we get most relable and accurate sums using cxsc::dotprecision accumulators.
size_t subpavings::SPSnode::spaceIndication [protected] |
An indication of the maximum number of data points a node needs to carry.
This is used for efficiency only to reserve vector space and a node can have more than this maximum number of data points associated with it. Defaults to defaultMaxPts.
int subpavings::SPSnode::splitDim [protected] |
Dimension the node's box has been split along.
This is needed to accurately divide data between a node's left and right children. Defaults to -1 if the node is a leaf.
real subpavings::SPSnode::splitValue [protected] |
The value, on split dimension, where node's box was split.
This is needed to accurately divide data between a node's left and right children. Defaults to 0.0 if the node is a leaf.