Algorithmic Primitives For Exploratory Data Analysis

Forums Personal Topics Unbidden Thoughts Algorithmic Primitives For Exploratory Data Analysis

This topic contains 2 replies, has 1 voice, and was last updated by  Josh Stern February 25, 2023 at 8:27 am.

  • Author
    Posts
  • #126283

    Josh Stern
    Moderator

    More complete spectral clustering is also interesting, while also being more computationally costly than the techniques mentioned above. The indicator test statistic screening could be used to first restrict attention to dimensions that are useful to the given analysis. Then, using those dimensions, similarity between all pairs of points can be computed either using normalized dot product or another method like normalized Sum(min(X_i(x1,x2)))/Sum(max(X_i(x1,x2))) (more sensitive to lots of features that are more notable when 1 than when 0). Spectral clustering is then run on the similarit matrix to use the most influential similarity centers as dimensional basis in the manner discussed above (with the cautions mentioned there). These can then be used as kernel estimates or as a basis for reduced rank GLM regression.

  • #126284

    Josh Stern
    Moderator

    The following set of algorithm is based on the idea of looking for connected clusters with irregular shapes as “explanatory processes”. Some versions are similar to Josh Tennenbaum’s algorithm for finding manifolds.

    Stage 1 – pick a set of dimensions D to describe spatial variation in the space X. The methods described above can be used.

    A PCA alternative could also be used to reduce the orignal high dimensional space – in that effort, we suggest using the normalized/whitened transformations of the original X_i components can also be used (after transforming categorical variables to indicators).

    Stage 2 – We define a measure of geometric proximity. In the case of regression prediction of Y_i, the proximity to be used can be a weighted mixture of the following 3 components: distance in Y_i space, distance in the reduced geometrical space D, and the normalized dot product (i.e. correlation) of the whitened X_i, recast as a pdistance ranging from 0.0 – coresponding to correlation value of 1.0 – and 1.0 – corresponding to correlation value of -1.0.

    Stage 3 – We perform agglomerative clustering to gradually merge all points into a clustered tree. Originally, every point is a member of its own cluster and distance to clusters is the geometric distance described in Stage 2. After clusters merge, their distance to individual points or individual clusters is defined as the distance between the closest pair of points in the 2 clusters.

    Stage 4 – Repeated steps of clustering produce a Tree that can be used as a basis for regression. To produce an estimation algorithm, one needs to identify best synthetic splitting rules for empirical split from the tree that will be used in the Regression model. One can then use any of the “local regression” methods (i.e. kernel regression) to define regression models for a given node in the tree, and use CART style or resampling calculations to determine where a tree should terminate as a leaf in the regression model.

    Alternative form for Density Modeling:
    For pure density modeling we do not have a space Y_i to drive our reduction of the space X. But as discussed in the note above, we can still use many of our indicator tests to look for bumps in 1-D density profiles created by kernel regression in 1-D, and the degree & presence of significant bumps become our criteria for including a given X_i in the reduced space D. Following that plan, we can built tree-based non-parametric density models in manner described above, leaving out the Y_i weightings.

You must be logged in to reply to this topic.