Online Exploratory Analysis & Prediction Frameworks

This topic contains 21 replies, has 1 voice, and was last updated by Josh Stern February 15, 2023 at 9:03 am.

Author

Posts
January 21, 2023 at 5:38 pm #126157

Josh Stern
Moderator

Thoughts on an efficient robust computing architecture that can be distributed:

a) At each stage of model development there are M “grid” dimensions. Conceptually, they are ordered like eigenvectors, from largest latent value of the problem space to smallest latest value. But the architecture isn’t literally that rigid or forcing a rigid view of the model. The dimensions are what *we* efficiently use to abstractly access the multidimensional X of the abstract problem space while considering multidimensional X => Y maps where Y can also be multidimensional & represent simultaneous optimization of various concerns. The ordering of processing the multiindices of the X’ overlay is fixed as a feature of our way of doing things, with the abstract concept of large to small eigenvalue progression.

Each dimension has a data structure/scheme describing something like the cardinality of steps currently prescribed for that dimension. But the modeling framework does not demand that the number of steps being the same for each index in the subspace stack and it does not demand uniform cubby sizing. Abstractly we are going perform computations like:

Model Space Cubby hole h = Lookup(eigen(M-1,X,Lookup(eigen(M-2,X,Lookup…eigen(0,X,Init)…..) – each dim knows how to work with the running prep from the more significant dims & a query point x in X in order to take us to the actual local modeling for a specific hole H and whatever data points are currently assigned to that hole.

Edit: Please note that there is zero requirement here for linear modeling. A given dimension could be cluster indicies if that was assigned by the chosen model building framework.

Edit: There are many different kinds of partial computation work that can be carried down from each dimension besides a linear vector orientation. These include range clips along that vector, components of kernel weightings, and other model specific computations. The general concept is that in the model building phase, the final cubby hole gives convenient access to all the data that are neither clipped or masked out by resampling masks. Notice that there is a dependency between what should be clipped and the type of modeling being attempted. In general, a modeling component that will only be used for a given cubby hole may still care about support from other data – in the well known case of linear regression, all the data is used. The “trick” is permit a kitchen sink of potential approaches that can efficiently reuse computational code and strategies to control over and under fitting for both global and local model features.

We might use a B-Tree for each eigendimension lookup. The actual computational work of each dimension gets adjusted over time as models & data develop. The number of dimensions can grow or shrink. The interpretation of major to minor stays conceptually fixed. The running “Lookup” structure and the flexibility for complex models in each cubby hole facilitates every form of computation with reasonable efficiency.

We believe that efficient computation in some areas will benefit from allowing actual grids on small simple problems & abstract hashing of cubby holes on large problems where many might be relatively data empty & defaulting to higher level models in high dimensional spaces. The design supports this flexibility without adding expensive computational cost.

The power & ubiquity of resampling schemes makes it desirable to implement methods for dynamically inserting & retracting masks of active data points. In a parallel, distributed architecture, we want those operations to be copy-on-write and concurrency safe.

Analytics about residuals, model influence points, outliers, etc. need to be efficiently implemented within the framework and made convenient for re-use, Improving error operations might well benefit from a sorted cull of most erroneous to least erroneous cubby holes, while computations aimed at model summary & visualization might work with the opposite ordering.
- January 21, 2023 at 5:58 pm #126158
  
  Josh Stern
  Moderator
  
  Relevant stat perspective. There’s a signficant difference in f:X=>Y modeling between starting out with the eigenvectors that have largest value for X itself vs. those that have the largest weighted value for f(X). The latter computation explicitly involves peaking at Y. If we go for the f weighted vectors initially, we want some extra anti-biasing care. But what specificially? Weighting towards the eigenvalues for X is mostly heuristic. I’d suggest some kind of L1 centroid robustness makes sense. What are the specific pro recommendations?
January 22, 2023 at 8:28 am #126160

Josh Stern
Moderator

Passes through all of the data in a given round or the regions of the data that are affected by a partial model modification are computational workhorses. The efficiency of the overall framework is enhanced if there are hooks to add extra work at each point which is useful for other analytics or the next round of fitting – e.g. the error residuals of each point could be inserted in a priority queue ordered by value or some other criteria of interest.
January 23, 2023 at 2:53 am #126161

Josh Stern
Moderator

Toot horn more for why this way of doing things is a win:

Combo of:

a) High dimensional problems & multi-objective problems & missing data are known frontiers of difficult. Computational cost is one element of the difficulty to be mastered. Combining exploratory modelling, optimized fitting, model refinement, & summarizaton are other issues. We don’t expect the depth of treatments (# of latent variables, classes, clusters, or depth of regression tree) to grow linearly in the number of problem dimensions. Something like logD is more appropriate. On the other hand, we don’t expect to stay with linear frameworks in large problems with lots of data. Adaptively analyzing “meaningful distance” is a feature of high dimensional problems.

b) In performance computing, making the right use of fast data structures matters a lot. Picking the right choice of arrays, trees, & hashmaps is engineering. Frameworks like C++ STL build a bridge between that type of issue & the issue of customizing work for an infinite set of potential applications. The descripton here is readily compatible with that sort of treatment & exhibits similar motivations.
Edit: When dealing with very large data sets that and applications that feature some combination of frequent locality/recurrence of access, B-Trees with recency caching to the nodes are usually a high performance variant that is unforuntely omitted from many libraries – i.e. the cache is something like a hashlookup for addresses of virtual chunks that include the nodes recently in use, which are often neighbors near the top of the scan space, etc

c) The concept of non-parametric fitting, analytics, & regression is general. Statistics includes lots of different modeling ideas, but missing data & time series with drifts & jumps are disconnected from the theoretical premises for the bulk of the catalog. Fixing that in a computational framework is a win. Looking for drift & jumps can involve a lot of extra computation. Again efficiency matters. Missing/partial data is also disconnected & offers similar benefits for unification. We want to get away from the practice that our solution is already broken at “We assume that…”

d) Resampling is the most non-parametric method of model evaluation as data set size increases. Can also be used for testing non-independency assumptions.

So we look for platform ways to promote more generality, more reusability, & faster computations for the kinds of problems where computation speed is a barrier.
January 26, 2023 at 2:29 am #126183

Josh Stern
Moderator

A more pseduo algorithmic summary of the general idea:

Step 1: Pick and X=>Y space of some time and general metrics for evaluating f(X). Many problems can be fit into a multivariate non-linear, non-parametric regression format. Multivariate Y is not a hard extension. Missing data & non-iid data with temporal drift or discontinuities have less theory, but we aim to treat those also. Bandit type problems where elements of X can be controlled and there are variable costs associated with different choices are another extension of interest that falls a bit outside the simple framework. But potentially we can consider X’ = where X1 can be controlled and has cost optimization while X2 is like a randomly sampled multivariate. Density estimation & Classification and Point Estimation are easily fit within the regression framework.

Step 2: Associate each stat methodology for regression with a way of fitting an initial model to f. Conceptually this is like first principal component without reference to others. After initial choice, it may be modified by later refinement. And mutiple different methods may be evaluated to see which is best. But within each general methodology type, the initial choice potentially matters in details & has far reaching affects on what comes after in that modeling form. Note, for example that even in the realm of linear regression with RMSE norms, Ridge regression, PCA, PLS,Lasso, Elastic Net etc may give different choices for the first and next eigenvectors. In general, we have no intention to restrict attention to linear models. Bu we do intend to look for additive combinations of predictors in each region of X that seek to be de-correlated from other predictors of the model operating in the same region of X. And we intend similar sort of dimensionality reduction for number of predictors to be employed in any given stack.

Step 3: Associate the choice of Step 2 with a particular total ordering of the data to be considered – e.g. some way of sorting it in a B-Tree. Identical points can be treated in a special way for added efficiency or assigned extra unique tags.

Step 4: Develop a pair of parameter fits for the given predictor with the properties that one of the fits is believed to be over-regularized & biased while the other is believed to have less bias with possible under-regularization

Step 5 (Optional): If Non-linear modeling is to be employed then we may wish to divide the span of ordered data points into regions based on criteria that consider both the number of points in each region (is it sufficient to sustain a statistical hypothesis that a division is merited?) & discontinuities in the residual of the over-regularized model predictor. Where we are treating non-iid time series, then the same criteria may also apply to the temporal coordinate of the points (assuming it is not used in the X space itself). The point of considering division here is to allow the statistical models for the remaining statck in each region to become uncoupled (as an exploratory hypothesis) and also to permit easier parallelization of the remaining model fitting. In general, a division may also include an apron of neighboring points that are understood as not part of the division being evaluated while still providing various forms of kernel support for fitting purposes. If a division is selected then the algorithmic form of the predictor in the resultant model contestant will include a corresponding separating decision rule.

At this point we’ve described a framework for looking at the first component of various forms of any (linear or non-linear) regression shrinkage estimators & their unshrunken counterparts along with additional mechanism for introducing controlled, computationally efficient non-linearity where that is compliminetary to the modeling enterprise. Please note that methods which seeek to use clusters, latent mixtures, causal models, etc. as explanatory variables may wish to fit a very simple mode – perhaps only a multivariate mean – as their default case in the first phase of analysis.

This juncture may also be good location for consideration of methodologies that seek to throw away some “outlier” points rather than treat them with robust regression methods.

Step 6 – Looping: Continue analysis through consideration of additional explanatory variates in each active region – often the entire data set – focusing on reduction of the residual in the overregularized version of the current model. In additional to reducing the residual, subsequent choices of explanatory predictors should consider trying to be orthonormal to the previously chosen sets of explanatory variables in their region of interest, but they may conceptually assign less penalty for contributions to correlation that result in predictions for Y that lie between the overregularlized and the midpoint of the gap to the underregularized envelopes of the previous level. There is no need for subsequent levels to use the same modeling style as prior levels.

Fit parameters in 2 verions as above. Again, consider discontinuities as above.

Resampling evaluations/analyses can & should be used within the cycle of parameter fitting at each level & in outside competitions between competing models. The iteration of model depth may stop due to a variety of criteria, but normatively should stop whenvever resampling error of the “underfitted” model appears to show no positive improvement over the prior levels.

The overall approach offers and overabundance of computational things to try. Implementations should build priority queues for scheduling the work that is mostly likely to be valuable first. Value is a function of improvement/completion of a current best model & perhaps extra information about the level of utility/interest that varies in different areas of prediction (e.g. mining for high value bumps in a commercial setting).
January 26, 2023 at 2:45 am #126184

Josh Stern
Moderator

Q: Having found some particularly interesting bump at a layered algorithmic stage, it might be tempting to start looking at that bump initially without the support being shaped by the other layers? What is the right algorithmic way to do that?

A: It depends on what we really mean by “interesting”. If our bump hunting is about finding something missed by other other data models, then we want the support of the underlying models in place if they are part of the model. We can, of course, consider different models that ininitialy fit a mean & then look for region bumps on top of that. Finding a bump with large model signal compared to prior regressions might signal the scheduling algorithm to give a high priority to looking at that sort of model first. For cases like “finding best stocks”, we are really in the situation described above where a chunk of X is under our instrumental control (where.when we choose to act). In fitting such models, we probably start with win/loss metrics that are very different than global regression, emphasizing special bumps as dominant features of our metrics. The first pass in that case is probably going to be mean fitting & then clustering. The “Prim” model was one heuristic attempt to do something like that, but it was not really designed to deal with high dimensionality. In high dimensions, we still benefit from the global search for which dimensions matter. Spectral clustering let’s one model the concept of “nearness to exempar support”. In the stock case, we may find that general features of “balance sheet” and “has a corporate name that in key sounds like Kill the NN outsider” are still general contributors to the model from lower dimensions.
January 26, 2023 at 1:10 pm #126185

Josh Stern
Moderator

Q: What’s your position on tradeoffs between more complex models that obtain greater prediction accuracy on benchmarks & models with simpler human factors that are easier to humans to interpret?

A: My position is that as data sizes & problem sizes grow, humans need more & more help from machines to find, interpret, analyze, and visualize results. In different scenarios, & contexts, one or another attribute of modeling may receive more or less weight. Aspects of models that are believed to aid interpretability or visualization should be formulated as costs & utilities so that they too can benefit from the most efficient computational processes searching the solution space. It’s absolutely valid to use fit metrics that give significant weight to the form factors that are judged to help with interpretability. Interfaces should make those choices available for experimentation/improvement so the issues get improved/buffed by practice.

Note also that the practice of using multi-layer neural nets with fixed structure & an incremental training regime for developing internal latent structure has become popularly knoiwn as “Deep Learning” & is used by many appllied practitioners for non-linear, non-parametric regression. In those cases, little weight is given to interpretability while significant weight is given to a priori priors and hardwired assumptions about the nature of latent variable structure. The category name “Deep Learning” does not promote penetrating analytical insight into problem structure, but it has proved to be a useful tool for many. This computational framework can be used for multi-staged Deep Learning, for fixed “neural archtecture” models, and/or for specific priors about latent variable structure in problems with multivariate, multi-objective Y. And particular choices can be given branded package names for reference. Our development of a utility framework for model search & fitting is at a less methdologically opinionated, more tool oriented level of utility platform development. Novelty is often associated with opinonated, but the design here is trying to go North in the direction of being non-opinionated on such matters.
January 26, 2023 at 1:51 pm #126186

Josh Stern
Moderator

The f(controlPlan,) => Y scenario discussed above, where “controlPlan” governs which part of X1 is chosen (possibly as a function of X2), is a common situation. It would be helpful to introduce some “branded” terms for the metrics to be optimized – e.g. ConstrainedMaxUtil_low_p(a,b) means Control maximizes ExpectedValue(f(Control(X1,X2),X2)=>Y) subject to the contraint that the probability of a scored value less than a, is less than b.

Something that is understood but sounds better than that would be good.
January 26, 2023 at 2:25 pm #126187

Josh Stern
Moderator

Implementation Note: Many methods in non-parametric stats make use of rank order statistics for various dimensions/indicators. In support of that, B-tree methods can be easily modified to include rank-order mile-posts in their nodes, enabling quick lookup of values.
January 26, 2023 at 4:09 pm #126188

Josh Stern
Moderator

Spatial Texture As Statistical Features – An Advanced Topic In Exploratory Data Analysis

Vision is the most informative & dominant sensory mode for most humans & spatial texture perception is widely recognized as an important component of both low & high level vision. One open methodological Q of interest can be posed like this: to what extent are the methods of input feature elaboration in spatial texture of vision relevant to featural elaboration of audio input, touch input (including, e.g. vibrations), taste input, and perhaps processing of higher level events like walking around in a crowded grocery supermarket & noticing/encountering different types of input – sites, sounds, people, signs, etc. – as one progresses through a shopping trip.

We believe that:

a) there are relevant generalizations that are useful to various forms of data analysis
b) they are computationally intensive, but “massively” parallelizable
c) generating statistics summarizing distributions of monte carlo tours of local distinctions/boundaries is a potential style of featural elaboration that deserves more attention (where compute resources are available)
d) The elaborations provided by generating spatial texture feature statistics can contribute in the style of support vector pre-processing – extra featural elaboration can make the regression Y distinctions of interest available to simple linear models/hyperplanes.

Abstract Form of Spatial Texture Stats:

Within a bounded, compact subspace region of X, we notice various functions f1,f2,…fn which are like binary features, with perhaps a transition slope between on/off. Various types of tours include following a boundary between on-off, staying within a region of on/off, or trying to jump to a near point in the opposite category. Lengths/angular dot products of steps in some random walk can be compiled as a set of featural distributions for any given starting point in the region; each starting point will have a different elaborated signature. The distribution of these signatures in a region is also a multivariate feature.

All of the above is parameterized by choice of compact sub-space, binary forms f, etc. Which are categorically information bearing for Y as measured by information divergence or similar stats? That’s a type of Q we can throw compute resources at.

Note that for the familiar case of visual texture, the viewers distance from a givenlight reflecting surface places a critical role in the scale & perception of spatial textures there. That distance would be an important featural dimension of X in a full treatment of the modeling effects for purposes of understanding human visual perception. If the the model is plausible, then different ways of describing paths around a given location should be able to provide vectors for discriminating between a paisley pattern present vs. not at that location.
January 28, 2023 at 11:26 pm #126200

Josh Stern
Moderator

In a competition at each level of additive component type, we may often select a provisional “winner” that does the best job of generalizabe reduction of the residual objective error function. But some other heuristics may also be relevant. Think of the example of a pool shot – knocking in more balls or good, but the expert player may prefer a result that leaves an easier table for the next shot. The analog of that here is clustering of the residual errors. There is no one definition of clustering, but we can measure cluster ability by using various types of data compression, especially those using run-length-encoding (RLE) to assess the lumpiness of the residual. Our heuristic search for best model may preferentially choose a next layer that is not the absolute best at reducing residual if it shows a significant advantage in clustering of the residuals.

Additional points: The Y residual may be compressed after zeroing out the most significant digits to eliminate noise influence. A further modeling analysis may be more or less concordant to the simplicity concept of a given compression scheme.
January 29, 2023 at 12:23 am #126201

Josh Stern
Moderator

Q: How could this framework contribute to models for web agents/web scraping?

A: For web agents, temporal drift/jumps are very real – styles change, libs change, browsers change, & websites change. Multi-objective models of correctness are also important. The feature space of available information is very high dimension. Different kinds of data would be available with some supervised, some semi-supervised, and some unsupervised based on different collection/processing methods. Missing data would be normal – is there a picture of an item available? Is it confirming or alarming for a given course of action??
February 3, 2023 at 6:56 am #126243

Josh Stern
Moderator

I asked myself this Q: “Take away models where basic Lipschitz continuity is an expected feature of major model factors. How would I characterize the class of models that get highest Bayesian priority from me in the remainder, pretending I can assign such a thing to a Q out of domain context?”

FWIW, I think those models show common features of somewhat repeating “local processes, like patterns” – e.g. cities have extra arterial roadways… In abstract terms, we can make an initial guess about distance using the first few eigenvectors of X and then look at neighborhood clustering of similar Y values. Is there a way to describe that as a generative center & then some elaboration around the center?

For example:
Step 1 – Identify large principal components of X
Step 2 – Call that metric space X’ – cluster (X’,Y) using a technique that emphasizes nearest link connectivity (our intuition is about growing local processes rather than any form of Gaussian)
Step 3 – model the clusters as a center plus patterns – for example by converting Y into rank order statistics and looking at moment distributions of those in each cluster
February 5, 2023 at 5:04 am #126248

Josh Stern
Moderator

Q: How can the compute scheduling process construct an estimate of “Best First Search” in order to decide what to work on next? This process is complicated by considering a hybrid space of many types of modeling methodologies, some of which make take a long time to evaluate.

A: I suggest combining the following principles:

a) Provide flexibility in how preferences for model form can promoted as benefits/penalties. We can do this by scoring an estimated model using both pure stats criteria and extra utility scores based on i. simplicity, ii. homogeneity of methodology, iii. computational complexity, etc.

b) Provide flexibility in specifying the number of complete model alternatives we are currently searching for and whether there is an extra bonus for finding heterogeneity in the example set with cardinality > 1.

c) Provide ways of estimating the final score & the amount of time remaining to attain that score along a particular model search path. There is some given probability p that the current path will improve on the current candidate set we are holding. If the model does not improve, then say that we could start again using a better base from the other methodology. So it makes sense to say that the estimated time to model improvement along this path is finite, but might be long & wasteful. The best first concept is to focus work on the path/node expansion with the shortest expected time to improve the current candidate set. We can and should also add in any additional costs for increasing the working set size of algorithms currently running or putting some candidates in temporary storate suspension – that gives a slightly revised tree of best considering also switching/multi-tasking costs.

d) We should develop stat/ML models of final score/time expectation given current progress. Features of current progress will include current & expected contributions to extra utility score, current score for the residual in the stat norm being used, and the current shrinkage score for the based that has been configured. We believe that more stable bases (low variability, etc.) have more room to run & improve, other things being equal.

Edit: Different parts of the evaluation function that contribute to the time to improvement estimate will go stale, at different points, as the solution set improves in one way or another (or the ML model changes). Efficient handling is probably to store the time of last update for each part & the global time change vector. If partial candidates are stored in a priority queue, the some heuristic can be used to determine how much of the front is worth updating prior to picking next.
February 7, 2023 at 3:22 pm #126253

Josh Stern
Moderator

Q: What is the best way to incorporate positive powers of neural net style “Deep Learning” in this framework?

A: All forms of non-linear regression that are designed to support optimized generalizability of predictions must find methods to limit the influence of each individual data point on the chosen model. The influence of each data point must shrink in proportion to growth of the size of the pool of points that are part of the support kernel for any given estimation. Neural net Deep Learning and Support Vector Regression include heuristic – that is not mathematically optitimized – methods for reducing the influence of individual points as they consider overfitted, highly elaborated models in their universe of numerical hypotheses. Basically I believe that over-elaboration methods should be employed alongside optimized methods for limiting the influence of each supporting data point on the model form. Our framework should focus first on computationally attractive methods for controlling point influence on each model class. Our software problem there is generality of our format won’t always give analytical solutions. We have at least these 4 cases – analytical solutions are easy, the user is trusted to supply a computational solution (which we could optionally verify as a check), the framework provides heuristic numerical solutions, or the framework provides precise numerical solutions at greater computational cost.

A given modeling situation may also support information about “noise distributions”. Where this info is available, adding additional cases from the “noise distribution” can be used as an additional mechanism of limiting data point influence and providing additional robustness to unknown artifacts of mismatch between training and future prediction scenarios. This is sometimes called “fuzzing”.

As a platform, the framework can also contribute combining work on finding interesting elaborations with work on sorting them & fitting them. Intellectually one can be true albeit lazy and say “Any functional elaboration could be useful to some fit.” True, but unhelpful. The style of elaborations that achieve power in ML learning are roughly this form:

1)Bounding windows of support kernels are pre-chosen in the space X.

2) Other forms of additional kernel support filtering are considered as well. Within each bounding window of X, we seek combinations of functional elaboration of X and possibily additional kernel filtering – which may be based on all of X – with the property that the mutual information between the functional elaboration of X in this window, with this additional filtering, is high.

3) Iteratively, additional elaborations in given window of support are allowed to utilize prior elaborations as functional inputs.

4) Further elaborations may also consider model predictions using other elaborations & new elaborations that provide high mutual information with the residual.

5) The neural net Deep Learning world tries to fight over-fitting with a mix of mechanisms: fixed coefficient forms for nets, competition between sub-models/elaborations, weight shrinkage, and manually manipulating data point influence by re-weighting the training sets. There is no explicit mathematical optimization of the combo. Fuzzing techniques are also sometimes employed.

We encourage adding algorithmic flexibility to support the various forms of limiting influence – including specification of fuzzing, where available. And we enourage adding libraries of routines that search for interesting elaborations.
February 7, 2023 at 5:58 pm #126254

Josh Stern
Moderator

One meta-dimension of data modeling where people often have prior or design preferences: how to treat smoothing in a) densely sampled areas with interpolation, b) loosely sampled or edge areas where there is extrapolation, and c) areas with high functional slope – is it best modeled as an actual discontinuity (cliff vs steep hill)? We imagine that it might be a good feature to let users independently specify preferences for each of these areas independently.
February 7, 2023 at 6:12 pm #126255

Josh Stern
Moderator

I posted the following idea a week or 2 ago & lost its place or contents.

Concept: Construct Non-parametric estimates of a smoothed surface f(X) based on a combination of identifying principal subspaces X’ for f(X), and for each pair of data points in X’, prior to filtering, consider the “straight line segment” connecting f(x1) and f(x2), x1,x2 in X’. Each line segment potentially contributes a kernel based vote for the value & slope of f(X’) at that point based on distance to the straight line segment & distance to the endpoints. Using suitable versions of that that we build estimates of slope & value at a suitable grid of points, when can then be modeled or visualized using familiar techniques. The preferences regarding extrapolation can be factored in to this process in a ready way.
February 8, 2023 at 9:13 am #126256

Josh Stern
Moderator

The linear regression related analyses that use principal components, place value on selecting the components with largest eigenvalue & on their mutual orthonormality. There are very few applications where anything is proposed as a magical/just right property of the exact directional vectors chosen. In our modeling context, where we are often not planning on sticking with linearity, it’s possible & potentially helpful to adjust piecewise components if it doesn’t have a high impact on our quest for stability & noise elimination. Consider, for example, the following plan – a running kernel is applied to the residuals from the best linear fit for the first component – at large local minima of this filter, it seems that the linear fit was relatively poorer. Does it help the fit to make a piecewise linear knot at that point & move it about to improve fit? Computationally, that option is attractive. However it implies that the next linear component with not be orthonormal in the remaining subspace of X or (X,Y). Orthnormality could be approximated by also breaking it into some unknown set of pieces…

Most important is that each component establish a total ordering an avoid collinearity in each region. If that is achieved, then other variations are new ideas to try out. The computational cost is low.
February 9, 2023 at 2:43 am #126257

Josh Stern
Moderator

“What If” scenarios include many sorts of hypothetical analyses that are explicitly not amenable to statistical data analysis. But here is a useful feature category that is:

we model multivariate f:(X1,X2)=>Y and we can construct models for density of (X1,X2) as mixtures of different models – which can be thought of in a Bayesian sense if that helps. Holding f fixed, we can hypothetically consider different mixtures of (X1,X2) – of course we only have noisy examples of f for the observations, but we can choose to resample with varied weights – they might represent a new trend in the data or a future prognosis or the result of a plan we are considering or an attempt to robustify are modeling & analysis. We can optimize models against those alternative mixtures. We can also consider changes to the weight we evaluate optimality or scoring (perhaps becoming more robust to changes in input costs).

Arbitrarily changing f itself is counterfactual speculation, so statistics doesn’t help. But fuzzing models for f’ + noise can be robustified as well.
February 9, 2023 at 6:03 am #126258

Josh Stern
Moderator

Efficiency & convenience can be gained by standardizing algorithms for computing kernels. In image processing, the kernels usually run over equally spaced grid points & behave orthogonally in different dimensions.

For large data sets, it probably makes sense to store the data in a moving array with a circle/cycling interpretation of start/center.

Where data are irregularly spaced, it may be possible to store the relevant spatial distance info along side the values.

Where multiple scales are to be considered, it may be efficient to consider them in parallel in one pass.
February 15, 2023 at 9:03 am #126276

Josh Stern
Moderator

Consider the possibilities for using mutual information models to clean up high dimensional spaces in the early going. Possible ops include:

a) Ranking I(Y,x_i) for various i
b) Dropping, temporarily, or permanently, the x_j with very low info
– note that “missing data” columns in X can change this utility
c) Noticing i,j where I(Y,x_i) > I(Y,x_j) AND I(Y,) – I(Y,x_i) is < epsilon (i dominantes the info from j). There we can consider dropping j, temporrily or permanently unless missing views of i is an issue. One benefit of looking at mutual information early is that the analysis can also incorporate the question of whether a given variable is most informative as a linear type regressor or as a multinomial. In the latter case, we can either focus on using it as a partition variable or associate a step function with a given value to each categorical partition, prior to futher analysis. If a given variable X_i is naturally metric then it can also support kernel based modeling within the dimension itself. That's like the categorical case with different local modeling strategy for that particular variable.
Author

Posts

You must be logged in to reply to this topic.

Personal Notes

Personal Notes for Friends

Online Exploratory Analysis & Prediction Frameworks