Multi-Dimensional Index Collation Theory

Forums Personal Topics Unbidden Thoughts Multi-Dimensional Index Collation Theory

This topic contains 7 replies, has 1 voice, and was last updated by  josh August 1, 2022 at 12:16 am.

  • Author
    Posts
  • #119307

    josh

    For linear version of algorithm, I think about this analytical step:

    Focus on speed of finding target set in range [A,B] of the global linear function, [low1,max1],[low2,max2]….for each of the indexed dimensions. Over that query, we translate low,low2,… to 0 and clip max1,max2,… to the max values that might fit under translation of B. Now for each dimension, we have a max % effect it might deliver which is sorted by width of our range of uncertainty. We march on reducing uncertainty by iteratively walking on the dimension that currently is giving the greatest bang, & then resetting.

    Each time we take a step on the greatest range dimension, we will actual at least 1 point which will be either collated into our retrieved set or a miss. In the process of examining from the endpoint of the greatest range, we shrink that range & adjust the potential percentage contribution from all the other ranges. Some of them can move their endpoints according to what is a possible hit in our target set.

    Is it fast? I think so.

  • #119308

    josh

    The applications of the algorithm don’t require true database indexes for the function dimensions. But if there are multiple values at a point they all need to be examined. So if the cardinality of collisions is over some threshold, it’s recommended to split the sub-task of equality there as a special case.

  • #119311

    josh

    Should be easy to adapt to starting a cursor for best K relative to the objective function – there’s a secondary range of uncertainty about whether a given candidate possibility makes the K cut; this closes when the bounds for the remaining points to be examined fall below the value of the candidate already qualified.

  • #119312

    josh

    Update thought: We can construct artificial examples which disprove the conjecture about worst case complexity. Consider adding to the original setup, a maximally uninformative pair of dimensions with a very wide range in each, where the points are arranged to cancel the overall goodness in the other dimension – i.e. whenever it is good in one, it is bad in the other, so all points are middling and the wide ranges keep us from looking at the other dimensions. The artificially created problem is linear in the expanded space but can have 0 Lipschitz continuity in the linear subspace of the original problem, using the function that has access to the extra oracle dimensions. That’s not a situation which is likely to arise in practice.

    In practice the suggested algorithm benefits for large N from the ability to only collate small subsets of interest in memory & to more quickly search the B tree indices using internal navigation coordinates.

    I’m not sure how to theoretically rule out the artificial dimensions. A posteriori they have low/zero mutual information with the objective function, so they shouldn’t be used over and over…

    • #119313

      josh

      In the artificial example, the objective function we care about becomes part of the “noise” subspace in the eigenanalysis of the expanded domain. Perhaps there is some practical way to exclude that sort of pathology in a incremental algorithm for building the indexes/etc.

      • #119314

        josh

        One possible counter strategy is to look at the first pair of points for each dimension & consider the local slope of “live unfiltered points” as a gateway factor. If it’s not working together with the broad range then we look at some other dimension where it is working.

        • #119348

          josh

          At any given time, for each index dimension, we have a range in the tree that we consider for candidates & we designate which end of the range is most promising.

          Depending on the specific task & state of search, each point examined it either definitely in, definitely out, or possibly in (e.g. case of a K-best or nearest neighbor search).

          Each examined point where the decision is definite, gives us some info to update are intervals of overall search uncertainty, and we move the boundary past that point. Finding positive examples is much more essential to our task. In a pruning phase, we continue to shrink the best boundary along each dimension that shows a definitively positive example.

          At the conclusion of the prune phase, we have some mix of uncertain points & dimensions where the last examined value was negative. We proceed to look at the top 2 points on each dimension and rank order the dimensions by which step is producing the greatest reduction in the potential range of objective function values remaining in our hypercube which would occur after removal of the set of outside points that take the same value for that dimension (not that the B-Tree structure insures this is a logN op in the worst case). Within this border set we may find n1,n2,& n3 positive, negative, & uncertain example points. In general, at any given time, we have already discovered k1 positive examples, and k2 uncertain examples, and we only need to retain the best K-k1 uncertain candidates for our overall search. However, we lose our algorithmic efficiency if we need to example O(cN) values in a given jump due to value lumps. So we must modify the description above to limit the candidate region to at most 1 slot jump at the level of a B-tree node. In a pathological case where all dimensions do not vary after 1 slot jump, then we pick dimension with the best starting value and process it’s first slot. In the worst case then we see up to O(logN) uncertain candidates. Now either are choice of K was pathogical, or we gain info about remaining range of objective function that is relevant to our search, updating the edges of data cube accordingly.

          Proceeding iteratively this way, we guarantee that each round shrinks our range of uncertainty about the potential objective function vales available and our data search cube. Is the info gain fast enough for our target efficiency?

You must be logged in to reply to this topic.