› Forums › Personal Topics › Unbidden Thoughts › Generalizing Structure From Motion As Causal Sensing
This topic contains 5 replies, has 1 voice, and was last updated by
Josh Stern March 3, 2023 at 7:35 pm.
-
AuthorPosts
-
March 2, 2023 at 6:48 pm #126295

Josh Stern
ModeratorQ:How can you learn that? Surely it can’t be completely non-parametric…
A: We were recently focused on the poin that even non-parametric density estimation comes with some form constraints: we have to somehow decide on a small number of dimensions that are relevant and usually some kind of neighborhood structure/topology within those dimensions.
In the generalized causal inference problem we similarly have some kinds of constraints on which sensor readings might clump as a thing, which conditions make a scene, which things and their dynamics make an event, and which transformations are candidates to learn. The structure from motion problem usually assumes rigid light reflecting objects, and limited adjustments to viewer angle. For example. Different structuring assumptions lead to different problems.
Edit: Consider, also the phenomenon of illusory contours that perceptual psychologists emphasize, because it is very fudamental: under those conditions, the human eye is vigorously imposing a model theoretic view of the visual stimulus that even resists knowledge about how it it objectively “wrong”. In that case, their is a part of the decision acceptance/model choice that was hard wired into our trained perceptual system. Under normal viewing conditions, our system imposes “clarity” on the view of objects and such which goes way beyong the physical demands of what could be generating the radiation on our retina. We impose a model-theoretic guess about the input.
Example sub-problem: There is a dark patch on something that could be an object. How does the sensor system decide if it is shadow, or dark paint, or a notch cut out or multiple speciaty lights combining from side positions? The answer is a combination of best fit using model & many sensors and learned transformations involving lighting. These answers are computationally complex-expensive. The human brain comes with a lot of hardware capacity. Don’t think twice about using models that include features which are *absent* from the sensor battery. The combination of present,absent, and unclear is important.
Another Though Example: At a quick glance, you notice a set of colored pipes, and chutes & wheels that seem to be an example of a playful “mouse trap” that typically involves a long special tour for a rolling ball. You walk closer and only later you notice that some of the piples are not connected, some open out to nothing, and some are suspended in isolation from the ceiling, hanging by thin wire. Surprise!
-
March 2, 2023 at 7:53 pm #126296

Josh Stern
ModeratorQ: Why is causality key?
A: The general domain is sensing of physical phenomena which may undergo physical transformations. The transformations exist in intervals of time. Observations of the transformations exist in intervals of time. Some transformations are reversible and some are not. In many cases we have some opportunity to perceive part of a cause – e.g. pressing the remote control button. Examples like a banana turning from green to yellow to mottled to black without seeing the “cause” are less common in our environment and our explanatory diagnosis.
In all of these little events, many sensor readings transform in parallel. Identifying the structural coherence of that, and finding dimensional subspaces is key to learning causes and which coincidences signal their effects.
-
March 2, 2023 at 11:43 pm #126297

Josh Stern
ModeratorGood response to this topic, so more from my skimpy notes:
Using vision as canonical example. Make some generalization of the Marrian concept of a set of primal sketches – it is really different positions in a connected, high dimensional manifold that varies with what is contributing to a particular event/video frame timepoint in a given scene, generalizations of the scene components to other members of overarching families/categories, and different levels of inferential elaboration.
Consider a “landmark” that is a candidate to be attached to some object or thing family. Unusual image features are super low level. Unusual shape features are a higher level.
In a given “map” is the landmark
Sensed as present
Not Sensed but inferred as present
Not Sensed but inferred as absent
Not Sensed and Status is UnknownWhich elaborations move closer to completing acceptable models while explaining the low level sensor readings? What is the minimal complexity set of causes in the frame & the event that explain the overall sensor readings?
It’s a giant search attached to a giant learning problem. Estimation rules/decisions are about making predictions and holding or conducting further investigation. Loss is highest for wrong model acceptance in the cases with the greatest utility at stake (is what I thought I saw out of the corner of my eye consequential?)
In general, things we call everyday objects exhibit a lot more rigidity in 3D space than non-object things (if they have 3D space). Things we notice as parts of objects like “forearm” exhibit even more rigidity than the entire human body. Matte surfaces exhibit more rigidity of light highlights than specular surfaces. Paintings exhibit more regularlity of paint splotches than drop clothes. And so forth.
Some causes, like flipping a light switch happen in an instant. Others, like the sun going down, or rotating an object in light happen more slowly so we see progressions. In all those cases, relating causes to event sequences of specific durations that match the “cause” is important to perceptual modeling.
-
March 3, 2023 at 9:47 am #126298

Josh Stern
ModeratorThe conditions in which SFM is known to suceed give a clue about what is to be learned in a particular model for small transformations & stringing a set of those together to make larger maps.
For purposes of engineering solutions to new target problems, the learning search can be greatly aided by creating VR simulations with sensing simulations that are near the conditions of the actual target. The VR includes mixtures of all the different factors which can play out in events in a sensing scene. The VR contains the advantages of full labeling of correct answers, and the ability to generate infinitely large data samples. Pattern recognition is mapped too – it’s like looking at a set of selected snapshots from different VR runs and making a judgment. The full virtual “PR” would involve judgments about all the contributing factors – their mixtures and their poses. In typical problems, the statistical tabulating learning doesn’t have any access to that data.
Edit: In the “full” PR:snapshot => scene labeling (extra credit, hard problem), the labeling would include labels that are vectors with components that are discrete, pseudo-continuous, and categories that are essentially probability densities over mixed data like that – possibly ad hoc, or possibly containing explantory unseen features like DNA or “Etsy Q. Smith, the NN designer”.
When performance on the VR problems is satisfactory, then it can be extended to the nearby real world data trials, noting what is different between the VR and the real data and trying to bridge those gaps.
Edit: In general, assigning labels and making predictions should be accompanied by a confidence estimate for probability of being correct. The confidence estimate can be optimized by scoring rules where training labels are available. These probabilities can be important for practical decision making about what step to take next, consciously or unconsciously. The phenomenological human experience of “good vision” is that when lighting is acceptable, we can quickly determine the essential reality of what we are looking at or have a infrequent sense of uncertainty. Being completely wrong is relatively rare.
-
March 3, 2023 at 7:35 pm #126299

Josh Stern
ModeratorWe can make lists of many different factors that can play a role in image formation at a given location:
surface color and variation and pattern/texture,
lighting sources,
physical texture,
reflectance – various properties
surface orientation
blurring due to distance, smoke, fog, etc
depth perception from various sources
cracks/creasesIn different ways, these can be organized as dimensions with mutually exclusive subsets/values. In a given region, detecting the presence or absence of various values for various dimensions will influence the probability assessment of other various at the same location. The space of all possible label assignments is too large to iterate, even numerically. Where their is a race to acquire an interpretation that is sufficiently probable and sufficiently complete to generate a decision for acceptance, this may be based on first focus on labeling that supports completeness & is assigned relatively higher confidence for specific conclusions because it is unlikely to be reset to a different value that the current MLE. Backtracking may happen, but the speed of operations suggests that it is relatively infrequent compared to the number of assignments to be cast.
Edit: A try for special statistical tests that work quickly.
In time we have m snapshots of scenes S_1,…S_m
We have previously created D dimensions of subspace that we use to describe basic proximinity of points to one another in each scene.
Within S_i we have identified candidate landmarks L1i,..Lki and cluster regions R1i,…Rqi
From some set of S_i of size at least 2, we examine mappings that injust some elements of Li and candidate match elements of Lj and points from Regions Rsi and Rti into some Rsj and Rtj. The statistical point is about non-chance similarity of the point matches, especially landmarks, and the region matches. Some particular mapping transformation is much better than others and lets us build correct partial maps for elements of each scene and maps of all the scene to latent models. This process may involve many candidate tries for isomorphisms of points, but the evidence should be relatively strong with only a limited number of points – so many non-chance relationships are mostly preserved.
One possible version would compare the distribution of groups of 4 points that are correctly mapped with 4 randomly chosen points – i.e. a matching pair from each of 2 frames with matching, model based transformations applied to each point. The result is 4 internal comparisons with a specific signature structure.
There may also be sampling models based on exploring populated regions of the
space in each frame.
-
-
AuthorPosts
You must be logged in to reply to this topic.