Engineering of IC Speech Disc Jockeying?

Forums Personal Topics Unbidden Thoughts Engineering of IC Speech Disc Jockeying?

This topic contains 5 replies, has 1 voice, and was last updated by  josh January 19, 2022 at 1:20 am.

  • Author
    Posts
  • #109405

    josh

    II & III are new areas of research. Because of that, new progress might come very quickly. I has been going on for a long time & is hard. In the past, the best progress often came from approaches that were less theoretical. Why?

    The dimensionality of the performance space is very, very large. We recognize so many different voices as “natural human voice” (vs. an artificial robotic sound) and a voice that sounds like a particular id, under a wide range of performance conditions. We have a wide range of opinions about pleasant/euphonic vs unpleasant voices. Any given voice can modify itself to convey different types of tempo, stress, emotion, pitch, rhythm, etc. Any given voice can try to modify its accent to sound like a another speaker who mimiced the vocal style of a different world region. In the micro-analysis of the signals, “coarticulation” – the way that sound formation varies according to the temporal sequence of phonemic sounds being uttered – turns out to be a very large influence.

    As a result of those factors, the best performing text-to-speech software often worked by storing vast libraries of coarticulation/style indexed speech bits from a particular, real voice, & stiching them together in real time rather than performing an entirely synthetic composition.

    Can modern ML help? How would you apply it? I think like this:
    Let’s agree that for our engineering purposes, Phonology is an overly simplified model without enough control parameters. We need something else. Going back to basics, we pick a model that involves sets of expressive style tags plus a parameterized dynamic system representing the true complexity and variation of human vocal production. We plan to use an ML approach to understand mapping from [SPEECH ACT DESCRIPTIONS] <=> Configured DYNAMIC SYSTEM TRAJECTORIES <=> Digital Sound. We won’t necessarily worry about computing the latter mapping using physics. We’ll just view the dynamic system as a latent subspace representation that helps the learning problems as they gather more & more performance data to learn/improve on all 4 of these mappings. The learning performance is evaluated according to the accuracy and quality of various “trips” between fully labeled pairings of input control and digital recordings. Data from existing software that is considered to be very high quality can also help speed the learning if the additional control labels are applied.

    • #109406

      josh

      For example of futher decomposition strategy: Pick an additional subspace mapping that takes input control parameters to a target temporal decomposition of a sequence of t1,t2,t3,…. that gives real time intervals plus configuration of the dynamic system at that stopwatch point in time.

      • #109407

        josh

        Gathering data from dynamic scans of particular speaker mouths/tongue/diaphragms could help.

      • #109421

        josh

        We believe that the simple classification of consonants & vowels, or something close to that, is useful for describing the musical/prosodic contour of speech. Note timings, amplitude dynamics (large scale & small scale), C vs. V as a key basis of prosody patterns to be planned.

        • #109424

          josh

          The human vocal apparatus makes it easier to produce slightly higher frequencies of sound while articulating more quickly, & lower frequencies while articulating slowly. This effect is expected as “natural”. An open Q is “Where does high quality & expressive speech increase or reduce pitch variation beyond what is natural variation due to convenience at tempo?”

You must be logged in to reply to this topic.