Network Data Compression With Agile Codebooks

This topic contains 5 replies, has 1 voice, and was last updated by Josh Stern March 1, 2023 at 5:36 am.

Author

Posts
February 27, 2023 at 10:18 am #126287

Josh Stern
Moderator

Q:Considering that a client might cache it’s useful code books between different sessions with a “site” or a “network infrastructure provider”, should there be a meta-system for describing code-book naming & retention?

A: Yes, it’s a good point. The naming system should consider what sorts of naming features will be the most stable. Using ICANN names as in Java feels dated/wrong now. But what then? Perhaps the coding itself can define an assembly language that is strong enough to do the job with canonical protocol oriented strings – hexadecimal transcoded.
February 27, 2023 at 1:18 pm #126288

Josh Stern
Moderator

Q:How would assembly language of tree construction help?

A: Couple of ideas.

1) Rather than just allow leaf content substitution in a code, we can also provide an unambiguous language for constructing an entire code tree with contents and a special built in code for transmitting complete codebooks – this helps with the case of frequent usage & an optimal set that’s far from the starting point.

2) Rather than use something based on a CERT system to uniquely identify namespaces and items within name spaces, we can associate the codebook itself with an unambiguous description of how to construct it and then use a cryptographic hash of the description as the token of whether or not we are talking about a know codebook.

3) Our efficiency does not depend much on preventing alternative ways of making functinally the same codebook, so 2) could be associated with one method starting from scratch & a shorter method starting from commonly stored dictionaries the client is likely to have. Server would recognize 2 or more different hashes for that.
February 27, 2023 at 6:39 pm #126289

Josh Stern
Moderator

Review of literature suggests that supporting Adaptive Arithmetic CodingAdaptive Arithetic Coding is a smart place to start. On top of that base though there are a lot of choices for how to generate high performance when adapting to streaming rather than long files.

Some issues to explore:

a) Good default code books – consider generating a test database that can mix streams of different data types in a parametrically specified way – popular streaming content, mpeg4, web pages, email, IM, compressed binaries, compressed source, databases, audio, etc. Random mixtures can be experimented with as a testbed comparing different schemes.

b) Add special codes for meta-commands like replacing a code word with something new

c) Add repetition of char X as a special form if it helps

d) How to efficiently describe the meta command of code entry replacement? Should it be just one or transmit a batch of edits together?

e) How to effciently describe a new “named” code book?

Edit: The Wikipedia article mentions the issue of ragged bit code ends in bytes. Bit ops are slow compared to byte ops in most implementations. Since we filling in a long sequence of bits, most of the time, it might be best to “table” pairs of <1st part of byte,2nd part of byte> for reading & writing as bytes (from words we read as words).

There are 3 different factors that lead to ragged bit waste: the first is restricting codes to byte alignment; the second is keeping the number of bits describing the interval fixed at a given length; and the third is only mapping codes of token length 1 to terminals. My recommendation is to allow the Meta command to set the upcoming bit length to any value in an allowable range. Switching lengths is similar to switching codebooks but simpler. A given codebook can have different sections of different length number of bits. My recommendation is to allow some codes to use multiple tokens – this doesn’t require the other changes, but they make it more efficient, especially in the adaptive context.
February 28, 2023 at 4:55 pm #126291

Josh Stern
Moderator

The math model for adaptively optimizing codebooks presents opportunities for collaborative applied research with those who work on the math of portfolio theory for investments. Why is that so?

First note that at any given time, the network sender will have a lookahead buffer of known data to compression code and a probability density estimate for what the work after that buffer/time might look like. The set of decisions to make under uncertainty is similar to decisions about what to add or trade in a portfolio under the following mapping:

Different classes of users with known or unknown tendency and preferences are similar to different areas/classes of investment that a fund or wealth manager will consider – could be the entire universe or a mixture that focuses mainly on a narrow subset.

Trading/slippage costs are a similar concern in both domains.

The decision to make something like a website icon part of a codebook is analogous to a choice with high volatility – can pay off with a “home run” or not depending on whether it remains in place & that code book is available for future encounters.

Uncertainty about what buffer length to expect and when it might switch to a different channel is like uncertainty models of investment horizon and other client behavior.

Maximizing the expected overall throughput rate is arguably similar to maximizing the estimating the discounted present value of different elements of the investment portfolio with the objective maximizing the overall portfolio.
March 1, 2023 at 5:36 am #126293

Josh Stern
Moderator

When all bits are used, regardless of byte alignment, the arithmetic coding schemes could also be re-worked as “use-prefix codes up to some longest given length & then fill in the rest with the bracket scheme”. Is that better than transmitting 2 copies of the prefix?

This paper from Junho Cho offers some tools for constructing codes with given shapes from probability targets:

https://arxiv.org/pdf/1810.02411.pdf

Note – if you design, for instance, with a special code that means “Next prefix code only is from this other code book” and that symbol gets k bits, the effective probability assignment weight for the codes in the next code book is like (1/2^k)*their weights. So one can optimize over multiple books simultaneously with k as a variable.
Author

Posts

You must be logged in to reply to this topic.

Personal Notes

Personal Notes for Friends

Network Data Compression With Agile Codebooks