An Idea for Extreme Data Compression

Discussion over at Yours.org

I’ve got an idea for extreme data compression. It would allow for essentially arbitrarily large amounts of data to be transmitted or stored in a very small space – even within the OP_Return space on the Bitcoin Cash blockchain.

Disclaimer: I am a complete and utter novice. I don’t know anything, nor do I have any technical expertise. This idea popped into my head a few nights ago when thinking about how to send longer messages through memo.cash. However, I’ve run this idea past a few people I respect, and some parts look promising. It might have a catastrophic flaw or assumption that I’m missing – or it might already be in common use, which wouldn’t be surprising – but it’s good enough to share and risk looking silly. Since I don’t speak the technical language, I can only communicate the big-picture concepts without technical precision.

The general idea is this: to compress data by assigning unique characters to unique patterns of data. Then, storing or transmitting the characters, instead of the data. The characters can then by decompressed to reconstruct the original data. I call it “character compression.”

There are currently more than 130,000 unique characters in Unicode alone, which could be a substantial enough set for significant data compression.

So, let’s take a simple example. Let’s say I want to communicate the phrase, “Bitcoin Cash is peer-to-peer electronic cash.” That’s 46 individual characters. Now imagine I assign the following, using Armenian characters, which are 2-bytes each:

“Bitcoin Cash” = “Ի”

“Is” = “Թ”

“Peer-to-peer” = “Զ”

“Electronic cash” = “Բ”

Call these four assignments a “compression key.” Now, I can communicate the phrase “Bitcoin Cash is peer-to-peer electronic cash” by only using 4 characters – ԻԹԶԲ.  If the receiver of information shares the same compression key, then he’ll be able to read the 4 symbols and fully reconstruct the phrase “Bitcoin Cash is peer-to-peer electronic cash.” Simple. In this case, the message goes from about 45 bytes down to 8 – a more than 80% reduction in size.

So, I propose we make a public database for assigning specific characters to the most used English words. Apparently, there are about 3,000 English words that make up 95% of English communication. To start, we could assign 3,000 individual characters to each word. We could use any of the characters not in use for English speakers, like Japanese kanji. “感” could mean anything. That way, we could communicate most English words simply by sharing the corresponding characters with each other. The fewer bytes taken up by the chosen character, the better.

Take for example, the play Romeo and Juliet. It’s got about 25,000 individual words. If each word had its own character, we could compress Romeo and Juliet down to 25,000 characters. Still, not small enough to fit inside a BCH transaction, but we can get much smaller.

Extreme compression

With 135,000 characters to use right now, and say, 5,000 assigned to the words in Romeo and Juliet, that still leaves about 130,000 characters that we’re free to assign. Here’s where the biggest compression happens: what if we assign unique characters to the most popular patterns in English speech. Multiple words at the same time, including the spaces. The words “give or take” are 12 characters together, but they come up all the time in written communication. What if we assigned “give or take” its own character?

You could have one list of, say, the top 100,000 most common English phrases (each a few words long), and compress each one down to a single character. Communicate the character – or put it inside the blockchain – and then somebody else with the same compression key could re-construct the original data.

In other words, you could essentially hack the character limit on platforms like Twitter or memo.cash. With Twitter, you might able to only use 144 characters, or with memo.cash, you might only be able to send 80 bytes worth of compressed characters per transaction, but those  characters would correspond to a much larger amount of information reconstructed on the receiver’s computer. The reduction in bandwidth would be enormous.

Even smaller words in common phrases could be reduced by 50%. Take the random phrases:

“And then,”

“To be”

“To the”

“The other”

“A different”

“Not again,” etc.

All of these could be compressed more than 50% if they had their own characters. The longer the phrases or the bigger the words, the more compression.

(Also, in the long run, why stop at 135,000 characters? Why couldn’t we come up with 10,000,000 individual characters that we could assign to unique and more complex pieces of data, to compress things even further? We could create new, 1-byte characters that are not in use in any language, for the sole-purpose of assigning common patterns to them. I don’t know the process involved in creating these things, so maybe this is impossible for some reason.)

All of this compression and reconstruction could be done behind the scenes. Humans wouldn’t need to see the symbols; they’d just see the reconstruction of it. Instead of communicating raw data with somebody, you’d instead be communicating the blueprint for reconstructing the data locally.

Public Database

This system requires creating a public database – the compression key that we all share. I don’t think it would be hard. It’d simply be a very large list of words and phrases, with their corresponding symbols that they get compressed to. The key could be hosted locally on people’s computers (kind of like downloading a new font for your computer to understand), or it could be hosted on something like Github. We could create a kind of universal standard for the compression of ordinary English into characters. Then, your computer would either automatically convert the phrases to characters, or you might have something like a Chrome plugin that does it for you – or perhaps individual companies will offer this kind of compression on their platform (I’m lookin’ at you, Memo.cash).

Playing out the idea further, companies could even have their own unique compression keys, if particular strings of data come up a lot in their systems. There’s no reason why characters would have to be assigned to English words only; they could also correspond to particular pieces of computer code, or any other string of data that’s in a repeated pattern.

If it’s not already widely in use, this system could not only dramatically reduce the amount of information communicated to people, but it might also dramatically reduce the amount of information stored on hard drives. Think how many times Google has the phrase “I am hungry” stored on their hard drives, throughout all the emails and Youtube comments they host. If the data is duplicated over and over, it could dramatically be reduced by replacing the phrases by a single character, then storing the compression key for data-reconstruction when necessary. Heck, you could write a computer program that specifically looks for repeated patterns of information, assigns those patterns a particular symbol, then overwrites the data with a single character and keeps the compression key for later reconstruction. These compressed symbols become a placeholder for patterns. Maximum compression would be the replacement of all repeating patterns with a single character placeholder. Again, I don’t know if this is something that’s already done, but if it’s not, it seems like a good idea.

This idea is very broad and open-ended. Who knows: there could be industry-standard conventions for character compression that differ per language or use-case. Or, it could be used to power a computer program that receives its instructions by monitoring the blockchain for specific inputs. Large amounts of data could be written/executed remotely by communicating only a small amount of symbols. Or, if your data isn’t compact enough, you could simply chain together BCH transactions to encode larger and larger amounts of data that could all be uncompressed by the same key. Or, perhaps you could compress the compression with a different key – finding patterns in your compression and then using a secondary key to compress even farther.

Perhaps this could even be used in P2P file sharing. Instead of sharing the entire file with somebody, perhaps you could share the compression key with them plus the instructions for reconstructing the data on their own machines. (Again, this key could be unique to the particular data that’s being compressed. It could maximize the compression on a per-file basis. So, every time that a common pattern appears, it’s compressed down to one character. The more repetition, the more compression.) Or maybe it could just be used to compress information between two people, who share their own unique keys.

Or maybe it’s just a clever hack to pack more information into limited blockspace. What do you think? There are obviously lots of details to work out, but it seems like this system could help us pack more information into tight spaces like OP_Return.