Monday, February 14, 2011

Human Computation: Compression

Since I have been so interested in Compression lately, I decided to do something a little different for this post, just for the heck of it (no, sorry, this isn't really glitch/noise related, don't worry, I have stuff coming up!)... I am going to try to compress an English sentence down to the SMALLEST possible size I can get it, while retaining the original idea. I wonder how small it can get before it is impossible to understand? Smaller is better, right?

For the text involved, I chose the somewhat-wordy sentence that is quite well known here in the USA, the opening line to the Gettysburg Address:

"Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal"


This sentence happens to measure exactly 146 characters long (not including spaces). Let us see how much we can reduce this file size with simple steps.


First, we can see several numbers present in the opening of the sentence. Let us run a simple process, replacing all worded digits with their numeral symbol equivalent:


"4 score and 7 years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal"


139 Characters. Could be better, but it's a start. Lets continue by striping down all words to their shortest possible synonym.


"4 score and 7 years ago our dads made on this land, a new state, made free, and loyal to the idea that all men are made same"


The idea is still clearly understandable, and the length has been reduced from 139 Characters to 97. Now, we correct the (quite outdated) unit of time in the beginning of the sentence, by finding an equal value in ordinary years.


"87 years ago our dads made on this land, a new state, made free, and loyal to the idea that all men are made same"


89 Characters.


Next, I correct the unnecessary complications in syntax.


"87 years ago our dads made here, a new state, made free, and loyal to the idea that all are made same"

Is it just me, or did I hear the sound of feminists cheering?

Anyways, that edit brought our number down 9 characters, bringing our grand total to 80, overall.

Now it is starting to get tricky! Lets try adding some abbreviations, texting style:

"87 yrs ago r dads made here, a nw state, made free, n loyal t th idea tht ll r made same"

67 Characters!

Starting to look like it was written by a 13 year old girl, but still understandable. Now, lets remove direct and indirect objects. Any linguist will know that this WILL remove some clarity, but hopefully not enough to obscure the original intended idea.

"87 yrs ago r dads made here, nw state, made free, n loyal t idea tht ll r made same"

64 Characters now. Getting a bit harder to understand now, but most twitter users should be able to understand it without trouble.

Now, lets try replacing a few of the extra with, admittedly less common abbreviations (that "made" was killing us!):

"87 yrs ago r dds md here, nw state, md fre, n lyl t idea tht ll r md sm"

52 Characters

I suppose it wouldn't hurt to cut out the bit in the middle, since it doesn't necessarily contribute to the overall idea...

"87 yrs ago r dds md here a fre state lyl t idea tht ll r md sm"

45 Characters

Well... Does the amount of years really matter?

"R dds md here a fre state lyl t idea tht ll r md sm"

37 Characters!

A bit of rephrasing...

"R dds md us fre state where ll r md sm"

28 Characters

Does it matter who made it? I guess not...

"A fre state where ll r md sm"

21 Characters

Since all are the same, I guess the free is a little redundant, right? And the made doesn't really need to be in there either.

"A state where ll r sm"

16 Characters

State doesn't really add much to the meaning here....

"place where ll r sm"

15 Characters

I guess just a description might suffice?

"whr ll r sm"

8 Characters

A statement?

"equality"

8 Characters Again

uhh, could we maybe make that an adjective?

"equal"

5 Characters!

abbreviate!

"eql"

3 Characters.

...

"eql"... The opening sentence to one of the most famous speeches of all time, brought down to three simple letters... I wonder if people will recognize it?

(and this concludes todays special on why over-compression is usually bad)

No comments:

Post a Comment