2013-01-11

Poe's Gold Bug cipher

I have been studying Edgar Allan Poe's The Gold Bug recently. I've read that it is the first fiction story using cryptography as an important aspect of the plot. Specifically, I've been looking carefully at the cipher in the story, which is what this post is about.


First, a very quick summary of the story.


Sketch of The Gold Bug (1843, Edgar Allan Poe)

Characters:

WILLIAM LEGRAND, amateur entomologist

DOCTOR TELLER, his friend (the narrator, and actually unnamed in the story)

JUPITER, Legrand's old servant

Time:
Autumn, mid 1800's.

Place:
An island off North Carolina, USA

Backstory:
Pirates have buried a treasure chest, with some clues, on the island Legrand lives on. Among other clues, they left behind a "very dirty foolscap" (a rugged paper-like material) with a skull-like drawing etched in it in invisible ink.

Plot:
This is structured as a mystery (although technically speaking the mystery genre did not yet exist in the mid 1800's). Our "heros" discover a treasure in the end.

The first clue is a "gold bug" which is found in their travels around the island. The "gold bug" is the body of a beetle-like insect that was discovered by Legrand and Jupiter in their travels. It is a very unusual speciman, so they (Legrand and Jupiter) are fascinated by it. In fact, in the story, it is even lent out to another scientist for inspection. When Teller arrives for a visit, it has been lent out, so Legrand must suffice to draw a picture of it on a scrap of paper he found on the beach and stuck in his pocket. (This scrap of paper was left by the pirates.) Except for this drawing (or more precisely, the drawing on that particular scrap of paper), the "gold bug" is not a significant part of the story.

The next clue Legrand discovers is the cryptogram. How does he discover it? First, Legrand is fascinated by the gold bug. Then he notices a skull pattern on the paper he draws the gold bug image on (which was put in his pocket on his travels around the
island). The cipher is on this piece of paper, but only uncovered when the paper is heated up. Once deciphered, the message gives directions to a treasure chest buried on the island. Legrand, aided by Teller and Jupiter, dig up the chest, which is filled with gold coins and jewels.


The story is of course in the public domain, so it is easy to find copies on the internet and various book collections. What I found surprising was the number of errors in either the cipher or the decryption. For example, I have a collection of Poe stories published by Dover. That book gives the wrong decrypted message. The version at wikisource, http://en.wikisource.org/wiki/Tales_(Poe)/The_Gold-Bug, seems to be correct.

In any case, if you translate the symbols used by Poe into letters, here is the cipher:

ABCCDBEAFFGHIJKLGFJCMFJCFIKEGHIJKDKNGEFFKAIOCPIQCHKDKBPKKFAHDIJGPIKKHSGHRIKFHCPIJ
KAFIAHDLQHCPIJSAGHLPAHTJFKNKHIJEGSLKAFIFGDKFJCCIOPCSIJKEKOIKQKCOIJKDKAIJFJKADALKKEGHK
OPCSIJKIPKKIJPCRBJIJKFJCIOGOIQOKKICRI

To derive this, I simply relaced "8" (the first symbol used by Poe in his cipher) by "A", "3" by "B", and so on. Except for the symbols, this is Poe's cipher. For you cryptogram fans, there are spoilers below, so if you want to copy and past the above cipher into a online substitution cipher site, such as Simon Singh's, you should stop reading and do that.

In the story, Poe actually explains in some detail how he (or rather Legrand) deciphered this. He used what is now called "frequency analysis" - which is simply to guess that the most frequently used symbols correspond to the most frequently used letter in everyday usage, then to guess the other symbols using known word patterns. Only 20 letters were used. Here are the character counts:

A: 12, B: 4, C: 16, D: 8, E: 6, F: 16, G: 11, H: 13, I: 26, J: 19, K: 33,
L: 5, M: 1, N: 2, O: 8, P: 10, Q: 4, R: 3, S: 5, T: 1

The six mostly frequently used letters in the English language are (in order) "E", "T", "A", "O", "I", and "N". In particular, we might guess that "K" is really "E", and that "I" is really "T". Making these substitutions, we have

ABCCDBEAFFGHtJeLGFJCMFJCFteEGHtJeDeNGEFFeAtOCPtQCHeDeBPeeFAHDtJGPteeHSGHRteFHCPtJ
eAFtAHDLQHCPtJSAGHLPAHTJFeNeHtJEGSLeAFtFGDeFJCCtOPCStJeEeOteQeCOtJeDeAtJFJeADALeeEGHe
OPCStJetPeetJPCRBJtJeFJCtOGOtQOeetCRt

(I'm using lower case for the replaced letters, for ease of reading.) You see several "tJe"'s so you might guess "J" is "H", giving

ABCCDBEAFFGHtheLGFhCMFhCFteEGHtheDeNGEFFeAtOCPtQCHeDeBPeeFAHDthGPteeHSGHRteFHCPth
eAFtAHDLQHCPthSAGHLPAHThFeNeHthEGSLeAFtFGDeFhCCtOPCStheEeOteQeCOtheDeAthFheADALeeEGHe
OPCSthetPeethPCRBhtheFhCtOGOtQOeetCRt

What are "C", "F", "H" and "A"? Well, of the most commonly occuring letters, "A", "O", "I" and "N" are left. Note "C" can't be "A" because there is a "CC". Let's try replacing "C" by "O":

ABooDBEAFFGHtheLGFhoMFhoFteEGHtheDeNGEFFeAtOoPtQoHeDeBPeeFAHDthGPteeHSGHRteFHoPth
eAFtAHDLQHoPthSAGHLPAHThFeNeHthEGSLeAFtFGDeFhootOPoStheEeOteQeoOtheDeAthFheADALeeEGHe
OPoSthetPeethPoRBhtheFhotOGOtQOeetoRt

This looks okay. What about "F"? Since there is an "FF", this leaves out "A" and "I" as replacements. Since "S" and "N" are about the same in terms of frequency, and since "SS" is more common that "NN", let's try replacing "F" by "S":

'ABooDBEAssGHtheLGshoMshosteEGHtheDeNGEsseAtOoPtQoHeDeBPeesAHDthGPteeHSGHRtesHoPth
eAstAHDLQHoPthSAGHLPAHThseNeHthEGSLeAstsGDeshootOPoStheEeOteQeoOtheDeAthsheADALeeEGHe
OPoSthetPeethPoRBhtheshotOGOtQOeetoRt

The most natural choices left for "H" and "A" are that "H" is really "N" and "A" is really "A" itself:

aBooDBEassGntheLGshoMshosteEGntheDeNGEsseatOoPtQoneDeBPeesanDthGPteenSGnRtesnoPth
eastanDLQnoPthSaGnLPanThseNenthEGSLeastsGDeshootOPoStheEeOteQeoOtheDeathsheaDaLeeEGne
OPoSthetPeethPoRBhtheshotOGOtQOeetoRt

We have replaced "K", "I", "J", "F" "A", "H". We recongnize several words - "the", "seat", "shoot", "shot". If "D" were "D" we would have:

aBoodBEassGntheLGshoMshosteEGnthedeNGEsseatOoPtQonedeBPeesandthGPteenSGnRtesnoPth
eastandLQnoPthSaGnLPanThseNenthEGSLeastsGdeshootOPoStheEeOteQeoOthedeathsheadaLeeEGne
OPoSthetPeethPoRBhtheshotOGOtQOeetoRt

We see now "death", "and", "head", and "east" (which was there before but now is more certain it is not part of another word). Also, "Bood" suggests "B" should be "G":

'agoodgEassGntheLGshoMshosteEGnthedeNGEsseatOoPtQonedegPeesandthGPteenSGnRtesnoPth
eastandLQnoPthSaGnLPanThseNenthEGSLeastsGdeshootOPoStheEeOteQeoOthedeathsheadaLeeEGne
OPoSthetPeethPoRghtheshotOGOtQOeetoRt

We see another word: "good". The string "gEass" suggests "E" is "L" and the string "onedeBPeesand" suggests "BP" is a "gr":

agoodglassGntheLGshoMshostelGnthedeNGlsseatOortQonedegreesandthGrteenSGnRtesnorth
eastandLQnorthSaGnLranThseNenthlGSLeastsGdeshootOroStheleOteQeoOthedeathsheadaLeelGne
OroSthetreethroRghtheshotOGOtQOeetoRt
We have new words - "hostel", "degrees" and "north". Guessing "G" is "I" and "R" is "U" gives:

agoodglassintheLishoMshostelinthedeNilsseatOortQonedegreesandthirteenSinutesnorth
eastandLQnorthSainLranThseNenthliSLeastsideshootOroStheleOteQeoOthedeathsheadaLeeline
OroSthetreethroughtheshotOiOtQOeetout

We recognize "through" and "side" and "out". Now, "S" should be "M" amd "O" should be "F":

agoodglassintheLishoMshostelinthedeNilsseatfortQonedegreesandthirteenminutesnorth
eastandLQnorthmainLranThseNenthlimLeastsideshootfromthelefteQeofthedeathsheadaLeeline
fromthetreethroughtheshotfiftQfeetout

It seems reasonable to replace "Q" by "Y", "N" by "V", and "L" by "B":

agoodglassinthebishoMshostelinthedevilsseatfortyonedegreesandthirteenminutesnorth
eastandbynorthmainbranThseventhlimbeastsideshootfromthelefteyeofthedeathsheadabeeline
fromthetreethroughtheshotfiftyfeetout

With only one letter left, we get the message:

a good glass in the bishops hostel in the devils seat forty one degrees and thirteen minutes north east and by north main branch seventh limb east side shoot from the left eye of the deaths head a beeline from the tree through the shot fifty feet out

This message told Legrand and his friends where to dig for the treasure.