Saturday, October 20, 2012

Holiday planning by Debian packages

Today I am facing a rather difficult decision. Should I travel to Lauterbrunnen for a week of BASE jumping or should I stay with my friends in Sweden for the last weekend of skydiving this season (with the obligatory party), and try out my brand new wingsuit? Facing this decision, I did what any rational human being would do: I decided to ask the aptitude package manager for help in making this decision.

Setting up a local Debian repository

If you already have a local debian repository, you can skip this, otherwise it's useful to set one up. Install a web server of your choice (I happened to be running lighttpd already) and then install, for example, reprepro. I am not going to elaborate on this, here:s a guide that may be useful: Local Debian Repository.

Creating a debian project

A very simple debian project consists of a debian debian directory with the 3 files "control", "rules" and "changelog", so let's create this.

bjorn@terra:~/src/$ mkdir -p decision-example/debian
bjorn@terra:~/src/$ cd decision-example
bjorn@terra:~/src/decision-example$ touch debian/control
bjorn@terra:~/src/decision-example$ cat debian/rules # create this
#!/usr/bin/make -f

%:
 dh $@
bjorn@terra:~/src/decision-example$ chmod +x debian/rules
bjorn@terra:~/src/decision-example$ cat debian/changelog # create this
decision-example (0.0.0) unstable; urgency=low

  * Initial release. (Closes: #XXXXXX)

 -- Björn Edström   Sat, 20 Oct 2012 11:37:03 +0200

While optional, I am also going to create the "compat" file because it will supress a bunch of warnings when building the package.

bjorn@terra:~/src/decision-example$ cat debian/compat # create this
7

The interesting file is the control file, which specifies one or more debian packages to be built. Just to get this out of the way, we are going to add a source package on top, and a single package representing the decision to go skydiving (this package is called gryttjom becayse my home dropzone is a place called Gryttjom which is north of Stockholm in Sweden):

Source: decision-example
Section: devel
Priority: optional
Maintainer: Root Key
Build-Depends: debhelper (>= 7)
Standards-Version: 3.8.4

Package: decision-gryttjom
Architecture: amd64
Description: Go to gryttjom

Having set this up, we can now build our package(s)

bjorn@terra:~/src/decision-example$ debuild -d -us -uc

Essentially that means to build binary (non-source), ignore dependencies when building, and not signing the package. If successful this will create a bunch of files in the parent directory.

Let's publish this package to our debian repository:

bjorn@terra:~/src$ dput -f -u decision decision-example_0.0.0_amd64.changes
Uploading to decision (via local to localhost):
Successfully uploaded packages.
bjorn@terra:~/src$ sudo /usr/bin/reprepro -b /var/www/debian processincoming decision
Exporting indices...

Given that you have configured your /etc/apt/sources correctly that package will now be available after an aptitude update:

bjorn@terra:~$ aptitude search decision
p   decision-gryttjom            - Go to gryttjom

Because we are going to create a bunch of packages from now on, let us create a build script:

bjorn@terra:~/src$ cat ./decision-example/publish.sh #/bin/bash
sudo rm -r /var/www/debian/{pool,db,dists} # lazy way to delete previous, only do this if you don't have other packages
dput -f -u decision decision-example_0.0.0_amd64.changes
sudo /usr/bin/reprepro -b /var/www/debian processincoming decision
sudo aptitude update

Decisions and dependencies

Aptitude, which is a package manager for debian, has a rather advanced dependency resolver. Specifically, you can use the keywords "Depends", "Conflicts", "Suggests", "Enhances", "Recommends" and a bunch of other to specify relations between packages. This is what we are going to use to make our decision.

On one level, we can specify our problem as "go to gryttjom or go to the alps". So let's create two packages for each option.

Package: decision-gryttjom
Provides: decision-root
Priority: standard
Conflicts: decision-alps
Architecture: amd64
Description: Go to gryttjom

Package: decision-alps
Provides: decision-root
Priority: standard
Conflicts: decision-gryttjom
Architecture: amd64
Description: Stay in the alps

Let us build these packages (debuild step) and publish (using the build script). Now we have:

bjorn@terra:~$ aptitude search decision
p   decision-alps                - Stay in the alps
p   decision-gryttjom            - Go to gryttjom
v   decision-root                -

We have now codified a decision problem as shown:

bjorn@terra:~$ sudo aptitude install decision-root
...
"decision-root" is a virtual package provided by:
  decision-gryttjom decision-alps
You must choose one to install.

bjorn@terra:~$ sudo aptitude install --schedule-only decision-gryttjom decision-alps
bjorn@terra:~$ sudo aptitude install # we scheduled above and execute here to avoid installing the default
...
The following packages are BROKEN:
  decision-alps decision-gryttjom
0 packages upgraded, 2 newly installed, 0 to remove and 1 not upgraded.
Need to get 1,966B of archives. After unpacking 57.3kB will be used.
The following packages have unmet dependencies:
  decision-alps: Conflicts: decision-gryttjom but 0.0.0 is to be installed.
  decision-gryttjom: Conflicts: decision-alps but 0.0.0 is to be installed.
The following actions will resolve these dependencies:

Keep the following packages at their current version:
decision-alps [Not Installed]

Score is 117

Accept this solution? [Y/n/q/?] n

The following actions will resolve these dependencies:

Keep the following packages at their current version:
decision-gryttjom [Not Installed]

Score is 117

Accept this solution? [Y/n/q/?]

What we have done is we have created two packages and a root "virtual" package and specified a relationship between them. Using the excellent debtree utility we can visualize this as follows:


bjorn@terra:~/src$ debtree -R -S decision-alps | dot -T png > decision-example/img/debtree1.png

Above is the dependency graph centered on the decision to go to the Alps. The root decision shows the two options: alps or gryttjom, while the red arrow shows a conflict.

Adding data points, additional decisions

The decision I am making is heavily influenced by weather forecasts. Furthermore, if I go to the Alps I a have the option to either stay in Lauterbrunnen or travel to Mt Brento in Italy to meet up with a friend. So lets create some data points and sub decisions:

Package: decision-gryttjom
Provides: decision-root
Priority: standard
Conflicts: decision-alps
Architecture: amd64
Description: Go to gryttjom

Package: decision-alps
Provides: decision-root
Priority: standard
Conflicts: decision-gryttjom
Suggests: decision-alps-brento, data-good-weather-alps
Architecture: amd64
Description: Stay in the alps

Package: decision-alps-brento
Depends: decision-alps
Architecture: amd64
Description: Go to brento

Package: data-good-weather-alps
Architecture: amd64
Enhances: decision-alps
Description: Good weather in the alps

Package: data-bad-weather-gryttjom
Architecture: amd64
Enhances: decision-alps
Recommends: decision-alps
Description: Bad weather at home

Here I have basically set up a bunch of hints or indications that if the weather forecast for Sweden looks bad then that leans towards going to the Alps. Building and publishing gives us:

bjorn@terra:~$ aptitude search decision
p   decision-alps                   - Stay in the alps
p   decision-alps-brento            - Go to brento
p   decision-gryttjom               - Go to gryttjom
v   decision-root                   -
bjorn@terra:~$ aptitude search ^data- | egrep 'alps|gryttjom'
p   data-bad-weather-gryttjom       - Bad weather at home
p   data-good-weather-alps          - Good weather in the alps


Now, let's say the forecast says it seems to be bad weather in Sweden. Let's see what aptitude says:

bjorn@terra:~$ sudo aptitude install --schedule-only decision-gryttjom decision-alps data-bad-weather-gryttjom

bjorn@terra:~$ sudo aptitude install
...
The following packages are BROKEN:
  decision-alps decision-gryttjom
The following NEW packages will be installed:
  data-bad-weather-gryttjom
0 packages upgraded, 3 newly installed, 0 to remove and 1 not upgraded.
Need to get 3,010B of archives. After unpacking 86.0kB will be used.
The following packages have unmet dependencies:
  decision-alps: Conflicts: decision-gryttjom but 0.0.0 is to be installed.
  decision-gryttjom: Conflicts: decision-alps but 0.0.0 is to be installed.
The following actions will resolve these dependencies:

Keep the following packages at their current version:
decision-gryttjom [Not Installed]

Score is 117

Accept this solution? [Y/n/q/?] n
The following actions will resolve these dependencies:

Keep the following packages at their current version:
decision-alps [Not Installed]

Leave the following dependencies unresolved:
data-bad-weather-gryttjom recommends decision-alps
Score is -83

As can be seen, given the simple dependencies we have set up shows that aptitude prefers BASE jumping if the weather is bad in Sweden.

Conclusion

To not make this article too long and tedious, I will stop here and not add more data points. However, by using Debian packaging "Suggest", "Ehances" and "Recommends" on some other data points (cost/travel overhead of getting to Brento, if my cold develops further, how much money I have etc) you can actually get something useful. As the number of data points increase it is not immediately obvious (as it is in the example above) what to do, and aptitude actually gives some suggestions.

Of course, I do not recommend anyone to base their life on what a Linux package manager says, and for that reason this article is a little bit tounge-in-cheek, and I realize this article is also the result of some procrastination on my part. However, actually formalizing the data points and relations between decisions helped me out, and when I have the data points I have formalized (travel costs, weather forecasts etc) will give aptitude a spin and hear it out.

Cheers,

Björn

Saturday, July 28, 2012

Fun with the TLS handshake

This article is about the TLS handshake in general, with a focus on how cryptographic signing is used in the protocol. Specifically, the article will attempt do to the following:
  1. Explain how the TLS handshake works, using Python code.
  2. Highlight how some parts of the protocol allow for signing short amount of client-supplied data by the server certificate.
  3. ... and abuse the above parts of the protocol, and common implementations of the protocol, to build poor-mans trusted time-stamping.
The companion code to this article is the little project "toytls" hosted at Github, which implements the client part of the handshake and some surrounding utilities to accomplish point 2 and 3 above.

See the toytls page at github for the code.

An overview of the TLS handshake

TLS, and it's predecessor SSL, is a protocol for secure (encrypted, authenticated) communication over an untrusted network. The "TLS handshake" is the initial part of the protocol, where a client and a server negotiate how (if) to accomplish the goal of secure communication.

The handshake consists of multiple messages being sent back and forth the server and client, with the ultimate goal of: a) allowing the client to verify the authenticity of the server, usually using cryptographic certificates, b) allowing the server to verify the client and c) to securely agree on a method to proceed with the bulk transfer of data.

The handshake typically look like this:

>>> ClientHello
<<< ServerHello
<<< Certificates
<<< ServerHelloDone
>>> ClientKeyExchange
>>> ChangeCipherSpec
>>> Finished
<<< ChangeCipherSpec
<<< Finished

This means that the client first sends a message ClientHello to the server, and will get the response ServerHello back, etc. The handshake ends with both client and server sending Finished. After that point, a secure channel has been set up for the transfer of data.

Cryptographic primitives and cipher suites

TLS has the concept of a "cipher suite". This is a predefined list of cryptographic primitives in different classes (encryption, hashing, authentication etc) to use in the handshake and communication. In this article, we will concern ourselves with three cipher suites:

TLS_RSA_WITH_RC4_128_SHA = 0x0005
TLS_DHE_RSA_WITH_AES_128_CBC_SHA = 0x0033
TLS_ECDHE_RSA_WITH_RC4_128_SHA = 0xc011

The first one, suite TLS_RSA_WITH_RC4_128_SHA (which has id 5), is implemented and supported by almost all clients and servers. The last two are not as common (the servers at Google and Facebook support one each, but not both) and are selected specifically because they have interesting signing behavior, which will be discussed further below.

In English, the three cipher suites above mean:
  • A cipher suite using RSA for secure key exchange, RC4 for bulk data encryption and SHA1 as a hash.
  • A cipher suite using Diffie-Hellman for secure key exhange, RSA signatures for authentication, AES-128 in CBC mode for bulk encryption and SHA1 as hash.
  • A cipher suite using Elliptic Curve Diffie-Hellman for secure key exchange, RSA signatures for authentication, RC4 for bulk data encryption and SHA1 as hash.
As can be seen, a full implementation of one of the versions of TLS requires a full arsenal of different cryptographic primitives (the three examples given are just a small subset of all cipher suites specified in the RFC:s). For the remainder of this article, we will mostly concern ourselves with the first suite, and only concern
ourselves with the second suit to implement the "clever hacks" mentioned in the very beginning of this article.

To implement a handshake using TLS_RSA_WITH_RC4_128_SHA, you need, of course, an implementation of RSA, RC4 and SHA1. In addition, TLS version 1.1, which is by far the most common version on the Internet today, requires an implementation of MD5 which is used for authenticating the handshake itself. This is not a requirement for TLS version 1.2, which use SHA-256 instead.

For practical use of RSA, the cipher suites mentioned employ the PKCS1v5 padding schema, which is a method padding a byte-string securely before encrypting it.

In toytls, the cryptographic primitives are implemented in toytls/rsa.py and toytls/rc4.py.

Message structure

To get a general feel for the TLS handshake we will dissect a sample ClientHello message, being sent from the client to the server. A simple ClientHello can look like below, with line breaks and comments added for readability.

16 03 02 00 31 # TLS Header
01 00 00 2d # Handshake header
03 02 # ClientHello field: version number (TLS 1.1)
50 0b af bb b7 5a b8 3e f0 ab 9a e3 f3 9c 63 15 \
33 41 37 ac fd 6c 18 1a 24 60 dc 49 67 c2 fd 96 # ClientHello field: random
00 # ClientHello field: session id
00 04 # ClientHello field: cipher suite length
00 33 c0 11 # ClientHello field: cipher suite(s)
01 # ClientHello field: compression support, length
00 # ClientHello field: compression support, no compression (0)
00 00 # ClientHello field: extension length (0)

All TLS messages have the TLS header which consists of a content type (first byte), version number (second two bytes) and a length (last two bytes). The content type describes the type of the message, and is 0x16 for the plaintext handshake and 0x17 för encrypted application data.

The handshake messages have a second handshake header describing the type of the handshake message (ClientHello, ServerHello etc). The handshake header also has a length field.

Following the handshake header is payload data specific to the handshake message. For ClientHello, this field contains some random data used for encryption purposes, a list of cipher suites supported by the client, information about compression support and optionally support for TLS extensions, which will not be covered in this article.

It is worth mentioning that, although the ChangeCipherSpec message is sent in the handshake procedure, it is not considered a handshake message and thus has a different content type (0x14) and does not have the handshake header. All the other messages mentioned above are handshake messages with a handshake header and a handshake content type.

Error handling

If something goes wrong during the handshake, the server can send an Alert message, which has content type 0x21. If you send incorrect TLS messages to the server (such as implementing authentication incorrectly, using incorrect decryption keys, attempting to use unsupported features, or otherwise implementing the protocol wrong)
you will get alert messages indicating some kind of failure.

In this article we will mostly ignore these messages. Just be aware that if you play around with toytls and change messages you will surely get these messages from the server.

The handshake from above

In this section of the article we will try to ignore the details of how the different TLS messages are marshalled, and instead focus on the content of each message. If you are interested in the bits and bytes, please have a look at the toytls source code and/or look at Wireshark output. In the description below, we will focus on the cipher suite TLS_RSA_WITH_RC4_128_SHA. Other cipher suites may have
slightly different message contents.

>>> ClientHello

Client says it supports TLS_RSA_WITH_RC4_128_SHA, no compression, and no extensions. It sends some random data used for cryptographic purposes.

<<< ServerHello

Server responds with some random bytes of it's own, also says that it accepts the cipher suite and compression option the client suggested. The server also gives a session ID which can be used if the client wants to reuse the TLS connection later. We will skip over the session ID and sessions in this article.

<<< Certificates

Server sends a list of certificates to the client (for HTTPS, these are the certificates shown in the web browser when the website is opened).

<<< ServerHelloDone

Server sends a message indicating that the client should continue with the handshake.

>>> ClientKeyExchange

The client generates a 48 byte random encryption key called the "pre master secret" and encrypts that with the public key in the server certificate. This encrypted key is sent to the server.

>>> ChangeCipherSpec

The client sends the control message ChangeCipherSpec indicating that everything from now on will be encrypted.

>>> Finished

The client sends an encrypted message that contains a hash of all the messages in the handshake sent and received. This makes it possible to detect tampering. This will be described further below.

<<< ChangeCipherSpec

The server sends the control message ChangeCipherSpec indicating that everything from now on will be encrypted.

<<< Finished

The server responds with a Finished message on it's own, sending an encrypted hash of the handshake messages.

Cryptography in the handshake

As mentioned above the Finished message contains a hash of the handshake sequence. For TLS 1.1, this hash is a slightly unusual construct: it is the concatenation of an MD5 and a SHA1 hash. The hashed data is the content of the handshake messages, without the TLS header.

This means that, when the client constructs ClientHello, it will also update the MD5-SHA1-hash with the content of the message. When it receives ServerHello, it will also update the hash, so the client and server can compare hashed in the Finished messages.

In addition to the MD5-SHA1-hash, TLS 1.1 also use a rather peculiar pseudo-random function (PRF) based on the two hashes. This pseudo-random function is used to generate keying material for the cryptographic primitives and MAC:s. The PRF is a combination of HMAC-MD5 and HMAC-SHA1. Please see toytls/hash.py for details.

When the cipher suite is TLS_RSA_WITH_RC4_128_SHA, the PRF is seeded with the pre-master secret, client random and server random data to generate a master secret. This master secret is then seeded to the PRF, again together with the random data, to create keying material. In this case, the keying material is for the RC4 stream cipher and HMAC-SHA1.

Both sides (client/server) hold encryption/MAC contexts for itself and the other side, so the client in our case has an RC4 encryption key for both the client and server.

Demonstrating the handshake

In the toytls project, please try out the script toytls-handshake.py. This will, given a server and port, complete a TLS 1.1 handshake.

Fun with signatures

With the cipher suite TLS_RSA_WITH_RC4_128_SHA, the handshake is authenticated because the pre master secret is encrypted with the public key in the server ceritificate, making sure only the holder of the private key can decrypt it. From the client side, this is some proof the server is valid.

When the cipher suit sets up a secure channel with Diffie Hellman, this kind of authentication is lacking. Diffie Hellman is anonymous and anyone could impersonate the server. To mitigate this, the TLS cipher suites using Diffie Hellman use RSA or some other algorithm for signing the handshake and encryption parameters. In TLS jargon, this is known as "Ephemeral Diffie Hellman", and is what DHE stands for in
the cipher suite names.

To inform the client of this signature, an extra message is introduced in the TLS handshake, known as ServerKeyExchange. This is sent between Certificate and ServerHelloDone. The signature is over the client and server random data, and the encryption parameters.

This means that a client can, using the client random field (as sent in ClientHello and used for encryption purposes), get up to 32 bytes of data signed by the server ceritificate. This is not completely useful on it's own. However, the TLS specification says that the first part of the random fields should be a timestamp.

Poor mans trusted timestamping

Trusted timestamping is cryptographically associating a timestamp with some data. Typically, you have a document you want to assert you wrote at a certain date. You take a hash of the document, send it to a trusted third party, and have them sign the hash together with a timestamp.

By using the client random field in the TLS handshake, together with a Diffie Hellman cipher suite, and using the fact that TLS implementations contain a timestamp from the server, you can use Google or Facebook as a trusted third party for timestamping up to 32 bytes of data, sufficient for for example a SHA-256 hash.

In the toytls project there are two scripts, toytls-sign.py and toytls-verify.py.

$ cat statement
In this statement I claim that I wrote the following documents the year 2012:

cae3614264895c0201525ec7efff4ca6bb34dfc2 toytls/x509.py
f087da9fe9ad13e8064e6ad951b6aac8c3d54799 scripts/toytls-sign.py

Cheers
be@bjrn.se

$ sha256sum statement
08c247a658bcfe4668d853192dfe9a27c4f7bbc75ca6bc567fdc4726b1628ee8 statement

$ sha256sum statement | awk '{print $1}' |
python -c "import sys; sys.stdout.write(sys.stdin.read().strip().decode('hex'))" |
PYTHONPATH=. scripts/toytls-sign.py www.google.com 443 signed_statement
Signing message:
08c247a658bcfe4668d853192dfe9a27c4f7bbc75ca6bc567fdc4726b1628ee8

Bytes signed:
08c247a658bcfe4668d853192dfe9a27c4f7bbc75ca6bc567fdc4726b1628ee8
5013fa565da4ee2d51c9441cf307578161d1f631a3b4784f193bdd7e9d55b598
03001741049c096ff72fe6a7a1bbc5227a7b9806ab0d12129212a3e700138070
42e35fd8f60efc9cda1ecd9bbf61464a179299b43c3cf195956eedd635a7f859
8091910bf3

Signature:
7c7e3ef5aeea49674e3311112b503c6cfe8149810d5615392d0405939667bf15
95a8c9693cc8d4105a52c85615e7132467757939f72f01354a74882f59463e4d
d76b7eb4ec0de9b6922e2fc3e74336eb0ae619f90f53a2384a1465970a11a9d5
66afd335d3ae9cb2e8f7fd757d5cb5fad530923d29b3df195a963ef699711141

Server certificate:
308203213082028aa00302010202104f9d96d966b0992b54c2957cb4157d4d30
...
20e90a70641108c85af17d9eec69a5a5d582d7271e9e56cdd276d5792bf72543
1c69f0b8f9

$ PYTHONPATH=. scripts/toytls-verify.py signed_statement
Signature Verification SUCCESS

Bytes signed:
08c247a658bcfe4668d853192dfe9a27c4f7bbc75ca6bc567fdc4726b1628ee8
5013fa565da4ee2d51c9441cf307578161d1f631a3b4784f193bdd7e9d55b598
03001741049c096ff72fe6a7a1bbc5227a7b9806ab0d12129212a3e700138070
42e35fd8f60efc9cda1ecd9bbf61464a179299b43c3cf195956eedd635a7f859
8091910bf3

Bytes signed, user supplied messsage (hex):
08c247a658bcfe4668d853192dfe9a27c4f7bbc75ca6bc567fdc4726b1628ee8

Bytes signed, user supplied messsage (repr):
"\x08\xc2G\xa6X\xbc\xfeFh\xd8S\x19-\xfe\x9a'\xc4\xf7\xbb\xc7\\\xa6\xbcV\x7f\xdcG&\xb1b\x8e\xe8"

Bytes signed, server unix timestamp:
1343486550

Bytes signed, server UTC timestamp:
2012-07-28 14:42:30

Signature:
7c7e3ef5aeea49674e3311112b503c6cfe8149810d5615392d0405939667bf15
95a8c9693cc8d4105a52c85615e7132467757939f72f01354a74882f59463e4d
d76b7eb4ec0de9b6922e2fc3e74336eb0ae619f90f53a2384a1465970a11a9d5
66afd335d3ae9cb2e8f7fd757d5cb5fad530923d29b3df195a963ef699711141

Server certificate. For more details, do:
$ openssl asn1parse -inform DER -in signed_statement.certificate.der

308203213082028aa00302010202104f9d96d966b0992b54c2957cb4157d4d30
...
20e90a70641108c85af17d9eec69a5a5d582d7271e9e56cdd276d5792bf72543
1c69f0b8f9

Conclusion

Happy hacking. :)

Thursday, January 22, 2009

A proposal for better IRC encryption

January 25 update: The IRCSRP propsal has been updated to version 1.1, fixing a minor replay attack.

February 12 update: The IRCSRP propsal has been updated to version 2.0, adding message authenticaton. This blog post has NOT been updated, for version 2.0 see the IRCSRP page.

This article assumes a basic understanding of cryptography and IRC terminology (such as “channel”).

IRC is mostly used for insensitive discussions in public channels, such as #math or #haskell on Freenode. As anyone can join these discussions, the plain text nature of the IRC protocol is usually not considered a problem.

Although plain text and thus not suited for private discussions, the protocol offers some features that lends itself to this usage behavior. First of all, it is possible for two users on an IRC network to talk to each other outside channels, similar to chat using instant messaging software. It is also usually possible to form password-protected channels, a feature often used by a group of friends.

While the IRC protocol itself does not specify means for encrypting these private conversations, several methods have been implemented over the years. Some IRC clients and servers now support SSL/TLS; however as most large networks do not, and SSL still allows the server admin to eavesdrop on private conversations, other methods are more common.

This short article will attempt to describe how some of the more common methods work, as well as suggesting an improved SRP-based protocol fixing some shortcomings in the available methods.

A short primer on the IRC protocol

The methods described in this article all use the IRC protocol itself for transporting ciphertext. To understand these methods it is thus necessary to understand the IRC protocol, at least to some extent. Fortunately the protocol is not very complicated. Consult RFC 2812 for the gritty details.

In this example, Alice has connected her IRC client to Freenode. On connection, a small handshake takes place where Alice and Freenode negotiate a unique nickname. Alice is not allowed to enter the server if her desired nickname is already in use. With a unique nickname, Alice can talk to other users on the network.

Next, Alice joins #math and asks a question about some mathematical topic. Her client sends to the server:

PRIVMSG #math :How can I prove this set has this property?\r\n

The server then parses this message, and sends to all other users in channel #math:

:alice!alice@host.com PRIVMSG #math :How can I prove this set has this property?\r\n

The first part of the message, after the initial colon and before the first space, is a string of the form nickname!username@hostname. The other users can thus know who sent the message.

The PRIVMSG message is also used for “private messages” (note that due to the name of the command, this name is unfortunate). If Alice and Bob talk to each other outside a channel, the message exchange looks something like this:

Alice sends to server:
PRIVMSG bob :Hi bob!\r\n

Server sends to bob:
:alice!alice@host.com PRIVMSG bob :Hi bob!\r\n

Bob sends to server:
PRIVMSG alice :Howdy alice!\r\n

Server sends to Alice:
:bob!bob@bobshost.com PRIVMSG alice :Howdy alice!\r\n

There is a second method to send messages to a user or channel. This is known as a NOTICE message, which has the exact same form as PRIVMSG. Client software usually displays these messages differently than PRIVMSG. Some of the software mentioned in this article use the NOTICE message for different purposes.

Encryption

All methods for encrypting IRC conversations within the protocol (that is, excluding layers such as IRC over SSL/TLS) work the same way. The IRC client, or some plugin, or some other software between the client and the IRC server, rewrites the content of the PRIVMSG (or NOTICE) messages before it’s sent to the server.

For example, the message:

PRIVMSG bob :Hi bob!\r\n

is rewritten to, using one common method described shortly:

PRIVMSG bob :+OK BRurM1bWPZ1.\r\n

before being sent to the server. (Note that the characters between the colon and the line break is not part of the IRC protocol itself).

Bob’s software then decrypts this message and the client displays it on the screen, assuming of course Bob knows the details needed for decryption, such as keys.

From here on in the article, the message content (as it would be sent without encryption) will be referred to as the “plaintext”, and the replacement will be referred to as “ircmessage”. In the example above:

plaintext = "Hi bob!"
ircmessage = "+OK BRurM1bWPZ1."

Blowfish-ECB (”blowcrypt”)

Probably the most common software for encrypting IRC is Fish, which is a plugin for several IRC clients. Fish implements an encryption method originally used in a program called blowcrypt. This method is supported by several other plugins and clients, such as Mircryption.

blowcrypt users (Alice and Bob talking privately, or a whole channel) agree on a password used as key. Messages are then encrypted with Blowfish using this key, and formatted in a special way before being sent to the server.

There are two points of interest. Blowfish is used in ECB mode, which is no mode at all. Each message is split into blocks of 8 bytes, encrypted individually. A block shorter than 8 bytes is padded with zeroes.

Because of this design decision, blowcrypt encryption is very easy to implement for software developers working on IRC clients. The shortcoming is weaker security. For example, the message “Hi” is encrypted to the same string every time, for a given key.

Before the ciphertext is sent to the server, it is formatted like so:

ircmessage = "+OK " || BlowcryptBase64(Blowfish(key, plaintext))

Where || denote concatenation and BlowcryptBase64 is a non-standard base64 implementation. The prefix “+OK “ is used as an identifier for the software handling the decryption. A conversation may look like this, where the string “password” is used as key:

PRIVMSG bob :Hi bob!\r\n

is rewritten to

PRIVMSG bob :+OK BRurM1bWPZ1.\r\n

Here’s some Python code.

>>> b = Blowfish("password")
>>> blowcrypt_pack("Hi bob!", b)
'+OK BRurM1bWPZ1.'
>>> blowcrypt_unpack(_, b)
'Hi bob!'

Blowfish-CBC (“Mircryption”)

An obvious shortcoming of blowcrypt is the use of ECB mode. The Mircryption plugin, which supports several common IRC clients, supports a new encryption method using Blowfish in CBC mode. (For more information about cryptographic modes in general and the CBC in particular, see the TrueCrypt article on this website).

For each message an 8 byte IV is randomly generated. The message is encrypted with the key and the IV. The message sent to the server is

ircmessage = "+OK *" || Base64(IV || BlowfishCBC(key, IV, plaintext))

Here Base64 is the standard MIME base64 implementation, so other than the use of Blowfish this encryption method is different from, and better than, blowcrypt in most ways.

Here’s some Python code:

>>> b = BlowfishCBC("keyTest")
>>> mircryption_cbc_pack("Hello world!", b)
'+OK *5RQreHBF54PH3wFxsFmf2o1i6dh5ykeA'
>>> mircryption_cbc_unpack(_, b)
'Hello world!'

Diffie-Hellman key exchange (“DH1080”)

A problem for users of blowcrypt, Fish, Mircryption or similar software is handling the keys. It is often cumbersome to establish a secret key with the person you want to talk to securely.

The Fish developers have solved this with the software DH1080, which is an implementation of the Diffie-Hellman key exchange for IRC. Diffie-Hellman is a cryptographic protocol where two parties establish a secret key over an insecure channel, thus getting rid of the need for users to remember/establish the key themselves.

Alice, using DH1080, instructs the software to establish a key with Bob. DH1080 then automatically does the necessary computations and communications with Bob to establish a secret key. When everything is done Alice and Bob both know a secret key that can be used by Fish (or other software).

DH1080 can only be used for “private message” conversations. Due to constraints both from the Diffie-Hellman protocol itself and how IRC works, it is not feasible to use DH1080 for establishing a key for a whole channel. So if you want to encrypt messages to a channel, you’ll have to stick with a pre-determined key or use the proposed method discussed later in this article.

That said. DH1080 is a standard Diffie-Hellman implementation, with few surprises. The IRC specific part is how the message is formatted and sent. Using the terminology from the Diffie-Hellman Wikipedia article:

1. Alice and Bob agree to use a prime number p (see below) and a base g = 2.

2. Alice chooses a secret integer 2 < a < p-1 as private key, computes her public key A = g^a mod p and sends Bob, in a NOTICE message:

ircmessage = "DH1080_INIT " || DH1080Base64(IntAsBytes(A))

3. Bob chooses a secret integer 2 < b < p-1 as private key, computes the public key B = g^b mod p and sends Alice, in a NOTICE message:

ircmessage = "DH1080_FINISH " || DH1080Base64(IntAsBytes(B))

4. Alice checks that B is valid and if it is computes the secret key s = B^a mod p. B is valid if 1 < B < p.

5. Bob checks that A is valid and if it is computes the secret key s = A^b mod p. A is valid if 1 < A < p.

Here, DH1080Base64 is another non-standard base64-implementation. IntAsBytes is a variable length, big-endian representation of an integer. For example, the integer 0xabbcc is represented as the three-byte string "\x0a\xbb\xcc".

A point of interest: DH1080 uses a curious 1080 bit prime number p in the default implementation. This prime number is constructed in such a way it yields a message in English when the prime number is base64 encoded.

In base 16:

p = 
FBE1022E23D213E8ACFA9AE8B9DFADA3EA6B7AC7A7B7E95AB5EB2DF858921
FEADE95E6AC7BE7DE6ADBAB8A783E7AF7A7FA6A2B7BEB1E72EAE2B72F9FA2
BFB2A2EFBEFAC868BADB3E828FA8BADFADA3E4CC1BE7E8AFE85E9698A783E
B68FA07A77AB6AD7BEB618ACF9CA2897EB28A6189EFA07AB99A8A7FA9AE29
9EFA7BA66DEAFEFBEFBF0B7D8B

And when the prime is represented as bytes, coded with the non-standard base64 implementation:

++ECLiPSE+is+proud+to+present+latest+FiSH+release+featuring+e
ven+more+security+for+you+++shouts+go+out+to+TMG+for+helping+
to+generate+this+cool+sophie+germain+prime+number++++/C32L

While the statement is arguably incorrect – p is a safe prime and (p-1)/2 is Sophie Germain – this doesn’t affect the security of the key exchange as the prime chosen satisfies all the desired properties for a “strong” prime.

By RFC 2631, a prime chosen is considered secure if it can be written as p = jq + 1, q is a large prime and j => 2. In this case, j = 2 by construction and both p and q=(p-1)/2 are prime, so the equality holds.

Here’s some Python code:

>>> alice = DH1080Ctx()
>>> bob = DH1080Ctx()
>>> dh1080_pack(alice)
'DH1080_INIT qStH1LjBpb47s0XY80W9e3efrVSh2Qfq19291XAuwa7C9UFvW0sYY
424FOS6JNVsoYVH85lj6oPkr8w3KRZDDqVoV+7yCVtLmhCcC3dHyz4Ynbe93HEtR3n
26+Q1dWPm+JgZEGSYnhNunk7FOqsXFUR/2O9dkbpnxZDh2UFFmg0uGbukgG+FA'
>>> dh1080_unpack(_, bob)
True
>>> dh1080_pack(bob)
'DH1080_FINISH mjyk//fqPoEwp5JfbJtzDmlfpzmtMEFw5Ueyk51ydAjXBd8cjqz
m5oW9V0/VTk8ag0DzzKw8+9C6hPotvonaI8PwkSplHlDHjGJgcpirIn7C07jHw2WEt
NUpZtyz2UNT/RStGLe/s+4MrmoUC6vOaIZMUu7sgCXyUbDfYk5QCcbQVRx6rH+hA'
>>> dh1080_unpack(_, alice)
True
>>> dh1080_secret(alice)
'tfu4Qysoy56OYeckat1HpJWzi+tJVx/cm+Svzb6eunQ'
>>> dh1080_secret(bob)
'tfu4Qysoy56OYeckat1HpJWzi+tJVx/cm+Svzb6eunQ'

Misc.

Other than the two Blowfish based solutions mentioned, there exist several other, less common, methods. Here are two of them.

PsyBNC

PsyBNC is a so-called bouncer, a kind of IRC proxy that offers several features. One of these features is encryption. The encryption is very similar to Blowfish-ECB, however instead of Base64 another serialization method is used instead.

During the code review, an innocent but dangerous implementation error was discovered in PsyBNC:s Blowfish code. Until this problem is fixed, users are advised to switch to a different encryption method. Cryptanalysis is available here, courtesy of David Wagner: sci.crypt: Strength of Blowfish with broken F-function?.

OTR

OTR is worth mentioning not because it’s very common, but because it should be. OTR, or Off-the-Record Messaging, is designed for encrypting instant messaging-style conversations and offers several desirable features for these kinds of conversations.

While OTR cannot encrypt IRC channels due to technical reasons, it is ideal for private message conversations. An OTR plugin currently exists for irssi.

Summary of available methods

There are two different cases to consider when choosing a solution for encrypting IRC conversations.

For “private message” conversations IRC does not severely limit the available choices, so more options are available:

Blowfish-ECB should due the ECB mode not be used at all.

Blowfish-CBC and a shared password: If the two users can keep the shared password secret, this method is decent in its simplicity. The fact that the users share a secret offers some authentication. A problem is if the password is compromised. Then all previous conversations are available to the attacker.

Blowfish-CBC with unauthenticated key exchange: This solution has the benefit of getting rid of the need of the two users to remember/establish a shared secret. Also, as the key can be constantly renewed, perfect forward secrecy is achieved. There are, however, several important shortcomings. Diffie Hellman alone can be attacked by a man in the middle – in fact it would be trivial for a server admin to automatically perform this attack when a DH1080 exchange takes place on his server. Adding some sort of authentication solves this, however this may requires more involvement from the users.

OTR: This is the ideal solution; in fact it was designed for instant message-style conversations. Right now the only problem is lack of implementations. As of January 2009, there only exist one OTR plugin, for irssi (while popular, mIRC is probably even more so).

For encrypting a whole channel, there aren’t as many reasonable choices. Of the methods described above, Blowfish-CBC with a shared password is recommended. A better method will be described shortly.

All of these methods can be combined with Tor or similar software to achieve anonymity, remembering the Tor end node can, like the IRC server admin, attack the DH key exchange.

Future improvements

There are currently no known attacks on Blowfish, so replacing this block cipher with something newer (AES, Twofish) is not necessarily an improvement – of course it doesn’t hurt. A more considerable improvement would be using a cryptographic protocol that somehow solves all the following problems:

  • Perfect forward secrecy.
  • Authentication.
  • Resistance against the attacks mentioned above (e.g. MITM).
  • A compromised password should do as little damage as possible.

With the following constraints

  • Should be easy to use for end users, i.e. not requiring significant public key infrastructure.
  • Should be applicable to group conversations, i.e. IRC channels.
  • Should be reasonably straightforward to implement.
  • Should not abuse the IRC network (this rules out solutions where all conversations take place in private messages, but client software rewrites the recipient giving the appearance of group conversation).

Here is my proposal to the IRC community. Comments are welcome.

IRCSRP version 1.1

February 12 update: The IRCSRP propsal has been updated to version 2.0, adding message authenticaton. This blog post has NOT been updated, for version 2.0 see the IRCSRP page.

This new method of IRC encryption is based on the “optimized” SRP-6, the Secure Remote Password Protocol. It is described in detail here: http://srp.stanford.edu/doc.html (the first paper is especially readable). SRP is a protocol for password-based authentication between a user and a host, where the user only has to remember a simple password. No public key infrastructure is necessary.

The protocol as described in this article has been adapted for IRC usage.

Sample setup

Alice, Bob, Carol and Dave talk in a channel #friends on a public IRC network. Dave is the most technical user of the four, and will be given a special purpose.

Overview

Instead of everyone on the channel sharing a password together, each one of Alice, Bob and Carol share a so-called verifier with Dave.

The basic idea of the IRC adapted protocol is each user does an authenticated key exchange with Dave. Alice and Dave share knowledge of a password (Alice knows her password, Dave has the verifier), which is used for authentication. If successful, Dave sends the current session key to Alice, which is used for decrypting the content of the channel.

Once in a while Dave generates a new session key and broadcasts it to everyone on the channel. Thus forward secrecy is achieved.

Details – channel encryption

The messages sent in the actual channel (#friends in our example) is:

ircmessage = "+aes " || Base64(IV || AESCBC(sessionkey, IV, "M" || info || plaintext))

Here Base64 is the standard MIME base64 implementation with padding characters left intact. AESCBC is AES in CBC-mode. sessionkey is a 256 bit key randomly generated by Dave. info is a short information string:

info = len(username) || username || timestamp

username is a string, also known as 'I'. It will be described shortly. len(username) is a single byte telling the length of the string. timestamp is a 32 bit Unix timestamp, represented in big endian.

As the IRC protocol is limited to messages 512 characters in length (this includes the whole message, including the PRIVMSG command) client software should split plaintext into parts if the complete IRC message gets too long.

When Dave broadcasts a new, updated session key, the message is:

ircmessage = "+aes " || Base64(IV || AESCBC(old_sessionkey, IV, "\xffKEY" || new_sessionkey))

Details – the authenticated key exchange (preparations)

The interesting part of the setup is establishing the session key for a user who doesn’t already know it. Here are some constants we are going to use, using the same terminology as the SRP paper.

N = The prime number from the 2048-bit MODP Group as described in RFC 3526.
g = 2
H = the SHA-256 hash algorithm. Depending on context, the hash is either a 32-byte string or said string interpreted as a 256 bit big-endian integer.

Before the key exchange, Alice and Dave have to share some information.

1) Alice selects a username I and a password P. The username should be constant and not derived from Alice IRC nickname or host.

2) Alice generates a random salt s then computes the verifier v:

s = random 256 bit integer
x = H(s || I || P)
v = g^x (mod N)

3) Alice gives Dave s and v, which he stores together with Alice username I. From now on, Alice only has to remember I and P. She can and should discard s, x and v.

It is very important Alice gives s and v to Dave in person, or through an existing authenticated secure channel (such as a GPG encrypted e-mail.)

Details – the authenticated key exchange

The key exchange works as follows, where all ircmessage are sent in a NOTICE:

1) Alice sends Dave her username I. This initiates the exchange.

ircmessage = "+srpa0 " || I

2) Dave looks up Alice information (s, v), computes and sends:

b = random integer with 1 < b < N.
B = 3v + g^b (mod N)
ircmessage = "+srpa1 " || Base64(s || B)

3) Alice asserts B != 0 (mod N), then computes and sends:

a = random integer with 1 < a < N.
A = g^a (mod N)
x = H(s || I || P)
u = H(A || B)
S = (B – 3g^x)^(a + ux) (mod N)
K = H(S)
M1 = H(A || B || S)
ircmessage = "+srpa2 " || Base64(M1 || IntAsBytes(A))

4) Dave verifies M1, then if Alice is trusted, computes and sends:

u = H(A || B)
S = (Av^u)^b (mod N)
K = H(S)
M2 = H(A || M1 || S)
ircmessage = "+srpa3 " || Base64(IV || AESCBC(K, IV, sessionkey || M2))

5) Alice verifies M2, and decrypts the session key using K. If the verification holds, then Dave and the session key are trusted.

Exactly why and how this works is described in the SRP paper.

Problems

The bottleneck in the protocol as described is the dependence on Dave. There are two notable problems to consider:

Problem 1: What happens when Dave is off-line

The protocol as described is better suited for medium sized than very small groups of friends. For the case with 10 or more users, it is seldom a problem in practice to find one or two users with good enough uptime to act as Dave.

The problem is for smaller groups, such as the sample setup described above. Assume all four users are talking to each other. Carol then disconnects, and later Dave changes the session key for the channel. Due to network trouble Dave then disconnects from the network. While Alice and Bob can still talk to each other, Carol can no longer decrypt the messages. With Dave offline, she can’t get the session key either.

Solution 1:

There are several potential solutions for this problem. Unfortunately most of the obvious ones (such as distributing Dave’s task) add complexity to the protocol, where users no longer have to remember only their password.

Until this problem is solved, the users are recommended to fall back on a simpler encryption method.

Problem 2: Net splits

A moderately common problem with IRC networks is so called net splits, where two or more IRC servers can no longer communicate.

Assume Alice and Bob are on one server, and Carol and Dave on another. The two servers split. After this has happened, Dave changes the key. When the two servers rejoin, Alice and Bob can no longer understand Carol and Dave. In this case, they will initiate a key exchange with Dave.

When several users simultaneously perform the SRP exchange with Dave, his CPU usage may spike for an unacceptable amount of time, depending on how fast his protocol implementation is and the number of users.

Solution 2:

There are several possible solutions to this problem:

1) Dave changes the session key manually instead of automatically. In this case he can avoid changing the key during a net split.

2) The software implementation adds some kind of net split detection, and doesn’t change the session key when the network is considered unstable.

3) The problem is ignored – several users exchanging keys with Dave simultaneously is not considered a problem.

Implementation details

While the key exchange has slightly more steps than a simple Diffie-Hellman exchange such as the one used by DH1080, all individual steps are quite easy to implement given a big integer library, a base64 implementation and code for AES, CBC and SHA256. In the Python programming language, the whole key exchange can be performed in about two hundred lines (this implementation is quite slow though).

There are other considerations for implementations of the protocol, especially concerning automation of the key exchange.

1) Alice joins #friends and tells the IRC client her password. This is the only user intervention required.

2) The implementation randomly picks a Dave and initiates the exchange.

3) From here on everything happens automatically, as required.

For the second step, the implementation may look for users prefixed as channel operators, or each user saves nicknames for the Dave’s in a configuration file.

Code

Here’s some Python code, demonstrating Dave's setup, Alice setup and the key exchange itself.

>>> s, v = ircsrp_generate("alice", "passw")
>>> dave = IRCSRPCtx(True)
>>> dave.users.db["alice"] = (s, v)
>>> dave.sessionkey = urandom(32)
>>> dave.cipher = AESCBC(dave.sessionkey)
>>>
>>> alice = IRCSRPCtx()
>>> alice.username = "alice"
>>> alice.password = "passw"
>>> 
>>> ircsrp_exchange(alice)
'+srpa0 alice'
>>> ircsrp_exchange(dave, _, "alicenick")
'+srpa1 AU/DMrnF/JrccLBs4EKDW4U4fJHafqvAwIsOxTiI84Z9oeisZlO6D1a
XTuUXeslI/4957Q1kUtJURzjulRSk43bZSeDfE90u/WvcfD5ayh/d7owdaVAsh/
eN6p7dJFSqWEh0Rd8xD1FM+w/SZ+6MF+USybPAe+MFg0cWmJx2UDryvWd0y0UiG
RNvVVdtWOS5TojJ7Psk+4wXrPq+1wz18+h6C/LbElJKezhsu2H02kdpfT9vSJ+7
bQBrfgWqr7QASakO+6I3BuyuVaIirVLrX1XHSwXE3Kt91TQAHY7clLXiYrVOOs1
E27VY8VJ7imXD3tlbx1TVzJRNehrWKPqrn3Wl42PxDN62fG/NisN/YeC6hiTivs
6ivwA6btsWHX9h'
>>> ircsrp_exchange(alice, _)
'+srpa2 6kSQvvZnqioVmLMsrNG0/CPFOnMW6qutgOOHCLCPJJBvHJtbjy2Q0Ee
Al0AAJdymGXyLvADHpDt5UgVLn0TtGLEc86wx+Shf/bdXAsOHaZrRm/P8csN2yy
t4CPSTr/oIGP8bLZIVcVOjZHwPX2uF9tmOE9o9xotLzPJVbEbZoCC/uvvMjQRch
LgW4tg6gknpI5U/noi0w4cMA7OY9HnBRXgGjTbrMIm00bkE3v7YWZ0CXBZX9W7q
F2B28qJauT5l6PL3o/PAgw3I5MP28/eowWGEjNf6W1wWqmE1q2oXS88xeCEAQ6A
mU0qFiHrWBFMP04Um9WNKAuNR9e8p/zGink/nCWX6JmjbApJRbHTjx3nNa7qCCN
3Fg/QsA+uJNkbA'
>>> ircsrp_exchange(dave, _, "alicenick")
'+srpa3 KLSFN+yqid9NXBzJEydFTAJm+9U5dZcbNIGr8sjzrLoOvSWn70H652D
gU/IA8a7PlDiL3YzJOTI/mk1C1x+M8WG8cjbsx8axdlmCvFC/xpk='
>>> ircsrp_exchange(alice, _)
*** Session key is: '\xce\xa3k\xa3\x03\xfdx$=\xd1\xf1P\xa4Cw
\x16\\\xaaw!d\x9f]|\xddl\xe4q\x15%\xfcI'
True
>>> dave.sessionkey
'\xce\xa3k\xa3\x03\xfdx$=\xd1\xf1P\xa4Cw\x16\\\xaaw!d\x9f]|\xddl
\xe4q\x15%\xfcI'
>>>
>>> ircsrp_pack(alice, "Hello everyone!")
'+aes zNRmWAM1WxOedS0twJVOIoBTQKbh/7c5GzgHRnfL+KbSrLCGTLjW/3yvV
AoiPMWh'
>>> ircsrp_unpack(dave, _)
*** Sent by username: alice
*** Sent at time: Sun Jan 25 16:11:47 2009
'Hello everyone!'

Conclusion

Of the available methods for encrypting IRC, This author recommends the Blowfish-CBC method for encrypting channels, and OTR for private messages (in which case it may be easier to use instant messaging instead of IRC). For protection only against passive eavesdroppers, Blowfish-CBC+DH1080 can be used instead of OTR.

The proposed IRCSRP method may not be applicable for all usage scenarios, but when it is the authentication and session key based approach should give considerably stronger security than a shared password.

The code in this article is available here: irccrypt.py. It has a the free OpenBSD license, and can be used for pretty much any purpose.

Wednesday, October 01, 2008

Let’s build an MP3-decoder!

Even though MP3 is probably the single most well known file format and codec on Earth, it’s not very well understood by most programmers – for many encoders/decoders is in the class of software “other people” write, like standard libraries or operating system kernels. This article will attempt to demystify the decoder, with short top-down primers on signal processing and information theory when necessary. Additionally, a small but not full-featured decoder will be written (in Haskell), suited to play around with.

The focus on this article is on concepts and the design choices the MPEG team made when they designed the codec – not on uninteresting implementation details or heavy theory. Some parts of a decoder are quite arcane and are better understood by reading the specification, a good book on signal processing, or the many papers on MP3 (see references at the end).

A note on the code: The decoder accompanying this article is written for readability, not speed. Additionally, some unusual features have been left out. The end result is a decoder that is inefficient and not standards compliant, but with hopefully readable code. You can grab the source here: mp3decoder-0.0.1.tar.gz. Scroll down to the bottom of the article or see README for build instructions.

A fair warning: The author is a hobby programmer, not an authority in signal processing. If you find an error, please drop me an e-mail. be@bjrn.se

With that out of the way, we begin our journey with the ear.

Human hearing and psychoacoustics

The main idea of MP3 encoding, and lossy audio coding in general, is removing acoustically irrelevant information from an audio signal to reduce its size. The job of the encoder is to remove some or all information from a signal component, while at the same time not changing the signal in such a way audible artifacts are introduced.

Several properties (or “deficiencies”) of human hearing are used by lossy audio codecs. One basic property is we can’t hear above 20 kHz or below 20 Hz, approximately. Additionally, there’s a threshold of hearing – once a signal is below a certain threshold it can’t be heard, it’s too quiet. This threshold varies with frequency; a 20 Hz tone can only be heard if it’s stronger than around 60 decibels, while frequencies in the region 1-5 kHz can easily be perceived at low volume.

A very important property affecting the auditory system is known as masking. A loud signal will “mask” other signals sufficiently close in frequency or time; meaning the loud signal modifies the threshold of hearing for spectral and temporal neighbors. This property is very useful: not only can the nearby masked signals be removed; the audible signal can also be compressed further as the noise introduced by heavy compression will be masked too.

This masking phenomenon happens within frequency regions known as critical bands – a strong signal within a critical band will mask frequencies within the band. We can think of the ear as a set of band pass filters, where different parts of the ear pick up different frequency regions. An audiologist or acoustics professor have plenty to say about critical bands and the subtleties of masking effects, however in this article we are taking a simplified engineering approach: for our purpose it’s enough to think of these critical bands as fixed frequency regions where masking effects occur.

Using the properties of the human auditory system, lossy codecs and encoders remove inaudible signals to reduce the information content, thus compressing the signal. The MP3 standard does not dictate how an encoder should be written (though it assumes the existence of critical bands), and implementers have plenty of freedom to remove content they deem imperceptible. One encoder may decide a particular frequency is inaudible and should be removed, while another encoder keeps the same signal. Different encoders use different psychoacoustic models, models describing how humans perceive sounds and thus what information may be removed.

About MP3

Before we begin decoding MP3, it is necessary to understand exactly what MP3 is. MP3 is a codec formally known as MPEG-1 Audio Layer 3, and it is defined in the MPEG-1 standard. This standard defines three different audio codecs, where layer 1 is the simplest that has the worst compression ratio, and layer 3 is the most complex but has the highest compression ratio and the best audio quality per bit rate. Layer 3 is based on layer 2, in turn based on layer 1. All of the three codecs share similarities and have many encoding/decoding parts in common.

The rationale for this design choice made sense back when the MPEG-1 standard was first written, as the similarities between the three codecs would ease the job for implementers. In hindsight, building layer 3 on top of the other two layers was perhaps not the best idea. Many of the advanced features of MP3 are shoehorned into place, and are more complex than they would have been if the codec was designed from scratch. In fact, many of the features of AAC were designed to be “simpler” than the counterpart in MP3.

At a very high level, an MP3 encoder works like this: An input source, say a WAV file, is fed to the encoder. There the signal is split into parts (in the time domain), to be processed individually. The encoder then takes one of the short signals and transforms it to the frequency domain. The psychoacoustic model removes as much information as possible, based on the content and phenomena such as masking. The frequency samples, now with less information, are compressed in a generic lossless compression step. The samples, as well as parameters how the samples were compressed, are then written to disk in a binary file format.

The decoder works in reverse. It reads the binary file format, decompress the frequency samples, reconstructs the samples based on information how content was removed by the model, and then transforms them to the time domain. Let’s start with the binary file format.

Decoding, step 1: Making sense of the data

Many computer users know that an MP3 are made up of several “frames”, consecutive blocks of data. While important for unpacking the bit stream, frames are not fundamental and cannot be decoded individually. In this article, what is usually called a frame we call a physical frame, while we call a block of data that can actually be decoded a logical frame, or simply just a frame.

A logical frame has many parts: it has a 4 byte header easily distinguishable from other data in the bit stream, it has 17 or 32 bytes known as side information, and a few hundred bytes of main data.

A physical frame has a header, an optional 2 byte checksum, side information, but only some of the main data unless in very rare circumstances. The screenshot below shows a physical frame as a thick black border, the frame header as 4 red bytes, and the side information as blue bytes (this MP3 does not have the optional checksum). The grayed out bytes is the main data that corresponds to the highlighted header and side information. The header for the following physical frame is also highlighted, to show the header always begin at offset 0.

The absolutely first thing we do when we decode the MP3 is to unpack the physical frames to logical frames – this is a means of abstraction, once we have a logical frame we can forget about everything else in the bit stream. We do this by reading an offset value in the side information that point to the beginning of the main data.

Why’s not the main data for a logical frame kept within the physical frame? At first this seems unnecessarily clumsy, but it has some advantages. The length of a physical frame is constant (within a byte) and solely based on the bit rate and other values stored in the easily found header. This makes seeking to arbitrary frames in the MP3 efficient for media players. Additionally, as frames are not limited to a fixed size in bits, parts of the audio signal with complex sounds can use bytes from preceding frames, in essence giving all MP3:s variable bit rate.

There are some limitations though: a frame can save its main data in several preceding frames, but not following frames – this would make streaming difficult. Also, the main data for a frame cannot be arbitrarily large, and is limited to about 500 bytes. This is limit is fairly short, and is often criticized.

The perceptive reader may notice the gray main data bytes in the image above begin with an interesting pattern (3E 50 00 00…) that resembles the first bytes of the main data in the next logical frame (38 40 00 00…). There is some structure in the main data, but usually this won’t be noticeable in a hex editor.

To work with the bit stream, we are going to use a very simple type:

data MP3Bitstream = MP3Bitstream {
    bitstreamStream :: B.ByteString,
    bitstreamBuffer :: [Word8]
}

Where the ByteString is the unparsed bit stream, and the [Word8] is an internal buffer used to reconstruct logical frames from physical frames. Not familiar with Haskell? Don’t worry; all the code in this article is only complementary.

As the bit stream may contain data we consider garbage, such as ID3 tags, we are using a simple helper function, mp3Seek, which takes the MP3Bitstream and discards bytes until it finds a valid header. The new MP3Bitstream can then be passed to a function that does the actual physical to logical unpacking.

mp3Seek :: MP3Bitstream -> Maybe MP3Bitstream
mp3UnpackFrame :: MP3Bitstream -> (MP3Bitstream, Maybe MP3LogicalFrame)

The anatomy of a logical frame

When we’re done decoding proper, a logical frame will have yielded us exactly 1152 time domain samples per channel. In a typical PCM WAV file, storing these samples would require 2304 bytes per channel – more than 4½ KB in total for a typical audio track. While large parts of the compression from 4½ KB audio to 0.4 KB frame stems from the removal of frequency content, a not insignificant contribution is thanks to a very efficient binary representation.

Before that, we have to make sense of the logical frame, especially the side information and the main data. When we’re done parsing the logical frame, we will have compressed audio and a bunch of parameters describing how to decompress it.

Unpacking the logical frame requires some information about the different parts. The 4-byte header stores some properties about the audio signal, most importantly the sample rate and the channel mode (mono, stereo etc). The information in the header is useful both for media player software, and for decoding the audio. Note that the header does not store many parameters used by the decoder, e.g. how audio samples should be reconstructed, those parameters are stored elsewhere.

The side information is 17 bytes for mono, 32 bytes otherwise. There’s lots of information in the side info. Most of the bits describe how the main data should be parsed, but there are also some parameters saved here used by other parts of the decoder.

The main data contains two “chunks” per channel, which are blocks of compressed audio (and corresponding parameters) decoded individually. A mono frame has two chunks, while a stereo frame has four. This partitioning is cruft left over from layer 1 and 2. Most new audio codecs designed from scratch don’t bother with this partitioning.

The first few bits of a chunk are the so-called scale factors – basically 21 numbers, which are used for decoding the chunk later. The reason the scale factors are stored in the main data and not the side information, as many other parameters, is the scale factors take up quite a lot of space. How the scale factors should be parsed, for example how long a scale factor is in bits, is described in the side information.

Following the scale factors is the actual compressed audio data for this chunk. These are a few hundred numbers, and take up most of the space in a chunk. These audio samples are actually compressed in a sense many programmers may be familiar with: Huffman coding, as used by zip, zlib and other common lossless data compression methods.

The Huffman coding is actually one of the biggest reasons an MP3 file is so small compared to the raw audio, and it’s worth investigating further. For now let’s pretend we have decoded the main data completely, including the Huffman coded data. Once we have done this for all four chunks (or two chunks for mono), we have successfully unpacked the frame. The function that does this is:

mp3ParseMainData :: MP3LogicalFrame -> Maybe MP3Data

Where MP3Data store some information, and the two/four parsed chunks.

Huffman coding

The basic idea of Huffman coding is simple. We take some data we want to compress, say a list of 8 bit characters. We then create a value table where we order the characters by frequency. If we don’t know beforehand how our list of characters will look, we can order the characters by probability of occurring in the string. We then assign code words to the value table, where we assign the short code words to the most probable values. A code word is simply an n-bit integer designed in such a way there are no ambiguities or clashes with shorter code words.

For example, lets say we have a very long string made up of the letters A, C, G and T. Being good programmers, we notice it’s wasteful to save this string as 8 bit characters, so we store them with 2 bits each. Huffman coding can compress the string further, if some of the letters are more frequent than others. In our example, we know beforehand ‘A’ occurs in the string with about 40% probability. We create a frequency table:

A40%
C35%
G20%
T5%

We then assign code words to the table. This is done in a specific way – if we pick code words at random we are not Huffman coding anymore but using a generic variable-length code.

A0
C10
G110
T111

Say we have a string of one thousand characters. If we save this string in ASCII, it will take up 8000 bits. If we instead use our 2-bit representation, it will only take 2000 bits. With Huffman coding however, we can save it in only 1850.

Decoding is the reverse of coding. If we have a bit string, say 00011111010, we read bits until there’s a match in the table. Our example string decodes to AAATGC. Note that the code word table is designed so there are no conflicts. If the table read

A0
C01

… and we encounter the bit 0 in a table, there’s no way we can ever get a C as the A will match all the time.

The standard method of decoding a Huffman coded string is by walking a binary tree, created from the code word table. When we encounter a 0 bit, we move – say – left in the tree, and right when we see a 1. This is the simplest method used in our decoder.

There’s a more efficient method to decode the string, a basic time-space tradeoff that can be used when the same code word table is used to code/decode several different bit strings, as is the case with MP3. Instead of walking a tree, we use a lookup table in a clever way. This is best illustrated with an example:

lookup[0xx] = (A, 1)
lookup[10x] = (C, 2)
lookup[110] = (G, 3)
lookup[111] = (T, 3)

In the table above, xx means all permutations of 2 bits; all bit patterns from 00 to 11. Our table thus contains all indices from 000 to 111. To decode a string using this table we peek 3 bits in the coded bit string. Our example bit string is 00011111010, so our index is 000. This matches the pair (A, 1), which means we have found the value A and we should discard 1 bit from the input. We peek another 3 bits in the string, and repeat the process.

For very large Huffman tables, where the longest code word is dozens of bits, it is not feasible to create a lookup table using this method of padding as it would require a table approximately 2n elements large, where n is the length of the longest code word. By carefully looking at a code word table however, it’s often possible to craft a very efficient lookup table by hand, that uses a method with “pointers” to different tables, which handle the longest code words.

How Huffman coding is used in MP3

To understand how Huffman coding is used by MP3, it is necessary to understand exactly what is being coded or decoded. The compressed data that we are about to decompress is frequency domain samples. Each logical frame has up to four chunks – two per channel – each containing up to 576 frequency samples. For a 44100 Hz audio signal, the first frequency sample (index 0) represent frequencies at around 0 Hz, while the last sample (index 575) represent a frequency around 22050 Hz.

These samples are divided into five different regions of variable length. The first three regions are known as the big values regions, the fourth region is known as the count1 region (or quad region), and the fifth is known as the zero region. The samples in the zero region are all zero, so these are not actually Huffman coded. If the big values regions and the quad region decode to 400 samples, the remaining 176 are simply padded with 0.

The three big values regions represent the important lower frequencies in the audio. The name big values refer to the information content: when we are done decoding the regions will contain integers in the range –8206 to 8206.

These three big values regions are coded with three different Huffman tables, defined in the MP3 standard. The standard defines 15 large tables for these regions, where each table outputs two frequency samples for a given code word. The tables are designed to compress the “typical” content of the frequency regions as much as possible.

To further increase compression, the 15 tables are paired with another parameter for a total of 29 different ways each of the three regions can be compressed. The side information contains information which of the 29 possibilities to use. Somewhat confusingly, the standard calls these possibilities “tables”. We will call them table pairs instead.

As an example, here is Huffman code table 1 (table1), as defined in the standard:

Code wordValue
1(0, 0)
001(0, 1)
01(1, 0)
000(1, 1)

And here is table pair 1: (table1, 0).

To decode a big values region using table pair 1, we proceed as follows: Say the chunk contains the following bits: 000101010... First we decode the bits as we usually decode Huffman coded strings: The three bits 000 correspond to the two output samples 1 and 1, we call them x and y.

Here’s where it gets interesting: The largest code table defined in the standard has samples no larger than 15. This is enough to represent most signals satisfactory, but sometimes a larger value is required. The second value in the table pair is known as the linbits (for some reason), and whenever we have found an output sample that is the maximum value (15) we read linbits number of bits, and add them to the sample. For table pair 1, the linbits is 0, and the maximum sample value is never 15, so we ignore it in this case. For some samples, linbits may be as large as 13, so the maximum value is 15+8191.

When we have read linbits for sample x, we get the sign. If x is not 0, we read one bit. This determines of the sample is positive or negative.

All in all, the two samples are decoded in these steps:

  1. Decode the first bits using the Huffman table. Call the samples x and y.
  2. If x = 15 and linbits is not 0, get linbits bits and add to x. x is now at most 8206.
  3. If x is not 0, get one bit. If 1, then x is –x.
  4. Do step 2 and 3 for y.

The count1 region codes the frequencies that are so high they have been compressed tightly, and when decoded we have samples in the range –1 to 1. There are only two possible tables for this region; these are known as the quad tables as each code word corresponds to 4 output samples. There are no linbits for the count1 region, so decoding is only a matter of using the appropriate table and get the sign bits.

  1. Decode the first bits using the Huffman table. Call the samples v, w, x and y.
  2. If v is not 0, get one bit. If 1, then v is –v.
  3. Do step 2 for w, x and y.

Step 1, summary

Unpacking an MP3 bit stream is very tedious, and is without doubt the decoding step that requires the most lines of code. The Huffman tables alone are a good 70 kilobytes, and all the parsing and unpacking requires a few hundred lines of code too.

The Huffman coding is undoubtedly one of the most important features of MP3 though. For a 500-byte logical frame with two channels, the output is 4x576 samples (1152 per channel) with a range of almost 15 bits, and that is even before we’ve done any transformations on the output samples. Without the Huffman coding, a logical frame would require up to 4-4½ kilobytes of storage, about an eight-fold increase in size.

All the unpacking is done by Unpack.hs, which exports two functions, mp3Seek and mp3Unpack. The latter is a simple helper function that combines mp3UnpackFrame and mp3ParseMainData. It looks like this:

mp3Unpack :: MP3Bitstream -> (MP3Bitstream, Maybe MP3Data)

Decoding, step 2: Re-quantization

Having successfully unpacked a frame, we now have a data structure containing audio to be processed further, and parameters how this should be done. Here are our types, what we got from mp3Unpack:

data MP3Data = MP3Data1Channels SampleRate ChannelMode (Bool, Bool) 
                                MP3DataChunk MP3DataChunk
             | MP3Data2Channels SampleRate ChannelMode (Bool, Bool) 
                                MP3DataChunk MP3DataChunk 
                                MP3DataChunk MP3DataChunk

data MP3DataChunk = MP3DataChunk {
    chunkBlockType    :: Int,
    chunkBlockFlag    :: BlockFlag,
    chunkScaleGain    :: Double,
    chunkScaleSubGain :: (Double, Double, Double),
    chunkScaleLong    :: [Double],
    chunkScaleShort   :: [[Double]],
    chunkISParam      :: ([Int], [[Int]]),
    chunkData         :: [Int]
}  

MP3Data is simply an unpacked and parsed logical frame. It contains some useful information, first is the sample rate, second is the channel mode, third are the stereo modes (more about them later). Then are the two-four data chunks, decoded separately. What the values stored in an MP3DataChunk represent will be described soon. For now it’s enough to know chunkData store the (at most) 576 frequency domain samples. An MP3DataChunk is also known as a granule, however to avoid confusion we are not going to use this term until later in the article.

Re-quantization

We have already done one of the key steps of decoding an MP3: decoding the Huffman data. We will now do the second key step – re-quantization.

As hinted in the chapter on human hearing, the heart of MP3 compression is quantization. Quantization is simply the approximation of a large range of values with a smaller set of values i.e. using fewer bits. For example if you take an analog audio signal and sample it at discrete intervals of time you get a discrete signal – a list of samples. As the analog signal is continuous, these samples will be real values. If we quantize the samples, say approximate each real valued sample with an integer between –32767 and +32767, we end up with a digital signal – discrete in both dimensions.

Quantization can be used as a form of lossy compression. For 16 bit PCM each sample in the signal can take on one of 216 values. If we instead approximate each sample in the range –16383 to +16383, we lose information but save 1 bit per sample. The difference between the original value and the quantized value is known as the quantization error, and this results in noise. The difference between a real valued sample and a 16-bit sample is so small it’s inaudible for most purposes, but if we remove too much information from the sample, the difference between the original will soon be audible.

Let’s stop for a moment and think about where this noise comes from. This requires a mathematical insight, due to Fourier: all continuous signals can be created by adding sinusoids together – even the square wave! This means that if we take a pure sine wave, say at 440 Hz, and quantize it, the quantization error will manifest itself as new frequency components in the signal. This makes sense – the quantized sine is not really a pure sine, so there must be something else in the signal. These new frequencies will be all over the spectra, and is noise. If the quantization error is small, the magnitude of the noise will be small.

And this is where we can thank evolution our ear is not perfect: If there’s a strong signal within a critical band, the noise due to quantization errors will be masked, up to the threshold. The encoder can thus throw away as much information as possible from the samples within the critical band, up to the point were discarding more information would result in noise passing the audible threshold. This is the key insight of lossy audio encoding.

Quantization methods can be written as mathematical expressions. Say we have a real valued sample in the range –1 to 1. To quantize this value to a form suitable for a 16 bit WAV file, we multiply the sample with 32727 and throw away the fractional part: q = floor(s * 32767) or equivalently in a form many programmers are familiar with: (short)(s * 32767.0). Re-quantization in this simple case is a division, where the difference between the re-quantized sample and the original is the quantization error.

Re-quantization in MP3

After we unpacked the MP3 bit stream and Huffman decoded the frequency samples in a chunk, we ended up with quantized frequency samples between –8206 and 8206. Now it’s time to re-quantize these samples to real values (floats), like when we take a 16-bit PCM sample and turn it to a float. When we’re done we have a sample in the range –1 to 1, much smaller than 8206. However our new sample has a much higher resolution, thanks to the information the encoder left in the frame how the sample should be reconstructed.

The MP3 encoder uses a non-linear quantizer, meaning the difference between consecutive re-quantized values is not constant. This is because low amplitude signals are more sensitive to noise, and thus require more bits than stronger signals – think of it as using more bits for small values, and fewer bits for large values. To achieve this non-linearity, the different scaling quantities are non-linear.

The encoder will first raise all samples by 3/4, that is newsample = oldsample3/4. The purpose is, according to the literature, to make the signal-to-noise ratio more consistent. We will gloss over the why’s and how’s here, and just raise all samples by 4/3 to restore the samples to their original value.

All 576 samples are then scaled by a quantity simply known as the gain, or the global gain because all samples are affected. This is chunkScaleGain, and it’s also a non-linear value.

This far, we haven’t done anything really unusual. We have taken a value, at most 8206, and scaled it with a variable quantity. This is not that much different from a 16 bit PCM WAV, where we take a value, at most 32767, and scale it with the fixed quantity 1/32767. Now things will get more interesting.

Some frequency regions, partitioned into several scale factor bands, are further scaled individually. This is what the scale factors are for: the frequencies in the first scale factor band are all multiplied by the first scale factor, etc. The bands are designed to approximate the critical bands. Here’s an illustration of the scale factor bandwidths for a 44100 Hz MP3. The astute reader may notice there are 22 bands, but only 21 scale factors. This is a design limitation that affects the very high frequencies.

The reason these bands are scaled individually is to better control quantization noise. If there’s a strong signal in one band, it will mask the noise in this band but not others. The values within a scale factor band are thus quantized independently from other bands by the encoder, depending on the masking effects.

Because of reasons that will hopefully be made more clear shortly, a chunk can be scaled in three different ways.

For one type of chunk – called “long” – we scale the 576 frequencies by the global gain and the 21 scale factors (chunkScaleLong), and leave it at that.

For another type of chunk – called “short” – the 576 samples are really three interleaved sets of 192 frequency samples. Don’t worry if this doesn’t make any sense now, we will talk about it soon. In this case, the scale factor bands look slightly different than in the illustration above, to accommodate the reduced bandwidths of the scale factor bands. Also, the scale factors are not 21 numbers, but sets of three numbers (chunkScaleShort). An additional parameter, chunkScaleSubGain, further scales the individual three sets of samples.

The third type of chunk is a mix of the above two.

When we have multiplied each sample with the corresponding scale factor and other gains, we are left with a high precision floating point representation of the frequency domain, where each sample is in the range –1 to 1.

Here’s some code, that uses almost all values in a MP3DataChunk. The three different scaling methods are controlled by the BlockFlag. There will be plenty more information about the block flag later in this article.

mp3Requantize :: SampleRate -> MP3DataChunk -> [Frequency]
mp3Requantize samplerate (MP3DataChunk bt bf gain (sg0, sg1, sg2) 
                         longsf shortsf _ compressed)
    | bf == LongBlocks  = long
    | bf == ShortBlocks = short
    | bf == MixedBlocks = take 36 long ++ drop 36 short
    where 
        long  = zipWith procLong  compressed longbands
        short = zipWith procShort compressed shortbands

        procLong sample sfb = 
            let localgain   = longsf !! sfb
                dsample     = fromIntegral sample
            in gain * localgain * dsample **^ (4/3)

        procShort sample (sfb, win) =
            let localgain = (shortsf !! sfb) !! win
                blockgain = case win of 0 -> sg0
                                        1 -> sg1
                                        2 -> sg2
                dsample   = fromIntegral sample
            in gain * localgain * blockgain * dsample **^ (4/3)
                                
        -- Frequency index (0-575) to scale factor band index (0-21).
        longbands = tableScaleBandIndexLong samplerate
        -- Frequency index to scale factor band index and window index (0-2).
        shortbands = tableScaleBandIndexShort samplerate

A fair warning: This presentation of the MP3 re-quantization step differs somewhat from the official specification. The specification presents the quantization as a long formula based on integer quantities. This decoder instead treats these integer quantities as floating point representations of non-linear quantities, so the re-quantization can be expressed as an intuitive series of multiplications. The end result is the same, but the intention is hopefully clearer.

Minor step: Reordering

Before quantizing the frequency samples, the encoder will in certain cases reorder the samples in a predefined way. We have already encountered this above: after the reordering by the encoder the “short” chunks with three small chunks of 192 samples each are combined to 576 samples ordered by frequency (sort of). This is to improve the efficiency of the Huffman coding, as the method with big values and different tables assume the lower frequencies are first in the list.

When we’re done re-quantizing in our decoder, we will reorder the “short” samples back to their original position. After this reordering, the samples in these chunks are no longer ordered by frequency. This is slightly confusing, so unless you are really interested in MP3 you can ignore this and concentrate on the “long” chunks, which have very few surprises.

Decoding, step 3: Joint Stereo

MP3 supports four different channel modes. Mono means the audio has a single channel. Stereo means the audio has two channels. Dual channel is identical to stereo for decoding purposes – it’s intended as information for the media player in case the two channels contain different audio, such as an audio book in two languages.

Then there’s joint stereo. This is like the regular stereo mode, but with some extra compression steps taking similarities between the two channels into account. This makes sense, especially for stereo music where there’s usually a very high correlation between the two channels. By removing some redundancy, the audio quality can be much higher for a given bit rate.

MP3 supports two joint stereo modes known as middle/side stereo (MS) and intensity stereo (IS). Whether these modes are in use is given by the (Bool, Bool) tuple in the MP3Data type. Additionally chunkISParam stores parameter used by IS mode.

MS stereo is very simple: instead of encoding two similar channels verbatim, the encoder computes the sum and the difference of the two channels before encoding. The information content in the “side” channel (difference) will be less than the “middle” channel (sum), and the encoder can use more bits for the middle channel for a better result. MS stereo is lossless, and is a very common mode that’s often used in joint stereo MP3:s. Decoding MS stereo is very cute:

mp3StereoMS :: [Frequency] -> [Frequency] -> ([Frequency], [Frequency])
mp3StereoMS middle side =
    let sqrtinv = 1 / (sqrt 2)
        left  = zipWith0 (\x y -> (x+y)*sqrtinv) 0.0 middle side
        right = zipWith0 (\x y -> (x-y)*sqrtinv) 0.0 middle side
    in (left, right)

The only oddity here is the division by the square root of 2 instead of simply 2. This is to scale down the channels for more efficient quantization by the encoder.

A more unusual stereo mode is known as intensity stereo, or IS for short. We will ignore IS stereo in this article.

Having done the stereo decoding, the only thing remaining is taking the frequency samples back to the time domain. This is the part heavy on theory.

Decoding, step 4: Frequency to time

At this point the only remaining MP3DataChunk values we will use are chunkBlockFlag and chunkBlockType. These are the sole two parameters that dictate how we’re going to transform our frequency domain samples to the time domain. To understand the block flag and block type we have to familiarize ourselves with some transforms, as well as one part of the encoder.

The encoder: filter banks and transforms

The input to an encoder is probably a time domain PCM WAV file, as one usually gets when ripping an audio CD. The encoder takes 576 time samples, from here on called a granule, and encodes two of these granules to a frame. For an input source with two channels, two granules per channel are stored in the frame. The encoder also saves information how the audio was compressed in the frame. This is the MP3Data type in our decoder.

The time domain samples are transformed to the frequency domain in several steps, one granule a time.

Analysis filter bank

First the 576 samples are fed to a set of 32 band pass filters, where each band pass filter outputs 18 time domain samples representing 1/32:th of the frequency spectra of the input signal. If the sample rate is 44100 Hz each band will be approximately 689 Hz wide (22050/32 Hz). Note that there’s downsampling going on here: Common band pass filters will output 576 output samples for 576 input samples, however the MP3 filters also reduce the number of samples by 32, so the combined output of all 32 filters is the same as the number of inputs.

This part of the encoder is known as the analysis filter bank (throw in the word polyphase for good measure), and it’s a part of the encoder common to all the MPEG-1 layers. Our decoder will do the reverse at the very end of the decoding process, combining the subbands to the original signal. The reverse is known as the synthesis filter bank. These two filter banks are simple conceptually, but real mammoths mathematically – at least the synthesis filter bank. We will treat them as black boxes.

MDCT

The output of each band pass filter is further transformed by the MDCT, the modified discrete cosine transform. This transform is just a method of transforming the time domain samples to the frequency domain. Layer 1 and 2 does not use this MDCT, but it was added on top of the filter bank for layer 3 as a finer frequency resolution than 689 Hz (given 44.1 KHz sample rate) proved to give better compression. This makes sense: simply dividing the whole frequency spectra in fixed size blocks means the decoder has to take several critical bands into account when quantizing the signal, which results in a worse compression ratio.

The MDCT takes a signal and represents it as a sum of cosine waves, turning it to the frequency domain. Compared to the DFT/FFT and other well-known transforms, the MDCT has a few properties that make it very suited for audio compression.

First of all, the MDCT has the energy compaction property common to several of the other discrete cosine transforms. This means most of the information in the signal is concentrated to a few output samples with high energy. If you take an input sequence, do an (M)DCT transform on it, set the “small” output values to 0, then do the inverse transform – the result is a fairly small change in the original input. This property is of course very useful for compression, and thus different cosine transforms are used by not only MP3 and audio compression in general but also JPEG and video coding techniques.

Secondly, the MDCT is designed to be performed on consecutive blocks of data, so it has smaller discrepancies at block boundaries compared to other transforms. This also makes it very suited for audio, as we’re almost always working with really long signals.

Technically, the MDCT is a so-called lapped transform, which means we use input samples from the previous input data when we work with the current input data. The input is 2N time samples and the output is N frequency samples. Instead of transforming 2N length blocks separately, consecutive blocks are overlapped. This overlapping helps reducing artifacts at block boundaries. First we perform the MDCT on say samples 0-35 (inclusive), then 18-53, then 36-71… To smoothen the boundaries between consecutive blocks, the MDCT is usually combined with a windowing function that is performed prior to the transform. A windowing function is simply a sequence of values that are zero outside some region, and often between 0 and 1 within the region, that are to be multiplied with another sequence. For the MDCT smooth, arc-like window functions are usually used, which makes the boundaries of the input block go smoothly to zero at the edges.

In the case of MP3, the MDCT is done on the subbands from the analysis filter bank. In order to get all the nice properties of the MDCT, the transform is not done on the 18 samples directly, but on a windowed signal formed by the concatenation of the 18 previous and the current samples. This is illustrated in the picture below, showing two consecutive granules (MP3DataChunk) in an audio channel. Remember: we are looking at the encoder here, the decoder works in reverse. This illustration shows the MDCT of the 0-679 Hz band.

The MDCT can either be applied to the 36 samples as described above, or three MDCT:s are done on 12 samples each – in either case the output is 18 frequency samples. The first choice, known as the long method, gives us greater frequency resolution. The second choice, known as the short method, gives us greater time resolution. The encoder selects the long MDCT to get better audio quality when the signal changes very little, and it selects short when there’s lots going on, that is for transients.

For the whole granule of 576 samples, the encoder can either do the long MDCT on all 32 subbands – this is the long block mode, or it can do the short MDCT in all subbands – this is the short block mode. There’s a third choice, known as the mixed block mode. In this case the encoder uses the long MDCT on the first two subbands, and the short MDCT on the remaining. The mixed block mode is a compromise: it’s used when time resolution is necessary, but using the short block mode would result in artifacts. The lowest frequencies are thus treated as long blocks, where the ear is most sensitive to frequency inaccuracies. Notice that the boundaries of the mixed block mode is fixed: the first two, and only two, subbands use the long MDCT. This is considered a design limitation of MP3: sometimes it’d be useful to have high frequency resolution in more than two subbands. In practice, many encoders do not support mixed blocks.

We discussed the block modes briefly in the chapter on re-quantization and reordering, and hopefully that part will make a little more sense knowing what’s going on inside the encoder. The 576 samples in a short granule are really 3x 192 small granules, but stored in such a way the facilities for compressing a long granule can be used.

The combination of the analysis filter bank and the MDCT is known as the hybrid filter bank, and it’s a very confusing part of the decoder. The analysis filter bank is used by all MPEG-1 layers, but as the frequency bands does not reflect the critical bands, layer 3 added the MDCT on top of the analysis filter bank. One of the features of AAC is a simpler method to transform the time domain samples to the frequency domain, which only use the MDCT, not bothering with the band pass filters.

The decoder

Digesting this information about the encoder leads to a startling realization: we can’t actually decode granules, or frames, independently! Due to the overlapping nature of the MDCT we need the inverse-MDCT output of the previous granule to decode the current granule.

This is where chunkBlockType and chunkBlockFlag are used. If chunkBlockFlag is set to the value LongBlocks, the encoder used a single 36-point MDCT for all 32 subbands (from the filter bank), with overlapping from the previous granule. If the value is ShortBlocks instead, three shorter 12-point MDCT:s were used. chunkBlockFlag can also be MixedBlocks. In this case the two lower frequency subbands from the filter bank are treated as LongBlocks, and the rest as ShortBlocks.

The value chunkBlockType is an integer, either 0,1,2 or 3. This decides which window is used. These window functions are pretty straightforward and similar, one is for the long blocks, one is for the three short blocks, and the two others are used exactly before and after a transition between a long and short block.

Before we do the inverse MDCT, we have to take some deficiencies of the encoder’s analysis filter bank into account. The downsampling in the filter bank introduces some aliasing (where signals are indistinguishable from other signals), but in such a way the synthesis filter bank cancels the aliasing. After the MDCT, the encoder will remove some of this aliasing. This, of course, means we have to undo this alias reduction in our decoder, prior the IMDCT. Otherwise the alias cancellation property of the synthesis filter bank will not work.

When we’ve dealt with the aliasing, we can IMDCT and then window, remembering to overlap with the output from the previous granule. For short blocks, the three small individual IMDCT inputs are overlapped directly, and this result is then treated as a long block.

The word “overlap” requires some clarifications in the context of the inverse transform. When we speak of the MDCT, a function from 2N inputs to N outputs, this just means we use half the previous samples as inputs to the function. If we’ve just MDCT:ed 36 input samples from offset 0 in a long sequence, we then MDCT 36 new samples from offset 18.

When we speak of the IMDCT, a function from N inputs to 2N outputs, there’s an addition step needed to reconstruct the original sequence. We do the IMDCT on the first 18 samples from the output sequence above. This gives us 36 samples. Output 18..35 are added, element wise, to output 0..17 of the IMDCT output of the next 18 samples. Here’s an illustration.

With that out of the way, here’s some code:

mp3IMDCT :: BlockFlag -> Int -> [Frequency] -> [Sample] -> ([Sample], [Sample])
mp3IMDCT blockflag blocktype freq overlap =
    let (samples, overlap') = 
            case blockflag of
                 LongBlocks  -> transf (doImdctLong blocktype) freq
                 ShortBlocks -> transf (doImdctShort) freq
                 MixedBlocks -> transf (doImdctLong 0)  (take 36 freq) <++>
                                transf (doImdctShort) (drop 36 freq)
        samples' = zipWith (+) samples overlap
    in (samples', overlap')
    where
        transf imdctfunc input = unzipConcat $ mapBlock 18 toSO input
            where
                -- toSO takes 18 input samples b and computes 36 time samples
                -- by the IMDCT. These are further divided into two equal
                -- parts (S, O) where S are time samples for this frame
                -- and O are values to be overlapped in the next frame.
                toSO b = splitAt 18 (imdctfunc b)
                unzipConcat xs = let (a, b) = unzip xs
                                 in (concat a, concat b)

doImdctLong :: Int -> [Frequency] -> [Sample]
doImdctLong blocktype f = imdct 18 f `windowWith` tableImdctWindow blocktype

doImdctShort :: [Frequency] -> [Sample]
doImdctShort f = overlap3 shorta shortb shortc
  where
    (f1, f2, f3) = splitAt2 6 f
    shorta       = imdct 6 f1 `windowWith` tableImdctWindow 2
    shortb       = imdct 6 f2 `windowWith` tableImdctWindow 2
    shortc       = imdct 6 f3 `windowWith` tableImdctWindow 2
    overlap3 a b c = 
      p1 ++ (zipWith3 add3 (a ++ p2) (p1 ++ b ++ p1) (p2 ++ c)) ++ p1
      where
        add3 x y z = x+y+z
        p1         = [0,0,0, 0,0,0]
        p2         = [0,0,0, 0,0,0, 0,0,0, 0,0,0]

Before we pass the time domain signal to the synthesis filter bank, there’s one final step. Some subbands from the analysis filter bank have inverted frequency spectra, which the encoder corrects. We have to undo this, as with the alias reduction.

Here are the steps required for taking our frequency samples back to time:

  1. [Frequency] Undo the alias reduction, taking the block flag into account.
  2. [Frequency] Perform the IMDCT, taking the block flag into account.
  3. [Time] Invert the frequency spectra for some bands.
  4. [Time] Synthesis filter bank.

A typical MP3 decoder will spend most of its time in the synthesis filter bank – it is by far the most computationally heavy part of the decoder. In our decoder, we will use the (slow) implementation from the specification. Typical real world decoders, such as the one in your favorite media player, use a highly optimized version of the filter bank using a transform in a clever way. We will not delve in this optimization technique further.

Step 4, summary

It’s easy to miss the forest for the trees, but we have to remember this decoding step is conceptually simple; it’s just messy in MP3 because the designers reused parts from layer 1, which makes the boundaries between time domain, frequency domain and granule less clear.

Using the decoder

Using the decoder is a matter of creating a bit stream, initializing it (mp3Seek), unpacking it to an MP3Data (mp3Unpack) and then decoding the MP3Data with mp3Decode. The decoder does not use any advanced Haskell concepts externally, such as state monads, so hopefully the language will not get in the way of the audio.

module Codec.Audio.MP3.Decoder (
    mp3Seek
   ,mp3Unpack
   ,MP3Bitstream(..)
   ,mp3Decode
   ,MP3DecodeState(..)
   ,emptyMP3DecodeState
) where
...

mp3Decode :: MP3DecodeState -> MP3Data -> (MP3DecodeState, [Sample], [Sample])

data MP3DecodeState = ...

emptyMP3DecodeState :: MP3DecodeState
emptyMP3DecodeState = ...

The code is tested with a new version of GHC. The decoder requires binary-strict, which can be found at Hackage. See README in the code for build instructions. Please note that the software is currently version 0.0.1 – it’s very, very slow, and has some missing features.

Code: mp3decoder-0.0.1.tar.gz.

Conclusion

MP3 has its peculiarities, especially the hybrid filter bank, but it’s still a nice codec with a firm grounding in psychoacoustic principles. Not standardizing the encoder was a good choice by the MPEG-1 team, and the available encoders show it’s possible to compress audio satisfactory within the constraints set by the decoder.

If you decide to play around with the source code, be sure to set your sound card to a low volume if you use headphones! Removing parts of the decoder may result in noise. Have fun.

References

CD 11172-3 Part 3 (the specification)

David Salomon, Data Compression: The Complete Reference, 3rd ed.

Davis Pan, A Tutorial on MPEG/Audio Compression

Rassol Raissi, The Theory Behind Mp3

The source code to libmad, LAME and 8Hz-mp3.