This post was inspired by Seed Labs.

What are we going to do?

The purpose of this lab is to create two different programs with different operations, but with the same MD5 hash. To do this, we will use a property derived from construction of this function. It makes it so that if we have colliding blocks(or set of blocks) B and B' then as long as we keep earlier and later blocks unchanged the sum also will not change.

md5( block(0)...block(m) | B  | block(m+2)...block(n) )

is equivalent to

md5( block(0)...block(m) | B' | block(m+2)...block(n) )

| means concaternation - gluing bytes together block(i) denotes the i-th 64 byte block

Since creating a collider from scratch is much more difficult and beyond the scope of this post, we will use a collider written by someone else.

The program is pretty simple and its sources can be found here: fastcoll_v1.0.0.5-1_source.zip

Building FastColl

After unpacking, you will find that there is no Makefile or any other script to build the program for us. This is not a major problem, but there are a few things to look out for:

  1. It requires a C++ compiler (I'll use g++)
  2. It requires libboost (you need to know what switches to give the linker)
  3. It requires its older version (ubuntu:focal-based docker container will be helpful)
  4. If we do dynamic linking it won't work outside the container (and it would be nice if it did)
  • Enter the temporary container (mounting the current working directory as /fc)
docker container run -v "$PWD":/fc -it --rm ubuntu:focal
  • Install required libs(1.67 version) and g++
apt update
apt install g++ libboost-{filesystem,program-options}1.67-dev
  • Go to the /fc directory and compile all the *.cpp files in it.
cd /fc
g++ -o md5collgen --static *.cpp -lboost_system -lboost_filesystem -lboost_program_options

The --static flag makes it possible to run the program outside the container as well. If it is otherwise, let's not blow the container because it will be needed later.

For comparison, I built both versions - static and dynamic:

❯ file a.out
a.out:
    ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV),
    dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2,
    BuildID[sha1]=e6253061d4ca208aa53dc92e675e021d4892da51,
    for GNU/Linux 3.2.0, not stripped
❯ file b.out
b.out:
    ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux),
    statically linked,
    BuildID[sha1]=6ed8dd891caffef02189c44b406f42700f53af2e,
    for GNU/Linux 3.2.0, not stripped
❯ ll *.out
.rwxr-xr-x root    root    376 KB Fri Apr 14 17:21:13 2023  a.out
.rwxr-xr-x root    root    3.1 MB Fri Apr 14 17:20:57 2023  b.out

The static one is much larger, suggesting that the libraries were indeed included in the executable.

  • Test to see if the program is working
❯ ./md5collgen
MD5 collision generator v1.5
by Marc Stevens (http://www.win.tue.nl/hashclash/)

Allowed options:
  -h [ --help ]           Show options.
  -q [ --quiet ]          Be less verbose.
  -i [ --ihv ] arg        Use specified initial value. Default is MD5 initial
                          value.
  -p [ --prefixfile ] arg Calculate initial value using given prefixfile. Also
                          copies data to output files.
  -o [ --out ] arg        Set output filenames. This must be the last option
                          and exactly 2 filenames must be specified.
                          Default: -o msg1.bin msg2.bin

Sample preparation

We start programming with a simple wireframe:

#include <stdint.h>
#include <stdio.h>

uint8_t check()
{
    return 1;
}

int main()
{
    printf("Hello world!\n");
    
    if(check())
    {
        printf("Bad stuff\n");
    }
    else
    {
        printf("Good stuff\n");
    }
}

Compile the program with the command gcc -o app_base sauce.c. 'app_base' is an arbitrary name, but I prefer it so that I don't lose track of which part I'm cutting and which part I'm gluing together.

The reason for this is that we need to have some kind of common prefix for both programs, collocated blocks, and a common suffix. We can't just write 2 different programs using this method, so the program must contain both "good" and "bad" functionality. This can be done in a less obvious way than using if(check()), but the obfuscation can be done by anyone according to their needs.

We need a space that we can freely manipulate without damaging the program, and at the same time there will be easy access from the code to see what version of the program we are in.

The checking trick is as follows:

#define BSIZE 24

uint64_t collider[BSIZE]  = { 0x42 /* colliding blocks */ };

uint64_t reference[BSIZE] = { 0x41 /* reference blocks */ };

uint8_t check()
{
    uint8_t i;
    for(i=0; i<BSIZE; ++i)
    {
        if(collider[i] != reference[i]) {
            return 0;
        }
    }
    return 1;
}

It is worth checking that our code works by putting a value into each of these blocks, first the same and then different. There may be questions about the type and size of the array. The type only affects the number of iterations during the check, since we will be slicing the program in the hex editor anyway. The size, on the other hand, is with a margin, taking into account that the collision will be implemented in 2 blocks and the arrays may not be properly aligned in memory.

First cuts

We are now trying to find where the beginning of our array is, this will be the end of our prefix. I opened the binary in the rehex program from the AUR, but any option is fine(really, xxd and grep will do as well).

We are looking for the smallest address within the bounds of our array that is a multiple of 64, which is the block size that MD5 works on. In the hexadecimal system, these are addresses with suffixes -00, -40, -80 and -C0. For example, in my case the address is 0x3040, which is 12352 in decimal. Let's remember this address because we're going to need it.

If we have trouble finding the right offset, we can fill the whole array with 0x69696969696969696969('iiiiiiii') or any other preferred value that is easy to find. Remember, hex editors have a search function.

Screenshot of 'app_base' opened in Rehex. Both arrays can be seen starting at offset 0x3040. Only the slice with the array data highlighted in blue is shown.

In order to generate colliding blocks correctly, we need to pass the prefix, because the state of MD5 at any given time will vary depending on what data was previously inserted, so changing that part after this step will require generating a new collision. However, it does not depend on the suffix - as long as we keep the suffix in common, the collision will still occur.

I cut out the prefix with the split program, which is available on virtually every Linux system (any other method of cutting it out, such as inside the editor, is fine).

# 12352 == 0x3040
split -b 12352  ./app_base splitted_

We can see if it was cut correctly:

❯ wc -c splitted_aa
12352 splitted_aa
❯ xxd -g4 -c32 splitted_ab | head -n 12
00000000: 42000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000020: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000040: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000060: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000080: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
000000a0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
000000c0: 41000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
000000e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000100: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000120: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000140: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000160: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

As you can see, the size of the prefix is, as expected, a multiple of 64, and in the second part we have our arrays. We can now use the collider, run it to generate different but colliding blocks.

❯ ./md5collgen -p splitted_aa -o prefix_coll_A prefix_coll_B
MD5 collision generator v1.5
by Marc Stevens (http://www.win.tue.nl/hashclash/)

Using output filenames: 'prefix_coll_A' and 'prefix_coll_B'
Using prefixfile: 'splitted_aa'
Using initial value: f689a84ef7dec30889248f56f67a607e

Generating first block: ..............
Generating second block: S00..
Running time: 13.2727 s

We can check to see if it worked:

❯ md5sum prefix_coll_A prefix_coll_B ; sha1sum prefix_coll_A prefix_coll_B
4eff2e2840c3c0aaee2a5c2967aa0934  prefix_coll_A
4eff2e2840c3c0aaee2a5c2967aa0934  prefix_coll_B
91e9d671c6a6808ea9430058aee178c89f83aecd  prefix_coll_A
90f9fbce8c7fb7c22c6611903ad289cf5552723c  prefix_coll_B

The SHA-1 algorithm shows the difference in the files, but the MD5 sums are identical. That's something!

I should have been a neurosurgeon

We are about to cut the collision from the generated files, but first let's look at the plan:

"bad" version"good" versionMD5's perspective
colliderblock Ablock Bmatch
referenceblock Ablock Amatch
col=ref ?YesNo

From MD5's point of view, it doesn't matter whether we put block A or B into the array, but the result of the check() function we wrote earlier does. To do this, we need to cut out the colliding blocks from the files. We can do this with a hex editor, but we have the command we need right in our history:

❯ split -b 12352  ./prefix_coll_A block_A_
❯ rm block_A_aa
❯ mv block_A_ab block_A

We do the same for block B. We should get two files block_A and block_B each with a size of 128 bytes.

❯ xxd -g4 -c32 block_A
00000000: d0da55a9 5f78f5e1 c49c1391 0ded9308 ab6b6ec0 df5ebcff 2e127ade 54408923
00000020: 73ebf3d0 da8c5be7 eb2759d9 0c7c9e8c 57833e73 50a1ccf0 fa2f9ab3 552d5d09
00000040: 032c3ef4 123634a3 0fd65236 3da76178 4a24ba52 16b5801d ddf715c5 585a5dae
00000060: 3363d6bd 5efba241 6388ad2f eb1d7b84 04b03b4a fceb6b1a 968e478b f10a5a15
❯ xxd -g4 -c32 block_B
00000000: d0da55a9 5f78f5e1 c49c1391 0ded9308 ab6b6e40 df5ebcff 2e127ade 54408923
00000020: 73ebf3d0 da8c5be7 eb2759d9 0cfc9e8c 57833e73 50a1ccf0 fa2f9a33 552d5d09
00000040: 032c3ef4 123634a3 0fd65236 3da76178 4a24bad2 16b5801d ddf715c5 585a5dae
00000060: 3363d6bd 5efba241 6388ad2f eb9d7a84 04b03b4a fceb6b1a 968e470b f10a5a15
❯ wc -c block_A block_B
128 block_A
128 block_B
256 total

I've set up two versions of app_base called app_good and app_evil, to keep me from mixing things up. I found the arrays we need at addresses 0x3040 and 0x3100. Let's swap them out as planned!

Screenshot of 'app_good' opened in Rehex. In red you can see 2 pasted blocks there are some differences(highlighted), for example at offset 0x13(relative to the beginning of the block) 'C0' != '40'

Screenshot of 'app_evil' opened in Rehex. In red you can see 2 pasted blocks - both are the same

We save (red in Rehex means unsaved changes) and check:

❯ ./app_good
Hello world!
Bad stuff
❯ ./app_evil
Hello world!
Good stuff
❯ md5sum app_good app_evil
8e871d437777ebdc1e483844ea366a54  app_good
8e871d437777ebdc1e483844ea366a54  app_evil

Summing up

The MD5 function in terms of collisions was torn apart repeatedly more than a decade ago. Being able to build a pair of binaries with the same hash doesn't seem like an achievement in 2023. It used to be, when MD5 was used along with digital signatures - including in certificates. Nevertheless, the exercise is valuable, showing some features of many hash functions that most of us don't think about on a daily basis. It's also a cool party trick to show off at a house party or Discord. If you haven't already done so, I encourage you to take an interest in cryptography in one aspect or another. It's a field that is very graceful and rewards effort with "fireworks" like the above.

Link to archive with both versions of the program