Implementing Huffman Coding in C


Huffman Coding (link to Wikipedia) is a compression algorithm used for loss-less data compression. Here’s the basic idea: each ASCII character is usually represented with 8 bits, but if we had a text filed composed of only the lowercase a-z letters we could represent each character with only 5 bits (i.e., 2^5 = 32, which is enough to represent 26 values), thus reducing the overall memory needed to store the data.

For example, the table of characters -> binary code could look like this:

  • a = 00000
  • b = 00001
  • c = 00010
  • d = 00011
  • e = 00100

And so on.

This is the fixed-length representation, and it already creates a significant compression rate (around 35% in the example above).

A more efficient approach is to use a variable-length representation, where each character can have a different number of bits to represent it. More specifically we first analyze the frequency of each character in the text, and then we create a binary tree (called Huffman tree) giving a shorter bit representation to the most used characters, so that they can be reached faster. Notice that it must be a prefix tree (i.e., the code of every letter can’t be prefix to the code of any other letter) else the decompression wouldn’t work.

For example, the frequency of the letters in the English language (according to Wikipedia) is the following:

english-letter-frequency

Now the algorithm to create the Huffman tree is the following:

  • Create a forest with one tree for each letter and its respective frequency as value
  • Join the two trees with the lowest value, removing each from the forest and adding instead the resulting combined tree
  • Repeat until there’s only one tree left

The Huffman tree for the a-z letters (and the space character) using the frequency table above would look like this (you would need to expand to the lower branches to see all the letters):

huffman-tree-1

We can traverse the tree and create a table with all the letters and their respective binary codes. It would look like this:

a = 0011
b = 011011
c = 11111
d = 00100
e = 101
f = 000010
g = 011001
h = 1110
i = 0101
j = 011010101
k = 0110101001
l = 00101
m = 000001
n = 0111
o = 0100
p = 011000
q = 0110101000
r = 1101
s = 1100
t = 0001
u = 11110
v = 0110100
w = 000000
x = 011010111
y = 000011
z = 011010110
space = 100

After that you just need to read the original text file letter by letter and to output the respective binary code. If you want to decompress a file, on the other hand, you read bit by bit and move along the Huffman tree until you find a letter, at which point you move back to the root of the tree and continue processing the bits.

Below you’ll find a C implementing of the Huffman coding (it includes all the parts, including the creating of the Huffman tree, the table and so on). If you prefer here’s the huffman.zip file for download.


#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define len(x) ((int)log10(x)+1)

/* Node of the huffman tree */
struct node{
    int value;
    char letter;
    struct node *left,*right;
};

typedef struct node Node;

/* 81 = 8.1%, 128 = 12.8% and so on. The 27th frequency is the space. Source is Wikipedia */
int englishLetterFrequencies [27] = {81, 15, 28, 43, 128, 23, 20, 61, 71, 2, 1, 40, 24, 69, 76, 20, 1, 61, 64, 91, 28, 10, 24, 1, 20, 1, 130};

/*finds and returns the small sub-tree in the forrest*/
int findSmaller (Node *array[], int differentFrom){
    int smaller;
    int i = 0;

    while (array[i]->value==-1)
        i++;
    smaller=i;
    if (i==differentFrom){
        i++;
        while (array[i]->value==-1)
            i++;
        smaller=i;
    }

    for (i=1;i<27;i++){
        if (array[i]->value==-1)
            continue;
        if (i==differentFrom)
            continue;
        if (array[i]->value<array[smaller]->value)
            smaller = i;
    }

    return smaller;
}

/*builds the huffman tree and returns its address by reference*/
void buildHuffmanTree(Node **tree){
    Node *temp;
    Node *array[27];
    int i, subTrees = 27;
    int smallOne,smallTwo;

    for (i=0;i<27;i++){
        array[i] = malloc(sizeof(Node));
        array[i]->value = englishLetterFrequencies[i];
        array[i]->letter = i;
        array[i]->left = NULL;
        array[i]->right = NULL;
    }

    while (subTrees>1){
        smallOne=findSmaller(array,-1);
        smallTwo=findSmaller(array,smallOne);
        temp = array[smallOne];
        array[smallOne] = malloc(sizeof(Node));
        array[smallOne]->value=temp->value+array[smallTwo]->value;
        array[smallOne]->letter=127;
        array[smallOne]->left=array[smallTwo];
        array[smallOne]->right=temp;
        array[smallTwo]->value=-1;
        subTrees--;
    }

    *tree = array[smallOne];

return;
}

/* builds the table with the bits for each letter. 1 stands for binary 0 and 2 for binary 1 (used to facilitate arithmetic)*/
void fillTable(int codeTable[], Node *tree, int Code){
    if (tree->letter<27)
        codeTable[(int)tree->letter] = Code;
    else{
        fillTable(codeTable, tree->left, Code*10+1);
        fillTable(codeTable, tree->right, Code*10+2);
    }

    return;
}

/*function to compress the input*/
void compressFile(FILE *input, FILE *output, int codeTable[]){
    char bit, c, x = 0;
    int n,length,bitsLeft = 8;
    int originalBits = 0, compressedBits = 0;

    while ((c=fgetc(input))!=10){
        originalBits++;
        if (c==32){
            length = len(codeTable[26]);
            n = codeTable[26];
        }
        else{
            length=len(codeTable[c-97]);
            n = codeTable[c-97];
        }

        while (length>0){
            compressedBits++;
            bit = n % 10 - 1;
            n /= 10;
            x = x | bit;
            bitsLeft--;
            length--;
            if (bitsLeft==0){
                fputc(x,output);
                x = 0;
                bitsLeft = 8;
            }
            x = x << 1;
        }
    }

    if (bitsLeft!=8){
        x = x << (bitsLeft-1);
        fputc(x,output);
    }

    /*print details of compression on the screen*/
    fprintf(stderr,"Original bits = %dn",originalBits*8);
    fprintf(stderr,"Compressed bits = %dn",compressedBits);
    fprintf(stderr,"Saved %.2f%% of memoryn",((float)compressedBits/(originalBits*8))*100);

    return;
}

/*function to decompress the input*/
void decompressFile (FILE *input, FILE *output, Node *tree){
    Node *current = tree;
    char c,bit;
    char mask = 1 << 7;
    int i;

    while ((c=fgetc(input))!=EOF){

        for (i=0;i<8;i++){
            bit = c & mask;
            c = c << 1;
            if (bit==0){
                current = current->left;
                if (current->letter!=127){
                    if (current->letter==26)
                        fputc(32, output);
                    else
                        fputc(current->letter+97,output);
                    current = tree;
                }
            }

            else{
                current = current->right;
                if (current->letter!=127){
                    if (current->letter==26)
                        fputc(32, output);
                    else
                        fputc(current->letter+97,output);
                    current = tree;
                }
            }
        }
    }

    return;
}

/*invert the codes in codeTable2 so they can be used with mod operator by compressFile function*/
void invertCodes(int codeTable[],int codeTable2[]){
    int i, n, copy;

    for (i=0;i<27;i++){
        n = codeTable[i];
        copy = 0;
        while (n>0){
            copy = copy * 10 + n %10;
            n /= 10;
        }
        codeTable2[i]=copy;
    }

return;
}

int main(){
    Node *tree;
    int codeTable[27], codeTable2[27];
    int compress;
    char filename[20];
    FILE *input, *output;

    buildHuffmanTree(&tree);

    fillTable(codeTable, tree, 0);

    invertCodes(codeTable,codeTable2);

    /*get input details from user*/
    printf("Type the name of the file to process:n");
    scanf("%s",filename);
    printf("Type 1 to compress and 2 to decompress:n");
    scanf("%d",&compress);

    input = fopen(filename, "r");
    output = fopen("output.txt","w");

    if (compress==1)
        compressFile(input,output,codeTable2);
    else
        decompressFile(input,output, tree);

    return 0;
}


31 thoughts on “Implementing Huffman Coding in C

  1. HImanshu Gupta

    #include
    #include
    #include

    #define len(x) ((int)log10(x)+1)

    /* Node of the huffman tree */
    struct node{
    int value;
    char letter;
    struct node *left,*right;
    };

    typedef struct node Node;

    /* 81 = 8.1%, 128 = 12.8% and so on. The 27th frequency is the space. Source is Wikipedia */
    int englishLetterFrequencies [27] = {81, 15, 28, 43, 128, 23, 20, 61, 71, 2, 1, 40, 24, 69, 76, 20, 1, 61, 64, 91, 28, 10, 24, 1, 20, 1, 130};

    /*finds and returns the small sub-tree in the forrest*/
    int findSmaller (Node *array[], int differentFrom){
    int smaller;
    int i = 0;

    while (array[i]->value==-1)
    i++;
    smaller=i;
    if (i==differentFrom){
    i++;
    while (array[i]->value==-1)
    i++;
    smaller=i;
    }

    for (i=1;ivalue==-1)
    continue;
    if (i==differentFrom)
    continue;
    if (array[i]->valuevalue)
    smaller = i;
    }

    return smaller;
    }

    /*builds the huffman tree and returns its address by reference*/
    void buildHuffmanTree(Node **tree){
    Node *temp;
    Node *array[27];
    int i, subTrees = 27;
    int smallOne,smallTwo;

    for (i=0;ivalue = englishLetterFrequencies[i];
    array[i]->letter = i;
    array[i]->left = NULL;
    array[i]->right = NULL;
    }

    while (subTrees>1){
    smallOne=findSmaller(array,-1);
    smallTwo=findSmaller(array,smallOne);
    temp = array[smallOne];
    array[smallOne] = malloc(sizeof(Node));
    array[smallOne]->value=temp->value+array[smallTwo]->value;
    array[smallOne]->letter=127;
    array[smallOne]->left=array[smallTwo];
    array[smallOne]->right=temp;
    array[smallTwo]->value=-1;
    subTrees–;
    }

    *tree = array[smallOne];

    return;
    }

    /* builds the table with the bits for each letter. 1 stands for binary 0 and 2 for binary 1 (used to facilitate arithmetic)*/
    void fillTable(int codeTable[], Node *tree, int Code){
    if (tree->letterletter] = Code;
    else{
    fillTable(codeTable, tree->left, Code*10+1);
    fillTable(codeTable, tree->right, Code*10+2);
    }

    return;
    }

    /*function to compress the input*/
    void compressFile(FILE *input, FILE *output, int codeTable[]){
    char bit, c, x = 0;
    int n,length,bitsLeft = 8;
    int originalBits = 0, compressedBits = 0;

    while ((c=fgetc(input))!=10){
    originalBits++;
    if (c==32){
    length = len(codeTable[26]);
    n = codeTable[26];
    }
    else{
    length=len(codeTable[c-97]);
    n = codeTable[c-97];
    }

    while (length>0){
    compressedBits++;
    bit = n % 10 – 1;
    n /= 10;
    x = x | bit;
    bitsLeft–;
    length–;
    if (bitsLeft==0){
    fputc(x,output);
    x = 0;
    bitsLeft = 8;
    }
    x = x << 1;
    }
    }

    if (bitsLeft!=8){
    x = x << (bitsLeft-1);
    fputc(x,output);
    }

    /*print details of compression on the screen*/
    fprintf(stderr,"Original bits = %dn",originalBits*8);
    fprintf(stderr,"Compressed bits = %dn",compressedBits);
    fprintf(stderr,"Saved %.2f%% of memoryn",((float)compressedBits/(originalBits*8))*100);

    return;
    }

    /*function to decompress the input*/
    void decompressFile (FILE *input, FILE *output, Node *tree){
    Node *current = tree;
    char c,bit;
    char mask = 1 << 7;
    int i;

    while ((c=fgetc(input))!=EOF){

    for (i=0;i<8;i++){
    bit = c & mask;
    c = c <left;
    if (current->letter!=127){
    if (current->letter==26)
    fputc(32, output);
    else
    fputc(current->letter+97,output);
    current = tree;
    }
    }

    else{
    current = current->right;
    if (current->letter!=127){
    if (current->letter==26)
    fputc(32, output);
    else
    fputc(current->letter+97,output);
    current = tree;
    }
    }
    }
    }

    return;
    }

    /*invert the codes in codeTable2 so they can be used with mod operator by compressFile function*/
    void invertCodes(int codeTable[],int codeTable2[]){
    int i, n, copy;

    for (i=0;i0){
    copy = copy * 10 + n %10;
    n /= 10;
    }
    codeTable2[i]=copy;
    }

    return;
    }

    int main(){
    Node *tree;
    int codeTable[27], codeTable2[27];
    int compress;
    char filename[20];
    FILE *input, *output;

    buildHuffmanTree(&tree);

    fillTable(codeTable, tree, 0);

    invertCodes(codeTable,codeTable2);

    /*get input details from user*/
    printf(“Type the name of the file to process:n”);
    scanf(“%s”,filename);
    printf(“Type 1 to compress and 2 to decompress:n”);
    scanf(“%d”,&compress);

    input = fopen(filename, “r”);
    output = fopen(“output.txt”,”w”);

    if (compress==1)
    compressFile(input,output,codeTable2);
    else
    decompressFile(input,output, tree);

    return 0;
    }
    someone plz gimme its output asap

    Reply
  2. naga lakshmi

    am getting log10: sing error.. i dont know how to correct that error.
    and also what type of data it accepts i mean values or alphabets

    Reply
    1. Rajesh Y

      Hi Lakshmi,

      I am also getting same problem , Did you get any solution ? if you get any solution just mail me “yarasani.rajesh@gmail.com”

      Problem :
      huf.c:(.text+0x3cf): undefined reference to `log10′
      huf.c:(.text+0x40b): undefined reference to `log10′
      collect2: error: ld returned 1 exit status

      waiting reply …thank u..

      Reply
      1. Huang Yibin

        in C language, there’s no function called log10 in math.h, if you want to calculate log10(2), you should calculate log(2) / log(10), log() is the only function in math.h to calculate Logarithm

        Reply
        1. blue

          There is in fact a log function in C.
          log => ln(x)
          log2(x) => base 2
          log10(x) => base 10
          log1p => log( 1 + x )

          http://www.codecogs.com/library/computing/c/math.h/log.php

          NB Log only takes type: ( long long x ), ( double x ), ( float x )

          As it relates to this, i too am getting an error. But it’s not that the function doesn’t exist, but instead it is taking incorrect data type.
          I tried going around it, by implementing log10 directly( log10( array ) + 1 ) on the line of code, “length = len(codeTable[26]);” , but no matter what I do, no matter what I cast, no matter whose data type I change, it won’t compress.
          Changing data types does remove the error message but in effect cause a logic error.

          Reply
          1. Takagi

            log10 is part of . So, in order to correctly compile the code, you must add a -lm at THE END (this is crucial) of your code, so, i.e.
            $(CC) … -o a.exe -lm

      2. Sarwesh Shah

        log10 only accepts float, long float or double.
        Typecast your input (in this case x) to float
        thus change code
        #define len(x) ((int)log10(x)+1)
        to
        #define len(x) ((int)log10(float(x))+1)

        Reply
    1. Yatin

      its giving an error how you solved it
      log10:domain error
      and where the compressed file will be stored????
      i m giving file name directly abc.txt
      without path.
      help if u can…

      Reply
  3. John

    Code is int defined in void fillTable(int codeTable[], Node *tree, int Code){…

    The largest int is 32767, if the tree has more than 5 levels, this line

    fillTable(codeTable, tree->left, Code*10+1);

    will cause overflow.

    Reply
  4. Edna

    On UBUNTU, log10 error can be fixed like this:
    gcc huffman.c -o huffman -lm

    I can do compress, but in output file I get all kinds of signs, more like old Egiptian signs, and when I do decompress of output.txt, i don’t get anything? Anyone can help me???

    Thanks

    Reply
    1. Yashira

      Change the compressed file name and decompress that file using the new name. (Example change output.txt as example.txt and decompress example.txt file)

      Reply
  5. Sinval

    Friend I really liked your project, but I could not run it, open it to put the file name that will be used to compress, and then pressed the option to compress, but apparently it goes into a loop where it is generating a characters in a growing output.txt file.

    Reply
  6. DUFAY Paul

    Hi, i’m a french student, so please apologize me if my english could be Clumsy.

    I try to use this algorithm in parallelization programming but, when i want to use a really large text, to slow down my computer and see if i could improve the code, after compression and decompression, the final text lose a large part.

    Do you have an explication?

    Thanks for your help

    Reply
  7. Shaun

    A good article! Thanks! Not sure why so many are having issues with compiling / figuring out how to use – works fine for me after reading through the code and understanding what its doing.
    Does the Huffman principle only work with a-z and space? If one were to add additional characters such as ‘@’ or ‘-‘ or ‘.’, could one theoretically increment the array size from 27 to encompass the required character set (and of course ordered in frequency where applicable) and still work? Is there a finite limit to the number of characters one wishes to utilize before the algorithm loses effectiveness?

    Reply
  8. Paul

    I’ve got such problem :
    Error error C2440: ‘=’ : cannot convert from ‘void *’ to ‘Node *’

    and code where that error is :

    array[i] = malloc(sizeof(Node));

    (with equal sign underlined)

    Reply
  9. NintendoDS

    I’m using this algorithm because my professor copied and pasted this as an exercise for me, making me implement the compress method.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *