Implementing Huffman Coding in C

Improve your writing skills in 5 minutes a day with the Daily Writing Tips email newsletter.

Huffman Coding (link to Wikipedia) is a compression algorithm used for loss-less data compression. Here’s the basic idea: each ASCII character is usually represented with 8 bits, but if we had a text filed composed of only the lowercase a-z letters we could represent each character with only 5 bits (i.e., 2^5 = 32, which is enough to represent 26 values), thus reducing the overall memory needed to store the data.

huffman-coding-c

For example, the table of characters -> binary code could look like this:

a = 00000
b = 00001
c = 00010
d = 00011
e = 00100

And so on.

This is the fixed-length representation, and it already creates a significant compression rate (around 35% in the example above).

A more efficient approach is to use a variable-length representation, where each character can have a different number of bits to represent it. More specifically we first analyze the frequency of each character in the text, and then we create a binary tree (called Huffman tree) giving a shorter bit representation to the most used characters, so that they can be reached faster. Notice that it must be a prefix tree (i.e., the code of every letter can’t be prefix to the code of any other letter) else the decompression wouldn’t work.

For example, the frequency of the letters in the English language (according to Wikipedia) is the following:

english-letter-frequency

Now the algorithm to create the Huffman tree is the following:

Create a forest with one tree for each letter and its respective frequency as value
Join the two trees with the lowest value, removing each from the forest and adding instead the resulting combined tree
Repeat until there’s only one tree left

The Huffman tree for the a-z letters (and the space character) using the frequency table above would look like this (you would need to expand to the lower branches to see all the letters):

huffman-tree-1

We can traverse the tree and create a table with all the letters and their respective binary codes. It would look like this:

a = 0011
b = 011011
c = 11111
d = 00100
e = 101
f = 000010
g = 011001
h = 1110
i = 0101
j = 011010101
k = 0110101001
l = 00101
m = 000001
n = 0111
o = 0100
p = 011000
q = 0110101000
r = 1101
s = 1100
t = 0001
u = 11110
v = 0110100
w = 000000
x = 011010111
y = 000011
z = 011010110
space = 100

After that you just need to read the original text file letter by letter and to output the respective binary code. If you want to decompress a file, on the other hand, you read bit by bit and move along the Huffman tree until you find a letter, at which point you move back to the root of the tree and continue processing the bits.

Below you’ll find a C implementing of the Huffman coding (it includes all the parts, including the creating of the Huffman tree, the table and so on). If you prefer here’s the huffman.zip file for download.


#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define len(x) ((int)log10(x)+1)

/* Node of the huffman tree */
struct node{
    int value;
    char letter;
    struct node *left,*right;
};

typedef struct node Node;

/* 81 = 8.1%, 128 = 12.8% and so on. The 27th frequency is the space. Source is Wikipedia */
int englishLetterFrequencies [27] = {81, 15, 28, 43, 128, 23, 20, 61, 71, 2, 1, 40, 24, 69, 76, 20, 1, 61, 64, 91, 28, 10, 24, 1, 20, 1, 130};

/*finds and returns the small sub-tree in the forrest*/
int findSmaller (Node *array[], int differentFrom){
    int smaller;
    int i = 0;

    while (array[i]->value==-1)
        i++;
    smaller=i;
    if (i==differentFrom){
        i++;
        while (array[i]->value==-1)
            i++;
        smaller=i;
    }

    for (i=1;i<27;i++){
        if (array[i]->value==-1)
            continue;
        if (i==differentFrom)
            continue;
        if (array[i]->value<array[smaller]->value)
            smaller = i;
    }

    return smaller;
}

/*builds the huffman tree and returns its address by reference*/
void buildHuffmanTree(Node **tree){
    Node *temp;
    Node *array[27];
    int i, subTrees = 27;
    int smallOne,smallTwo;

    for (i=0;i<27;i++){
        array[i] = malloc(sizeof(Node));
        array[i]->value = englishLetterFrequencies[i];
        array[i]->letter = i;
        array[i]->left = NULL;
        array[i]->right = NULL;
    }

    while (subTrees>1){
        smallOne=findSmaller(array,-1);
        smallTwo=findSmaller(array,smallOne);
        temp = array[smallOne];
        array[smallOne] = malloc(sizeof(Node));
        array[smallOne]->value=temp->value+array[smallTwo]->value;
        array[smallOne]->letter=127;
        array[smallOne]->left=array[smallTwo];
        array[smallOne]->right=temp;
        array[smallTwo]->value=-1;
        subTrees--;
    }

    *tree = array[smallOne];

return;
}

/* builds the table with the bits for each letter. 1 stands for binary 0 and 2 for binary 1 (used to facilitate arithmetic)*/
void fillTable(int codeTable[], Node *tree, int Code){
    if (tree->letter<27)
        codeTable[(int)tree->letter] = Code;
    else{
        fillTable(codeTable, tree->left, Code*10+1);
        fillTable(codeTable, tree->right, Code*10+2);
    }

    return;
}

/*function to compress the input*/
void compressFile(FILE *input, FILE *output, int codeTable[]){
    char bit, c, x = 0;
    int n,length,bitsLeft = 8;
    int originalBits = 0, compressedBits = 0;

    while ((c=fgetc(input))!=10){
        originalBits++;
        if (c==32){
            length = len(codeTable[26]);
            n = codeTable[26];
        }
        else{
            length=len(codeTable[c-97]);
            n = codeTable[c-97];
        }

        while (length>0){
            compressedBits++;
            bit = n % 10 - 1;
            n /= 10;
            x = x | bit;
            bitsLeft--;
            length--;
            if (bitsLeft==0){
                fputc(x,output);
                x = 0;
                bitsLeft = 8;
            }
            x = x << 1;
        }
    }

    if (bitsLeft!=8){
        x = x << (bitsLeft-1);
        fputc(x,output);
    }

    /*print details of compression on the screen*/
    fprintf(stderr,"Original bits = %dn",originalBits*8);
    fprintf(stderr,"Compressed bits = %dn",compressedBits);
    fprintf(stderr,"Saved %.2f%% of memoryn",((float)compressedBits/(originalBits*8))*100);

    return;
}

/*function to decompress the input*/
void decompressFile (FILE *input, FILE *output, Node *tree){
    Node *current = tree;
    char c,bit;
    char mask = 1 << 7;
    int i;

    while ((c=fgetc(input))!=EOF){

        for (i=0;i<8;i++){
            bit = c & mask;
            c = c << 1;
            if (bit==0){
                current = current->left;
                if (current->letter!=127){
                    if (current->letter==26)
                        fputc(32, output);
                    else
                        fputc(current->letter+97,output);
                    current = tree;
                }
            }

            else{
                current = current->right;
                if (current->letter!=127){
                    if (current->letter==26)
                        fputc(32, output);
                    else
                        fputc(current->letter+97,output);
                    current = tree;
                }
            }
        }
    }

    return;
}

/*invert the codes in codeTable2 so they can be used with mod operator by compressFile function*/
void invertCodes(int codeTable[],int codeTable2[]){
    int i, n, copy;

    for (i=0;i<27;i++){
        n = codeTable[i];
        copy = 0;
        while (n>0){
            copy = copy * 10 + n %10;
            n /= 10;
        }
        codeTable2[i]=copy;
    }

return;
}

int main(){
    Node *tree;
    int codeTable[27], codeTable2[27];
    int compress;
    char filename[20];
    FILE *input, *output;

    buildHuffmanTree(&tree);

    fillTable(codeTable, tree, 0);

    invertCodes(codeTable,codeTable2);

    /*get input details from user*/
    printf("Type the name of the file to process:n");
    scanf("%s",filename);
    printf("Type 1 to compress and 2 to decompress:n");
    scanf("%d",&compress);

    input = fopen(filename, "r");
    output = fopen("output.txt","w");

    if (compress==1)
        compressFile(input,output,codeTable2);
    else
        decompressFile(input,output, tree);

    return 0;
}

37 thoughts on “Implementing Huffman Coding in C”

HImanshu Gupta February 21, 2012 at 7:33 am

#include
#include
#include

#define len(x) ((int)log10(x)+1)

/* Node of the huffman tree */
struct node{
int value;
char letter;
struct node *left,*right;
};

typedef struct node Node;

/* 81 = 8.1%, 128 = 12.8% and so on. The 27th frequency is the space. Source is Wikipedia */
int englishLetterFrequencies [27] = {81, 15, 28, 43, 128, 23, 20, 61, 71, 2, 1, 40, 24, 69, 76, 20, 1, 61, 64, 91, 28, 10, 24, 1, 20, 1, 130};

/*finds and returns the small sub-tree in the forrest*/
int findSmaller (Node *array[], int differentFrom){
int smaller;
int i = 0;

while (array[i]->value==-1)
i++;
smaller=i;
if (i==differentFrom){
i++;
while (array[i]->value==-1)
i++;
smaller=i;
}

for (i=1;ivalue==-1)
continue;
if (i==differentFrom)
continue;
if (array[i]->valuevalue)
smaller = i;
}

return smaller;
}

/*builds the huffman tree and returns its address by reference*/
void buildHuffmanTree(Node **tree){
Node *temp;
Node *array[27];
int i, subTrees = 27;
int smallOne,smallTwo;

for (i=0;ivalue = englishLetterFrequencies[i];
array[i]->letter = i;
array[i]->left = NULL;
array[i]->right = NULL;
}

while (subTrees>1){
smallOne=findSmaller(array,-1);
smallTwo=findSmaller(array,smallOne);
temp = array[smallOne];
array[smallOne] = malloc(sizeof(Node));
array[smallOne]->value=temp->value+array[smallTwo]->value;
array[smallOne]->letter=127;
array[smallOne]->left=array[smallTwo];
array[smallOne]->right=temp;
array[smallTwo]->value=-1;
subTrees–;
}

*tree = array[smallOne];

return;
}

/* builds the table with the bits for each letter. 1 stands for binary 0 and 2 for binary 1 (used to facilitate arithmetic)*/
void fillTable(int codeTable[], Node *tree, int Code){
if (tree->letterletter] = Code;
else{
fillTable(codeTable, tree->left, Code*10+1);
fillTable(codeTable, tree->right, Code*10+2);
}

return;
}

/*function to compress the input*/
void compressFile(FILE *input, FILE *output, int codeTable[]){
char bit, c, x = 0;
int n,length,bitsLeft = 8;
int originalBits = 0, compressedBits = 0;

while ((c=fgetc(input))!=10){
originalBits++;
if (c==32){
length = len(codeTable[26]);
n = codeTable[26];
}
else{
length=len(codeTable[c-97]);
n = codeTable[c-97];
}

while (length>0){
compressedBits++;
bit = n % 10 – 1;
n /= 10;
x = x | bit;
bitsLeft–;
length–;
if (bitsLeft==0){
fputc(x,output);
x = 0;
bitsLeft = 8;
}
x = x << 1;
}
}

if (bitsLeft!=8){
x = x << (bitsLeft-1);
fputc(x,output);
}

/*print details of compression on the screen*/
fprintf(stderr,"Original bits = %dn",originalBits*8);
fprintf(stderr,"Compressed bits = %dn",compressedBits);
fprintf(stderr,"Saved %.2f%% of memoryn",((float)compressedBits/(originalBits*8))*100);

return;
}

/*function to decompress the input*/
void decompressFile (FILE *input, FILE *output, Node *tree){
Node *current = tree;
char c,bit;
char mask = 1 << 7;
int i;

while ((c=fgetc(input))!=EOF){

for (i=0;i<8;i++){
bit = c & mask;
c = c <left;
if (current->letter!=127){
if (current->letter==26)
fputc(32, output);
else
fputc(current->letter+97,output);
current = tree;
}
}

else{
current = current->right;
if (current->letter!=127){
if (current->letter==26)
fputc(32, output);
else
fputc(current->letter+97,output);
current = tree;
}
}
}
}

return;
}

/*invert the codes in codeTable2 so they can be used with mod operator by compressFile function*/
void invertCodes(int codeTable[],int codeTable2[]){
int i, n, copy;

for (i=0;i0){
copy = copy * 10 + n %10;
n /= 10;
}
codeTable2[i]=copy;
}

return;
}

int main(){
Node *tree;
int codeTable[27], codeTable2[27];
int compress;
char filename[20];
FILE *input, *output;

buildHuffmanTree(&tree);

fillTable(codeTable, tree, 0);

invertCodes(codeTable,codeTable2);

/*get input details from user*/
printf(“Type the name of the file to process:n”);
scanf(“%s”,filename);
printf(“Type 1 to compress and 2 to decompress:n”);
scanf(“%d”,&compress);

input = fopen(filename, “r”);
output = fopen(“output.txt”,”w”);

if (compress==1)
compressFile(input,output,codeTable2);
else
decompressFile(input,output, tree);

return 0;
}
someone plz gimme its output asap

Reply ↓
sumpi February 26, 2012 at 2:44 pm

i am not able to run.Domain error.why???????????

Reply ↓
1. Krishna January 11, 2013 at 5:18 pm
  
  @sumpi. If you are using gcc, do this.
  gcc filename.c -lm -0 anyname
  
  Reply ↓
Shobhit March 13, 2013 at 7:31 pm

I am using turboo c ++ and i am getting domain error arrey out of bound. Kindly tell me the solution.
Thank you

Reply ↓
naga lakshmi May 19, 2013 at 6:16 am

am getting log10: sing error.. i dont know how to correct that error.
and also what type of data it accepts i mean values or alphabets

Reply ↓
1. Rajesh Y October 8, 2013 at 11:06 am
  
  Hi Lakshmi,
  
  I am also getting same problem , Did you get any solution ? if you get any solution just mail me “yarasani.rajesh@gmail.com”
  
  Problem :
  huf.c:(.text+0x3cf): undefined reference to `log10′
  huf.c:(.text+0x40b): undefined reference to `log10′
  collect2: error: ld returned 1 exit status
  
  waiting reply …thank u..
  
  Reply ↓
  1. Huang Yibin February 13, 2014 at 6:59 pm
    
    in C language, there’s no function called log10 in math.h, if you want to calculate log10(2), you should calculate log(2) / log(10), log() is the only function in math.h to calculate Logarithm
    
    Reply ↓
    1. blue April 24, 2014 at 7:31 am
      
      There is in fact a log function in C.
      log => ln(x)
      log2(x) => base 2
      log10(x) => base 10
      log1p => log( 1 + x )
      
      http://www.codecogs.com/library/computing/c/math.h/log.php
      
      NB Log only takes type: ( long long x ), ( double x ), ( float x )
      
      As it relates to this, i too am getting an error. But it’s not that the function doesn’t exist, but instead it is taking incorrect data type.
      I tried going around it, by implementing log10 directly( log10( array ) + 1 ) on the line of code, “length = len(codeTable[26]);” , but no matter what I do, no matter what I cast, no matter whose data type I change, it won’t compress.
      Changing data types does remove the error message but in effect cause a logic error.
      
      Reply ↓
      1. Takagi May 29, 2014 at 7:01 pm
        
        log10 is part of . So, in order to correctly compile the code, you must add a -lm at THE END (this is crucial) of your code, so, i.e.
        $(CC) … -o a.exe -lm
  2. Sarwesh Shah July 16, 2014 at 6:41 am
    
    log10 only accepts float, long float or double.
    Typecast your input (in this case x) to float
    thus change code
    #define len(x) ((int)log10(x)+1)
    to
    #define len(x) ((int)log10(float(x))+1)
    
    Reply ↓
hari May 25, 2013 at 2:08 pm

it is working only for one file,how can compress multiple file?

Reply ↓
1. Yatin October 11, 2014 at 8:58 am
  
  its giving an error how you solved it
  log10:domain error
  and where the compressed file will be stored????
  i m giving file name directly abc.txt
  without path.
  help if u can…
  
  Reply ↓
dhiraj July 1, 2013 at 7:46 am

thanks I realy need it

Reply ↓
John August 4, 2013 at 11:35 pm

Code is int defined in void fillTable(int codeTable[], Node *tree, int Code){…

The largest int is 32767, if the tree has more than 5 levels, this line

fillTable(codeTable, tree->left, Code*10+1);

will cause overflow.

Reply ↓
1. muthu October 6, 2015 at 7:03 am
  
  could you please reply me how we can compress a huge file with this code?
  
  Reply ↓
Sunny Dizon December 26, 2013 at 2:52 pm

Hi. Just want to ask, how’d you solve that Log10:domain error. Thanks for the immediate response.

Reply ↓
1. momo November 30, 2016 at 11:42 am
  
  you simply add -lm at the end when you compile
  
  eg.
  gcc -o huffman huffman.c -lm
  
  Reply ↓
Amaro Junior January 2, 2014 at 7:15 pm

Great!

Reply ↓
Edna February 19, 2014 at 8:24 pm

On UBUNTU, log10 error can be fixed like this:
gcc huffman.c -o huffman -lm

I can do compress, but in output file I get all kinds of signs, more like old Egiptian signs, and when I do decompress of output.txt, i don’t get anything? Anyone can help me???

Thanks

Reply ↓
shalini April 29, 2014 at 4:46 pm

Fr compression it s working fine but it s not getting decompressed. Can u pls help me regarding this issue??

Reply ↓
1. Yashira May 31, 2014 at 3:11 pm
  
  Change the compressed file name and decompress that file using the new name. (Example change output.txt as example.txt and decompress example.txt file)
  
  Reply ↓
2. arun October 12, 2014 at 12:01 pm
  
  kindy end me the proper working code.my id is:arunsecret.kumar@gmail.com
  kindly do the needfull
  
  Reply ↓
Sinval August 19, 2014 at 3:08 pm

Friend I really liked your project, but I could not run it, open it to put the file name that will be used to compress, and then pressed the option to compress, but apparently it goes into a loop where it is generating a characters in a growing output.txt file.

Reply ↓
DUFAY Paul November 24, 2014 at 4:40 pm

Hi, i’m a french student, so please apologize me if my english could be Clumsy.

I try to use this algorithm in parallelization programming but, when i want to use a really large text, to slow down my computer and see if i could improve the code, after compression and decompression, the final text lose a large part.

Do you have an explication?

Thanks for your help

Reply ↓
Shaun January 11, 2015 at 6:25 am

A good article! Thanks! Not sure why so many are having issues with compiling / figuring out how to use – works fine for me after reading through the code and understanding what its doing.
Does the Huffman principle only work with a-z and space? If one were to add additional characters such as ‘@’ or ‘-‘ or ‘.’, could one theoretically increment the array size from 27 to encompass the required character set (and of course ordered in frequency where applicable) and still work? Is there a finite limit to the number of characters one wishes to utilize before the algorithm loses effectiveness?

Reply ↓
Paul January 24, 2015 at 9:01 pm

I’ve got such problem :
Error error C2440: ‘=’ : cannot convert from ‘void *’ to ‘Node *’

and code where that error is :

array[i] = malloc(sizeof(Node));

(with equal sign underlined)

Reply ↓
1. Vosiz November 6, 2015 at 9:22 am
  
  try this:
  array[i] = (Node *)malloc(sizeof(Node));
  
  Reply ↓
2. Test November 12, 2015 at 10:42 pm
  
  Just cast array[i] = (Node*)malloc(sizeof(Node))
  
  Reply ↓
sandhiya October 16, 2015 at 1:44 am

kindly sen me the proper code to mail id:sunithabistm@gmail.com

Reply ↓
NintendoDS November 4, 2015 at 7:50 pm

I’m using this algorithm because my professor copied and pasted this as an exercise for me, making me implement the compress method.

Reply ↓
Karan Shah November 7, 2016 at 8:06 pm

Check this one. It helped me. A very different approach here:
http://codehuntertech.com/data-structures/trees/huffman-tree/

Reply ↓
ralph March 11, 2017 at 9:26 am

While compiling the huffman code it ask me the file to type pleace help me solve this

Reply ↓
Nitesh April 2, 2017 at 10:47 am

Whats the output of this program….Can you guys give the input and output of this program!!

Reply ↓
1. Aditya May 12, 2018 at 11:25 am
  
  input can be any file which contains that data with alphabets and spaces. This code may not compress if there is any special characters like ‘-‘, ‘:’, etc… and output contains binary data which will be unreadable format.
  
  Reply ↓
Sunny June 7, 2017 at 4:58 am

I compiled this code currently, but I get run time error “Segmentation Fault (Core Dumped)

Reply ↓
1. Aditya May 12, 2018 at 11:26 am
  
  just debug and find it we can help you …
  
  Reply ↓
AlfhaOmega May 24, 2018 at 3:04 am

use dev cpp to run the code

Reply ↓

Programming Logic

Algorithms, Computer Science and Programming Puzzles

Implementing Huffman Coding in C

37 thoughts on “Implementing Huffman Coding in C”

Leave a Reply Cancel reply