# Implementing Huffman Coding in C

Huffman Coding (link to Wikipedia) is a compression algorithm used for loss-less data compression. Here’s the basic idea: each ASCII character is usually represented with 8 bits, but if we had a text filed composed of only the lowercase a-z letters we could represent each character with only 5 bits (i.e., 2^5 = 32, which is enough to represent 26 values), thus reducing the overall memory needed to store the data. For example, the table of characters -> binary code could look like this:

• a = 00000
• b = 00001
• c = 00010
• d = 00011
• e = 00100

And so on.

This is the fixed-length representation, and it already creates a significant compression rate (around 35% in the example above).

A more efficient approach is to use a variable-length representation, where each character can have a different number of bits to represent it. More specifically we first analyze the frequency of each character in the text, and then we create a binary tree (called Huffman tree) giving a shorter bit representation to the most used characters, so that they can be reached faster. Notice that it must be a prefix tree (i.e., the code of every letter can’t be prefix to the code of any other letter) else the decompression wouldn’t work.

For example, the frequency of the letters in the English language (according to Wikipedia) is the following: Now the algorithm to create the Huffman tree is the following:

• Create a forest with one tree for each letter and its respective frequency as value
• Join the two trees with the lowest value, removing each from the forest and adding instead the resulting combined tree
• Repeat until there’s only one tree left

The Huffman tree for the a-z letters (and the space character) using the frequency table above would look like this (you would need to expand to the lower branches to see all the letters): We can traverse the tree and create a table with all the letters and their respective binary codes. It would look like this:

``````a = 0011
b = 011011
c = 11111
d = 00100
e = 101
f = 000010
g = 011001
h = 1110
i = 0101
j = 011010101
k = 0110101001
l = 00101
m = 000001
n = 0111
o = 0100
p = 011000
q = 0110101000
r = 1101
s = 1100
t = 0001
u = 11110
v = 0110100
w = 000000
x = 011010111
y = 000011
z = 011010110
space = 100``````

After that you just need to read the original text file letter by letter and to output the respective binary code. If you want to decompress a file, on the other hand, you read bit by bit and move along the Huffman tree until you find a letter, at which point you move back to the root of the tree and continue processing the bits.

Below you’ll find a C implementing of the Huffman coding (it includes all the parts, including the creating of the Huffman tree, the table and so on). If you prefer here’s the huffman.zip file for download.

``````
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define len(x) ((int)log10(x)+1)

/* Node of the huffman tree */
struct node{
int value;
char letter;
struct node *left,*right;
};

typedef struct node Node;

/* 81 = 8.1%, 128 = 12.8% and so on. The 27th frequency is the space. Source is Wikipedia */
int englishLetterFrequencies  = {81, 15, 28, 43, 128, 23, 20, 61, 71, 2, 1, 40, 24, 69, 76, 20, 1, 61, 64, 91, 28, 10, 24, 1, 20, 1, 130};

/*finds and returns the small sub-tree in the forrest*/
int findSmaller (Node *array[], int differentFrom){
int smaller;
int i = 0;

while (array[i]->value==-1)
i++;
smaller=i;
if (i==differentFrom){
i++;
while (array[i]->value==-1)
i++;
smaller=i;
}

for (i=1;i<27;i++){
if (array[i]->value==-1)
continue;
if (i==differentFrom)
continue;
if (array[i]->value<array[smaller]->value)
smaller = i;
}

return smaller;
}

/*builds the huffman tree and returns its address by reference*/
void buildHuffmanTree(Node **tree){
Node *temp;
Node *array;
int i, subTrees = 27;
int smallOne,smallTwo;

for (i=0;i<27;i++){
array[i] = malloc(sizeof(Node));
array[i]->value = englishLetterFrequencies[i];
array[i]->letter = i;
array[i]->left = NULL;
array[i]->right = NULL;
}

while (subTrees>1){
smallOne=findSmaller(array,-1);
smallTwo=findSmaller(array,smallOne);
temp = array[smallOne];
array[smallOne] = malloc(sizeof(Node));
array[smallOne]->value=temp->value+array[smallTwo]->value;
array[smallOne]->letter=127;
array[smallOne]->left=array[smallTwo];
array[smallOne]->right=temp;
array[smallTwo]->value=-1;
subTrees--;
}

*tree = array[smallOne];

return;
}

/* builds the table with the bits for each letter. 1 stands for binary 0 and 2 for binary 1 (used to facilitate arithmetic)*/
void fillTable(int codeTable[], Node *tree, int Code){
if (tree->letter<27)
codeTable[(int)tree->letter] = Code;
else{
fillTable(codeTable, tree->left, Code*10+1);
fillTable(codeTable, tree->right, Code*10+2);
}

return;
}

/*function to compress the input*/
void compressFile(FILE *input, FILE *output, int codeTable[]){
char bit, c, x = 0;
int n,length,bitsLeft = 8;
int originalBits = 0, compressedBits = 0;

while ((c=fgetc(input))!=10){
originalBits++;
if (c==32){
length = len(codeTable);
n = codeTable;
}
else{
length=len(codeTable[c-97]);
n = codeTable[c-97];
}

while (length>0){
compressedBits++;
bit = n % 10 - 1;
n /= 10;
x = x | bit;
bitsLeft--;
length--;
if (bitsLeft==0){
fputc(x,output);
x = 0;
bitsLeft = 8;
}
x = x << 1;
}
}

if (bitsLeft!=8){
x = x << (bitsLeft-1);
fputc(x,output);
}

/*print details of compression on the screen*/
fprintf(stderr,"Original bits = %dn",originalBits*8);
fprintf(stderr,"Compressed bits = %dn",compressedBits);
fprintf(stderr,"Saved %.2f%% of memoryn",((float)compressedBits/(originalBits*8))*100);

return;
}

/*function to decompress the input*/
void decompressFile (FILE *input, FILE *output, Node *tree){
Node *current = tree;
char c,bit;
char mask = 1 << 7;
int i;

while ((c=fgetc(input))!=EOF){

for (i=0;i<8;i++){
c = c << 1;
if (bit==0){
current = current->left;
if (current->letter!=127){
if (current->letter==26)
fputc(32, output);
else
fputc(current->letter+97,output);
current = tree;
}
}

else{
current = current->right;
if (current->letter!=127){
if (current->letter==26)
fputc(32, output);
else
fputc(current->letter+97,output);
current = tree;
}
}
}
}

return;
}

/*invert the codes in codeTable2 so they can be used with mod operator by compressFile function*/
void invertCodes(int codeTable[],int codeTable2[]){
int i, n, copy;

for (i=0;i<27;i++){
n = codeTable[i];
copy = 0;
while (n>0){
copy = copy * 10 + n %10;
n /= 10;
}
codeTable2[i]=copy;
}

return;
}

int main(){
Node *tree;
int codeTable, codeTable2;
int compress;
char filename;
FILE *input, *output;

buildHuffmanTree(&tree);

fillTable(codeTable, tree, 0);

invertCodes(codeTable,codeTable2);

/*get input details from user*/
printf("Type the name of the file to process:n");
scanf("%s",filename);
printf("Type 1 to compress and 2 to decompress:n");
scanf("%d",&compress);

input = fopen(filename, "r");
output = fopen("output.txt","w");

if (compress==1)
compressFile(input,output,codeTable2);
else
decompressFile(input,output, tree);

return 0;
}
``````

## 37 thoughts on “Implementing Huffman Coding in C”

1. HImanshu Gupta

#include
#include
#include

#define len(x) ((int)log10(x)+1)

/* Node of the huffman tree */
struct node{
int value;
char letter;
struct node *left,*right;
};

typedef struct node Node;

/* 81 = 8.1%, 128 = 12.8% and so on. The 27th frequency is the space. Source is Wikipedia */
int englishLetterFrequencies  = {81, 15, 28, 43, 128, 23, 20, 61, 71, 2, 1, 40, 24, 69, 76, 20, 1, 61, 64, 91, 28, 10, 24, 1, 20, 1, 130};

/*finds and returns the small sub-tree in the forrest*/
int findSmaller (Node *array[], int differentFrom){
int smaller;
int i = 0;

while (array[i]->value==-1)
i++;
smaller=i;
if (i==differentFrom){
i++;
while (array[i]->value==-1)
i++;
smaller=i;
}

for (i=1;ivalue==-1)
continue;
if (i==differentFrom)
continue;
if (array[i]->valuevalue)
smaller = i;
}

return smaller;
}

/*builds the huffman tree and returns its address by reference*/
void buildHuffmanTree(Node **tree){
Node *temp;
Node *array;
int i, subTrees = 27;
int smallOne,smallTwo;

for (i=0;ivalue = englishLetterFrequencies[i];
array[i]->letter = i;
array[i]->left = NULL;
array[i]->right = NULL;
}

while (subTrees>1){
smallOne=findSmaller(array,-1);
smallTwo=findSmaller(array,smallOne);
temp = array[smallOne];
array[smallOne] = malloc(sizeof(Node));
array[smallOne]->value=temp->value+array[smallTwo]->value;
array[smallOne]->letter=127;
array[smallOne]->left=array[smallTwo];
array[smallOne]->right=temp;
array[smallTwo]->value=-1;
subTrees–;
}

*tree = array[smallOne];

return;
}

/* builds the table with the bits for each letter. 1 stands for binary 0 and 2 for binary 1 (used to facilitate arithmetic)*/
void fillTable(int codeTable[], Node *tree, int Code){
if (tree->letterletter] = Code;
else{
fillTable(codeTable, tree->left, Code*10+1);
fillTable(codeTable, tree->right, Code*10+2);
}

return;
}

/*function to compress the input*/
void compressFile(FILE *input, FILE *output, int codeTable[]){
char bit, c, x = 0;
int n,length,bitsLeft = 8;
int originalBits = 0, compressedBits = 0;

while ((c=fgetc(input))!=10){
originalBits++;
if (c==32){
length = len(codeTable);
n = codeTable;
}
else{
length=len(codeTable[c-97]);
n = codeTable[c-97];
}

while (length>0){
compressedBits++;
bit = n % 10 – 1;
n /= 10;
x = x | bit;
bitsLeft–;
length–;
if (bitsLeft==0){
fputc(x,output);
x = 0;
bitsLeft = 8;
}
x = x << 1;
}
}

if (bitsLeft!=8){
x = x << (bitsLeft-1);
fputc(x,output);
}

/*print details of compression on the screen*/
fprintf(stderr,"Original bits = %dn",originalBits*8);
fprintf(stderr,"Compressed bits = %dn",compressedBits);
fprintf(stderr,"Saved %.2f%% of memoryn",((float)compressedBits/(originalBits*8))*100);

return;
}

/*function to decompress the input*/
void decompressFile (FILE *input, FILE *output, Node *tree){
Node *current = tree;
char c,bit;
char mask = 1 << 7;
int i;

while ((c=fgetc(input))!=EOF){

for (i=0;i<8;i++){
c = c <left;
if (current->letter!=127){
if (current->letter==26)
fputc(32, output);
else
fputc(current->letter+97,output);
current = tree;
}
}

else{
current = current->right;
if (current->letter!=127){
if (current->letter==26)
fputc(32, output);
else
fputc(current->letter+97,output);
current = tree;
}
}
}
}

return;
}

/*invert the codes in codeTable2 so they can be used with mod operator by compressFile function*/
void invertCodes(int codeTable[],int codeTable2[]){
int i, n, copy;

for (i=0;i0){
copy = copy * 10 + n %10;
n /= 10;
}
codeTable2[i]=copy;
}

return;
}

int main(){
Node *tree;
int codeTable, codeTable2;
int compress;
char filename;
FILE *input, *output;

buildHuffmanTree(&tree);

fillTable(codeTable, tree, 0);

invertCodes(codeTable,codeTable2);

/*get input details from user*/
printf(“Type the name of the file to process:n”);
scanf(“%s”,filename);
printf(“Type 1 to compress and 2 to decompress:n”);
scanf(“%d”,&compress);

input = fopen(filename, “r”);
output = fopen(“output.txt”,”w”);

if (compress==1)
compressFile(input,output,codeTable2);
else
decompressFile(input,output, tree);

return 0;
}
someone plz gimme its output asap

2. sumpi

i am not able to run.Domain error.why???????????

1. Krishna

@sumpi. If you are using gcc, do this.
gcc filename.c -lm -0 anyname

3. Shobhit

I am using turboo c ++ and i am getting domain error arrey out of bound. Kindly tell me the solution.
Thank you

4. naga lakshmi

am getting log10: sing error.. i dont know how to correct that error.
and also what type of data it accepts i mean values or alphabets

1. Rajesh Y

Hi Lakshmi,

I am also getting same problem , Did you get any solution ? if you get any solution just mail me “yarasani.rajesh@gmail.com”

Problem :
huf.c:(.text+0x3cf): undefined reference to `log10′
huf.c:(.text+0x40b): undefined reference to `log10′
collect2: error: ld returned 1 exit status

1. Huang Yibin

in C language, there’s no function called log10 in math.h, if you want to calculate log10(2), you should calculate log(2) / log(10), log() is the only function in math.h to calculate Logarithm

1. blue

There is in fact a log function in C.
log => ln(x)
log2(x) => base 2
log10(x) => base 10
log1p => log( 1 + x )

http://www.codecogs.com/library/computing/c/math.h/log.php

NB Log only takes type: ( long long x ), ( double x ), ( float x )

As it relates to this, i too am getting an error. But it’s not that the function doesn’t exist, but instead it is taking incorrect data type.
I tried going around it, by implementing log10 directly( log10( array ) + 1 ) on the line of code, “length = len(codeTable);” , but no matter what I do, no matter what I cast, no matter whose data type I change, it won’t compress.
Changing data types does remove the error message but in effect cause a logic error.

1. Takagi

log10 is part of . So, in order to correctly compile the code, you must add a -lm at THE END (this is crucial) of your code, so, i.e.
\$(CC) … -o a.exe -lm

2. Sarwesh Shah

log10 only accepts float, long float or double.
Typecast your input (in this case x) to float
thus change code
` #define len(x) ((int)log10(x)+1) `
to
` #define len(x) ((int)log10(float(x))+1) `

5. hari

it is working only for one file,how can compress multiple file?

1. Yatin

its giving an error how you solved it
log10:domain error
and where the compressed file will be stored????
i m giving file name directly abc.txt
without path.
help if u can…

6. dhiraj

thanks I realy need it

7. John

Code is int defined in void fillTable(int codeTable[], Node *tree, int Code){…

The largest int is 32767, if the tree has more than 5 levels, this line

fillTable(codeTable, tree->left, Code*10+1);

will cause overflow.

1. muthu

could you please reply me how we can compress a huge file with this code?

8. Sunny Dizon

Hi. Just want to ask, how’d you solve that Log10:domain error. Thanks for the immediate response.

1. momo

you simply add -lm at the end when you compile

eg.
gcc -o huffman huffman.c -lm

9. Amaro Junior

Great!

10. Edna

On UBUNTU, log10 error can be fixed like this:
gcc huffman.c -o huffman -lm

I can do compress, but in output file I get all kinds of signs, more like old Egiptian signs, and when I do decompress of output.txt, i don’t get anything? Anyone can help me???

Thanks

11. shalini

Fr compression it s working fine but it s not getting decompressed. Can u pls help me regarding this issue??

1. Yashira

Change the compressed file name and decompress that file using the new name. (Example change output.txt as example.txt and decompress example.txt file)

2. arun

kindy end me the proper working code.my id is:arunsecret.kumar@gmail.com
kindly do the needfull

12. Sinval

Friend I really liked your project, but I could not run it, open it to put the file name that will be used to compress, and then pressed the option to compress, but apparently it goes into a loop where it is generating a characters in a growing output.txt file.

13. DUFAY Paul

Hi, i’m a french student, so please apologize me if my english could be Clumsy.

I try to use this algorithm in parallelization programming but, when i want to use a really large text, to slow down my computer and see if i could improve the code, after compression and decompression, the final text lose a large part.

Do you have an explication?

14. Shaun

A good article! Thanks! Not sure why so many are having issues with compiling / figuring out how to use – works fine for me after reading through the code and understanding what its doing.
Does the Huffman principle only work with a-z and space? If one were to add additional characters such as ‘@’ or ‘-‘ or ‘.’, could one theoretically increment the array size from 27 to encompass the required character set (and of course ordered in frequency where applicable) and still work? Is there a finite limit to the number of characters one wishes to utilize before the algorithm loses effectiveness?

15. Paul

I’ve got such problem :
Error error C2440: ‘=’ : cannot convert from ‘void *’ to ‘Node *’

and code where that error is :

array[i] = malloc(sizeof(Node));

(with equal sign underlined)

1. Vosiz

try this:
array[i] = (Node *)malloc(sizeof(Node));

2. Test

Just cast array[i] = (Node*)malloc(sizeof(Node))

16. sandhiya

kindly sen me the proper code to mail id:sunithabistm@gmail.com

17. NintendoDS

I’m using this algorithm because my professor copied and pasted this as an exercise for me, making me implement the compress method.

18. ralph

While compiling the huffman code it ask me the file to type pleace help me solve this

19. Nitesh

Whats the output of this program….Can you guys give the input and output of this program!!

1. Aditya

input can be any file which contains that data with alphabets and spaces. This code may not compress if there is any special characters like ‘-‘, ‘:’, etc… and output contains binary data which will be unreadable format.

20. Sunny

I compiled this code currently, but I get run time error “Segmentation Fault (Core Dumped)

1. Aditya

21. AlfhaOmega