Huffman Coding (link to Wikipedia) is a compression algorithm used for loss-less data compression. Here’s the basic idea: each ASCII character is usually represented with 8 bits, but if we had a text filed composed of only the lowercase a-z letters we could represent each character with only 5 bits (i.e., 2^5 = 32, which is enough to represent 26 values), thus reducing the overall memory needed to store the data.

For example, the table of characters -> binary code could look like this:

- a = 00000
- b = 00001
- c = 00010
- d = 00011
- e = 00100

And so on.

This is the fixed-length representation, and it already creates a significant compression rate (around 35% in the example above).

A more efficient approach is to use a variable-length representation, where each character can have a different number of bits to represent it. More specifically we first analyze the frequency of each character in the text, and then we create a binary tree (called Huffman tree) giving a shorter bit representation to the most used characters, so that they can be reached faster. Notice that it must be a prefix tree (i.e., the code of every letter can’t be prefix to the code of any other letter) else the decompression wouldn’t work.

For example, the frequency of the letters in the English language (according to Wikipedia) is the following:

Now the algorithm to create the Huffman tree is the following:

- Create a forest with one tree for each letter and its respective frequency as value
- Join the two trees with the lowest value, removing each from the forest and adding instead the resulting combined tree
- Repeat until there’s only one tree left

The Huffman tree for the a-z letters (and the space character) using the frequency table above would look like this (you would need to expand to the lower branches to see all the letters):

We can traverse the tree and create a table with all the letters and their respective binary codes. It would look like this:

```
a = 0011
b = 011011
c = 11111
d = 00100
e = 101
f = 000010
g = 011001
h = 1110
i = 0101
j = 011010101
k = 0110101001
l = 00101
m = 000001
n = 0111
o = 0100
p = 011000
q = 0110101000
r = 1101
s = 1100
t = 0001
u = 11110
v = 0110100
w = 000000
x = 011010111
y = 000011
z = 011010110
space = 100
```

After that you just need to read the original text file letter by letter and to output the respective binary code. If you want to decompress a file, on the other hand, you read bit by bit and move along the Huffman tree until you find a letter, at which point you move back to the root of the tree and continue processing the bits.

Below you’ll find a C implementing of the Huffman coding (it includes all the parts, including the creating of the Huffman tree, the table and so on). If you prefer here’s the huffman.zip file for download.

```
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define len(x) ((int)log10(x)+1)
/* Node of the huffman tree */
struct node{
int value;
char letter;
struct node *left,*right;
};
typedef struct node Node;
/* 81 = 8.1%, 128 = 12.8% and so on. The 27th frequency is the space. Source is Wikipedia */
int englishLetterFrequencies [27] = {81, 15, 28, 43, 128, 23, 20, 61, 71, 2, 1, 40, 24, 69, 76, 20, 1, 61, 64, 91, 28, 10, 24, 1, 20, 1, 130};
/*finds and returns the small sub-tree in the forrest*/
int findSmaller (Node *array[], int differentFrom){
int smaller;
int i = 0;
while (array[i]->value==-1)
i++;
smaller=i;
if (i==differentFrom){
i++;
while (array[i]->value==-1)
i++;
smaller=i;
}
for (i=1;i<27;i++){
if (array[i]->value==-1)
continue;
if (i==differentFrom)
continue;
if (array[i]->value<array[smaller]->value)
smaller = i;
}
return smaller;
}
/*builds the huffman tree and returns its address by reference*/
void buildHuffmanTree(Node **tree){
Node *temp;
Node *array[27];
int i, subTrees = 27;
int smallOne,smallTwo;
for (i=0;i<27;i++){
array[i] = malloc(sizeof(Node));
array[i]->value = englishLetterFrequencies[i];
array[i]->letter = i;
array[i]->left = NULL;
array[i]->right = NULL;
}
while (subTrees>1){
smallOne=findSmaller(array,-1);
smallTwo=findSmaller(array,smallOne);
temp = array[smallOne];
array[smallOne] = malloc(sizeof(Node));
array[smallOne]->value=temp->value+array[smallTwo]->value;
array[smallOne]->letter=127;
array[smallOne]->left=array[smallTwo];
array[smallOne]->right=temp;
array[smallTwo]->value=-1;
subTrees--;
}
*tree = array[smallOne];
return;
}
/* builds the table with the bits for each letter. 1 stands for binary 0 and 2 for binary 1 (used to facilitate arithmetic)*/
void fillTable(int codeTable[], Node *tree, int Code){
if (tree->letter<27)
codeTable[(int)tree->letter] = Code;
else{
fillTable(codeTable, tree->left, Code*10+1);
fillTable(codeTable, tree->right, Code*10+2);
}
return;
}
/*function to compress the input*/
void compressFile(FILE *input, FILE *output, int codeTable[]){
char bit, c, x = 0;
int n,length,bitsLeft = 8;
int originalBits = 0, compressedBits = 0;
while ((c=fgetc(input))!=10){
originalBits++;
if (c==32){
length = len(codeTable[26]);
n = codeTable[26];
}
else{
length=len(codeTable[c-97]);
n = codeTable[c-97];
}
while (length>0){
compressedBits++;
bit = n % 10 - 1;
n /= 10;
x = x | bit;
bitsLeft--;
length--;
if (bitsLeft==0){
fputc(x,output);
x = 0;
bitsLeft = 8;
}
x = x << 1;
}
}
if (bitsLeft!=8){
x = x << (bitsLeft-1);
fputc(x,output);
}
/*print details of compression on the screen*/
fprintf(stderr,"Original bits = %dn",originalBits*8);
fprintf(stderr,"Compressed bits = %dn",compressedBits);
fprintf(stderr,"Saved %.2f%% of memoryn",((float)compressedBits/(originalBits*8))*100);
return;
}
/*function to decompress the input*/
void decompressFile (FILE *input, FILE *output, Node *tree){
Node *current = tree;
char c,bit;
char mask = 1 << 7;
int i;
while ((c=fgetc(input))!=EOF){
for (i=0;i<8;i++){
bit = c & mask;
c = c << 1;
if (bit==0){
current = current->left;
if (current->letter!=127){
if (current->letter==26)
fputc(32, output);
else
fputc(current->letter+97,output);
current = tree;
}
}
else{
current = current->right;
if (current->letter!=127){
if (current->letter==26)
fputc(32, output);
else
fputc(current->letter+97,output);
current = tree;
}
}
}
}
return;
}
/*invert the codes in codeTable2 so they can be used with mod operator by compressFile function*/
void invertCodes(int codeTable[],int codeTable2[]){
int i, n, copy;
for (i=0;i<27;i++){
n = codeTable[i];
copy = 0;
while (n>0){
copy = copy * 10 + n %10;
n /= 10;
}
codeTable2[i]=copy;
}
return;
}
int main(){
Node *tree;
int codeTable[27], codeTable2[27];
int compress;
char filename[20];
FILE *input, *output;
buildHuffmanTree(&tree);
fillTable(codeTable, tree, 0);
invertCodes(codeTable,codeTable2);
/*get input details from user*/
printf("Type the name of the file to process:n");
scanf("%s",filename);
printf("Type 1 to compress and 2 to decompress:n");
scanf("%d",&compress);
input = fopen(filename, "r");
output = fopen("output.txt","w");
if (compress==1)
compressFile(input,output,codeTable2);
else
decompressFile(input,output, tree);
return 0;
}
```

HImanshu Gupta#include

#include

#include

#define len(x) ((int)log10(x)+1)

/* Node of the huffman tree */

struct node{

int value;

char letter;

struct node *left,*right;

};

typedef struct node Node;

/* 81 = 8.1%, 128 = 12.8% and so on. The 27th frequency is the space. Source is Wikipedia */

int englishLetterFrequencies [27] = {81, 15, 28, 43, 128, 23, 20, 61, 71, 2, 1, 40, 24, 69, 76, 20, 1, 61, 64, 91, 28, 10, 24, 1, 20, 1, 130};

/*finds and returns the small sub-tree in the forrest*/

int findSmaller (Node *array[], int differentFrom){

int smaller;

int i = 0;

while (array[i]->value==-1)

i++;

smaller=i;

if (i==differentFrom){

i++;

while (array[i]->value==-1)

i++;

smaller=i;

}

for (i=1;ivalue==-1)

continue;

if (i==differentFrom)

continue;

if (array[i]->valuevalue)

smaller = i;

}

return smaller;

}

/*builds the huffman tree and returns its address by reference*/

void buildHuffmanTree(Node **tree){

Node *temp;

Node *array[27];

int i, subTrees = 27;

int smallOne,smallTwo;

for (i=0;ivalue = englishLetterFrequencies[i];

array[i]->letter = i;

array[i]->left = NULL;

array[i]->right = NULL;

}

while (subTrees>1){

smallOne=findSmaller(array,-1);

smallTwo=findSmaller(array,smallOne);

temp = array[smallOne];

array[smallOne] = malloc(sizeof(Node));

array[smallOne]->value=temp->value+array[smallTwo]->value;

array[smallOne]->letter=127;

array[smallOne]->left=array[smallTwo];

array[smallOne]->right=temp;

array[smallTwo]->value=-1;

subTrees–;

}

*tree = array[smallOne];

return;

}

/* builds the table with the bits for each letter. 1 stands for binary 0 and 2 for binary 1 (used to facilitate arithmetic)*/

void fillTable(int codeTable[], Node *tree, int Code){

if (tree->letterletter] = Code;

else{

fillTable(codeTable, tree->left, Code*10+1);

fillTable(codeTable, tree->right, Code*10+2);

}

return;

}

/*function to compress the input*/

void compressFile(FILE *input, FILE *output, int codeTable[]){

char bit, c, x = 0;

int n,length,bitsLeft = 8;

int originalBits = 0, compressedBits = 0;

while ((c=fgetc(input))!=10){

originalBits++;

if (c==32){

length = len(codeTable[26]);

n = codeTable[26];

}

else{

length=len(codeTable[c-97]);

n = codeTable[c-97];

}

while (length>0){

compressedBits++;

bit = n % 10 – 1;

n /= 10;

x = x | bit;

bitsLeft–;

length–;

if (bitsLeft==0){

fputc(x,output);

x = 0;

bitsLeft = 8;

}

x = x << 1;

}

}

if (bitsLeft!=8){

x = x << (bitsLeft-1);

fputc(x,output);

}

/*print details of compression on the screen*/

fprintf(stderr,"Original bits = %dn",originalBits*8);

fprintf(stderr,"Compressed bits = %dn",compressedBits);

fprintf(stderr,"Saved %.2f%% of memoryn",((float)compressedBits/(originalBits*8))*100);

return;

}

/*function to decompress the input*/

void decompressFile (FILE *input, FILE *output, Node *tree){

Node *current = tree;

char c,bit;

char mask = 1 << 7;

int i;

while ((c=fgetc(input))!=EOF){

for (i=0;i<8;i++){

bit = c & mask;

c = c <left;

if (current->letter!=127){

if (current->letter==26)

fputc(32, output);

else

fputc(current->letter+97,output);

current = tree;

}

}

else{

current = current->right;

if (current->letter!=127){

if (current->letter==26)

fputc(32, output);

else

fputc(current->letter+97,output);

current = tree;

}

}

}

}

return;

}

/*invert the codes in codeTable2 so they can be used with mod operator by compressFile function*/

void invertCodes(int codeTable[],int codeTable2[]){

int i, n, copy;

for (i=0;i0){

copy = copy * 10 + n %10;

n /= 10;

}

codeTable2[i]=copy;

}

return;

}

int main(){

Node *tree;

int codeTable[27], codeTable2[27];

int compress;

char filename[20];

FILE *input, *output;

buildHuffmanTree(&tree);

fillTable(codeTable, tree, 0);

invertCodes(codeTable,codeTable2);

/*get input details from user*/

printf(“Type the name of the file to process:n”);

scanf(“%s”,filename);

printf(“Type 1 to compress and 2 to decompress:n”);

scanf(“%d”,&compress);

input = fopen(filename, “r”);

output = fopen(“output.txt”,”w”);

if (compress==1)

compressFile(input,output,codeTable2);

else

decompressFile(input,output, tree);

return 0;

}

someone plz gimme its output asap

sumpii am not able to run.Domain error.why???????????

Krishna@sumpi. If you are using gcc, do this.

gcc filename.c -lm -0 anyname

ShobhitI am using turboo c ++ and i am getting domain error arrey out of bound. Kindly tell me the solution.

Thank you

naga lakshmiam getting log10: sing error.. i dont know how to correct that error.

and also what type of data it accepts i mean values or alphabets

Rajesh YHi Lakshmi,

I am also getting same problem , Did you get any solution ? if you get any solution just mail me “yarasani.rajesh@gmail.com”

Problem :

huf.c:(.text+0x3cf): undefined reference to `log10′

huf.c:(.text+0x40b): undefined reference to `log10′

collect2: error: ld returned 1 exit status

waiting reply …thank u..

Huang Yibinin C language, there’s no function called log10 in math.h, if you want to calculate log10(2), you should calculate log(2) / log(10), log() is the only function in math.h to calculate Logarithm

blueThere is in fact a log function in C.

log => ln(x)

log2(x) => base 2

log10(x) => base 10

log1p => log( 1 + x )

http://www.codecogs.com/library/computing/c/math.h/log.php

NB Log only takes type: ( long long x ), ( double x ), ( float x )

As it relates to this, i too am getting an error. But it’s not that the function doesn’t exist, but instead it is taking incorrect data type.

I tried going around it, by implementing log10 directly( log10( array ) + 1 ) on the line of code, “length = len(codeTable[26]);” , but no matter what I do, no matter what I cast, no matter whose data type I change, it won’t compress.

Changing data types does remove the error message but in effect cause a logic error.

Takagilog10 is part of . So, in order to correctly compile the code, you must add a -lm at THE END (this is crucial) of your code, so, i.e.

$(CC) … -o a.exe -lm

Sarwesh Shahlog10 only accepts float, long float or double.

Typecast your input (in this case x) to float

thus change code

`#define len(x) ((int)log10(x)+1)`

to

`#define len(x) ((int)log10(float(x))+1)`

hariit is working only for one file,how can compress multiple file?

Yatinits giving an error how you solved it

log10:domain error

and where the compressed file will be stored????

i m giving file name directly abc.txt

without path.

help if u can…

dhirajthanks I realy need it

JohnCode is int defined in void fillTable(int codeTable[], Node *tree, int Code){…

The largest int is 32767, if the tree has more than 5 levels, this line

fillTable(codeTable, tree->left, Code*10+1);

will cause overflow.

muthucould you please reply me how we can compress a huge file with this code?

Sunny DizonHi. Just want to ask, how’d you solve that Log10:domain error. Thanks for the immediate response.

momoyou simply add -lm at the end when you compile

eg.

gcc -o huffman huffman.c -lm

Amaro JuniorGreat!

EdnaOn UBUNTU, log10 error can be fixed like this:

gcc huffman.c -o huffman -lm

I can do compress, but in output file I get all kinds of signs, more like old Egiptian signs, and when I do decompress of output.txt, i don’t get anything? Anyone can help me???

Thanks

shaliniFr compression it s working fine but it s not getting decompressed. Can u pls help me regarding this issue??

YashiraChange the compressed file name and decompress that file using the new name. (Example change output.txt as example.txt and decompress example.txt file)

arunkindy end me the proper working code.my id is:arunsecret.kumar@gmail.com

kindly do the needfull

SinvalFriend I really liked your project, but I could not run it, open it to put the file name that will be used to compress, and then pressed the option to compress, but apparently it goes into a loop where it is generating a characters in a growing output.txt file.

DUFAY PaulHi, i’m a french student, so please apologize me if my english could be Clumsy.

I try to use this algorithm in parallelization programming but, when i want to use a really large text, to slow down my computer and see if i could improve the code, after compression and decompression, the final text lose a large part.

Do you have an explication?

Thanks for your help

ShaunA good article! Thanks! Not sure why so many are having issues with compiling / figuring out how to use – works fine for me after reading through the code and understanding what its doing.

Does the Huffman principle only work with a-z and space? If one were to add additional characters such as ‘@’ or ‘-‘ or ‘.’, could one theoretically increment the array size from 27 to encompass the required character set (and of course ordered in frequency where applicable) and still work? Is there a finite limit to the number of characters one wishes to utilize before the algorithm loses effectiveness?

PaulI’ve got such problem :

Error error C2440: ‘=’ : cannot convert from ‘void *’ to ‘Node *’

and code where that error is :

array[i] = malloc(sizeof(Node));

(with equal sign underlined)

Vosiztry this:

array[i] = (Node *)malloc(sizeof(Node));

TestJust cast array[i] = (Node*)malloc(sizeof(Node))

sandhiyakindly sen me the proper code to mail id:sunithabistm@gmail.com

NintendoDSI’m using this algorithm because my professor copied and pasted this as an exercise for me, making me implement the compress method.

Karan ShahCheck this one. It helped me. A very different approach here:

http://codehuntertech.com/data-structures/trees/huffman-tree/