数据压缩实验三--Huffman编解码及压缩率的比较
一,Huffman码
1 Huffman 编码
Huffman Coding (霍夫曼编码)是一种无失真编码的编码方式,Huffman编码是可变字长编码(VLC)的一种。
Huffman 编码基于信源的概率统计模型,它的基本思路是,出现概率大的信源符号编长码,出现概率小的信源符号编短码,从而使平均码长最小。
在程序实现中常使用一种叫做树的数据结构实现 Huffman 编码,由它编出的码是即时码。
2 Huffman 编码的方法
2.1 统计符号的发生概率;
2.2 把频率按从小到大的顺序排列;
2.3 每一次选出最小的两个值,作为二叉树的两个叶子节点,将和作为它们的根节点,这两个叶子节点不
再参与比较,新的根节点参与比较;
2.4 重复 3,直到最后得到和为1 的根节点;
2.5 将形成的二叉树的左节点标 0,右节点标 1,把从最上面的根节点到最下面的叶子节点途中遇
到的0,1 序列串起来,就得到了各个符号的编码。
3 Huffman解码
Huffman解码是查表+翻译的过程。读取随接收文件传来的码表后,再逐位读取文件实际数据,对照码表进行翻译即可。
二 代码
流程中最关键的对Huffman树的操作在程序中主要通过两个结构体实现:Huffman_node和Huffman_code。
建立的二叉树上每个节点都以Huffman_node类型存在。节点之间的主要关系有父子、兄弟,Huffman_node中定义了指向父节点的指针*parent和指向孩子的指针*zero, *one来表述节点与节点之间的关系。除此之外,还有节点本身的属性:isLeaf、count、symbol。
而编码码字定义为了Huffman_code,本身属性包括码字占用的比特数和码字本身。
Huffcode.c
/** huffcode - Encode/Decode files using Huffman encoding.* http://huffman.sourceforge.net* Copyright (C) 2003 Douglas Ryan Richardson; Gauss Interprise, Inc** This library is free software; you can redistribute it and/or* modify it under the terms of the GNU Lesser General Public* License as published by the Free Software Foundation; either* version 2.1 of the License, or (at your option) any later version.** This library is distributed in the hope that it will be useful,* but WITHOUT ANY WARRANTY; without even the implied warranty of* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU* Lesser General Public License for more details.** You should have received a copy of the GNU Lesser General Public* License along with this library; if not, write to the Free Software* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA*/#include "huffman.h"
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <stdlib.h>
#include <assert.h>#ifdef WIN32
#include <malloc.h>
extern int getopt(int, char**, char*);
extern char* optarg;
#else
#include <unistd.h>
#endifstatic int memory_encode_file(FILE *in, FILE *out);
static int memory_decode_file(FILE *in, FILE *out);static void
version(FILE *out)
{fputs("huffcode 0.3\n""Copyright (C) 2003 Douglas Ryan Richardson""; Gauss Interprise, Inc\n",out);
}static void
usage(FILE* out)
{fputs("Usage: huffcode [-i<input file>] [-o<output file>] [-d|-c]\n""-i - input file (default is standard input)\n""-o - output file (default is standard output)\n""-d - decompress\n""-c - compress (default)\n""-m - read file into memory, compress, then write to file (not default)\n",// step1: by yzhang, for huffman statistics"-t - output huffman statistics\n",//step1:end by yzhangout);
}int
main(int argc, char** argv)
{char memory = 0;char compress = 1;int opt;const char *file_in = NULL, *file_out = NULL;//step1:add by yzhang for huffman statisticsconst char *file_out_table = NULL;//end by yzhangFILE *in = stdin;FILE *out = stdout;//step1:add by yzhang for huffman statisticsFILE * outTable = NULL;//end by yzhang/* Get the command line arguments. */while((opt = getopt(argc, argv, "i:o:cdhvmt:")) != -1) //演示如何跳出循环,及查找括号对{switch(opt){case 'i':file_in = optarg;break;case 'o':file_out = optarg;break;case 'c':compress = 1;//压缩break;case 'd':compress = 0;//解压break;case 'h':usage(stdout);return 0;case 'v':version(stdout);return 0;case 'm':memory = 1;break;// by yzhang for huffman statisticscase 't':file_out_table = optarg; break;//end by yzhangdefault:usage(stderr);return 1;}}/* If an input file is given then open it. */if(file_in){in = fopen(file_in, "rb");if(!in){fprintf(stderr,"Can't open input file '%s': %s\n",file_in, strerror(errno));return 1;}}/* If an output file is given then create it. */if(file_out){out = fopen(file_out, "wb");if(!out){fprintf(stderr,"Can't open output file '%s': %s\n",file_out, strerror(errno));return 1;}}//by yzhang for huffman statisticsif(file_out_table){outTable = fopen(file_out_table, "w");if(!outTable){fprintf(stderr,"Can't open output file '%s': %s\n",file_out_table, strerror(errno));return 1;}}//end by yzhangif(memory){return compress ?memory_encode_file(in, out) : memory_decode_file(in, out);}if(compress) //change by yzhanghuffman_encode_file(in, out,outTable);//step1:changed by yzhang from huffman_encode_file(in, out) to huffman_encode_file(in, out,outTable)elsehuffman_decode_file(in, out);if(in)fclose(in);if(out)fclose(out);if(outTable)fclose(outTable);return 0;
}static int
memory_encode_file(FILE *in, FILE *out)
{unsigned char *buf = NULL, *bufout = NULL;unsigned int len = 0, cur = 0, inc = 1024, bufoutlen = 0;assert(in && out);/* Read the file into memory. */while(!feof(in)){unsigned char *tmp;len += inc;tmp = (unsigned char*)realloc(buf, len);if(!tmp){if(buf)free(buf);return 1;}buf = tmp;cur += fread(buf + cur, 1, inc, in);}if(!buf)return 1;/* Encode the memory. */if(huffman_encode_memory(buf, cur, &bufout, &bufoutlen)){free(buf);return 1;}free(buf);/* Write the memory to the file. */if(fwrite(bufout, 1, bufoutlen, out) != bufoutlen){free(bufout);return 1;}free(bufout);return 0;
}static int
memory_decode_file(FILE *in, FILE *out)
{unsigned char *buf = NULL, *bufout = NULL;unsigned int len = 0, cur = 0, inc = 1024, bufoutlen = 0;assert(in && out);/* Read the file into memory. */while(!feof(in)){unsigned char *tmp;len += inc;tmp = (unsigned char*)realloc(buf, len);if(!tmp){if(buf)free(buf);return 1;}buf = tmp;cur += fread(buf + cur, 1, inc, in);}if(!buf)return 1;/* Decode the memory. */if(huffman_decode_memory(buf, cur, &bufout, &bufoutlen)){free(buf);return 1;}free(buf);/* Write the memory to the file. */if(fwrite(bufout, 1, bufoutlen, out) != bufoutlen){free(bufout);return 1;}free(bufout);return 0;
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
Huffman.c
/** huffman - Encode/Decode files using Huffman encoding.* http://huffman.sourceforge.net* Copyright (C) 2003 Douglas Ryan Richardson; Gauss Interprise, Inc** This library is free software; you can redistribute it and/or* modify it under the terms of the GNU Lesser General Public* License as published by the Free Software Foundation; either* version 2.1 of the License, or (at your option) any later version.** This library is distributed in the hope that it will be useful,* but WITHOUT ANY WARRANTY; without even the implied warranty of* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU* Lesser General Public License for more details.** You should have received a copy of the GNU Lesser General Public* License along with this library; if not, write to the Free Software* Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA*/#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include "huffman.h"#ifdef WIN32
#include <winsock2.h>
#include <malloc.h>
#define alloca _alloca
#else
#include <netinet/in.h>
#endiftypedef struct huffman_node_tag
{unsigned char isLeaf;unsigned long count;struct huffman_node_tag *parent;union{struct{struct huffman_node_tag *zero, *one;};unsigned char symbol;};
} huffman_node;typedef struct huffman_code_tag
{/* The length of this code in bits. */unsigned long numbits;/* The bits that make up this code. The firstbit is at position 0 in bits[0]. The secondbit is at position 1 in bits[0]. The eighthbit is at position 7 in bits[0]. The ninthbit is at position 0 in bits[1]. */unsigned char *bits;
} huffman_code;//step2:add by yzhang for huffman statistics
//存放信源符号的信息:符号频率、比特数、符号码字
typedef struct huffman_statistics_result
{float freq[256];unsigned long numbits[256];unsigned char bits[256][100];
}huffman_stat;/*huffman_stat *init_huffstatistics()
{ huffman_stat *p;int i;p = (huffman_stat*)malloc(sizeof(huffman_stat));p->freq = (float *)malloc(sizeof(float)*256 );p->numbits = (unsigned long *) malloc(sizeof(unsigned long)*256);for (i=0 ; i<256;i++)p->bits[i] = (unsigned char *)malloc(sizeof(unsigned char)*100); return p;
}*/
//end by yzhang//将bit数转换为其对应的byte数,不能被8整除的部分要多分配一整个byte给它
static unsigned long
numbytes_from_numbits(unsigned long numbits)
{return numbits / 8 + (numbits % 8 ? 1 : 0);
}/** get_bit returns the ith bit in the bits array* in the 0th position of the return value.*/
static unsigned char
get_bit(unsigned char* bits, unsigned long i)
{return (bits[i / 8] >> i % 8) & 1;
}//由于程序中从二叉树形成码字的过程是从叶到根的,所以需要bit反转函数来获得顺序正确的码字,同时以byte为单位对其进行规范化
//例:传入倒序码字为010111011,通过bit反转函数变为00000001 10111010
static void
reverse_bits(unsigned char* bits, unsigned long numbits)
{unsigned long numbytes = numbytes_from_numbits(numbits);unsigned char *tmp =(unsigned char*)alloca(numbytes);//alloca与malloc功能相似,但alloca会自动释放申请的空间unsigned long curbit;long curbyte = 0;memset(tmp, 0, numbytes);//将tmp指向空间的前numbytes个字节内容全部置0for(curbit = 0; curbit < numbits; ++curbit){unsigned int bitpos = curbit % 8;//如果一个byte写满了,就跳到下一个byte继续写if(curbit > 0 && curbit % 8 == 0)++curbyte;//通过get_bit函数从传入的bits里获得当前操作的比特结果,用移位运算将其移动到在一个byte里对应的位置//由于tmp的指向操作是以byte为单位的,这里只能通过按位取或(|=)来把bit一个一个写到tmp指向的空间里去//bit反转是靠numbits-curbit-1实现的tmp[curbyte] |= (get_bit(bits, numbits - curbit - 1) << bitpos);}memcpy(bits, tmp, numbytes);//把反转后的tmp写回到bits里
}/** new_code builds a huffman_code from a leaf in* a Huffman tree.*/
static huffman_code*
new_code(const huffman_node* leaf)
{/* Build the huffman code by walking up to* the root node and then reversing the bits,* since the Huffman code is calculated by* walking down the tree. */unsigned long numbits = 0;unsigned char* bits = NULL;huffman_code *p;//此段while循环的目的是从传入的叶结点开始向上进行寻根,得到该叶结点对应的码字while(leaf && leaf->parent){huffman_node *parent = leaf->parent;unsigned char cur_bit = (unsigned char)(numbits % 8);unsigned long cur_byte = numbits / 8;/* If we need another byte to hold the code,then allocate it. */if(cur_bit == 0){size_t newSize = cur_byte + 1;bits = (unsigned char*)realloc(bits, newSize);//把bits所占的空间大小调整为newSize个字节bits[newSize - 1] = 0; /* Initialize the new byte. */}/* If a one must be added then or it in. If a zero* must be added then do nothing, since the byte* was initialized to zero. */if(leaf == parent->one)//如果叶结点的地址等于该叶结点的爹妈的1孩子地址,则进行对应的移位操作bits[cur_byte] |= 1 << cur_bit;++numbits;leaf = parent;}if(bits)reverse_bits(bits, numbits);p = (huffman_code*)malloc(sizeof(huffman_code));p->numbits = numbits;p->bits = bits;return p;//p里包含了编完的码字、码字长度
}#define MAX_SYMBOLS 256
typedef huffman_node* SymbolFrequencies[MAX_SYMBOLS];
typedef huffman_code* SymbolEncoder[MAX_SYMBOLS];//传入符号,建立其对应的叶结点,设置参数
static huffman_node*
new_leaf_node(unsigned char symbol)
{huffman_node *p = (huffman_node*)malloc(sizeof(huffman_node));p->isLeaf = 1;p->symbol = symbol;p->count = 0;p->parent = 0;return p;
}//建立一个非叶结点,并将它的0、1孩子地址设置为传入的0、1结点地址
static huffman_node*
new_nonleaf_node(unsigned long count, huffman_node *zero, huffman_node *one)
{huffman_node *p = (huffman_node*)malloc(sizeof(huffman_node));p->isLeaf = 0;p->count = count;p->zero = zero;p->one = one;p->parent = 0;return p;
}static void
free_huffman_tree(huffman_node *subtree)
{if(subtree == NULL)return;if(!subtree->isLeaf){free_huffman_tree(subtree->zero);free_huffman_tree(subtree->one);}free(subtree);
}static void
free_code(huffman_code* p)
{free(p->bits);free(p);
}static void
free_encoder(SymbolEncoder *pSE)
{unsigned long i;for(i = 0; i < MAX_SYMBOLS; ++i){huffman_code *p = (*pSE)[i];if(p)free_code(p);}free(pSE);
}static void
init_frequencies(SymbolFrequencies *pSF)
{memset(*pSF, 0, sizeof(SymbolFrequencies));
#if 0unsigned int i;for(i = 0; i < MAX_SYMBOLS; ++i){unsigned char uc = (unsigned char)i;(*pSF)[i] = new_leaf_node(uc);}
#endif
}typedef struct buf_cache_tag
{unsigned char *cache;unsigned int cache_len;unsigned int cache_cur;unsigned char **pbufout;unsigned int *pbufoutlen;
} buf_cache;static int init_cache(buf_cache* pc,unsigned int cache_size,unsigned char **pbufout,unsigned int *pbufoutlen)
{assert(pc && pbufout && pbufoutlen);if(!pbufout || !pbufoutlen)return 1;pc->cache = (unsigned char*)malloc(cache_size);pc->cache_len = cache_size;pc->cache_cur = 0;pc->pbufout = pbufout;*pbufout = NULL;pc->pbufoutlen = pbufoutlen;*pbufoutlen = 0;return pc->cache ? 0 : 1;
}static void free_cache(buf_cache* pc)
{assert(pc);if(pc->cache){free(pc->cache);pc->cache = NULL;}
}static int flush_cache(buf_cache* pc)
{assert(pc);if(pc->cache_cur > 0){unsigned int newlen = pc->cache_cur + *pc->pbufoutlen;unsigned char* tmp = realloc(*pc->pbufout, newlen);if(!tmp)return 1;memcpy(tmp + *pc->pbufoutlen, pc->cache, pc->cache_cur);*pc->pbufout = tmp;*pc->pbufoutlen = newlen;pc->cache_cur = 0;}return 0;
}static int write_cache(buf_cache* pc,const void *to_write,unsigned int to_write_len)
{unsigned char* tmp;assert(pc && to_write);assert(pc->cache_len >= pc->cache_cur);/* If trying to write more than the cache will hold* flush the cache and allocate enough space immediately,* that is, don't use the cache. */if(to_write_len > pc->cache_len - pc->cache_cur){unsigned int newlen;flush_cache(pc);newlen = *pc->pbufoutlen + to_write_len;tmp = realloc(*pc->pbufout, newlen);if(!tmp)return 1;memcpy(tmp + *pc->pbufoutlen, to_write, to_write_len);*pc->pbufout = tmp;*pc->pbufoutlen = newlen;}else{/* Write the data to the cache. */memcpy(pc->cache + pc->cache_cur, to_write, to_write_len);pc->cache_cur += to_write_len;}return 0;
}//为信源符号建立叶结点,统计次数
static unsigned int
get_symbol_frequencies(SymbolFrequencies *pSF, FILE *in)
{int c;unsigned int total_count = 0;/* Set all frequencies to 0. */init_frequencies(pSF);/* Count the frequency of each symbol in the input file. */while((c = fgetc(in)) != EOF){unsigned char uc = c;if(!(*pSF)[uc])//如果第一次遇到这个符号,则新建该符号的叶结点(*pSF)[uc] = new_leaf_node(uc);++(*pSF)[uc]->count;//对所有符号出现的次数分别进行计数++total_count;}return total_count;
}static unsigned int
get_symbol_frequencies_from_memory(SymbolFrequencies *pSF,const unsigned char *bufin,unsigned int bufinlen)
{unsigned int i;unsigned int total_count = 0;/* Set all frequencies to 0. */init_frequencies(pSF);/* Count the frequency of each symbol in the input file. */for(i = 0; i < bufinlen; ++i){unsigned char uc = bufin[i];if(!(*pSF)[uc])(*pSF)[uc] = new_leaf_node(uc);++(*pSF)[uc]->count;++total_count;}return total_count;
}/** When used by qsort, SFComp sorts the array so that* the symbol with the lowest frequency is first. Any* NULL entries will be sorted to the end of the list.*/
static int
SFComp(const void *p1, const void *p2)
{const huffman_node *hn1 = *(const huffman_node**)p1;const huffman_node *hn2 = *(const huffman_node**)p2;/* Sort all NULLs to the end. */if(hn1 == NULL && hn2 == NULL)return 0;if(hn1 == NULL)return 1;if(hn2 == NULL)return -1;if(hn1->count > hn2->count)return 1;else if(hn1->count < hn2->count)return -1;return 0;
}#if 1
static void
print_freqs(SymbolFrequencies * pSF)
{size_t i;for(i = 0; i < MAX_SYMBOLS; ++i){if((*pSF)[i])printf("%d, %ld\n", (*pSF)[i]->symbol, (*pSF)[i]->count);elseprintf("NULL\n");}
}
#endif/** build_symbol_encoder builds a SymbolEncoder by walking* down to the leaves of the Huffman tree and then,* for each leaf, determines its code.*/
static void
build_symbol_encoder(huffman_node *subtree, SymbolEncoder *pSF)
{if(subtree == NULL)return;//如果传入的结点是叶结点,对其进行编码并存放在对应的指针指向的空间里;如果不是,用递归方法不断调用自身传入该结点的左、右孩子,直到叶结点if(subtree->isLeaf)(*pSF)[subtree->symbol] = new_code(subtree);else{ //递归build_symbol_encoder(subtree->zero, pSF);build_symbol_encoder(subtree->one, pSF);}
}/** calculate_huffman_codes turns pSF into an array* with a single entry that is the root of the* huffman tree. The return value is a SymbolEncoder,* which is an array of huffman codes index by symbol value.*/
static SymbolEncoder*
calculate_huffman_codes(SymbolFrequencies * pSF)
{unsigned int i = 0;unsigned int n = 0;huffman_node *m1 = NULL, *m2 = NULL;SymbolEncoder *pSE = NULL;#if 1printf("BEFORE SORT\n");print_freqs(pSF); //演示堆栈的使用
#endif/* Sort the symbol frequency array by ascending frequency. *///qsort是自带的快速排序函数,参数为待排序数组的首地址(*pSF),排序元素数量(MAX_SYMBOLS),每个元素的长度(sizeof((*pSF)[0])),自定义的比较函数(SFComp,返回1则前〉后,-1则后〉前)qsort((*pSF), MAX_SYMBOLS, sizeof((*pSF)[0]), SFComp); //讲解SFComp函数的作用,断点在调试程序里的作用#if 1 printf("AFTER SORT\n");print_freqs(pSF);
#endif/* Get the number of symbols. */for(n = 0; n < MAX_SYMBOLS && (*pSF)[n]; ++n);/** Construct a Huffman tree. This code is based* on the algorithm given in Managing Gigabytes* by Ian Witten et al, 2nd edition, page 34.* Note that this implementation uses a simple* count instead of probability.*/for(i = 0; i < n - 1; ++i){/* Set m1 and m2 to the two subsets of least probability. */m1 = (*pSF)[0];m2 = (*pSF)[1];/* Replace m1 and m2 with a set {m1, m2} whose probability* is the sum of that of m1 and m2. */(*pSF)[0] = m1->parent = m2->parent =new_nonleaf_node(m1->count + m2->count, m1, m2);(*pSF)[1] = NULL;/* Put newSet into the correct count position in pSF. */qsort((*pSF), n, sizeof((*pSF)[0]), SFComp);}/* Build the SymbolEncoder array from the tree. */pSE = (SymbolEncoder*)malloc(sizeof(SymbolEncoder));memset(pSE, 0, sizeof(SymbolEncoder));build_symbol_encoder((*pSF)[0], pSE);return pSE;
}/** Write the huffman code table. The format is:* 4 byte code count in network byte order.* 4 byte number of bytes encoded* (if you decode the data, you should get this number of bytes)* code1* ...* codeN, where N is the count read at the begginning of the file.* Each codeI has the following format:* 1 byte symbol, 1 byte code bit length, code bytes.* Each entry has numbytes_from_numbits code bytes.* The last byte of each code may have extra bits, if the number of* bits in the code is not a multiple of 8.*/
static int
write_code_table(FILE* out, SymbolEncoder *se, unsigned int symbol_count)
{unsigned long i, count = 0;/* Determine the number of entries in se. */for(i = 0; i < MAX_SYMBOLS; ++i){if((*se)[i])++count;}/* Write the number of entries in network byte order. */i = htonl(count); //在网络传输中,采用big-endian序,对于0x0A0B0C0D ,传输顺序就是0A 0B 0C 0D ,//因此big-endian作为network byte order,little-endian作为host byte order。//little-endian的优势在于unsigned char/short/int/long类型转换时,存储位置无需改变if(fwrite(&i, sizeof(i), 1, out) != 1)return 1;/* Write the number of bytes that will be encoded. */symbol_count = htonl(symbol_count);if(fwrite(&symbol_count, sizeof(symbol_count), 1, out) != 1)return 1;/* Write the entries. */for(i = 0; i < MAX_SYMBOLS; ++i){huffman_code *p = (*se)[i];if(p){unsigned int numbytes;/* Write the 1 byte symbol. */fputc((unsigned char)i, out);/* Write the 1 byte code bit length. */fputc(p->numbits, out);/* Write the code bytes. */numbytes = numbytes_from_numbits(p->numbits);if(fwrite(p->bits, 1, numbytes, out) != numbytes)return 1;}}return 0;
}/** Allocates memory and sets *pbufout to point to it. The memory* contains the code table.*/
static int
write_code_table_to_memory(buf_cache *pc,SymbolEncoder *se,unsigned int symbol_count)
{unsigned long i, count = 0;/* Determine the number of entries in se. */for(i = 0; i < MAX_SYMBOLS; ++i){if((*se)[i])++count;}/* Write the number of entries in network byte order. */i = htonl(count);if(write_cache(pc, &i, sizeof(i)))return 1;/* Write the number of bytes that will be encoded. */symbol_count = htonl(symbol_count);if(write_cache(pc, &symbol_count, sizeof(symbol_count)))return 1;/* Write the entries. */for(i = 0; i < MAX_SYMBOLS; ++i){huffman_code *p = (*se)[i];if(p){unsigned int numbytes;/* The value of i is < MAX_SYMBOLS (256), so it canbe stored in an unsigned char. */unsigned char uc = (unsigned char)i;/* Write the 1 byte symbol. */if(write_cache(pc, &uc, sizeof(uc)))return 1;/* Write the 1 byte code bit length. */uc = (unsigned char)p->numbits;if(write_cache(pc, &uc, sizeof(uc)))return 1;/* Write the code bytes. */numbytes = numbytes_from_numbits(p->numbits);if(write_cache(pc, p->bits, numbytes))return 1;}}return 0;
}/** read_code_table builds a Huffman tree from the code* in the in file. This function returns NULL on error.* The returned value should be freed with free_huffman_tree.*/
static huffman_node*
read_code_table(FILE* in, unsigned int *pDataBytes)
{//在解码端重建huffman树huffman_node *root = new_nonleaf_node(0, NULL, NULL);unsigned int count;/* Read the number of entries.(it is stored in network byte order). */if(fread(&count, sizeof(count), 1, in) != 1){free_huffman_tree(root);return NULL;}count = ntohl(count);//将一个无符号长整形数从网络字节顺序转换为主机字节顺序/* Read the number of data bytes this encoding represents. */if(fread(pDataBytes, sizeof(*pDataBytes), 1, in) != 1){free_huffman_tree(root);return NULL;}*pDataBytes = ntohl(*pDataBytes);/* Read the entries. */while(count-- > 0){int c;unsigned int curbit;unsigned char symbol;unsigned char numbits;unsigned char numbytes;unsigned char *bytes;huffman_node *p = root;if((c = fgetc(in)) == EOF)//读取符号并判断{free_huffman_tree(root);return NULL;}symbol = (unsigned char)c;if((c = fgetc(in)) == EOF)//读取字符长度并判断{free_huffman_tree(root);return NULL;}numbits = (unsigned char)c;numbytes = (unsigned char)numbytes_from_numbits(numbits);bytes = (unsigned char*)malloc(numbytes);if(fread(bytes, 1, numbytes, in) != numbytes){free(bytes);free_huffman_tree(root);return NULL;}/** Add the entry to the Huffman tree. The value* of the current bit is used switch between* zero and one child nodes in the tree. New nodes* are added as needed in the tree.*/for(curbit = 0; curbit < numbits; ++curbit){if(get_bit(bytes, curbit)){if(p->one == NULL){p->one = curbit == (unsigned char)(numbits - 1)? new_leaf_node(symbol): new_nonleaf_node(0, NULL, NULL);p->one->parent = p;}p = p->one;}else{if(p->zero == NULL){p->zero = curbit == (unsigned char)(numbits - 1)? new_leaf_node(symbol): new_nonleaf_node(0, NULL, NULL);p->zero->parent = p;}p = p->zero;}}free(bytes);}return root;
}static int
memread(const unsigned char* buf,unsigned int buflen,unsigned int *pindex,void* bufout,unsigned int readlen)
{assert(buf && pindex && bufout);assert(buflen >= *pindex);if(buflen < *pindex)return 1;if(readlen + *pindex >= buflen)return 1;memcpy(bufout, buf + *pindex, readlen);*pindex += readlen;return 0;
}static huffman_node*
read_code_table_from_memory(const unsigned char* bufin,unsigned int bufinlen,unsigned int *pindex,unsigned int *pDataBytes)
{huffman_node *root = new_nonleaf_node(0, NULL, NULL);unsigned int count;/* Read the number of entries.(it is stored in network byte order). */if(memread(bufin, bufinlen, pindex, &count, sizeof(count))){free_huffman_tree(root);return NULL;}count = ntohl(count);/* Read the number of data bytes this encoding represents. */if(memread(bufin, bufinlen, pindex, pDataBytes, sizeof(*pDataBytes))){free_huffman_tree(root);return NULL;}*pDataBytes = ntohl(*pDataBytes);/* Read the entries. */while(count-- > 0){unsigned int curbit;unsigned char symbol;unsigned char numbits;unsigned char numbytes;unsigned char *bytes;huffman_node *p = root;if(memread(bufin, bufinlen, pindex, &symbol, sizeof(symbol))){free_huffman_tree(root);return NULL;}if(memread(bufin, bufinlen, pindex, &numbits, sizeof(numbits))){free_huffman_tree(root);return NULL;}numbytes = (unsigned char)numbytes_from_numbits(numbits);bytes = (unsigned char*)malloc(numbytes);if(memread(bufin, bufinlen, pindex, bytes, numbytes)){free(bytes);free_huffman_tree(root);return NULL;}/** Add the entry to the Huffman tree. The value* of the current bit is used switch between* zero and one child nodes in the tree. New nodes* are added as needed in the tree.*/for(curbit = 0; curbit < numbits; ++curbit){if(get_bit(bytes, curbit)){if(p->one == NULL){p->one = curbit == (unsigned char)(numbits - 1)? new_leaf_node(symbol): new_nonleaf_node(0, NULL, NULL);p->one->parent = p;}p = p->one;}else{if(p->zero == NULL){p->zero = curbit == (unsigned char)(numbits - 1)? new_leaf_node(symbol): new_nonleaf_node(0, NULL, NULL);p->zero->parent = p;}p = p->zero;}}free(bytes);}return root;
}static int
do_file_encode(FILE* in, FILE* out, SymbolEncoder *se)
{unsigned char curbyte = 0;unsigned char curbit = 0;int c;while((c = fgetc(in)) != EOF){unsigned char uc = (unsigned char)c;huffman_code *code = (*se)[uc];unsigned long i;for(i = 0; i < code->numbits; ++i){/* Add the current bit to curbyte. */curbyte |= get_bit(code->bits, i) << curbit;/* If this byte is filled up then write it* out and reset the curbit and curbyte. */if(++curbit == 8){fputc(curbyte, out);curbyte = 0;curbit = 0;}}}/** If there is data in curbyte that has not been* output yet, which means that the last encoded* character did not fall on a byte boundary,* then output it.*/if(curbit > 0)//写最后一个符号没写满8bit的情况fputc(curbyte, out);return 0;
}static int
do_memory_encode(buf_cache *pc,const unsigned char* bufin,unsigned int bufinlen,SymbolEncoder *se)
{unsigned char curbyte = 0;unsigned char curbit = 0;unsigned int i;for(i = 0; i < bufinlen; ++i){unsigned char uc = bufin[i];huffman_code *code = (*se)[uc];unsigned long i;for(i = 0; i < code->numbits; ++i){/* Add the current bit to curbyte. */curbyte |= get_bit(code->bits, i) << curbit;/* If this byte is filled up then write it* out and reset the curbit and curbyte. */if(++curbit == 8){if(write_cache(pc, &curbyte, sizeof(curbyte)))return 1;curbyte = 0;curbit = 0;}}}/** If there is data in curbyte that has not been* output yet, which means that the last encoded* character did not fall on a byte boundary,* then output it.*/return curbit > 0 ? write_cache(pc, &curbyte, sizeof(curbyte)) : 0;
}//step3:add by yzhang for huffman statistics
int huffST_getSymFrequencies(SymbolFrequencies *SF, huffman_stat *st,int total_count)
{int i,count =0;for(i = 0; i < MAX_SYMBOLS; ++i){ if((*SF)[i]){st->freq[i]=(float)(*SF)[i]->count/total_count;count+=(*SF)[i]->count;}else {st->freq[i]= 0;}}if(count==total_count)return 1;elsereturn 0;
}int huffST_getcodeword(SymbolEncoder *se, huffman_stat *st)
{unsigned long i,j;for(i = 0; i < MAX_SYMBOLS; ++i){huffman_code *p = (*se)[i];if(p){unsigned int numbytes;st->numbits[i] = p->numbits;numbytes = numbytes_from_numbits(p->numbits);for (j=0;j<numbytes;j++)st->bits[i][j] = p->bits[j];}elsest->numbits[i] =0;}return 0;
}void output_huffman_statistics(huffman_stat *st,FILE *out_Table)
{int i,j;unsigned char c;fprintf(out_Table,"symbol\t freq\t codelength\t code\n");for(i = 0; i < MAX_SYMBOLS; ++i){ fprintf(out_Table,"%d\t ",i);fprintf(out_Table,"%f\t ",st->freq[i]);fprintf(out_Table,"%d\t ",st->numbits[i]);if(st->numbits[i]){for(j = 0; j < st->numbits[i]; ++j){c =get_bit(st->bits[i], j);fprintf(out_Table,"%d",c);}}fprintf(out_Table,"\n");}
}
//end by yzhang
/** huffman_encode_file huffman encodes in to out.*/
int
huffman_encode_file(FILE *in, FILE *out, FILE *out_Table) //step1:changed by yzhang for huffman statistics from (FILE *in, FILE *out) to (FILE *in, FILE *out, FILE *out_Table)
{SymbolFrequencies sf;SymbolEncoder *se;huffman_node *root = NULL;int rc;unsigned int symbol_count;//step2:add by yzhang for huffman statisticshuffman_stat hs;//end by yzhang/* Get the frequency of each symbol in the input file. */symbol_count = get_symbol_frequencies(&sf, in); //演示扫描完一遍文件后,SF指针数组的每个元素的构成//step3:add by yzhang for huffman statistics,... get the frequency of each symbol huffST_getSymFrequencies(&sf,&hs,symbol_count);//end by yzhang/* Build an optimal table from the symbolCount. */se = calculate_huffman_codes(&sf);root = sf[0];//step3:add by yzhang for huffman statistics... output the statistics to filehuffST_getcodeword(se, &hs);output_huffman_statistics(&hs,out_Table);//end by yzhang/* Scan the file again and, using the tablepreviously built, encode it into the output file. */rewind(in);rc = write_code_table(out, se, symbol_count);if(rc == 0)rc = do_file_encode(in, out, se);/* Free the Huffman tree. */free_huffman_tree(root);free_encoder(se);return rc;
}int
huffman_decode_file(FILE *in, FILE *out)
{huffman_node *root, *p;int c;unsigned int data_count;/* Read the Huffman code table. */root = read_code_table(in, &data_count);if(!root)return 1;/* Decode the file. */p = root;while(data_count > 0 && (c = fgetc(in)) != EOF){unsigned char byte = (unsigned char)c;unsigned char mask = 1;while(data_count > 0 && mask){p = byte & mask ? p->one : p->zero;mask <<= 1;if(p->isLeaf){fputc(p->symbol, out);p = root;--data_count;}}}free_huffman_tree(root);return 0;
}#define CACHE_SIZE 1024int huffman_encode_memory(const unsigned char *bufin,unsigned int bufinlen,unsigned char **pbufout,unsigned int *pbufoutlen)
{SymbolFrequencies sf;SymbolEncoder *se;huffman_node *root = NULL;int rc;unsigned int symbol_count;buf_cache cache;/* Ensure the arguments are valid. */if(!pbufout || !pbufoutlen)return 1;if(init_cache(&cache, CACHE_SIZE, pbufout, pbufoutlen))return 1;/* Get the frequency of each symbol in the input memory. */symbol_count = get_symbol_frequencies_from_memory(&sf, bufin, bufinlen);/* Build an optimal table from the symbolCount. */se = calculate_huffman_codes(&sf);root = sf[0];/* Scan the memory again and, using the tablepreviously built, encode it into the output memory. */rc = write_code_table_to_memory(&cache, se, symbol_count);if(rc == 0)rc = do_memory_encode(&cache, bufin, bufinlen, se);/* Flush the cache. */flush_cache(&cache);/* Free the Huffman tree. */free_huffman_tree(root);free_encoder(se);free_cache(&cache);return rc;
}int huffman_decode_memory(const unsigned char *bufin,unsigned int bufinlen,unsigned char **pbufout,unsigned int *pbufoutlen)
{huffman_node *root, *p;unsigned int data_count;unsigned int i = 0;unsigned char *buf;unsigned int bufcur = 0;/* Ensure the arguments are valid. */if(!pbufout || !pbufoutlen)return 1;/* Read the Huffman code table. */root = read_code_table_from_memory(bufin, bufinlen, &i, &data_count);if(!root)return 1;buf = (unsigned char*)malloc(data_count);/* Decode the memory. */p = root;for(; i < bufinlen && data_count > 0; ++i) {unsigned char byte = bufin[i];unsigned char mask = 1;while(data_count > 0 && mask){p = byte & mask ? p->one : p->zero;mask <<= 1;if(p->isLeaf){buf[bufcur++] = p->symbol;p = root;--data_count;}}}free_huffman_tree(root);*pbufout = buf;*pbufoutlen = bufcur;return 0;
}
实验结果:
以表格形式表示的实验结果
file type | avi | doc | excel | jpg | mp4 |
---|---|---|---|---|---|
average_codelength | 7.970551 | 2.604187 | 7.81014 | 7.991087 | 8.000016 |
entropy | 7.952410383 | 2.47715948 | 7.778867823 | 7.972575064 | 7.999820052 |
original file size(KB) | 7531.64 | 11 | 35.7 | 293 | 58188.58 |
compressed file size(KB) | 7504.74 | 4.44 | 35.7 | 294 | 58189.34 |
Compression ratio | 1.003584401 | 2.477477477 | 1 | 0.996598639 | 0.999986939 |
file type | png | ppt | psd | rar | |
---|---|---|---|---|---|
average_codelength | 7.966299 | 7.998246 | 5.810035 | 6.067207 | 7.999944 |
entropy | 7.949396899 | 7.980677645 | 5.763231865 | 6.040932735 | 7.998544022 |
original file size(KB) | 212 | 249 | 396 | 693 | 313 |
compressed file size(KB) | 212 | 250 | 288 | 526 | 314 |
Compression ratio | 1 | 0.996 | 1.375 | 1.317490494 | 0.996815287 |
实验结果与预期结果一样,对于分布越是不均匀的样本,Huffman编码的压缩率越高。
数据压缩实验三--Huffman编解码及压缩率的比较相关推荐
- 数据压缩 实验三 Huffman编解码算法实现与压缩效率分析
实验目的 掌握Huffman编解码实现的数据结构和实现框架, 进一步熟练使用C编程语言, 并完成压缩效率的分析. 实验原理 1.本实验中Huffman编码算法 (1)将文件以ASCII字符流的形式读入 ...
- 实验三 Huffman编解码算法实现与压缩效率分析
一.Huffman编解码原理 1. Huffman编码 对原始文件进行Huffman编码,首先需要解决以下几点问题: 文件符号的概率分布情况是怎样的? Huffman树是如何建立的? 建立起Huffm ...
- 实验三—Huffman编解码
一.实验原理 1.Huffman编码的步骤: (1)首先将所有字符发生的概率从小到大进行排序: (2)将最小的两个概率进行两两一合并,之后继续找最小的两个概率进行合并包括前面已经合并的和数: (3)一 ...
- 数据压缩原理 实验三 Huffman编解码算法实现与压缩效率分析
实验原理 Huffman编码是一种无失真编码方式,是一种可变长编码,它将出现概率大的信源符号短编码,出现概率小的信源符号长编码. 编码步骤: ①将文件以ASCII字符流的形式读入,统计每个符号的发生概 ...
- 实验三 LZW编解码算法实现与分析
LZW简述 本部分参考wiki https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch LZW压缩算法在1978年提出,由 Abr ...
- [实验三]LZW 编解码算法实现与分析
目录 一.LZW算法 1.1 编码步骤 1.2 解码步骤 1.3 关于有可能出现当前码字CW不在词典中的情况说明 二.代码实现 2.1 程序说明 2.2 数据结构 2.3 bitio.h 2.4 bi ...
- 实验三 LZW编解码实验
一.LZW算法简介 LZW为词典编码的一种,是通过从输入数据中创建"短语词典".在编码过程中遇到词典中出现的"短语"时,编码器就输出其对应的"序号&q ...
- Huffman编解码
Huffman编解码算法实现与压缩效率分析 一.背景知识及相关公式 1.信源熵 信源熵是信息的度量单位,一般用H表示,单位是比特,对于任意一个随机变量,它的熵定义为,变量的不确定性越大,熵也就越大. ...
- 多媒体技术与应用之图像Huffman编解码
多媒体技术与应用之图像Huffman编解码 一.实验内容 1.了解BMP图像的格式,实现BMP图片格式的数据域及文件头的分离 2.熟悉Huffman编码原理 3.使用Huffman编码算法对给定图像文 ...
最新文章
- DL之Encoder-Decoder:Encoder-Decoder结构的相关论文、设计思路、关键步骤等配图集合之详细攻略
- 【JUC并发编程03】线程间通信
- 单片机设置12分频c语言,AT89C51单片机,如何实现延迟一秒
- ios集成firebase_如何使用Firebase将Google Login集成到Ionic应用程序中
- Silverlight WCF RIA服务(十三)数据 3
- java编程思想3感悟(4)---被隐藏的具体实现
- CSS3 高斯模糊与动画效果
- input reset 重置时间
- VBA中数组(Array)与随机数(Rnd)的使用
- 四位共阳极数码管显示函数_初学者,求助!!设计一个4位LED数码管动态扫描显示电路,用...
- Android实战练习——简单的网络视频播放器
- Jzoj4384 Hashit
- 挑战程序竞赛系列(22):3.2弹性碰撞
- 不会比这更详细的前端工程化的入门教程了
- 骨传导蓝牙耳机哪个牌子好?最受欢迎的五款骨传导蓝牙耳机
- 2019-9-11-数据结构查找方法总结
- Linux Ubuntu20.10 安装Process Monitor(Procmon),以及使用方法
- Discourse开源论坛搭建
- java实习生简历模板
- 贪便宜给自己带来的麻烦,哈哈。