数据压缩是指在不丢失有用信息的前提下，缩减数据量以减少存储空间，提高其传输、存储和处理效率，或按照一定的算法对数据进行重新组织，减少数据的冗余和存储的空间的一种技术方法。数据压缩包括有损压缩和无损压缩。

这篇文章将从0到1的告诉大家如何实现下图所示的字符串文本压缩与文件压缩。

完成数据压缩很简单，只需要知道下面几个知识点：

二进制
数组
前缀编码
Huffman tree
Huffman coding

下面进入正题：

什么是字节(Byte)数组？

在 scala 中每个 Byte 都是由八位二进制补码组成的，这样的 Byte 组成的数组就是字节数组。字节也是计算机中数据存储的基本单位。

一个实现压缩的小案例：

字符串 “abcd” 中每个字符对应的 Ascii 码是 a->97 b->98 c->99 d->100，这个字符串字符串转化为字节数组就是 Array[97, 98, 99, 100]
转为二进制补码的对应关系:
97->01100001 98->01100010 99->01100011 100->01100100

同理，字符串 “aaaabbbccd” 对应的字节数组就是 Array[97,97,97,97,98,98,98,99,99,100] 在计算机底层其实就是 01100001011000010110000101100001011000100110001001100010011000110110001101100100
所以这个字符串 “aaaabbbccd” 的大小为10个字节。
这里我们重新做一个编码，把上边字符串 “aaaabbbccd” 中每个字符出现的次数做个排序，也就是 a:4 b:3 c:2 d:1
我们按照排序先后，分别用二进制 0,1,10,11 来表示 a,b,c,d
那么，字符串 “aaaabbbccdd” 就变成了 0000111101011，这里存储大小才用了12个bit，不到2个字节的大小，这其实就是一个数据压缩。
问题来了，上边虽然实现了数据压缩，但是，我们无法还原，因为编码中 0000111101011 标黄的 0 我们不能确定它是 a 还是 c，里面的 1 我们也不知道他们是单独的 1 还是 11 ，这种字符编码存在可能本身为其他字符编码前缀这种问题。

前缀编码

什么是前缀编码？
字符的编码都不能是其他字符编码的前缀，符合此要求的编码叫做前缀编码，即不能匹配到重复的编码，哈夫曼编码(Huffman coding) 就是这种前缀编码。
用这种编码实现的数据压缩，我们就可以解决数据压缩还原(逆向编码即可)的问题。

哈夫曼树（Huffman Tree）

在谈论哈夫曼编码前，我们先了解什么是哈夫曼树？

给定 n 个权值作为 n 个叶子结点，构造一棵二叉树，若该树的 带权路径长度 (wpl: weighted path length) 达到最小，称这样的 二叉树 为 最优二叉树 ，也称为 哈夫曼树(Huffman Tree)
赫夫曼树是带权路径长度最短的树，权值较大的结点离根较近

如图所示：

下面我们用代码来实现创建一个 Huffman Tree

创建一个节点类；这里 Node 需要继承 Comparable ，因为 Node 必须是可排序的。
构建 Huffman Tree；这里需要注意的地方是，scala 中 ListBuffer 是才是可变长 List，List 是不可变的。

哈夫曼编码（Huffman Coding）

什么是哈夫曼编码？
哈夫曼编码(Huffman Coding)，又称霍夫曼编码，是一种编码方式，哈夫曼编码是可变字长编码(VLC)的一种。Huffman于1952年提出一种编码方法，该方法完全依据字符出现概率来构造异字头的平均长度最短的码字，有时称之为最佳编码，一般就叫做Huffman编码（有时也称为霍夫曼编码）。

哈夫曼编码表
给定字符串 "aaaabbbccd" ，根据给定字符串，我们以字符出现次数为权构建哈夫曼树；根据哈夫曼树，给各个字符规定编码，向左的路径为 0 ，向右的路径为 1 。编码如图所示：

有了这个编码表，我们可以把给定字符串 "aaaabbbccd" 编码为 "0000101010111111110" 。比起原来的编码 "01100001011000010110000101100001011000100110001001100010011000110110001101100100" 减少了不少的空间，之前是 10 个字节，通过哈夫曼编码后只用了不到 3 个字节。

重点来了！！！

如何把字节数组变成哈夫曼编码后的字节数组？

获取字节数组，遍历字节数组，以每个字节信息为 Node 节点的数据信息(data)，字节出现次数为权(weight)，构建一个 ListBuffer
根据 ListBuffer 里面的元素(也就是Node节点)构建哈夫曼树
根据哈夫曼树生成哈夫曼编码表
根据哈夫曼编码表，把原字节数组编码成哈夫曼编码后的字节数组，完成数据压缩

我们现在用下边这个字符串来走一遍过程
“Only if you asked to see me, our meeting would be meaningful to me”

先把字符串变成字节数组

val content = "Only if you asked to see me, our meeting would be meaningful to me"
val bytes = content.getBytes()
for (b <- bytes) print(b + " ")// 遍历bytes：
/*
79 110 108 121 32 105 102 32 121 111 117 32 97 115 107 101 100 32 116 111 32 115 101 101 32 109 101 44 32 111 117 114 32 109 101 101 116 105 110 103 32 119 111 117 108 100 32 98 101 32 109 101 97 110 105 110 103 102 117 108 32 116 111 32 109 101
*/

使用字节数组构建 ListBuffer，ListBuffer 中元素为 Node，Node 中 data 为字节信息，weight(权) 为每个字节出现的次数，也就类似做一个word count 的功能

val list = getNodes(contentBytes)
for (x <- list) print(x.data + "->" + x.weight + "; ")
// 这里遍历结果为：
/*
32->13; 97->2; 98->1; 100->2; 101->9; 102->2; 103->2; 105->3; 107->1; 44->1; 108->3; 109->4; 110->4; 79->1; 111->5; 114->1; 115->2; 116->3; 117->4; 119->1; 121->2;
*/

根据 ListBuffer 里面的元素(也就是Node节点)构建哈夫曼树，这里返回的是哈夫曼树的 root 节点
根据哈夫曼树生成哈夫曼编码表，编码表用一个可变Map来存储

val list = getNodes(contentBytes)
val tree = createHuffmanTree(list)
val map = getHuffmanCodeTab(tree)
for ((k, v) <- map) print(k + "->" + v + "; ")
// 这里遍历后得到
/*
32->00; 97->01110; 98->110000; 100->01111; 101->101; 102->10000; 103->10001; 105->11101; 107->110001; 44->110010; 108->11110; 109->0100; 110->0101; 79->110011; 111->1101; 114->111000; 115->10010; 116->11111; 117->0110; 119->111001; 121->10011;
*/

根据哈夫曼编码表，把原字节数组编码成哈夫曼编码后的字节数组，完成数据压缩

val huffmanBytes = enCode(bytes, map)
for (b <- huffmanBytes) print(b + " ")
// 这里遍历将得到
/*
-51 125 51 -80 39 -84 58 88 -41 -97 -46 86 -119 114 53 -72 18 -33 -11 98 115 -83 -25 -104 81 43 -105 -85 24 55 -113 -24 37 0
*/

到这里，我们已经完成了数据的压缩

原字节数组：
79 110 108 121 32 105 102 32 121 111 117 32 97 115 107 101 100 32 116 111 32 115 101 101 32 109 101 44 32 111 117 114 32 109 101 101 116 105 110 103 32 119 111 117 108 100 32 98 101 32 109 101 97 110 105 110 103 102 117 108 32 116 111 32 109 101
编码后的数组：
-51 125 51 -80 39 -84 58 88 -41 -97 -46 86 -119 114 53 -72 18 -33 -11 98 115 -83 -25 -104 81 43 -105 -85 24 55 -113 -24 37 0

最后，我们来看看如何解码？

解码只需要根据编码流程，逆向操作即可

实现文件压缩

哈夫曼编码是按字节来处理的，因此可以处理所有的文件(图片，文本文件，xml文件等)；当然如果一个文件中内容重复数据不多，那么压缩效果就不会很明显，比如很复杂的色彩丰富的图片。

我们来测试一下

zipFile("F:\\src.txt", "F:\\scr.zip")

结束奉上源码：

import java.io.{File, FileInputStream, FileNotFoundException, FileOutputStream, InputStream, ObjectInputStream, ObjectOutputStream, OutputStream}import scala.collection.mutable
import scala.collection.mutable.ListBuffer
import scala.reflect.io.Path
import scala.util.control.Breaks.{break, breakable}object CompressionDemo {def main(args: Array[String]): Unit = {val content = "Only if you asked to see me, our meeting would be meaningful to me"val contentBytes = content.getBytes()println("origin size: " + contentBytes.length + "B")print("origin: ")for (b <- contentBytes) print(b + " ")println("\norigin content: " + content)println()/*编码步骤step 1: 以每个字节(data)，字节出现次数(weight)构建一个列表 nodesstep 2: 遍历 nodes, 根据 nodes 里面的元素构建 Huffman treestep 3: 根据 Huffman tree 生成 Huffman code tablestep 4: 根据编码表把原字节数组变成哈夫曼编码后的字节数组 完成数据压缩*/val huffmanCodeBytes = enCode(contentBytes)println("encoded size: " + huffmanCodeBytes.length + "B")print("encoded: ")for (b <- huffmanCodeBytes) print(b + " ")println("\nencoded content: " + new String(huffmanCodeBytes))println()/*解码步骤 获取原 Huffman code table 逆向操作就行*/val originBytes = deCode(huffmanCodeBytes, huffmanCodeTab)println("decoded size: " + originBytes.length + "B")print("decoded: ")for (b <- originBytes) print(b + " ")println("\ndecoded content: " + new String(originBytes))// compression & decompression/*zipFile(sourcePath, targetPath)unZipFile(sourcePath, targetPath)*/}/*** Decompression of data* @param srcPath file input path* @param dstPath file output path*/def unZipFile(srcPath: String, dstPath: String): Unit = {var is: InputStream = nullvar ois: ObjectInputStream = nullvar os: OutputStream = nulltry {// read fileis = new FileInputStream(srcPath)ois = new ObjectInputStream(is)val huffmanCodeBytes = ois.readObject.asInstanceOf[Array[Byte]]val huffmanCodeTab = ois.readObject.asInstanceOf[mutable.Map[Byte, String]]// decodeval bytes = deCode(huffmanCodeBytes, huffmanCodeTab)// write fileos = new FileOutputStream(dstPath)os.write(bytes)} catch {case ex: Exception => println(ex.getMessage)} finally {try {is.close()os.close()ois.close()} catch {case ex: Exception => println(ex.getMessage)} finally {println("解压完成")}}}/*** Compression of data** @param srcPath file input path* @param dstPath file output path*/def zipFile(srcPath: String, dstPath: String): Unit = {var originBytes = 0var huffmanBytes = 0var is: InputStream = nullvar os: OutputStream = nullvar oos: ObjectOutputStream = nulltry {// read fileis = new FileInputStream(srcPath)val b: Array[Byte] = new Array[Byte](is.available)is.read(b)originBytes = b.length// encodeval huffmanCodeBytes = enCode(b)huffmanBytes = huffmanCodeBytes.length// write fileos = new FileOutputStream(dstPath)oos = new ObjectOutputStream(os)oos.writeObject(huffmanCodeBytes)// 一定要把 huffman code table 写进去oos.writeObject(huffmanCodeTab)} catch {case ex: Exception => println(ex.getMessage)} finally {try {is.closeos.closeoos.close} catch {case ex: Exception => println(ex.getMessage)} finally {println("压缩完成: " + originBytes + "B -> " + huffmanBytes + "B")}}}/*** Get origin byte array** @param bytes          Huffman code byte array* @param huffmanCodeTab Huffman code table* @return array of Byte*/def deCode(bytes: Array[Byte], huffmanCodeTab: mutable.Map[Byte, String]): Array[Byte] = {val sb = bitToString(bytes)val map = mutable.Map[String, Byte]()for ((k, v) <- huffmanCodeTab) map.put(v, k)val list = ListBuffer[Byte]()val checkStr = new StringBuilderfor (c <- sb) {checkStr += cval b = map.getOrElse(checkStr.toString, "")if (b.isInstanceOf[Byte]) {list += b.toString.toBytecheckStr.clear}}list.toArray}/*** Get Huffman code string** @param bytes Huffman code array* @return Huffman code string*/def bitToString(bytes: Array[Byte]): StringBuilder = {val sb = new StringBuilderfor (i <- 0 until bytes.length - 2) {val b = bytes(i) | 256val str = b.toBinaryStringsb ++= str.substring(str.length - 8)}val lastByte = bytes(bytes.length - 2)val flagByte = bytes(bytes.length - 1)if (flagByte == 0) {sb ++= lastByte.toBinaryString} else {if (lastByte > 0) {for (i <- 1 to flagByte) sb += '0'sb ++= lastByte.toBinaryString} else {for (i <- 1 to flagByte) sb += '0'}}sb}/*** Convert array to huffman code array** @param bytes array to be processed* @return a huffman code array*/def enCode(bytes: Array[Byte]): Array[Byte] = {getHuffmanCodeTab(createHuffmanTree(getNodes(bytes)))enCode(bytes, huffmanCodeTab)}/*** Convert array to huffman code array** @param bytes          array to be processed* @param huffmanCodeTab huffman code table* @return a huffman code array*/def enCode(bytes: Array[Byte], huffmanCodeTab: mutable.Map[Byte, String]): Array[Byte] = {val sb = new StringBuilderfor (b <- bytes) sb ++= huffmanCodeTab(b)val len = (sb.length + 7) / 8 // if (sb.length % 8 == 0) sb.length / 8 else sb.length / 8 + 1val huffmanCodeBytes = new Array[Byte](len + 1) // 多一位存放末尾标识 防止数据丢失var index = 0var lastByte: Byte = 0for (i <- 0 until sb.length() by 8) {val str = if (i + 8 > sb.length()) sb.substring(i) else sb.substring(i, i + 8)huffmanCodeBytes(index) = if (str.length == 8) {Integer.parseInt(str, 2).toByte} else {var i = 0breakable(while (i < str.length) {if ('1' == str.charAt(i)) break()i += 1})lastByte = i.toByteInteger.parseInt(str, 2).toByte}index += 1}huffmanCodeBytes(index) = lastBytehuffmanCodeBytes}// Huffman code table, for example 32->"01", 97->"100" ...val huffmanCodeTab = mutable.Map[Byte, String]()// save leaf node pathval stringBuilder = new StringBuilder/*** Get a huffman code table** @param root the root node* @return a map of huffman code table*/def getHuffmanCodeTab(root: Node): mutable.Map[Byte, String] = {if (root.weight == 0) return null// is single?if (root.left == null && root.right == null) {huffmanCodeTab.put(root.data, "0")}markNodePath(root.left, '0', stringBuilder)markNodePath(root.right, '1', stringBuilder)huffmanCodeTab}/*** Mark the leaf node path** @param node the node of huffman tree* @param code the leaf node path code* @param sb   save leaf node path*/def markNodePath(node: Node, code: Char, sb: StringBuilder): Unit = {// val sbTemp = new mutable.StringBuilder(sb.toString)val sbTemp = sb.clone()sbTemp += codeif (node != null) {if (node.data == 0 && (node.left != null || node.right != null)) { // Parent nodemarkNodePath(node.left, '0', sbTemp)markNodePath(node.right, '1', sbTemp)} else { // Leaf nodehuffmanCodeTab.put(node.data, sbTemp.toString)}}}/*** Create Huffman tree** @param nodes a list of nodes* @return parent node of Huffman tree*/def createHuffmanTree(nodes: ListBuffer[Node]): Node = {while (nodes.size > 1) {// watch out !!!  scala is different from java; here, nodes hasn't changedval sortedNode = nodes.sortedval leftNode = sortedNode(0)val rightNode = sortedNode(1)val parent = new Node(0, leftNode.weight + rightNode.weight, leftNode, rightNode)nodes -= (leftNode, rightNode)nodes += parent}nodes(0)}/*** Get a list of node** @param bytes a byte array* @return a list of node*/def getNodes(bytes: Array[Byte]): ListBuffer[Node] = {val nodes = ListBuffer[Node]()val counts = mutable.Map[Byte, Int]()// count the times of bfor (b <- bytes) {counts.put(b, counts.getOrElse(b, 0) + 1)}for ((k, v) <- counts) {nodes += new Node(k, v, null, null)}nodes}/*** The node of binary tree* data: character storage, for example 'a' => 97, 'b' => 98* weight: the weight of Huffman tree* left: the left node* right: the right node*/class Node extends Comparable[Node] {var data: Byte = _var weight: Int = _var left: Node = _var right: Node = _def this(data: Byte, weight: Int, left: Node, right: Node) {thisthis.data = datathis.weight = weightthis.left = leftthis.right = right}def preOrder: Unit = {print("[data:" + this.data + ", weight:" + this.weight + "] ")if (this.left != null) this.left.preOrderif (this.right != null) this.right.preOrder}override def compareTo(o: Node): Int = this.weight - o.weight}}

Scala+HuffmanCoding实现无损压缩相关推荐

hadoop,spark,scala,flink 大数据分布式系统汇总
20220314 https://shimo.im/docs/YcPW8YY3T6dT86dV/read 尚硅谷大数据文档资料 iceberg相当于对hive的读写,starrocks相当于对mysq ...
2021年大数据常用语言Scala（三十八）：scala高级用法隐式转换和隐式参数
目录隐式转换和隐式参数隐式转换自动导入隐式转换方法隐式转换的时机隐式参数隐式转换和隐式参数隐式转换和隐式参数是scala非常有特色的功能,也是Java等其他编程语言没有的功能.我们可以很 ...
2021年大数据常用语言Scala（三十七）：scala高级用法高阶函数用法
目录高阶函数用法作为值的函数匿名函数柯里化(多参数列表) 闭包高阶函数用法 Scala 混合了面向对象和函数式的特性,在函数式编程语言中,函数是"头等公民",它和Int. ...
2021年大数据常用语言Scala（三十六）：scala高级用法泛型
目录泛型定义一个泛型方法定义一个泛型类上下界协变.逆变.非变非变协变逆变泛型 scala和Java一样,类和特质.方法都可以支持泛型.我们在学习集合的时候,一般都会涉及到泛型. sc ...
2021年大数据常用语言Scala（三十五）：scala高级用法提取器(Extractor)
目录提取器(Extractor) 定义提取器提取器(Extractor) 我们之前已经使用过scala中非常强大的模式匹配功能了,通过模式匹配,我们可以快速匹配样例类中的成员变量.例如: // ...
2021年大数据常用语言Scala（三十四）：scala高级用法异常处理
目录异常处理捕获异常抛出异常异常处理 Scala中无需在方法上声明异常来看看下面一段代码. def main(args: Array[String]): Unit = {val i = 1 ...
2021年大数据常用语言Scala（三十一）：scala面向对象特质(trait)
目录特质(trait) 作为接口使用定义具体的方法定义具体方法和抽象方法定义具体的字段和抽象的字段实例对象混入trait trait调用链 trait的构造机制 trait继承class 特 ...
2021年大数据常用语言Scala（二十九）：scala面向对象单例对象
目录单例对象定义object - 掌握伴生对象 - 掌握 apply方法 - 掌握 main方法单例对象 Scala中没有static关键字,但是它支持静态如果要定义静态的东西,统统定义到o ...
2021年大数据常用语言Scala（二十八）：scala面向对象 MAVEN依赖和类
目录 scala面向对象 MAVEN依赖类 - 掌握创建类和对象 - 掌握 getter/setter - 了解类的构造器 - 掌握 scala面向对象 MAVEN依赖 <?xml ver ...
2021年大数据常用语言Scala（二十七）：函数式编程聚合操作
目录聚合操作聚合 reduce 定义案例折叠 fold 定义案例聚合操作聚合操作,可以将一个列表中的数据合并为一个.这种操作经常用来统计分析中聚合 reduce reduce表示 ...

Scala+HuffmanCoding实现无损压缩