  在 scala 中每个 Byte 都是由八位二进制补码组成的,这样的 Byte 组成的数组就是字节数组。字节也是计算机中数据存储的基本单位。


  字符串 “abcd” 中每个字符对应的 Ascii 码是 a->97 b->98 c->99 d->100,这个字符串字符串转化为字节数组就是 Array[97, 98, 99, 100]
97->01100001 98->01100010 99->01100011 100->01100100

  同理,字符串 “aaaabbbccd” 对应的字节数组就是 Array[97,97,97,97,98,98,98,99,99,100] 在计算机底层其实就是 01100001011000010110000101100001011000100110001001100010011000110110001101100100
  所以这个字符串 “aaaabbbccd” 的大小为10个字节。
  这里我们重新做一个编码,把上边字符串 “aaaabbbccd” 中每个字符出现的次数做个排序,也就是 a:4 b:3 c:2 d:1
我们按照排序先后,分别用二进制 0,1,10,11 来表示 a,b,c,d
那么,字符串 “aaaabbbccdd” 就变成了 0000111101011,这里存储大小才用了12个bit,不到2个字节的大小,这其实就是一个数据压缩。
  问题来了,上边虽然实现了数据压缩,但是,我们无法还原,因为编码中 0000111101011 标黄的 0 我们不能确定它是 a 还是 c, 里面的 1 我们也不知道他们是单独的 1 还是 11 ,这种字符编码存在可能本身为其他字符编码前缀这种问题。


  字符的编码都不能是其他字符编码的前缀,符合此要求的编码叫做前缀编码, 即不能匹配到重复的编码,哈夫曼编码(Huffman coding) 就是这种 前缀 编码。

哈夫曼树(Huffman Tree


  • 给定 n 个权值作为 n 个叶子结点,构造一棵二叉树,若该树的 带权路径长度 (wpl: weighted path length) 达到最小,称这样的 二叉树最优二叉树 ,也称为 哈夫曼树(Huffman Tree)
  • 赫夫曼树是带权路径长度最短的树,权值较大的结点离根较近


下面我们用代码来实现创建一个 Huffman Tree

  1. 创建一个节点类;这里 Node 需要继承 Comparable ,因为 Node 必须是可排序的。
  2. 构建 Huffman Tree;这里需要注意的地方是,scalaListBuffer 是才是可变长 ListList 是不可变的。
哈夫曼编码(Huffman Coding

  哈夫曼编码(Huffman Coding),又称霍夫曼编码,是一种编码方式,哈夫曼编码是可变字长编码(VLC)的一种。Huffman于1952年提出一种编码方法,该方法完全依据字符出现概率来构造异字头的平均长度最短的码字,有时称之为最佳编码,一般就叫做Huffman编码(有时也称为霍夫曼编码)。

  给定字符串 "aaaabbbccd" ,根据给定字符串,我们以字符出现次数为权构建哈夫曼树;根据哈夫曼树,给各个字符规定编码,向左的路径为 0 ,向右的路径为 1 。 编码如图所示:

  有了这个编码表,我们可以把给定字符串 "aaaabbbccd" 编码为 "0000101010111111110" 。比起原来的编码 "01100001011000010110000101100001011000100110001001100010011000110110001101100100" 减少了不少的空间,之前是 10 个字节,通过哈夫曼编码后只用了不到 3 个字节。


  1. 获取字节数组,遍历字节数组,以每个字节信息为 Node 节点的数据信息(data),字节出现次数为权(weight),构建一个 ListBuffer
  2. 根据 ListBuffer 里面的元素(也就是Node节点)构建哈夫曼树
  3. 根据哈夫曼树生成哈夫曼编码表
  4. 根据哈夫曼编码表,把原字节数组编码成哈夫曼编码后的字节数组,完成数据压缩

“Only if you asked to see me, our meeting would be meaningful to me”

  1. 先把字符串变成字节数组
val content = "Only if you asked to see me, our meeting would be meaningful to me"
val bytes = content.getBytes()
for (b <- bytes) print(b + " ")// 遍历bytes:
79 110 108 121 32 105 102 32 121 111 117 32 97 115 107 101 100 32 116 111 32 115 101 101 32 109 101 44 32 111 117 114 32 109 101 101 116 105 110 103 32 119 111 117 108 100 32 98 101 32 109 101 97 110 105 110 103 102 117 108 32 116 111 32 109 101
  1. 使用字节数组构建 ListBuffer,ListBuffer 中元素为 Node,Node 中 data 为字节信息,weight(权) 为每个字节出现的次数,也就类似做一个word count 的功能
val list = getNodes(contentBytes)
for (x <- list) print(x.data + "->" + x.weight + "; ")
// 这里遍历结果为:
32->13; 97->2; 98->1; 100->2; 101->9; 102->2; 103->2; 105->3; 107->1; 44->1; 108->3; 109->4; 110->4; 79->1; 111->5; 114->1; 115->2; 116->3; 117->4; 119->1; 121->2;
  1. 根据 ListBuffer 里面的元素(也就是Node节点)构建哈夫曼树,这里返回的是哈夫曼树的 root 节点
  2. 根据哈夫曼树生成哈夫曼编码表,编码表用一个可变Map来存储
val list = getNodes(contentBytes)
val tree = createHuffmanTree(list)
val map = getHuffmanCodeTab(tree)
for ((k, v) <- map) print(k + "->" + v + "; ")
// 这里遍历后得到
32->00; 97->01110; 98->110000; 100->01111; 101->101; 102->10000; 103->10001; 105->11101; 107->110001; 44->110010; 108->11110; 109->0100; 110->0101; 79->110011; 111->1101; 114->111000; 115->10010; 116->11111; 117->0110; 119->111001; 121->10011;
  1. 根据哈夫曼编码表,把原字节数组编码成哈夫曼编码后的字节数组,完成数据压缩
val huffmanBytes = enCode(bytes, map)
for (b <- huffmanBytes) print(b + " ")
// 这里遍历将得到
-51 125 51 -80 39 -84 58 88 -41 -97 -46 86 -119 114 53 -72 18 -33 -11 98 115 -83 -25 -104 81 43 -105 -85 24 55 -113 -24 37 0

79 110 108 121 32 105 102 32 121 111 117 32 97 115 107 101 100 32 116 111 32 115 101 101 32 109 101 44 32 111 117 114 32 109 101 101 116 105 110 103 32 119 111 117 108 100 32 98 101 32 109 101 97 110 105 110 103 102 117 108 32 116 111 32 109 101
-51 125 51 -80 39 -84 58 88 -41 -97 -46 86 -119 114 53 -72 18 -33 -11 98 115 -83 -25 -104 81 43 -105 -85 24 55 -113 -24 37 0






zipFile("F:\\src.txt", "F:\\scr.zip")


import java.io.{File, FileInputStream, FileNotFoundException, FileOutputStream, InputStream, ObjectInputStream, ObjectOutputStream, OutputStream}import scala.collection.mutable
import scala.collection.mutable.ListBuffer
import scala.reflect.io.Path
import scala.util.control.Breaks.{break, breakable}object CompressionDemo {def main(args: Array[String]): Unit = {val content = "Only if you asked to see me, our meeting would be meaningful to me"val contentBytes = content.getBytes()println("origin size: " + contentBytes.length + "B")print("origin: ")for (b <- contentBytes) print(b + " ")println("\norigin content: " + content)println()/*编码步骤step 1: 以每个字节(data),字节出现次数(weight)构建一个列表 nodesstep 2: 遍历 nodes, 根据 nodes 里面的元素构建 Huffman treestep 3: 根据 Huffman tree 生成 Huffman code tablestep 4: 根据编码表把原字节数组变成哈夫曼编码后的字节数组 完成数据压缩*/val huffmanCodeBytes = enCode(contentBytes)println("encoded size: " + huffmanCodeBytes.length + "B")print("encoded: ")for (b <- huffmanCodeBytes) print(b + " ")println("\nencoded content: " + new String(huffmanCodeBytes))println()/*解码步骤 获取原 Huffman code table 逆向操作就行*/val originBytes = deCode(huffmanCodeBytes, huffmanCodeTab)println("decoded size: " + originBytes.length + "B")print("decoded: ")for (b <- originBytes) print(b + " ")println("\ndecoded content: " + new String(originBytes))// compression & decompression/*zipFile(sourcePath, targetPath)unZipFile(sourcePath, targetPath)*/}/*** Decompression of data* @param srcPath file input path* @param dstPath file output path*/def unZipFile(srcPath: String, dstPath: String): Unit = {var is: InputStream = nullvar ois: ObjectInputStream = nullvar os: OutputStream = nulltry {// read fileis = new FileInputStream(srcPath)ois = new ObjectInputStream(is)val huffmanCodeBytes = ois.readObject.asInstanceOf[Array[Byte]]val huffmanCodeTab = ois.readObject.asInstanceOf[mutable.Map[Byte, String]]// decodeval bytes = deCode(huffmanCodeBytes, huffmanCodeTab)// write fileos = new FileOutputStream(dstPath)os.write(bytes)} catch {case ex: Exception => println(ex.getMessage)} finally {try {is.close()os.close()ois.close()} catch {case ex: Exception => println(ex.getMessage)} finally {println("解压完成")}}}/*** Compression of data** @param srcPath file input path* @param dstPath file output path*/def zipFile(srcPath: String, dstPath: String): Unit = {var originBytes = 0var huffmanBytes = 0var is: InputStream = nullvar os: OutputStream = nullvar oos: ObjectOutputStream = nulltry {// read fileis = new FileInputStream(srcPath)val b: Array[Byte] = new Array[Byte](is.available)is.read(b)originBytes = b.length// encodeval huffmanCodeBytes = enCode(b)huffmanBytes = huffmanCodeBytes.length// write fileos = new FileOutputStream(dstPath)oos = new ObjectOutputStream(os)oos.writeObject(huffmanCodeBytes)// 一定要把 huffman code table 写进去oos.writeObject(huffmanCodeTab)} catch {case ex: Exception => println(ex.getMessage)} finally {try {is.closeos.closeoos.close} catch {case ex: Exception => println(ex.getMessage)} finally {println("压缩完成: " + originBytes + "B -> " + huffmanBytes + "B")}}}/*** Get origin byte array** @param bytes          Huffman code byte array* @param huffmanCodeTab Huffman code table* @return array of Byte*/def deCode(bytes: Array[Byte], huffmanCodeTab: mutable.Map[Byte, String]): Array[Byte] = {val sb = bitToString(bytes)val map = mutable.Map[String, Byte]()for ((k, v) <- huffmanCodeTab) map.put(v, k)val list = ListBuffer[Byte]()val checkStr = new StringBuilderfor (c <- sb) {checkStr += cval b = map.getOrElse(checkStr.toString, "")if (b.isInstanceOf[Byte]) {list += b.toString.toBytecheckStr.clear}}list.toArray}/*** Get Huffman code string** @param bytes Huffman code array* @return Huffman code string*/def bitToString(bytes: Array[Byte]): StringBuilder = {val sb = new StringBuilderfor (i <- 0 until bytes.length - 2) {val b = bytes(i) | 256val str = b.toBinaryStringsb ++= str.substring(str.length - 8)}val lastByte = bytes(bytes.length - 2)val flagByte = bytes(bytes.length - 1)if (flagByte == 0) {sb ++= lastByte.toBinaryString} else {if (lastByte > 0) {for (i <- 1 to flagByte) sb += '0'sb ++= lastByte.toBinaryString} else {for (i <- 1 to flagByte) sb += '0'}}sb}/*** Convert array to huffman code array** @param bytes array to be processed* @return a huffman code array*/def enCode(bytes: Array[Byte]): Array[Byte] = {getHuffmanCodeTab(createHuffmanTree(getNodes(bytes)))enCode(bytes, huffmanCodeTab)}/*** Convert array to huffman code array** @param bytes          array to be processed* @param huffmanCodeTab huffman code table* @return a huffman code array*/def enCode(bytes: Array[Byte], huffmanCodeTab: mutable.Map[Byte, String]): Array[Byte] = {val sb = new StringBuilderfor (b <- bytes) sb ++= huffmanCodeTab(b)val len = (sb.length + 7) / 8 // if (sb.length % 8 == 0) sb.length / 8 else sb.length / 8 + 1val huffmanCodeBytes = new Array[Byte](len + 1) // 多一位存放末尾标识 防止数据丢失var index = 0var lastByte: Byte = 0for (i <- 0 until sb.length() by 8) {val str = if (i + 8 > sb.length()) sb.substring(i) else sb.substring(i, i + 8)huffmanCodeBytes(index) = if (str.length == 8) {Integer.parseInt(str, 2).toByte} else {var i = 0breakable(while (i < str.length) {if ('1' == str.charAt(i)) break()i += 1})lastByte = i.toByteInteger.parseInt(str, 2).toByte}index += 1}huffmanCodeBytes(index) = lastBytehuffmanCodeBytes}// Huffman code table, for example 32->"01", 97->"100" ...val huffmanCodeTab = mutable.Map[Byte, String]()// save leaf node pathval stringBuilder = new StringBuilder/*** Get a huffman code table** @param root the root node* @return a map of huffman code table*/def getHuffmanCodeTab(root: Node): mutable.Map[Byte, String] = {if (root.weight == 0) return null// is single?if (root.left == null && root.right == null) {huffmanCodeTab.put(root.data, "0")}markNodePath(root.left, '0', stringBuilder)markNodePath(root.right, '1', stringBuilder)huffmanCodeTab}/*** Mark the leaf node path** @param node the node of huffman tree* @param code the leaf node path code* @param sb   save leaf node path*/def markNodePath(node: Node, code: Char, sb: StringBuilder): Unit = {// val sbTemp = new mutable.StringBuilder(sb.toString)val sbTemp = sb.clone()sbTemp += codeif (node != null) {if (node.data == 0 && (node.left != null || node.right != null)) { // Parent nodemarkNodePath(node.left, '0', sbTemp)markNodePath(node.right, '1', sbTemp)} else { // Leaf nodehuffmanCodeTab.put(node.data, sbTemp.toString)}}}/*** Create Huffman tree** @param nodes a list of nodes* @return parent node of Huffman tree*/def createHuffmanTree(nodes: ListBuffer[Node]): Node = {while (nodes.size > 1) {// watch out !!!  scala is different from java; here, nodes hasn't changedval sortedNode = nodes.sortedval leftNode = sortedNode(0)val rightNode = sortedNode(1)val parent = new Node(0, leftNode.weight + rightNode.weight, leftNode, rightNode)nodes -= (leftNode, rightNode)nodes += parent}nodes(0)}/*** Get a list of node** @param bytes a byte array* @return a list of node*/def getNodes(bytes: Array[Byte]): ListBuffer[Node] = {val nodes = ListBuffer[Node]()val counts = mutable.Map[Byte, Int]()// count the times of bfor (b <- bytes) {counts.put(b, counts.getOrElse(b, 0) + 1)}for ((k, v) <- counts) {nodes += new Node(k, v, null, null)}nodes}/*** The node of binary tree* data: character storage, for example 'a' => 97, 'b' => 98* weight: the weight of Huffman tree* left: the left node* right: the right node*/class Node extends Comparable[Node] {var data: Byte = _var weight: Int = _var left: Node = _var right: Node = _def this(data: Byte, weight: Int, left: Node, right: Node) {thisthis.data = datathis.weight = weightthis.left = leftthis.right = right}def preOrder: Unit = {print("[data:" + this.data + ", weight:" + this.weight + "] ")if (this.left != null) this.left.preOrderif (this.right != null) this.right.preOrder}override def compareTo(o: Node): Int = this.weight - o.weight}}


