Txilm Protocol: Monoxide公链系统中的区块压缩算法

今天介绍一下将在Monoxide公链系统中采用的区块压缩算法。从比特币的BIP152出发的进一步改进算法。使用Txilm 算法将使的区块传输带宽需求降低到原来的1/80，同时使得冲突概率控制在1/1000左右，消解冲突的计算量可以忽略不计。这个算法不依赖各个全节点之间的mempool是高度一致的，无需额外的协议保持mempool的同步。

Txilm Protocol: Lossy Block Compression with Salted Short Hashing

Background

Compact blocks carries only TXIDs by assuming most TXes in a newly created block are already stored in the mempool for most full nodes. This is proposed in BIP152. More details can be found inbitcoincore.org/en/2016

Compact blocks basically replace each TX (300–400 bytes roughly) with a 32-bytes TXID (e.g. SHA256 of the TX). This yields a nearly 10 times bandwidth saving.

We aim to further reduce the size of each TX representation in a compact block to around 40 bits. We can achieve 6.4x compression over the original proposal of Compact blocks (6.4 = 32 bytes / 40 bits). The new proposal is simply without using additional data structures like bloom filter or IBLT. Also, the new proposal doesn’t rely on consistent mempool across full nodes.

Combined with Canonical transaction ordering rule (CTOR), the short hashing can be further reduced to 32 bits, which yields 8x compression over the original proposal of Compact blocks. A totally 80 times data size reduction is realized.

Rationale

We present each TX in a block by a small hash value based on TXID:

TXID-HASH = h(TXID)

In which h is a small hash function that generates 32-bit to 64-bit hash values. It can be just CRC32, CRC40 or CRC64. The proposed new compact block scheme includes only a list of TXID_HASH, ordered as the original list of TX.

Ambiguity may occurs with such a k-bit small-sized hash, which needs to be resolved by each full node. Once receiving a new block that includes the TXID-HASH list from the sender, the receiver searches each received TXID-HASH in the hash list produced by its mempool. For each TXID_HASH, three cases may happen as follows:

Not found: There is no TX in the mempool that matches the received TXID_HASH. The receiver will ask the sender or other peers for the TXID.
A single match is found: the TXID is resolved.
Multiple matches are found: the receiver will collect all matched TXIDs as candidates for a 2nd-stage resolving.

In the 2nd-stage, all combinations of candidates of multiple TXID-HASH are iterated for recomputing the Merkle tree. A correct combination will result in a matched Merkle root with the one carried by the block header.

If any of the combinations in (3) or the resolved TXID list in (2) can not match the Merkle root in the block header, the receiver will fall back to ask the sender to transfer the complete TXID list of the block, which is described in the original Compact blocks proposal. The cause of this situation may happen is that at least one TXID in the receiver mempool has the same TXID-HASH in the received TXID-HASH list, while this is not the one included in the block.

An optional optimization to the 2nd-stage search is to add a lightweight pre-check before actually recomputing the Merkle root. We propose a lightweight Merkle tree by replacing SHA256 with CRC32, the CRC32-Merkle tree, with a 4-byte root. When creating a new block, the 4-byte CRC32-Merkle root will prepended to the encoded TXID-HASH data. This yields 40x acceleration in searching the right combination. (SHA256 over 16 bytes v.s. CRC32 over 8 bytes)

Resolving ambiguity may incur additional latency. Iterating the combination of many ambiguous TXID_HASH may also consume a considerable amount of CPU time. The feasibility of this proposal highly depends on the probability of hash collision, which is related to the length of the hash value (the k-bit) and also the size of mempool.

Probability of a Single Collision

A single collision is defined as any case of the 1 or 3 occurred at least once. Such collision can be within the TXID-HASH list received, or ones between the received list and the mempool.

Given a TXID-HASH with k bits, the collision rate is 1/2^k. So the probability of a single collision occurred in a new block can be formulated as Generalized Birthday problem. Say, we have total mTX in the mempool in average, and the new block carries n TX. The probability of a single collision can be approximated as:

PSC = 1 - (1 - 1/2^k)^( m*n + n*n/2 ).

For example, we have m = 60000, n = 10000:

k = 32, PSC = 0.14
k = 40, PSC = 0.00059
k = 48, PSC = 0.0000023

We recommends k = 40 as a reasonable value with good compression ratio and pretty low collision rate. A sufficient k is roughly proportional to log( m*n + n*n/2 ).

For example, we enlarge to 100x: m = 6000000, n = 1000000:

k = 48, PSC = 0.023
k = 56, PSC = 0.000090
k = 64, PSC = 0.00000035

Or, we reduce to something much smaller: m = 3000, n = 200:

k = 24, PSC = 0.036
k = 32, PSC = 0.00014
k = 40, PSC = 0.00000056

Combined with CTOR

Canonical transaction ordering rule (CTOR) is an ordering scheme to sort transactions in a block and in mempools according to the hash of transaction, which is already deployed in BCH network. Based on CTOR, the proposed scheme will achieve much lower collision probability and/or much higher compression ratio.

Since transactions are ordered in both blocks and mempools, the candidate space of any ambiguous TXID-HASH will be narrowed to a range bound by its previous and next TXIDwith resolved TXID-HASH, instead of the entire mempool. Assuming the newly confirmed TXID are evenly distributed in the sorted mempool, the size of the potential collision space will be reduced from m to m/n . The collision within block only occurs if the ambiguous TXID-HASHare adjacent after ordering. This dramatically reduces the collision probability and the cost of resolving ambiguity, which allows even shorter hash with higher compression ratio.

With transaction ordering by CTOR, the collision probability can be approximated as:

PSC = 1 - (1 - 1/2^k)^( m + n/2 )

For small blocks: m = 3000, n = 200:

k = 16, PSC = 0.046
k = 24, PSC = 0.00018
k = 32, PSC = 0.00000072

For medium blocks: m = 60000, n = 10000:

k = 16, PSC = 0.63
k = 24, PSC = 0.0039
k = 32, PSC = 0.000015

For large blocks: m = 6000000, n = 1000000:

k = 24, PSC = 0.32
k = 32, PSC = 0.0015
k = 40, PSC = 0.0000059

Collision Attack

It is easy to construct a new transaction with its TXID-HASH matches with that of an existing transaction. Massively creation of such malicious transactions for collision may invalidate the collision probability analysis stated above and makes verification of new blocks costly, which eventually results in higher fork rate. We propose two simple strategies to address the attack model.

Short Hashing with Salt A simple strategy for defense is to introduce a salt when calculate the TXID-HASH:

TXID-HASH = h( Salt + TXID )

The salt should be specific to the block carrying those TXID-HASHes and included in the encoded data. For example, just take the CRC32-Merkle as the Salt, or introduce another 4-byte field with random bits.

By introducing Salt: The attacker cannot construct malicious transactions even that existing transactions are known to all, until a new block carrying them is out. Malicious transactions are unlikely reach full nodes earlier than the new block is received and verified. It requires the attacker can flood malicious transactions to the entire network very fast. Theoretically, it is possible (e.g. by controlling a large botnet) especially the underlying P2P network is small. The strategy makes malicious transaction specific to a single block, thus malicious transactions spreaded previously will not be able to attack future block creation and makes attack much inefficient economically.

Fall back when Under Attack Introducing Salt dramatically raised the cost of collision attack, while in the extreme case, massive collision attack is still possible, regardless of the cost. We require miners fall back to TXID list when the entire network is under attack. It is also incentivized because creating orphan blocks is wasting hash rate.

Such attack can be observed by all full nodes including miners when verifying incoming new blocks. It can be done by simply counting the number of ambiguous TXID-HASH per-block. If the counts significantly higher the expected value and forks are observed. The next block should fall back to TXID list.

P.S. 你是高手看到这里了，本专栏诚邀中文技术翻译的小伙伴 ~~ 有兴趣的联系我哦 !
P.S II. 如果有错误，请不吝指正 ~~

Reference

Compact Blocks FAQ

https://bitcoincore.org/en/2016/06/07/compact-blocks-faq/

BIP152: Compact Blocks

https://github.com/bitcoin/bips/blob/master/bip-0152.mediawiki

Canonical Transaction Ordering for Bitcoin

https://blog.vermorel.com/pdf/canonical-tx-ordering-2018-06-12.pdf

Wiki: Birthday problem

https://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem

Big Number Calculation

https://www.ttmath.org/online_calculator