本文翻译自:Sorting 1 million 8-digit numbers in 1 MB of RAM

I have a computer with 1 MB of RAM and no other local storage. 我有一台1 MB RAM的计算机,没有其他本地存储。 I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the sorted list out over another TCP connection. 我必须使用它通过TCP连接接受100万个8位十进制数,对它们进行排序,然后通过另一个TCP连接发送已排序的列表。

The list of numbers may contain duplicates, which I must not discard. 数字列表可能包含重复项,我不能丢弃。 The code will be placed in ROM, so I need not subtract the size of my code from the 1 MB. 代码将放在ROM中,因此我不需要从1 MB中减去代码的大小。 I already have code to drive the Ethernet port and handle TCP/IP connections, and it requires 2 KB for its state data, including a 1 KB buffer via which the code will read and write data. 我已经有驱动以太网端口和处理TCP / IP连接的代码,它的状态数据需要2 KB,包括1 KB缓冲区,代码将通过该缓冲区读写数据。 Is there a solution to this problem? 有这个问题的解决方案吗?

Sources Of Question And Answer: 问答来源:

slashdot.org slashdot.org

cleaton.net cleaton.net


#1楼

参考:https://stackoom.com/question/rUOs/在-MB-RAM中排序-万个-位数字


#2楼

A solution is possible only because of the difference between 1 megabyte and 1 million bytes. 只有1兆字节和1百万字节之间的差异才能实现解决方案。 There are about 2 to the power 8093729.5 different ways to choose 1 million 8-digit numbers with duplicates allowed and order unimportant, so a machine with only 1 million bytes of RAM doesn't have enough states to represent all the possibilities. 大约有2个功率8093729.5不同的方式来选择100万个8位数字,允许重复,并且命令不重要,因此只有100万字节RAM的机器没有足够的状态来代表所有可能性。 But 1M (less 2k for TCP/IP) is 1022*1024*8 = 8372224 bits, so a solution is possible. 但是1M(TCP / IP少于2k)是1022 * 1024 * 8 = 8372224位,因此可以实现解决方案。

Part 1, initial solution 第1部分,初始解决方案

This approach needs a little more than 1M, I'll refine it to fit into 1M later. 这种方法需要略高于1M,我将其改进以适应1M以后。

I'll store a compact sorted list of numbers in the range 0 to 99999999 as a sequence of sublists of 7-bit numbers. 我将存储0到99999999范围内的紧凑排序列表作为7位数字的子列表序列。 The first sublist holds numbers from 0 to 127, the second sublist holds numbers from 128 to 255, etc. 100000000/128 is exactly 781250, so 781250 such sublists will be needed. 第一个子列表包含0到127之间的数字,第二个子列表包含128到255之间的数字等.100000000/128正好是781250,因此需要781250个这样的子列表。

Each sublist consists of a 2-bit sublist header followed by a sublist body. 每个子列表由一个2位子列表头和一个子列表主体组成。 The sublist body takes up 7 bits per sublist entry. 子列表主体每个子列表条目占用7位。 The sublists are all concatenated together, and the format makes it possible to tell where one sublist ends and the next begins. 子列表全部连接在一起,格式使得可以分辨一个子列表的结束位置和下一个子列表的开始位置。 The total storage required for a fully populated list is 2*781250 + 7*1000000 = 8562500 bits, which is about 1.021 M-bytes. 完全填充列表所需的总存储量为2 * 781250 + 7 * 1000000 = 8562500位,大约为1.021 M字节。

The 4 possible sublist header values are: 4个可能的子列表标题值为:

00 Empty sublist, nothing follows. 00清空子列表,后面没有任何内容。

01 Singleton, there is only one entry in the sublist and and next 7 bits hold it. 01单例,子列表中只有一个条目,接下来的7位保留它。

10 The sublist holds at least 2 distinct numbers. 10子列表至少包含2个不同的数字。 The entries are stored in non-decreasing order, except that the last entry is less than or equal to the first. 条目以非递减顺序存储,除了最后一个条目小于或等于第一个条目。 This allows the end of the sublist to be identified. 这允许识别子列表的结尾。 For example, the numbers 2,4,6 would be stored as (4,6,2). 例如,数字2,4,6将存储为(4,6,2)。 The numbers 2,2,3,4,4 would be stored as (2,3,4,4,2). 数字2,2,3,4,4将存储为(2,3,4,4,2)。

11 The sublist holds 2 or more repetitions of a single number. 11子列表包含2个或更多个单个数字的重复。 The next 7 bits give the number. 接下来的7位给出了数字。 Then come zero or more 7-bit entries with the value 1, followed by a 7-bit entry with the value 0. The length of the sublist body dictates the number of repetitions. 然后得到零或多个值为1的7位条目,接着是值为0的7位条目。子列表主体的长度决定了重复次数。 For example, the numbers 12,12 would be stored as (12,0), the numbers 12,12,12 would be stored as (12,1,0), 12,12,12,12 would be (12,1,1,0) and so on. 例如,数字12,12将存储为(12,0),数字12,12,12将存储为(12,1,0),12,12,12,12将存储为(12,1) ,1,0)等。

I start off with an empty list, read a bunch of numbers in and store them as 32 bit integers, sort the new numbers in place (using heapsort, probably) and then merge them into a new compact sorted list. 我从一个空列表开始,读取一堆数字并将它们存储为32位整数,对新数字进行排序(可能使用heapsort),然后将它们合并到一个新的紧凑排序列表中。 Repeat until there are no more numbers to read, then walk the compact list once more to generate the output. 重复直到没有更多数字要读取,然后再次走紧凑列表以生成输出。

The line below represents memory just before the start of the list merge operation. 下面的行表示列表合并操作开始之前的内存。 The "O"s are the region that hold the sorted 32-bit integers. “O”是保存排序的32位整数的区域。 The "X"s are the region that hold the old compact list. “X”是保存旧紧凑列表的区域。 The "=" signs are the expansion room for the compact list, 7 bits for each integer in the "O"s. “=”符号是紧凑列表的扩展空间,“O”中的每个整数为7位。 The "Z"s are other random overhead. “Z”是其他随机开销。

ZZZOOOOOOOOOOOOOOOOOOOOOOOOOO==========XXXXXXXXXXXXXXXXXXXXXXXXXX

The merge routine starts reading at the leftmost "O" and at the leftmost "X", and starts writing at the leftmost "=". 合并例程开始读取最左边的“O”和最左边的“X”,并开始在最左边的“=”处写入。 The write pointer doesn't catch the compact list read pointer until all of the new integers are merged, because both pointers advance 2 bits for each sublist and 7 bits for each entry in the old compact list, and there is enough extra room for the 7-bit entries for the new numbers. 在所有新的整数合并之前,写指针不会捕获紧凑列表读指针,因为两个指针为每个子列表提前2位,为旧压缩列表中的每个条目提前7位,并且有足够的额外空间用于新数字的7位条目。

Part 2, cramming it into 1M 第2部分,将其塞进1M

To Squeeze the solution above into 1M, I need to make the compact list format a bit more compact. 要将上面的解决方案挤压到1M,我需要使紧凑列表格式更紧凑。 I'll get rid of one of the sublist types, so that there will be just 3 different possible sublist header values. 我将摆脱其中一个子列表类型,因此只有3种不同的子列表标题值。 Then I can use "00", "01" and "1" as the sublist header values and save a few bits. 然后我可以使用“00”,“01”和“1”作为子列表标题值并保存几位。 The sublist types are: 子列表类型是:

A Empty sublist, nothing follows. 空子列表,后面没有任何内容。

B Singleton, there is only one entry in the sublist and and next 7 bits hold it. B Singleton,子列表中只有一个条目,接下来的7位保留它。

C The sublist holds at least 2 distinct numbers. C子列表包含至少2个不同的数字。 The entries are stored in non-decreasing order, except that the last entry is less than or equal to the first. 条目以非递减顺序存储,除了最后一个条目小于或等于第一个条目。 This allows the end of the sublist to be identified. 这允许识别子列表的结尾。 For example, the numbers 2,4,6 would be stored as (4,6,2). 例如,数字2,4,6将存储为(4,6,2)。 The numbers 2,2,3,4,4 would be stored as (2,3,4,4,2). 数字2,2,3,4,4将存储为(2,3,4,4,2)。

D The sublist consists of 2 or more repetitions of a single number. D子列表包含2个或更多个单个数字的重复。

My 3 sublist header values will be "A", "B" and "C", so I need a way to represent D-type sublists. 我的3个子列表标题值将是“A”,“B”和“C”,所以我需要一种方法来表示D类型的子列表。

Suppose I have the C-type sublist header followed by 3 entries, such as "C[17][101][58]". 假设我有C型子列表标题后跟3个条目,例如“C [17] [101] [58]”。 This can't be part of a valid C-type sublist as described above, since the third entry is less than the second but more than the first. 这不能是如上所述的有效C类子列表的一部分,因为第三个条目小于第二个条目但是多于第一条目。 I can use this type of construct to represent a D-type sublist. 我可以使用这种类型的构造来表示D类子列表。 In bit terms, anywhere I have "C{00?????}{1??????}{01?????}" is an impossible C-type sublist. 在比特术语中,我有“C {00 ?????} {1 ??????} {01 ?????}”是一个不可能的C型子列表。 I'll use this to represent a sublist consisting of 3 or more repetitions of a single number. 我将使用它来表示由3个或更多个单个数字重复组成的子列表。 The first two 7-bit words encode the number (the "N" bits below) and are followed by zero or more {0100001} words followed by a {0100000} word. 前两个7位字对数字进行编码(下面的“N”位),然后是零个或多个{0100001}字,后跟{0100000}字。

For example, 3 repetitions: "C{00NNNNN}{1NN0000}{0100000}", 4 repetitions: "C{00NNNNN}{1NN0000}{0100001}{0100000}", and so on.

That just leaves lists that hold exactly 2 repetitions of a single number. 这只留下了只包含2个重复单个数字的列表。 I'll represent those with another impossible C-type sublist pattern: "C{0??????}{11?????}{10?????}". 我将用另一个不可能的C型子列表模式代表那些:“C {0 ??????} {11 ?????} {10 ?????}”。 There's plenty of room for the 7 bits of the number in the first 2 words, but this pattern is longer than the sublist that it represents, which makes things a bit more complex. 前两个单词中7位数字有足够的空间,但这种模式比它所代表的子列表更长,这使得事情变得更复杂。 The five question-marks at the end can be considered not part of the pattern, so I have: "C{0NNNNNN}{11N????}10" as my pattern, with the number to be repeated stored in the "N"s. 最后的五个问号可以被认为不是模式的一部分,所以我有:“C {0NNNNNN} {11N ????} 10”作为我的模式,其中要重复的数字存储在“N”中“S。 That's 2 bits too long. 那是2位太长了。

I'll have to borrow 2 bits and pay them back from the 4 unused bits in this pattern. 我将不得不借用2位并从这种模式中的4个未使用的位中支付它们。 When reading, on encountering "C{0NNNNNN}{11N00AB}10", output 2 instances of the number in the "N"s, overwrite the "10" at the end with bits A and B, and rewind the read pointer by 2 bits. 在读取时,遇到“C {0NNNNNN} {11N00AB} 10”时,输出“N”中数字的2个实例,用位A和B覆盖末尾的“10”,并将读取指针倒回2位。 Destructive reads are ok for this algorithm, since each compact list gets walked only once. 破坏性读取对于该算法是可以的,因为每个紧凑列表仅被移动一次。

When writing a sublist of 2 repetitions of a single number, write "C{0NNNNNN}11N00" and set the borrowed bits counter to 2. At every write where the borrowed bits counter is non-zero, it is decremented for each bit written and "10" is written when the counter hits zero. 写入单个数字的2个重复的子列表时,写入“C {0NNNNNN} 11N00”并将借位计数器设置为2.在每次写入时,借位计数器为非零,对于写入的每个位,它都会递减。当计数器达到零时写入“10”。 So the next 2 bits written will go into slots A and B, and then the "10" will get dropped onto the end. 所以写入的下两位将进入插槽A和B,然后“10”将被放到最后。

With 3 sublist header values represented by "00", "01" and "1", I can assign "1" to the most popular sublist type. 使用由“00”,“01”和“1”表示的3个子列表标题值,我可以将“1”分配给最流行的子列表类型。 I'll need a small table to map sublist header values to sublist types, and I'll need an occurrence counter for each sublist type so that I know what the best sublist header mapping is. 我需要一个小表来将子列表头值映射到子列表类型,并且我需要每个子列表类型的出现计数器,以便我知道最佳子列表头映射是什么。

The worst case minimal representation of a fully populated compact list occurs when all the sublist types are equally popular. 当所有子列表类型同样受欢迎时,出现完全填充的紧凑列表的最坏情况最小表示。 In that case I save 1 bit for every 3 sublist headers, so the list size is 2*781250 + 7*1000000 - 781250/3 = 8302083.3 bits. 在这种情况下,我为每3个子列表标题保存1位,因此列表大小为2 * 781250 + 7 * 1000000 - 781250/3 = 8302083.3位。 Rounding up to a 32 bit word boundary, thats 8302112 bits, or 1037764 bytes. 舍入到32位字边界,即8302112位,或10​​37764字节。

1M minus the 2k for TCP/IP state and buffers is 1022*1024 = 1046528 bytes, leaving me 8764 bytes to play with. 1M减去TCP / IP状态和缓冲区的2k是1022 * 1024 = 1046528字节,留下8764字节。

But what about the process of changing the sublist header mapping ? 但是更改子列表头映射的过程呢? In the memory map below, "Z" is random overhead, "=" is free space, "X" is the compact list. 在下面的存储器映射中,“Z”是随机开销,“=”是自由空间,“X”是紧凑列表。

ZZZ=====XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Start reading at the leftmost "X" and start writing at the leftmost "=" and work right. 从最左边的“X”开始阅读并开始写在最左边的“=”并向右工作。 When it's done the compact list will be a little shorter and it will be at the wrong end of memory: 当它完成后,紧凑列表将会更短一点,并且它将位于错误的内存末尾:

ZZZXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX=======

So then I'll need to shunt it to the right: 那么我需要将它分流到右边:

ZZZ=======XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

In the header mapping change process, up to 1/3 of the sublist headers will be changing from 1-bit to 2-bit. 在标头映射更改过程中,最多1/3的子列表标头将从1位更改为2位。 In the worst case these will all be at the head of the list, so I'll need at least 781250/3 bits of free storage before I start, which takes me back to the memory requirements of the previous version of the compact list :( 在最坏的情况下,这些都将位于列表的首位,因此在开始之前我需要至少781250/3位的空闲存储空间,这使我回到了先前版本的紧凑列表的内存要求: (

To get around that, I'll split the 781250 sublists into 10 sublist groups of 78125 sublists each. 为了解决这个问题,我将把781250子列表分成10个子列表组,每个子列表包含78125个子列表。 Each group has its own independent sublist header mapping. 每个组都有自己独立的子列表头映射。 Using the letters A to J for the groups: 使用字母A到J为组:

ZZZ=====AAAAAABBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ

Each sublist group shrinks or stays the same during a sublist header mapping change: 在子列表标题映射更改期间,每个子列表组都缩小或保持不变:

ZZZ=====AAAAAABBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAA=====BBCCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABB=====CCCCDDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCC======DDDDDEEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDD======EEEFFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEE======FFFGGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFF======GGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGG=======HHIJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHH=======IJJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHI=======JJJJJJJJJJJJJJJJJJJJ
ZZZAAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ=======
ZZZ=======AAAAAABBCCCDDDDDEEEFFFGGGGGGGGGGHHIJJJJJJJJJJJJJJJJJJJJ

The worst case temporary expansion of a sublist group during a mapping change is 78125/3 = 26042 bits, under 4k. 在映射更改期间,子列表组的最坏情况临时扩展是78125/3 = 26042位,低于4k。 If I allow 4k plus the 1037764 bytes for a fully populated compact list, that leaves me 8764 - 4096 = 4668 bytes for the "Z"s in the memory map. 如果我允许4k加上1037764字节用于完全填充的紧凑列表,那么对于存储器映射中的“Z”,我留下8764 - 4096 = 4668字节。

That should be plenty for the 10 sublist header mapping tables, 30 sublist header occurrence counts and the other few counters, pointers and small buffers I'll need, and space I've used without noticing, like stack space for function call return addresses and local variables. 对于10个子列表头映射表,30个子列表头事件计数以及我需要的其他几个计数器,指针和小缓冲区,以及我在没有注意到的情况下使用的空间,如函数调用返回地址的堆栈空间和局部变量。

Part 3, how long would it take to run? 第3部分,运行需要多长时间?

With an empty compact list the 1-bit list header will be used for an empty sublist, and the starting size of the list will be 781250 bits. 对于空紧凑列表,1位列表标题将用于空子列表,列表的起始大小将为781250位。 In the worst case the list grows 8 bits for each number added, so 32 + 8 = 40 bits of free space are needed for each of the 32-bit numbers to be placed at the top of the list buffer and then sorted and merged. 在最坏的情况下,列表对于每个添加的数字增长8位,因此每个32位数字需要32 + 8 = 40位的可用空间放置在列表缓冲区的顶部,然后进行排序和合并。 In the worst case, changing the sublist header mapping results in a space usage of 2*781250 + 7*entries - 781250/3 bits. 在最坏的情况下,更改子列表标头映射会导致空间使用量为2 * 781250 + 7 *个条目 - 781250/3位。

With a policy of changing the sublist header mapping after every fifth merge once there are at least 800000 numbers in the list, a worst case run would involve a total of about 30M of compact list reading and writing activity. 一旦在列表中存在至少800000个数字,在每第五次合并之后改变子列表头部映射的策略,最坏的情况运行将涉及总共约30M的紧凑列表读取和写入活动。

Source: 资源:

http://nick.cleaton.net/ramsortsol.html http://nick.cleaton.net/ramsortsol.html


#3楼

Please see the first correct answer or the later answer with arithmetic encoding . 请使用算术编码查看第一个正确答案或更晚的答案 。 Below you may find some fun, but not a 100% bullet-proof solution. 下面你可能会发现一些有趣的,但不是100%的防弹解决方案。

This is quite an interesting task and here is an another solution. 这是一项非常有趣的任务,这是另一种解决方案。 I hope somebody would find the result useful (or at least interesting). 我希望有人会发现结果有用(或至少有趣)。

Stage 1: Initial data structure, rough compression approach, basic results 第1阶段:初始数据结构,粗略压缩方法,基本结果

Let's do some simple math: we have 1M (1048576 bytes) of RAM initially available to store 10^6 8 digit decimal numbers. 让我们做一些简单的数学运算:我们有1M(1048576字节)的RAM最初可用于存储10 ^ 6 8位十进制数。 [0;99999999]. [0; 99999999]。 So to store one number 27 bits are needed (taking the assumption that unsigned numbers will be used). 因此,需要存储一个数字27位(假设将使用无符号数)。 Thus, to store a raw stream ~3.5M of RAM will be needed. 因此,要存储原始流〜需要3.5M的RAM。 Somebody already said it doesn't seem to be feasible, but I would say the task can be solved if the input is "good enough". 有人已经说过这似乎不可行,但我会说如果输入“足够好”,任务就可以解决了。 Basically, the idea is to compress the input data with compression factor 0.29 or higher and do sorting in a proper manner. 基本上,想法是压缩因子为0.29或更高,并以适当的方式进行排序。

Let's solve the compression issue first. 让我们先解决压缩问题。 There are some relevant tests already available: 已经有一些相关的测试:

http://www.theeggeadventure.com/wikimedia/index.php/Java_Data_Compression http://www.theeggeadventure.com/wikimedia/index.php/Java_Data_Compression

"I ran a test to compress one million consecutive integers using various forms of compression. The results are as follows:" “我进行了一项测试,使用各种形式的压缩来压缩一百万个连续的整数。结果如下:”

None     4000027
Deflate  2006803
Filtered 1391833
BZip2    427067
Lzma     255040

It looks like LZMA ( Lempel–Ziv–Markov chain algorithm ) is a good choice to continue with. 看起来LZMA( Lempel-Ziv-Markov链算法 )是继续使用的不错选择。 I've prepared a simple PoC, but there are still some details to be highlighted: 我准备了一个简单的PoC,但仍有一些细节要突出:

  1. Memory is limited so the idea is to presort numbers and use compressed buckets (dynamic size) as temporary storage 内存有限,因此想法是预先设定数字并使用压缩桶(动态大小)作为临时存储
  2. It is easier to achieve a better compression factor with presorted data, so there is a static buffer for each bucket (numbers from the buffer are to be sorted before LZMA) 使用预先排序的数据更容易实现更好的压缩因子,因此每个桶都有一个静态缓冲区(缓冲区中的数字要在LZMA之前排序)
  3. Each bucket holds a specific range, so the final sort can be done for each bucket separately 每个桶都有一个特定的范围,因此可以分别对每个桶进行最终排序
  4. Bucket's size can be properly set, so there will be enough memory to decompress stored data and do the final sort for each bucket separately 可以正确设置存储桶的大小,因此将有足够的内存来解压缩存储的数据并分别对每个存储桶进行最终排序

Please note, attached code is a POC , it can't be used as a final solution, it just demonstrates the idea of using several smaller buffers to store presorted numbers in some optimal way (possibly compressed). 请注意,附加代码是POC ,它不能用作最终解决方案,它只是演示了使用几个较小的缓冲区以某种最佳方式(可能是压缩)存储预分类数字的想法。 LZMA is not proposed as a final solution. LZMA不是最终解决方案。 It is used as a fastest possible way to introduce a compression to this PoC. 它被用作向此PoC引入压缩的最快方式。

See the PoC code below (please note it just a demo, to compile it LZMA-Java will be needed): 请参阅下面的PoC代码(请注意它只是一个演示,要编译它将需要LZMA-Java ):

public class MemorySortDemo {static final int NUM_COUNT = 1000000;
static final int NUM_MAX   = 100000000;static final int BUCKETS      = 5;
static final int DICT_SIZE    = 16 * 1024; // LZMA dictionary size
static final int BUCKET_SIZE  = 1024;
static final int BUFFER_SIZE  = 10 * 1024;
static final int BUCKET_RANGE = NUM_MAX / BUCKETS;static class Producer {private Random random = new Random();public int produce() { return random.nextInt(NUM_MAX); }
}static class Bucket {public int size, pointer;public int[] buffer = new int[BUFFER_SIZE];public ByteArrayOutputStream tempOut = new ByteArrayOutputStream();public DataOutputStream tempDataOut = new DataOutputStream(tempOut);public ByteArrayOutputStream compressedOut = new ByteArrayOutputStream();public void submitBuffer() throws IOException {Arrays.sort(buffer, 0, pointer);for (int j = 0; j < pointer; j++) {tempDataOut.writeInt(buffer[j]);size++;}            pointer = 0;}public void write(int value) throws IOException {if (isBufferFull()) {submitBuffer();}buffer[pointer++] = value;}public boolean isBufferFull() {return pointer == BUFFER_SIZE;}public byte[] compressData() throws IOException {tempDataOut.close();return compress(tempOut.toByteArray());}        private byte[] compress(byte[] input) throws IOException {final BufferedInputStream in = new BufferedInputStream(new ByteArrayInputStream(input));final DataOutputStream out = new DataOutputStream(new BufferedOutputStream(compressedOut));final Encoder encoder = new Encoder();encoder.setEndMarkerMode(true);encoder.setNumFastBytes(0x20);encoder.setDictionarySize(DICT_SIZE);encoder.setMatchFinder(Encoder.EMatchFinderTypeBT4);ByteArrayOutputStream encoderPrperties = new ByteArrayOutputStream();encoder.writeCoderProperties(encoderPrperties);encoderPrperties.flush();encoderPrperties.close();encoder.code(in, out, -1, -1, null);out.flush();out.close();in.close();return encoderPrperties.toByteArray();}public int[] decompress(byte[] properties) throws IOException {InputStream in = new ByteArrayInputStream(compressedOut.toByteArray());ByteArrayOutputStream data = new ByteArrayOutputStream(10 * 1024);BufferedOutputStream out = new BufferedOutputStream(data);Decoder decoder = new Decoder();decoder.setDecoderProperties(properties);decoder.code(in, out, 4 * size);out.flush();out.close();in.close();DataInputStream input = new DataInputStream(new ByteArrayInputStream(data.toByteArray()));int[] array = new int[size];for (int k = 0; k < size; k++) {array[k] = input.readInt();}return array;}
}static class Sorter {private Bucket[] bucket = new Bucket[BUCKETS];public void doSort(Producer p, Consumer c) throws IOException {for (int i = 0; i < bucket.length; i++) {  // allocate bucketsbucket[i] = new Bucket();}for(int i=0; i< NUM_COUNT; i++) {         // produce some dataint value = p.produce();int bucketId = value/BUCKET_RANGE;bucket[bucketId].write(value);c.register(value);}for (int i = 0; i < bucket.length; i++) { // submit non-empty buffersbucket[i].submitBuffer();}byte[] compressProperties = null;for (int i = 0; i < bucket.length; i++) { // compress the datacompressProperties = bucket[i].compressData();}printStatistics();for (int i = 0; i < bucket.length; i++) { // decode & sort buckets one by oneint[] array = bucket[i].decompress(compressProperties);Arrays.sort(array);for(int v : array) {c.consume(v);}}c.finalCheck();}public void printStatistics() {int size = 0;int sizeCompressed = 0;for (int i = 0; i < BUCKETS; i++) {int bucketSize = 4*bucket[i].size;size += bucketSize;sizeCompressed += bucket[i].compressedOut.size();System.out.println("  bucket[" + i+ "] contains: " + bucket[i].size+ " numbers, compressed size: " + bucket[i].compressedOut.size()+ String.format(" compression factor: %.2f", ((double)bucket[i].compressedOut.size())/bucketSize));}System.out.println(String.format("Data size: %.2fM",(double)size/(1014*1024))+ String.format(" compressed %.2fM",(double)sizeCompressed/(1014*1024))+ String.format(" compression factor %.2f",(double)sizeCompressed/size));}
}static class Consumer {private Set<Integer> values = new HashSet<>();int v = -1;public void consume(int value) {if(v < 0) v = value;if(v > value) {throw new IllegalArgumentException("Current value is greater than previous: " + v + " > " + value);}else{v = value;values.remove(value);}}public void register(int value) {values.add(value);}public void finalCheck() {System.out.println(values.size() > 0 ? "NOT OK: " + values.size() : "OK!");}
}public static void main(String[] args) throws IOException {Producer p = new Producer();Consumer c = new Consumer();Sorter sorter = new Sorter();sorter.doSort(p, c);
}
}

With random numbers it produces the following: 随机数字产生以下内容:

bucket[0] contains: 200357 numbers, compressed size: 353679 compression factor: 0.44
bucket[1] contains: 199465 numbers, compressed size: 352127 compression factor: 0.44
bucket[2] contains: 199682 numbers, compressed size: 352464 compression factor: 0.44
bucket[3] contains: 199949 numbers, compressed size: 352947 compression factor: 0.44
bucket[4] contains: 200547 numbers, compressed size: 353914 compression factor: 0.44
Data size: 3.85M compressed 1.70M compression factor 0.44

For a simple ascending sequence (one bucket is used) it produces: 对于简单的升序(使用一个桶),它产生:

bucket[0] contains: 1000000 numbers, compressed size: 256700 compression factor: 0.06
Data size: 3.85M compressed 0.25M compression factor 0.06

EDIT 编辑

Conclusion: 结论:

  1. Don't try to fool the Nature 不要试图欺骗大自然
  2. Use simpler compression with lower memory footprint 使用更简单的压缩,内存占用更少
  3. Some additional clues are really needed. 真的需要一些额外的线索。 Common bullet-proof solution does not seem to be feasible. 常见的防弹解决方案似乎不可行。

Stage 2: Enhanced compression, final conclusion 阶段2:增强压缩,最终结论

As was already mentioned in the previous section, any suitable compression technique can be used. 如前一节中已经提到的,可以使用任何合适的压缩技术。 So let's get rid of LZMA in favor of simpler and better (if possible) approach. 因此,让我们摆脱LZMA,转而采用更简单,更好(如果可能)的方法。 There are a lot of good solutions including Arithmetic coding , Radix tree etc. 有许多好的解决方案,包括算术编码 , 基数树等。

Anyway, simple but useful encoding scheme will be more illustrative than yet another external library, providing some nifty algorithm. 无论如何,简单但有用的编码方案将比另一个外部库更具说明性,提供一些漂亮的算法。 The actual solution is pretty straightforward: since there are buckets with partially sorted data, deltas can be used instead of numbers. 实际的解决方案非常简单:由于存在具有部分排序数据的存储桶,因此可以使用增量而不是数字。

Random input test shows slightly better results: 随机输入测试显示略好的结果:

bucket[0] contains: 10103 numbers, compressed size: 13683 compression factor: 0.34
bucket[1] contains: 9885 numbers, compressed size: 13479 compression factor: 0.34
...
bucket[98] contains: 10026 numbers, compressed size: 13612 compression factor: 0.34
bucket[99] contains: 10058 numbers, compressed size: 13701 compression factor: 0.34
Data size: 3.85M compressed 1.31M compression factor 0.34

Sample code 示例代码

  public static void encode(int[] buffer, int length, BinaryOut output) {short size = (short)(length & 0x7FFF);output.write(size);output.write(buffer[0]);for(int i=1; i< size; i++) {int next = buffer[i] - buffer[i-1];int bits = getBinarySize(next);int len = bits;if(bits > 24) {output.write(3, 2);len = bits - 24;}else if(bits > 16) {output.write(2, 2);len = bits-16;}else if(bits > 8) {output.write(1, 2);len = bits - 8;}else{output.write(0, 2);}if (len > 0) {if ((len % 2) > 0) {len = len / 2;output.write(len, 2);output.write(false);} else {len = len / 2 - 1;output.write(len, 2);}output.write(next, bits);}}
}public static short decode(BinaryIn input, int[] buffer, int offset) {short length = input.readShort();int value = input.readInt();buffer[offset] = value;for (int i = 1; i < length; i++) {int flag = input.readInt(2);int bits;int next = 0;switch (flag) {case 0:bits = 2 * input.readInt(2) + 2;next = input.readInt(bits);break;case 1:bits = 8 + 2 * input.readInt(2) +2;next = input.readInt(bits);break;case 2:bits = 16 + 2 * input.readInt(2) +2;next = input.readInt(bits);break;case 3:bits = 24 + 2 * input.readInt(2) +2;next = input.readInt(bits);break;}buffer[offset + i] = buffer[offset + i - 1] + next;}return length;
}

Please note, this approach: 请注意,这种方法:

  1. does not consume a lot of memory 不消耗大量内存
  2. works with streams 适用于溪流
  3. provides not so bad results 提供不那么糟糕的结果

Full code can be found here , BinaryInput and BinaryOutput implementations can be found here 完整的代码可以发现在这里 ,BinaryInput和BinaryOutput实现,可以发现这里

Final conclusion 定论

No final conclusion :) Sometimes it is really good idea to move one level up and review the task from a meta-level point of view. 没有最终结论:)有时,从元级别的角度来看,提升一级并审查任务是一个非常好的主意。

It was fun to spend some time with this task. 花一些时间完成这项任务很有趣。 BTW, there are a lot of interesting answers below. 顺便说一下,下面有很多有趣的答案。 Thank you for your attention and happy codding. 感谢您的关注和愉快的编码。


#4楼

If the input stream could be received few times this would be much easier (no info about that, idea and time-performance problem). 如果输入流可以被接收几次,那么这将更容易(没有关于这个,想法和时间性能问题的信息)。 Then, we could count the decimal values. 然后,我们可以计算小数值。 With counted values it would be easy to make the output stream. 使用计数值,可以很容易地生成输出流。 Compress by counting the values. 通过计算值进行压缩。 It depends what would be in the input stream. 它取决于输入流中的内容。


#5楼

A radix tree representation would come close to handling this problem, since the radix tree takes advantage of "prefix compression". 基数树表示将接近处理此问题,因为基数树利用“前缀压缩”。 But it's hard to conceive of a radix tree representation that could represent a single node in one byte -- two is probably about the limit. 但是很难想象一个基数树表示可以代表一个字节中的单个节点 - 两个可能是极限。

But, regardless of how the data is represented, once it is sorted it can be stored in prefix-compressed form, where the numbers 10, 11, and 12 would be represented by, say 001b, 001b, 001b, indicating an increment of 1 from the previous number. 但是,无论数据如何表示,一旦它被排序,它可以以前缀压缩形式存储,其中数字10,11和12将由例如001b,001b,001b表示,表示增量为1从前一个号码。 Perhaps, then, 10101b would represent an increment of 5, 1101001b an increment of 9, etc. 那么,10101b可能代表5,1101001b的增量,增量为9,等等。


#6楼

(My original answer was wrong, sorry for the bad math, see below the break.) (我原来的回答是错误的,抱歉数学不好,见下面的休息。)

How about this? 这个怎么样?

The first 27 bits store the lowest number you have seen, then the difference to the next number seen, encoded as follows: 5 bits to store the number of bits used in storing the difference, then the difference. 前27位存储您看到的最低数字,然后与下一个数字的差异,编码如下:5位用于存储用于存储差异的位数,然后是差值。 Use 00000 to indicate that you saw that number again. 使用00000表示您再次看到该号码。

This works because as more numbers are inserted, the average difference between numbers goes down, so you use less bits to store the difference as you add more numbers. 这是有效的,因为随着插入的数字越来越多,数字之间的平均差异会下降,因此在添加更多数字时使用较少的位来存储差异。 I believe this is called a delta list. 我相信这被称为增量列表。

The worst case I can think of is all numbers evenly spaced (by 100), eg Assuming 0 is the first number: 我能想到的最糟糕的情况是所有数字均匀间隔(100),例如假设0是第一个数字:

000000000000000000000000000 00111 1100100^^^^^^^^^^^^^a million times27 + 1,000,000 * (5+7) bits = ~ 427k

Reddit to the rescue! Reddit来救援!

If all you had to do was sort them, this problem would be easy. 如果您只需要对它们进行排序,那么这个问题就很容易了。 It takes 122k (1 million bits) to store which numbers you have seen (0th bit on if 0 was seen, 2300th bit on if 2300 was seen, etc. 需要122k(100万位)才能存储您看到的数字(如果看到0则为第0位,如果看到2300则为第2300位,等等)。

You read the numbers, store them in the bit field, and then shift the bits out while keeping a count. 您读取数字,将它们存储在位字段中,然后在保持计数的同时将位移出。

BUT, you have to remember how many you have seen. 但是,你必须记住你见过多少。 I was inspired by the sublist answer above to come up with this scheme: 我受到上面的子列表答案的启发,想出了这个方案:

Instead of using one bit, use either 2 or 27 bits: 不使用一位,而是使用2位或27位:

  • 00 means you did not see the number. 00表示您没有看到该号码。
  • 01 means you saw it once 01意味着你曾经看过它
  • 1 means you saw it, and the next 26 bits are the count of how many times. 1意味着你看到它,接下来的26位是多少次的计数。

I think this works: if there are no duplicates, you have a 244k list. 我认为这是有效的:如果没有重复,你有一个244k列表。 In the worst case you see each number twice (if you see one number three times, it shortens the rest of the list for you), that means you have seen 50,000 more than once, and you have seen 950,000 items 0 or 1 times. 在最坏的情况下,您会看到每个数字两次(如果您看到一个数字三次,它会缩短列表的其余部分),这意味着您已经多次看到50,000个,并且您已经看到了950,000个项目0或1次。

50,000 * 27 + 950,000 * 2 = 396.7k. 50,000 * 27 + 950,000 * 2 = 396.7k。

You can make further improvements if you use the following encoding: 如果使用以下编码,则可以进一步改进:

0 means you did not see the number 10 means you saw it once 11 is how you keep count 0意味着你没有看到数字10意味着你曾经看过它11你是如何计算的

Which will, on average, result in 280.7k of storage. 平均而言,这将导致280.7k的存储空间。

EDIT: my Sunday morning math was wrong. 编辑:我周日早上的数学错了。

The worst case is we see 500,000 numbers twice, so the math becomes: 最糟糕的情况是我们两次看到500,000个数字,所以数学变为:

500,000 *27 + 500,000 *2 = 1.77M 500,000 * 27 + 500,000 * 2 = 1.77M

The alternate encoding results in an average storage of 备用编码导致平均存储空间

500,000 * 27 + 500,000 = 1.70M 500,000 * 27 + 500,000 = 1.70M

: ( :(

在1 MB RAM中排序100万个8位数字相关推荐

  1. 怎样使用1M的内存排序100万个8位数

    今天看到这篇文章.颇为震撼.感叹算法之"神通". 借助于合适的算法能够完毕看似不可能的事情. 最早这个问题是在Stack Overflow站点上面给出的(Sorting numbe ...

  2. 每人100万!6位院士,获领域重要奖项!

    近日,陈嘉庚科学奖基金会网站公布了2022年度陈嘉庚科学奖获奖项目和陈嘉庚青年科学奖获奖人名单."复微分几何及其应用"等6个项目获2022年度陈嘉庚科学奖,魏微等6人获2022年度 ...

  3. Java中如何保留小数点后几位数字

    保留小数点后几位数字 对于一些Java的初学者(博主也只算平民级别,以下是我的见解,可能有小错误,有错莫怪),如何保留一个double型小数点后固定的位数很是让人苦恼,因为我刚学的时候不知道如何保留小 ...

  4. 假如买彩票中了100万怎样安全地把钱领回来?

    首先自己中奖后,在反复核对过自己的数字和中奖号码一致后,一定要小心保管好,然后就是带着这张票票和身份证去兑换中心找工作人员核实并办理相关的手续. 大家都知道我们获得的钱达到一定金额后是要向国家交税的, ...

  5. js中如何截取小数点后两位数字

    用Javascript取float型小数点后两位,例22.127456取成22.13,如何做? 1. 最笨的办法. 1 function  get() 2 { 3    var s = 22.1274 ...

  6. qt中的mysql能存入多少行数据_Qt中提高sqlite的读写速度(使用事务一次性写入100万条数据)...

    SQLite数据库本质上来讲就是一个磁盘上的文件,所以一切的数据库操作其实都会转化为对文件的操作,而频繁的文件操作将会是一个很好时的过程,会极大地影响数据库存取的速度.例如:向数据库中插入100万条数 ...

  7. 圆周率 π 小数点第 100 万亿数字是多少?Google 用 Debian 服务器给出了答案

    整理 | 苏宓 出品 | CSDN(ID:CSDNnews) π=3.1415926...... 想必学生时代,当提及背诵圆周率 π 小数点后面的个数时,很多人的胜负欲在悄然之间被激起,"只 ...

  8. 导出100万条数据到excel

    导出100万条数据到excel 目的 数据库中有100万条数据,用java程序导入到excel,所花费的时间 演示 过程 eclipse 创建main sqlserver连接数据库程序 TCP/IP连 ...

  9. 一位豪气的老板,送东西送到让客户都不好意思了,当天收款100万!

    做生意要想赚大钱,你必须首先把优惠给到客户,用真心,才能换真心. 就目前的社会环境来说,生意相比十年前,难做了许多,相比五年前,也难做许多. 一位豪气的老板,送东西送到让客户都不好意思了,当天收款10 ...

最新文章

  1. ASP.NET Core 2.1带来SignalR、Razor类库
  2. 【CyberSecurityLearning 附】DNS复习演示所有实验
  3. ai算子是什么_肇观电子刷新端侧AI芯片性能记录并发布“5分钟部署”AI开发平台 - 企业资讯...
  4. 怎么一秒钟给微信头像戴上圣诞帽,我教你啊
  5. linux rsync配置文件参数详解
  6. 论文笔记_S2D.41_2017-ICCV-使用深度估计与深度卷积神经场,进行单目视觉里程计的尺度恢复
  7. win7系统服务器错误404,Win7旗舰版系统下无法打开http://localhost出现404错误如何解决...
  8. 千方百剂显示服务器错误,千方百剂远程服务器地址
  9. 如何写出优雅的React代码Clean Code vs. Dirty Code
  10. 计算机主机一闪一闪的无法启动,电脑无法正常启动,启动时绿灯一闪一闪的
  11. 【BZOJ4199】品酒大会(后缀自动机)
  12. HostGator 評價 – 優異的運行時間與支持一鍵安裝 WordPress,內含 4 折 60% 折扣優惠購買連結! - TechMoon 科技月球...
  13. python判断手机号运营商_匹配手机号码及运营商校验
  14. 浅谈Camera subsytem
  15. 脑图神器 XMind ZEN
  16. Win电源选项设置效果及意义(个人整理)(包含:电源设置,开启卓越模式,睡眠休眠的区别)
  17. 【OpenCV人脸识别入门教程之二】人脸检测
  18. 正态分布累积函数及其反函数 C/C++
  19. STM32CubeMX(stm32F030C8T6) 之RTC闹钟唤醒停机模式-STM32开发实战 (2)
  20. 定位线程Segment fault (SIGSEGV)的方法

热门文章

  1. 亚马逊AWS:正确设置FTP
  2. 国际:什么是程序员的优秀品质
  3. string :操作总结
  4. 算法----- 给定一颗二叉树,找到二叉树上任意两个节点之间的距离(Java版本)
  5. MyBatis学习总结(一)——MyBatis快速入门
  6. hibernate tools for eclipse plugins在线怎么安装
  7. Android之SimpleAdapter使用
  8. jieba(结巴)常用方法
  9. Layout两列定宽中间自适应三列布局
  10. 求二叉树第K层的节点个数+求二叉树叶子节点的个数