数据库课上老师提出的问题，大意是给一个集合S，给一个散列函数和相应的散列表，长为m，从S映射到表，问使得给一个x，通过散列表判断其不在S中的概率小于0.05，这个m该是多少？
老师说这个问题是美国大学生都会证的问题，这也是中国大学生研究生缺乏的思考能力。
我完全没头绪。。只是在想这跟m有什么关系，下课后也没找到合适的资料。这里整理一下我查到的一些关于哈希表的长度设定问题的英文资料和机翻。
想看知识点的直接翻到最后即可。

USCD_EDU

http://cseweb.ucsd.edu/~kube/cls/100/Lectures/lec16/lec16-8.html
Hash table size

By “size” of the hash table we mean how many slots or buckets it has [ 哈希表的“大小”是指它有多少个槽或桶]
Choice of hash table size depends in part on choice of hash function, and collision resolution strategy [ 散列表大小的选择部分取决于散列函数的选择和冲突解决策略]
But a good general “rule of thumb” is: [ 但一个好的一般“经验法则”是：]
The hash table should be an array with length about 1.3 times the maximum number of keys that will actually be in the table, and [ 哈希表应该是一个数组，其长度约为表中实际存在的最大键数的1.3倍]
Size of hash table array should be a prime number [ 哈希表数组的大小应该是素数]
So, let M = the next prime larger than 1.3 times the number of keys you will want to store in the table, and create the table as an array of length M [ 因此，让M =下一个素数大于您想要存储在表中的键数的1.3倍，并将表创建为长度为M的数组]
(If you underestimate the number of keys, you may have to create a larger table and rehash the entries when it gets too full; if you overestimate the number of keys, you will be wasting some space) [ （如果你低估了键的数量，你可能需要创建一个更大的表，并在条目太满时重新输入条目;如果你高估了键的数量，你将浪费一些空间）]

How is the size of a hash table determined How should optimization be done for it to be fast？

https://www.quora.com/How-is-the-size-of-a-hash-table-determined-How-should-optimization-be-done-for-it-to-be-fast

It’s a tuning parameter - it depends what you’re trying to optimize and what resources you have or are willing to commit but thinking performance will be proportional to average collision chain length is the right thing to be managing. [ 这是一个调整参数 - 它取决于您要优化的内容以及您拥有或愿意承诺的资源，但思考性能与平均冲突链长度成正比是正确的管理方式。]

I don’t know what your application is but assuming you optimize collision handling 3-4 chains will be blisteringly fast on any modern laptop (up). [ 我不知道您的应用程序是什么，但假设您优化了碰撞处理3-4链条在任何现代笔记本电脑上都会非常快（上）。] If you’re on a phone or smaller device you might find this is more of a size/speed trade-off. [ 如果您使用的是手机或小型设备，您可能会发现这更多的是尺寸/速度权衡。]

It’s a common myth for this kind of hash-table that you should pick a prime number but because your hash value has a tendency to have a fixed remainder modulo 33 you should pick something co-prime with 33. [ 对于这种哈希表来说，你应该选择素数是一个常见的神话，但是因为你的哈希值有一个固定余数模33的倾向，你应该选择一个33的共同素数。]

A smart choice is a power of 2. [ 聪明的选择是2的力量。] That’s because you can obtain the remainder by masking bits with & and avoid a (relatively) costly / implicit in your %. [ 那是因为你可以通过用＆屏蔽位来获得余数，并避免在你的％中（相对）代价高昂/隐含。]

NB: A side-effect of using powers of 2 is that it’s easy to divide or combine collision chains if you resize the table dynamically. [ 注意：使用2的幂的副作用是，如果动态调整表的大小，则很容易划分或组合碰撞链。] However I get the impression you have a static dictionary and won’t be re-sizing. [ 但是我得到的印象是你有一个静态字典，不会重新调整大小。]

Optimized? [ 优化？]

First, make sure you retain the full hash (an unsigned 32-bit int will be likely suitable). [ 首先，确保保留完整的哈希值（无符号的32位int可能是合适的）。]
Second when traversing a collision chain compare hash before value. [ 第二，当遍历碰撞链时比较值之前的散列。] If the hashes don’t match you don’t need a (relatively) expensive string comparison. [ 如果散列不匹配，则不需要（相对）昂贵的字符串比较。]
The hash you’ve chosen is known to have good performance with English text you should find few if any collisions at full 32-bit hash comparison and make next to zero failed string comparisons. [ 您已选择的哈希已知具有良好的英文文本性能，如果在完全32位哈希比较中发生任何冲突，则应该找到很少，并且接下来的零字符串比较失败。]

Third consider ordering the collision chain. [ 第三，考虑订购碰撞链。] If access is random order it by hash value. [ 如果访问是随机的，则按哈希值排序。]
That way you can dive out of a chain when you realize the look-up value can’t be held. [ 这样，当您意识到无法保持查找值时，您可以跳出链条。]

Alternatively if access isn’t random consider ordering the collision chains by a static or dynamic frequency. [ 或者，如果访问不是随机的，则考虑通过静态或动态频率对碰撞链进行排序。] Static frequency would be based on occurrence of a word in some text “corpus”. [ 静态频率将基于某些文本“语料库”中单词的出现。] That is you’d want ‘the’ to appear at the front of its collision chain and ‘wayzgoose’ likely towards the end! [ 那就是你希望’the’出现在它碰撞链的前面，并且’wayzgoose’可能会到达终点！]
Dynamic frequency would involve moving words that are ‘hit’ to the front of their collision chain knowing words recur in a given text. [ 动态频率将涉及将“击中”的单词移动到其碰撞链的前面，知道单词在给定文本中重复出现。]

If you are writing a spell checker (and I’ve somewhat assumed that’s the application) I really do recommend finding a corpus. [ 如果你正在写一个拼写检查器（我有点认为这是应用程序）我真的建议找一个语料库。] It doesn’t even have to be very big because (of course) the common words are common and will sort to the front very quickly and even if the ‘uncommon’ words aren’t optimized - they have less impact because they’re uncommon! [ 它甚至不必非常大，因为（当然）常见的词很常见并且会很快排在前面，即使“不常见”的词语没有被优化 - 它们的影响也很小，因为它们并不常见！]

PS: I also know a practically perfect (in the formal sense) hash for English words however I think you’ll find your hash is pretty good. [ PS：我也知道一个几乎完美的（在正式意义上）英语单词的哈希，但我想你会发现你的哈希非常好。]

总结

散列表的大小的选择部分取决于散列函数的选择和冲突解决的策略
其实讨论的是平衡性能与平均冲突长度平衡性能与平均冲突长度
一般情况下，越短搜的越快越省空间，但相应冲突机会就越大（散列函数影响力更大）
处理碰撞是非常重要的一环，比散列表的大小重要多了。
经验：

约为表中实际存在的最大的键值的1.3倍
素数
使用2的幂也是个方法，好处是可以通过直接位操作降低代价;坏处是：如果需要动态调整表的大小，很容对碰撞链进行划分或者合并。

其实最后也没解决关于概率，证明这些的原问题，不过总感觉老师说的是散列、碰撞那些的概率。。可能我没听清，当然不排除老师口误了哈哈哈。

关于散列表的大小设定相关推荐

数据结构(55) 散列表（哈希表，hash table，hash map）
目录 1.散列表的基本概念 2.散列函数的构造方法 3.常用的散列函数 3.1.直接定址法 3.2.除留余数法 3.3.数字分析法 3.4.平方取中法 3.5.乘法哈希法(The Multiplica ...
【数据结构】散列表知识点
散列存储的特性散列存储:散列表,采用的存储方式是散列存储.那么何为散列存储呢?散列存储是根据元素的关键字直接计算出该元素的存储地址,又称哈希(Hash)存储.采用散列存储的方式存储数据时,具备的优点 ...
哈希表(散列表)介绍
目录前言一.哈希概念 1.1 什么时哈希表 1.2 哈希函数 1.3 哈希冲突 1.4 哈希冲突的解决 1.4.1 闭散列 1.4.2 开散列 1.4.3 问题前言哈希表时C++11两容器un ...
散列表（离散链表法）
散列表(离散链表法) 1.相关介绍散列表也叫哈希表,英文名字Hash Table,有具体的哈希函数,将值映射到具体的表下标中.这样查找起来就十分方便. 散列表的注意点: (1)散列函数要具有一致性, ...
由散列表到BitMap的概念与应用（一）
散列表提到散列表,大家可能会想到常用的集合HashMap,HashTable等. 散列表(Hash table,也叫哈希表),是根据关键码值(Key value)而直接进行访问的数据结构.也就是说, ...
数据结构 -- 散列表
散列表作为一种能够提供高效插入,查找,删除以及遍历的数据结构,被应用在很多不同的存储组件之中. 就像rocksdb中的hashskiplist,redis的有序集合,java的 LinkedHash ...
【算法导论】学习笔记——第11章散列表
11.1 直接寻址表当关键字的全域U很小,可采用直接寻址的方式.假设动态集合S的元素都取自全域U={0, 1, ..., m-1}的一个关键字,并且没有两个元素具有相同的关键字. 为表示动态集合,使 ...
查找之散列表查找（哈希表）
基础概念散列技术是在记录的存储位置和它的关键字之间建立一个确定的对应关系f,使得每个关键字key对应一个存储位置f(key).这里对应关系f称为散列函数,又称为哈希(Hash)函数. 采用散列技术将 ...
十、散列表（Hash Table）
一.概述散列表(Hash Table),也称"哈希表"或者"Hash 表" 1.相关概念原始数据叫作键(键值)或关键字(key): 将原始数据转化为数组下标 ...

关于散列表的大小设定

USCD_EDU

How is the size of a hash table determined How should optimization be done for it to be fast？

总结

关于散列表的大小设定相关推荐

最新文章

热门文章