tsql语句中的t是什么

The Java HashCode method is used to determine uniqueness or similarity of strings. While implemented in Java, there can be many benefits of creating a similar or customized version of this method.

Java HashCode方法用于确定字符串的唯一性或相似性。 用Java实现时,创建此方法的相似或自定义版本可能有很多好处。

介绍 (Introduction)

Determine uniqueness or similarity between strings is often important, both in application code and in database scripts. While SQL Server has many built-in functions that can be used to generate hash values or checksums, there is sometimes the need to use a specific algorithm that is similar to one used in code.

在应用程序代码和数据库脚本中,确定字符串之间的唯一性或相似性通常很重要。 尽管SQL Server具有许多可用于生成哈希值或校验和的内置函数,但有时仍需要使用一种特定的算法,该算法类似于代码中使用的算法。

This article mirrors a need that arose unexpectedly in which there was a desire to validate Java HashCode values for many elements in a list and manage those results within SQL Server. For the application being addressed, this would improve the efficiency of code by allowing set-based comparisons, rather than code pulling rows in batches (or one at a time).

本文反映了意外出现的需求,其中存在一种需求,即需要验证列表中许多元素的Java HashCode值并在SQL Server中管理这些结果。 对于正在处理的应用程序,这将允许进行基于集合的比较,而不是通过代码批量(或一次)拉行来提高代码效率。

We will build from scratch a TSQL version of HashCode, explaining each component, how it works, and supplying a handful of demos that show how to build and use it in any SQL Server environment.

我们将从头开始构建HashCode的TSQL版本,解释每个组件,其工作原理,并提供一些演示,演示如何在任何SQL Server环境中构建和使用它。

为什么要哈希? (Why Hash?)

There are many different reasons why we would be interested in hashing data, regardless of whether we use HashCode, or some other algorithm.

我们对哈希数据感兴趣的原因有很多,无论我们使用的是HashCode还是其他算法。

The primary use of hashing is for the organization and location of data. Data can be mapped into a set of hash buckets, where we can control the number and size of those buckets. When we need to locate a value or set of values at a later time, we can reduce our search to only those with a similar hash.

哈希的主要用途是用于数据的组织和位置。 数据可以映射到一组哈希存储桶中,我们可以在其中控制这些存储桶的数量和大小。 当以后需要定位一个值或一组值时,可以将搜索范围缩小到仅具有相似哈希值的搜索范围。

Hashing is very effective at describing the similarity or differences between data values. Identical hashes indicate values that are identical or very similar, whereas differing hashes indicate values that are unique.

散列在描述数据值之间的相似性或差异时非常有效。 相同的哈希值表示相同或非常相似的值,而不同的哈希值表示唯一的值。

Hashing can also be used to verify if a data element was transmitted successfully across a network or between machines. While a checksum is very effective for this purpose, there can be benefits to customizing a hash specifically for a given application’s needs.

散列还可以用于验证数据元素是否已成功通过网络或在机器之间传输。 虽然校验和对于此目的非常有效,但是专门针对给定应用程序的需求自定义哈希可以带来好处。

One other use of hashing is to index encrypted data (when needed). By taking a hash of an unencrypted value prior to encryption, we can create a column that can be indexed and searched when decryption is required. This allows us to greatly reduce the search space from a table scan to an index seek in which the resulting number of rows to read will be small.

散列的另一种用途是索引加密的数据(需要时)。 通过在加密之前对未加密的值进行哈希处理,我们可以创建一列,当需要解密时可以对其进行索引和搜索。 这使我们可以大大减少从表扫描到索引查找的搜索空间,在索引搜索中,要读取的行数将很少。

HashCode的详细信息 (Details of HashCode)

The goal of any hash function is to map data into corresponding buckets in a target data type as evenly as possible. Mathematically, there are many ways to accomplish this, with most involving sums, products, powers, or division of each character of data involved. Prime numbers are often used in these calculations in order to ensure as much uniqueness as possible in each calculation.

任何哈希函数的目的是将数据尽可能均匀地映射到目标数据类型的相应存储区中。 从数学上讲,有许多方法可以完成此操作,其中大多数涉及总和,乘积,幂或所涉及数据的每个字符的划分。 在这些计算中经常使用质数,以确保每次计算中尽可能多的唯一性。

Our ultimate goal is to return the hash of any string or set of strings that are passed into the HashCode method. The method itself is based on a mathematical formula that sums hashes for each character within a string:

我们的最终目标是返回传递给HashCode方法的任何字符串或字符串集的哈希。 该方法本身基于数学公式,该公式将字符串中每个字符的哈希值相加:

Where:
STR: The input string
n: Number of characters in the string
i: Current character in the string
31: An arbitrary prime number.

哪里:
STR:输入字符串
n:字符串中的字符数
i:字符串中的当前字符
31:任意质数。

The arbitrary prime number helps ensure that we cover a field of numbers large enough to encompass strings of a given length. Larger strings warrant larger calculations. As a result, the size of the hash result, whether it can be SMALLINT, INT, BIGINT, or a DECIMAL calculation, can be determined by string complexity. For the sake of our demonstrations, we will use 31 as the prime number, and INT for the hash results, which is identical to the HashCode function as is currently implemented in Java.

任意质数有助于确保我们覆盖一个足以包含给定长度的字符串的数字字段。 较大的字符串需要较大的计算量。 结果,哈希结果的大小(可以是SMALLINT,INT,BIGINT还是DECIMAL计算)可以通过字符串复杂度来确定。 为了便于演示,我们将使用31作为素数,使用INT作为哈希结果,这与Java中当前实现的HashCode函数相同。

For any application of this function to relatively small strings, this will be more than sufficient. If you were looking to run hashes against VARCHAR(MAX) columns, large volumes of strings, or some combination of those 2, then you might want to consider using DECIMAL to gain even more digits. It is unclear why we would want to hash the full text of large comments, descriptions, or text, but if the need exists, we can adjust our scripts to accommodate as needed.

对于将此功能应用于相对较小的字符串的任何应用,这将绰绰有余。 如果您想对VARCHAR(MAX)列,大量字符串或其中的2种进行哈希运算,则可能需要考虑使用DECIMAL来获得更多的数字。 目前尚不清楚为什么要对大型注释,描述或文本的全文进行散列,但是如果需要,我们可以根据需要调整脚本以适应需要。

An obvious question: What happens when our sums exceed the boundaries of the data type used for the results? In that scenario, the result is rolled up or down by half of the range of our data type. This preserves all bits within our hash, while removing the leading excess bit. We’ll run through a few examples below in order to see how this works.

一个明显的问题:当我们的总和超出用于结果的数据类型的边界时,会发生什么? 在这种情况下,结果将向上或向下滚动我们数据类型范围的一半。 这将保留哈希中的所有位,同时删除前导多余位。 我们将在下面介绍一些示例,以了解其工作原理。

If we want to calculate the HashCode for multiple strings, the result is the sum of HashCodes for all strings, calculated independently. As with the summations performed above, if we are summing multiple HashCode values and overflow our output data type, then we negate the result and remove the leading bit in order to “rollover” and continue summing effectively and predictably. Since each string is calculated separately, the order of distinct strings does not matter when determining HashCode values.

如果我们要为多个字符串计算HashCode,则结果是所有字符串的HashCodes的总和,是独立计算的。 与上面执行的求和一样,如果我们要对多个HashCode值求和并溢出我们的输出数据类型,那么我们将对结果求反并删除前导位以“翻转”并继续有效且可预测地求和。 由于每个字符串都是单独计算的,因此确定HashCode值时,不同字符串的顺序无关紧要。

独特性 (Uniqueness)

An important note about hash functions in general is that they do not guarantee uniqueness! Two identical objects will generate the same hash every time it is calculated, but two unequal objects can have the same hash code. Similarly, two objects with the same hash might actually be different—though they are likely to be quite similar to each other.

通常,有关哈希函数的重要说明是它们不能保证唯一性! 每次计算时,两个相同的对象将生成相同的哈希,但是两个不相等的对象可以具有相同的哈希码。 同样,具有相同散列的两个对象实际上可能有所不同,尽管它们可能彼此非常相似。

The number of possible strings that exist with combinations of any characters will always be far greater than an INT, or a BIGINT for that matter. For most applications, HashCode will provide more than enough uniqueness and identification value to be worthwhile, but it is important to note that it does not guarantee uniqueness.

存在任何字符组合的可能字符串的数目将始终远远大于INT或BIGINT。 对于大多数应用程序,HashCode将提供足够多的唯一性和标识值,以使它们值得,但要注意的是,它不能保证唯一性,这一点很重要。

Within the scope a given application, duplication will typically be minimal, but it is worth noting that it is not mathematically impossible, nor is uniqueness guaranteed.

在给定应用程序的范围内,复制通常将是最少的,但是值得注意的是,从数学上来说并非不可能,也不保证唯一性。

For example, consider the HashCode values for “Aa” vs. “BB”:
The following are the ASCII values for the characters above:
A: 65 a: 97 B: 66

例如,考虑“ Aa”和“ BB”的HashCode值:
以下是上述字符的ASCII值:
A:65 A:97 B:66

Therefore, the calculations for these strings would be:
Aa = 65 * 312-1-0 + 97 * 312-1-1 = 2,015 + 97 = 2,112.
BB = 66 * 312-1-0 + 66 * 312-1-1 = 2,015 + 97 = 2,112.

因此,这些字符串的计算将是:
Aa = 65 * 31 2-1-0 + 97 * 31 2-1-1 = 2,015 + 97 = 2,112。
BB = 66 * 31 2-1-0 + 66 * 31 2-1-1 = 2,015 + 97 = 2,112。

As a result, very short strings are far more likely to result in duplication. Additionally, strings with a very limited variety of characters will also result in duplication more often. If you are using a hash algorithm similar to this in order to compare very short or non-diverse strings, consider also performing an equality check to validate the dummy scenario in which two strings are overly simplistic in nature. As a bonus, the comparison cost of two simple strings will be relatively inexpensive.

结果,非常短的字符串更有可能导致重复。 此外,字符种类非常有限的字符串也会导致重复次数增加。 如果您使用类似于此的哈希算法来比较非常短或不多样化的字符串,请考虑还执行相等检查以验证其中两个字符串本质上过于简单的虚拟场景。 另外,两个简单字符串的比较成本相对较低。

To sum up uniqueness:

总结唯一性:

  • Two identical strings or sets of strings will result in the same HashCode. 两个相同的字符串或一组字符串将产生相同的HashCode。
  • Multiple strings presented in different orders will produce the same HashCode. 以不同顺序显示的多个字符串将产生相同的HashCode。
  • Two identical HashCodes can represent two different strings, although this is very, very unlikely. 两个完全相同的HashCode可以表示两个不同的字符串,尽管这种可能性非常非常小。
  • Two different strings can result in the same HashCode, although this is very, very unlikely. 两个不同的字符串可以导致相同的HashCode,尽管这种可能性非常非常小。
  • The more different strings that are analyzed, the greater the probability of a collision. 分析的字符串越多,发生碰撞的可能性越大。

HashCode is excellent for comparing strings, but should never be used as a key for determining uniqueness with certainty, nor should it be used as a primary or alternate key on a table. Alternatively, HashCode can be used to determine if two different strings or set of strings are similar. If the hashes are identical, the values can be further analyzed as needed, whereas if they are different, we can move on with the knowledge that they are definitely not identical.

HashCode非常适合比较字符串,但决不能用作确定性确定唯一性的键,也不应用作表的主键或备用键。 另外,HashCode可用于确定两个不同的字符串或一组字符串是否相似。 如果哈希值相同,则可以根据需要进一步分析这些值,而如果它们不同,则可以继续了解它们绝对不相同。

哈希结果示例 (Example Hash Results )

Our inputs for any tests we perform will be comma-delimited lists of strings. For these examples, the HashCode value is calculated for each item in the list, and then those values are added together. For example, consider the string “Table, Column, Function, Stored Procedure, Index”. This list contains 5 elements and will be broken down as follows:

我们执行的任何测试的输入将是逗号分隔的字符串列表。 对于这些示例,将为列表中的每个项目计算HashCode值,然后将这些值加在一起。 例如,考虑字符串“表,列,函数,存储过程,索引”。 该列表包含5个元素,并将按以下方式细分:

HashCode(‘Table,Column,Function,Stored Procedure,Index’) =
HashCode(‘Table’) + HashCode(‘Column) + HashCode(‘Function) + HashCode(‘Stored Procedure) + HashCode(‘Index’) =
HashCode(‘Table’) = 84 * 315-1-0 + 97 * 315-1-1 + 98 * 315-1-2 + 108 * 315-1-3 + 101 * 315-1-4 = 80,563,118
HashCode(‘Column) = 67 * 3166-1-0 + 111 * 3166-1-1 + 108 * 316-1-2 + 117 * 316-1-3 + 109 * 316-1-4 + 110 * 316-1-5 = 2,023,997,302
HashCode(‘Function’) = 67 * 3168-1-0 + 111 * 3168-1-1 + 108 * 3188-1-2 + 117 * 318-1-3 + 109 * 3188-1-4 + 110 * 318-1-5 + 110 * 318-1-6 + 110 * 318-1-7 = 1,445,582,840
HashCode(‘Stored Procedure’) = 67 * 31616-1-0 + 111 * 31616-1-1 + 108 * 31816-1-2 + 117 * 3116-1-3 + 109 * 31816-1-4 + 110 * 3116-1-5 + 110 * 3116-1-6 + 110 * 3116-1-7 + 67 * 31616-1-8 + 111 * 31616-1-9 + 108 * 31816-1-10 + 117 * 3116-1-11 + 109 * 31816-1-12 + 110 * 3116–1-13 + 110 * 3116–1-14 + 110 * 3116–1-15 = -1,281,122,736
HashCode(‘Index’) = 73 * 315-1-0 + 110 * 315-1-1 + 100 * 315-1-2 + 101 * 315-1-3 + 120 * 315-1-4 = 70,793,394

HashCode (“表,列,函数,存储过程,索引”)=
HashCode ('表')+ HashCode ('列)+ HashCode ('函数)+ HashCode ('存储过程)+ HashCode ('Index')=
HashCode (表格)= 84 * 31 5-1-0 + 97 * 31 5-1-1 + 98 * 31 5-1-2 + 108 * 31 5-1-3 + 101 * 31 5-1- 4 = 80,563,118
哈希码 ('列)= 67 * 316 6-1-0 + 111 * 316 6-1-1 + 108 * 31 6-1-2 + 117 * 31 6-1-3 + 109 * 31 6-1-4 + 110 * 31 6-1-5 = 2,023,997,302
HashCode ('函数')= 67 * 316 8-1-0 + 111 * 316 8-1-1 + 108 * 318 8-1-2 + 117 * 31 8-1-3 + 109 * 318 8-1- 4 + 110 * 31 8-1-5 + 110 * 31 8-1-6 + 110 * 31 8-1-7 = 1,445,582,840
哈希码 (``存储过程'')= 67 * 316 16-1-0 + 111 * 316 16-1-1 + 108 * 318 16-1-2 + 117 * 31 16-1-3 + 109 * 318 16-1 -4 + 110 * 31 16-1-5 + 110 * 31 16-1-6 + 110 * 31 16-1-7 + 67 * 316 16-1-8 + 111 * 316 16-1-9 + 108 * 318 16-1-10 + 117 * 31 16-1-11 + 109 * 318 16-1-12 + 110 * 31 16–1-13 + 110 * 31 16–1-14 + 110 * 31 16–1 15 = -1,281,122,736
HashCode ('索引')= 73 * 31 5-1-0 + 110 * 31 5-1-1 + 100 * 31 5-1-2 + 101 * 31 5-1-3 + 120 * 31 5-1- 4 = 70,793,394

HashCode(‘Table, Column, Function, Stored Procedure, Index’) = 2,339,813,918…
BUT, since this value is greater than the bounds of an INTEGER, we need to lop off the leading digit as follows:

HashCode (“表格,列,函数,存储过程,索引”)= 2,339,813,918…
但是,由于此值大于INTEGER的范围,因此我们需要按如下所示舍弃前导数字:

HashCode(‘Table,Column,Function,Stored Procedure,Index’) = 2,339,813,918 – 232 = -1,955,153,378.

HashCode (“表,列,函数,存储过程,索引”)= 2,339,813,918 – 2 32 = -1,955,153,378。

This additional bit of arithmetic allows us to ensure that all results stay within the boundaries of the data type we have chosen (INT). If we used BIGINT instead, then we would need to take action on our results when they exceeded the bounds of -263 to 263– 1, at which point we would add or subtract 264 in order to maintain our results as valid BIGINT values.

这种额外的运算能力使我们能够确保所有结果都位于我们选择的数据类型(INT)的范围内。 如果我们改用BIGINT,则当结果超出-2 63到2 63 – 1的范围时,我们需要对结果采取措施,此时我们需要加上或减去2 64才能将结果保持为有效BIGINT价值观。

在TSQL中构建HashCode (Building HashCode in TSQL)

The challenge at hand is to build a TSQL function that can do all of this for us, both accurately and quickly. To accomplish this, let’s break down the problem into a number of simpler, discrete steps, each of which we can table separately, and then put back together at the end in order to achieve our goals:

当前面临的挑战是构建一个TSQL函数,该函数可以为我们准确,快速地完成所有这些工作。 为此,让我们将问题分解为多个更简单的离散步骤,我们可以分别列出每个步骤,然后将它们放在一起以实现我们的目标:

  1. Create or use a method for parsing comma-delimited strings. 创建或使用一种方法来解析逗号分隔的字符串。
  2. Create a function that accepts a single delimited list that will be parsed into discrete strings to be processed. 创建一个接受单个定界列表的函数,该定界列表将被解析为离散字符串以进行处理。
    1. Loop through each character in that string. 遍历该字符串中的每个字符。
    2. Get the ASCII value for that character. 获取该字符的ASCII值。
    3. Calculate the HashCode value for that ASCII code. 计算该ASCII代码的HashCode值。
    4. Add to the running total for the HashCode for this string 添加到此字符串的HashCode的运行总计
    5. If the HashCode of this is greater than the boundaries of an INT, adjust back into those bounds. 如果此方法的HashCode大于INT的边界,请调整回这些边界。
    6. Add the result to the running total for all strings to be hashed. 将结果添加到要哈希的所有字符串的运行总计中。
    7. If the grand total is outside of the bounds of an INT, adjust in the same fashion as we did above. 如果总计超出了INT的范围,请按照与上述相同的方式进行调整。
  3. Return the HashCode total from the function. 从函数返回HashCode总数。

There are many ways to parse a delimited string. Feel free to reference a previous article in which this topic was delved into quite thoroughly:
Efficient creation and parsing of delimited strings

有很多解析带分隔符的字符串的方法。 请随意参考之前的文章,其中对该主题进行了详尽的探讨:
高效创建和解析定界字符串

For simplicity—and for keeping our code simple, we will use the function STRING_SPLIT, which is included by default in all versions of SQL Server starting with 2016. Feel free to use any other string-splitting method at your disposal, whether it be in the article above, or an in-house solution that you use elsewhere in your database schema.

为简单起见,并且为了使代码保持简单,我们将使用功能STRING_SPLIT,该功能默认情况下从2016年开始包含在所有版本SQL Server中。无论您使用哪种字符串拆分方法,都可以随意使用上面的文章,或在数据库架构中其他地方使用的内部解决方案。

With that bit of housekeeping out of the way, we can declare our function:

有了一些家政服务,我们可以声明我们的功能:


CREATE FUNCTION dbo.Java_Hashcode(@Input_String_List VARCHAR(MAX)) -- This is a CSV of any number of strings
RETURNS BIGINT
AS
BEGIN

You may ask, “Why does the function return a BIGINT, when the value should always be an INT? For our work below, I’ve chosen to use BIGINT for all numeric data types. This allows easy expansion of the result set into BIGINT instead of INT, with only a few changes to the underlying calculations. We need to use BIGINT in order to track running totals within the function, as they will often exceed the boundaries of an integer briefly, until adjusted at the end of each loop. Using BIGINT everywhere maintains consistency across all of our variables.

您可能会问:“当值应始终为INT时,为什么函数会返回BIGINT? 对于下面的工作,我选择对所有数字数据类型使用BIGINT。 这样就可以将结果集轻松扩展为BIGINT而不是INT,而仅对基础计算进行了一些更改。 我们需要使用BIGINT来跟踪函数内的运行总计,因为它们通常会短暂地超出整数的边界,直到在每个循环的末尾进行调整为止。 在任何地方使用BIGINT都可以保持我们所有变量的一致性。

To begin our function, we’ll declare a variety of variables that will be used throughout our calculations, as well as perform our string splitting:

为了开始我们的功能,我们将声明将在整个计算中使用的各种变量,以及执行字符串拆分的方法:


DECLARE @Input_Strings TABLE(Input_String VARCHAR(MAX));INSERT INTO @Input_Strings(Input_String)SELECTColumn_DataFROM STRING_SPLIT(@Input_String_List, ',');DECLARE @Java_Hashcode_Output BIGINT;DECLARE @Java_Hashcode_Output_Total BIGINT = 0;DECLARE @Character_Counter BIGINT;DECLARE @Input_String_Length BIGINT;DECLARE @Current_Character VARCHAR(1);DECLARE @Current_Character_Ascii_Value SMALLINT;DECLARE @Prime_Number BIGINT = 31;DECLARE @Current_String VARCHAR(MAX);

The @Input_Strings table will be used to store the set of strings passed into the function, which are parsed and inserted using STRING_SPLIT, or any other favorite list-splitting method. The remainder of the variables will be used for loop variables or as placeholders for details about string lengths or the prime number we are choosing (31). With this housekeeping out of the way, we will use a CURSOR to loop through each string. A WHILE loop could also be used, with similar results.

@Input_Strings表将用于存储传递给函数的字符串集,这些字符串是使用STRING_SPLIT或任何其他常用的列表拆分方法进行解析和插入的。 其余变量将用于循环变量或用作占位符,以获取有关我们选择的字符串长度或素数的详细信息(31)。 有了这种整理工作,我们将使用CURSOR遍历每个字符串。 也可以使用WHILE循环,结果相似。


DECLARE String_Cursor CURSOR FORSELECT Input_String FROM @Input_Strings;OPEN String_Cursor;FETCH NEXT FROM String_Cursor INTO @Current_String;

With a loop CURSOR defined, we can now iterate through each string and calculate the HashCode of each, one character at a time:

定义了循环CURSOR之后,我们现在可以遍历每个字符串并一次计算每个字符的HashCode:


WHILE @@FETCH_STATUS = 0BEGINSELECT @Input_String_Length = LEN(@Current_String);SELECT @Character_Counter = 1SELECT @Java_Hashcode_Output = 0;

The check of @@FETCH_STATUS allows us to continue looping until the cursor returns no further items from a FETCH command. We then assign the length of the string to a variable and reset the character counter, which will loop until we hit the length of the string. @Java_Hashcode_Output will store the running total results for the current string, to be used when the inner loop is complete:

@@ FETCH_STATUS的检查使我们可以继续循环,直到游标没有从FETCH命令返回任何其他项目为止。 然后,我们将字符串的长度分配给变量,然后重置字符计数器,该计数器将循环播放,直到达到字符串的长度为止。 @Java_Hashcode_Output将存储当前字符串的运行总计结果,以在内部循环完成时使用:


WHILE @Character_Counter <= @Input_String_LengthBEGINSELECT @Current_Character = SUBSTRING(@Current_String, @Character_Counter, 1);SELECT @Current_Character_Ascii_Value = ASCII(@Current_Character);SELECT @Java_Hashcode_Output = (@Java_Hashcode_Output * @Prime_Number + @Current_Character_Ascii_Value) % POWER(CAST(2 AS BIGINT), 32);SELECT @Character_Counter = @Character_Counter + 1;END  

The WHILE loop will iterate once per character in the current string that is being examined. The @Current_Character pulls one character at a time as we run through this loop. We then proceed to determine the ASCII value of the character, also storing that in a scalar variable (@Current_Character_Ascii_Value).

WHILE循环将对正在检查的当前字符串中的每个字符进行一次迭代。 在此循环中, @ Current_Character一次拉一个字符。 然后,我们继续确定字符的ASCII值,并将其存储在标量变量(@Current_Character_Ascii_Value )中。

Now we perform the important calculation of the HashCode for this character, which multiplies our existing output by our prime number (31) and adds it to the current character’s ASCII value. We then calculate the modulus (remainder) between that result and 232. The modulus operation allows us to continue making our summation calculations accurately without rolling more than one power of 2 out of bounds of an INT. For longer strings, this is very important to prevent our calculations from becoming astronomically large!

现在,我们对该字符执行HashCode的重要计算,该计算将现有输出乘以我们的质数(31),并将其添加到当前字符的ASCII值。 然后,我们计算该结果与2 32之间的模数(余数)。 模运算允许我们继续准确地进行求和计算,而不会超出INT的范围的2的幂次方。 对于更长的字符串,这对于防止我们的计算在天文上变得很大非常重要!

With each loop, we continue to calculate the HashCode value using the existing value, as well as variables for the given character. Lastly, we increment @Character_Counter and continue until we are finished with the current string.

在每个循环中,我们将继续使用现有值以及给定字符的变量来计算HashCode值。 最后,我们递增@Character_Counter并继续,直到完成当前字符串为止。

After each HashCode calculation, we need to validate that the results are within the range of our output data type, in this case, INTEGER. The following TSQL will validate and adjust, if needed:

在每次进行HashCode计算之后,我们需要验证结果是否在我们的输出数据类型(在本例中为INTEGER)的范围内。 如果需要,以下TSQL将进行验证和调整:


IF @Java_Hashcode_Output >= POWER(CAST(2 AS BIGINT), 31)BEGINSELECT @Java_Hashcode_Output = @Java_Hashcode_Output - POWER(CAST(2 AS BIGINT), 32)ENDELSEIF @Java_Hashcode_Output <= -1 * POWER(CAST(2 AS BIGINT), 31)BEGINSELECT @Java_Hashcode_Output = @Java_Hashcode_Output + POWER(CAST(2 AS BIGINT), 32)END

The first IF statement checks to see if our HashCode output exceeds 231, in which case we subtract 232 from our result, moving it into negative territory, but not beyond the scope of an INTEGER. This effectively is removing the leading bit from the result and negating it.

第一条IF语句检查以查看我们的HashCode输出是否超过2 31 ,在这种情况下,我们从结果中减去2 32 ,将其移至负数范围,但不超出INTEGER的范围。 这实际上是从结果中删除前导位并取反。

The second IF statement checks for the opposite condition, where the HashCode output is less than -231, in which case we add 232 from our result, removing the leading bit from the negative number and negating it.

第二条IF语句检查相反的条件,即HashCode输出小于-2 31 ,在这种情况下,我们将结果加2 32 ,从负数中删除前导位并取反。

We have now calculated the HashCode for the first string in our sequence! If that was the only string supplied in the input list, then we are done, otherwise we will move to the start of the loop and perform our calculations on the next string. In order to keep a running total of HashCodes for each string, we will sum our result with a running total variable, @Java_Hashcode_Output_Total:

现在,我们已经为序列中的第一个字符串计算了HashCode! 如果那是输入列表中唯一提供的字符串,那么我们就完成了,否则,我们将移至循环的开始并对下一个字符串执行计算。 为了保持每个字符串的运行HashCode总数,我们将结果与运行的总计变量@Java_Hashcode_Output_Total相加


SELECT @Java_Hashcode_Output_Total = @Java_Hashcode_Output_Total + @Java_Hashcode_Output;

@Java_Hashcode_Output_Total will equal @Java_Hashcode_Output_Total将等于@Java_Hashcode_Output, but in future iterations will run the chance of exceeding the boundaries of an INTEGER. Our solution to this problem is exactly the same as how we dealt with this problem above: @Java_Hashcode_Output ,但是在将来的迭代中,将有可能超出INTEGER的范围。 我们对这个问题的解决方案与上述解决方案完全相同:


IF @Java_Hashcode_Output_Total >= POWER(CAST(2 AS BIGINT), 31)BEGINSELECT @Java_Hashcode_Output_Total = @Java_Hashcode_Output_Total - POWER(CAST(2 AS BIGINT), 32)ENDELSEIF @Java_Hashcode_Output_Total <= -1 * POWER(CAST(2 AS BIGINT), 31)BEGINSELECT @Java_Hashcode_Output_Total = @Java_Hashcode_Output_Total + POWER(CAST(2 AS BIGINT), 32)END

These two calculations should look very familiar! They are almost identical to how we kept our HashCode result for a single string within the range of our output variable. If the running total for our entire string list exceeds the boundaries of an INTEGER, on the positive or negative side of the range, then we will add or subtract 232 from the result to move it back within our acceptable result set.

这两个计算应该看起来非常熟悉! 它们几乎与我们将单个字符串的HashCode结果保持在输出变量范围内的方式相同。 如果整个字符串列表的运行总计超出了整数的界限,则在范围的正数或负数范围内,我们将从结果中减去2 32以便将其移回可接受的结果集中。

With the string calculation, as well as a running total complete, we can fetch the next string from our cursor and continue with the next calculation:

通过字符串计算以及运行中的总计,我们可以从游标中获取下一个字符串并继续进行下一个计算:


FETCH NEXT FROM String_Cursor INTO @Current_String;END
 CLOSE String_Cursor;DEALLOCATE String_Cursor;RETURN @Java_Hashcode_Output_Total;
END
GO

That’s it—all in about 75 lines of TSQL! The most frequently-made mistake when trying to convert a hashing algorithm into TSQL is handling overflow situations in which results exceed the boundaries of our output data type. In this script, we used the INTEGER data type to handle output results, allowing for 4,294,967,296 distinct possible result values. Integer was chosen to match the behavior of the HashCode method as written in Java, but could be altered to use a BIGINT instead, which would provide 18,446,744,073,709,551,616 values within the allowable range. While it may seem like crazy talk to write out numbers that large, but if uniqueness were more important, and we wanted a significantly larger result range with less chances for collisions, then the larger values would provide that additional buffer against duplication.

就是这样-大约有75行TSQL! 尝试将哈希算法转换为TSQL时最常犯的错误是处理溢出情况,在这种情况下结果超出了我们的输出数据类型的范围。 在此脚本中,我们使用INTEGER数据类型来处理输出结果,从而允许4,294,967,296个不同的可能结果值。 选择Integer以匹配用Java编写的HashCode方法的行为,但可以更改为使用BIGINT来代替,这将在允许范围内提供18,446,744,073,709,551,616值。 写出这么大的数字似乎有些疯狂,但是如果唯一性更重要,并且我们希望结果范围大得多且发生冲突的机会更少,那么较大的值将为复制提供额外的缓冲区。

If, for any reason BIGINT was still inadequate, we could introduce a DECIMAL type, which allows results between -1038+1 and 1038-1. If uniqueness is so critical that numbers that absurdly large aren’t good enough, then hashing may not be the correct way to manage your data. It may instead be necessary to compare data as-is, or encrypt using a more complex algorithm that guarantees uniqueness. Hashing provides a high level of uniqueness using very inexpensive calculations, which are good enough for most common uses.

如果由于某种原因BIGINT仍然不足,我们可以引入DECIMAL类型,该类型允许结果在-10 38 +1和10 38 -1之间。 如果唯一性非常重要,以至于数字太大而不够好,那么散列可能不是管理数据的正确方法。 相反,可能有必要按原样比较数据,或使用保证唯一性的更复杂算法进行加密。 哈希使用非常便宜的计算提供了高度的唯一性,对于大多数常见用途而言已经足够了。

示例功能使用 (Example Function Use)

Calling our HashCode function is as easy as including it in a SELECT statement:

调用HashCode函数就像将它包含在SELECT语句中一样容易:


SELECT dbo.Java_Hashcode('TestString1,SecondString,MoreStrings,ThisIsFun!!!');

The result is a single BIGINT (within the range of an integer):

结果是单个BIGINT(在整数范围内):

376744202

376744202

As with any function, we can use the function as part of a result set when selecting data from a table. As always, test performance against any large data sets before moving into a production environment. Assuming it passes the speed test, you can accomplish this with a query similar to this:

与任何函数一样,当从表中选择数据时,我们可以将该函数用作结果集的一部分。 与往常一样,在移入生产环境之前,请针对任何大数据集测试性能。 假设它通过了速度测试,则可以使用类似以下的查询来完成此任务:


SELECTProduct.Name,dbo.Java_Hashcode(Product.Name) AS Product_Name_HashCode
FROM Production.Product

These values could then be summed up, if desired, or other calculations performed on the results.

如果需要,可以将这些值相加,或者对结果进行其他计算。

结论 (Conclusion)

Hashing is a useful and inexpensive way to obfuscate, index, or group results into convenient subdivisions. HashCode is a common and simple algorithm used by Java across many platforms that can be translated into, and used in TSQL if desired. This allows us to guarantee identical results between TSQL scripts and application code.

哈希是将结果混淆,索引或分组为方便的细分的有用且廉价的方法。 HashCode是Java在许多平台上使用的通用且简单的算法,可以将其转换为TSQL,并在需要时在TSQL中使用。 这使我们能够保证TSQL脚本和应用程序代码之间的结果相同。

Attached to this article is the full script that we built above. Feel free to download, reuse, and customize this script as needed. While very iterative in its approach, the fact that we are operating on scalar variables allows this to perform well, even if we decide to parse longer lists of larger strings. Extremely long lists might perform a bit slower, but generally will be acceptable. A test run of 1000 strings with an average of 25 characters completed in under a second, even on slow spinning-disk drive storage.

本文附带的是我们上面构建的完整脚本 。 可以根据需要随意下载,重复使用和自定义此脚本。 尽管它的方法非常迭代,但即使我们决定解析较长的较大字符串列表,我们也可以对标量变量进行操作,这一点使其性能良好。 极长的列表执行速度可能会慢一些,但通常可以接受。 即使在慢速旋转磁盘驱动器存储上,也可以在一秒钟内完成1000个字符串的测试运行,平均平均25个字符。

Enjoy!

请享用!

翻译自: https://www.sqlshack.com/java-hashcode-tsql/

tsql语句中的t是什么

tsql语句中的t是什么_TSQL中的Java HashCode相关推荐

  1. 在T-SQL语句中访问远程数据库(openrowset/opendatasource/openquery)

    原文:在T-SQL语句中访问远程数据库(openrowset/opendatasource/openquery) 1.启用Ad Hoc Distributed Queries 在使用openrowse ...

  2. 使用t-sql语句修改表中的某些数据及数据类型。_数据库基本理论详细介绍

    1.数据库范式 第一范式:列不可分,eg:[联系人](姓名,性别,电话),一个联系人有家庭电话和公司电话,那么这种表结构设计就没有达到 1NF: 第二范式:有主键,保证完全依赖.eg:订单明细表[Or ...

  3. t-sql语句插入_T-SQL的本机大容量插入基础知识

    t-sql语句插入 From troubleshooting many data flow applications designed by others, I've seen a common pa ...

  4. SQL Server中T-SQL语句查询使用的函数

    SQL Server中T-SQL语句查询使用的函数 一,字符串函数 字符串函数用于对字符串数据进行处理,并返回一个字符串或数字. 函数名 描述 举例 CHARINDEX 用来寻找一个指定的字符串在另一 ...

  5. TSQL语句中的Like用法

    SQL Server:SQL Like 的特殊用法 %:匹配零个及多个任意字符: _:与任意单字符匹配: []:匹配一个范围: [^]:排除一个范围 Symbol Meaning like '5[%] ...

  6. 使用t-sql语句修改表中的某些数据及数据类型。_Java面试——数据库知识点

    微信公众号:猿的夜场 关注可了解更多的技术文档.问题或建议,请公众号留言! MySQL 1.建 主键:数据库表中对储存数据对象予以唯一和完整标识的数据列或属性的组合.一个数据列只能有一个主键,且主键的 ...

  7. c# mysql executescalar_C# 操作MySQL数据库, ExecuteScalar()方法执行T-SQL语句, COUNT(*), 统计数据...

    C# 操作My SQL数据库需要引用"MySql.Data", 可通过两种方式获取. 1.从NuGet下载"Install-Package MySql.Data -Ver ...

  8. SQL server 数据库——T-SQL语句基础

    T-SQL语句基础 1.创建数据库:create datebase 数据库名 2.删除数据库:delete datebase 数据库名 3.注释:/*一段 */            一行 -- 4. ...

  9. [黑马程序员五]:常用的T-SQL语句

    -------   Windows Phone 7手机开发..Net培训.期待与您交流! ------- 经过一天的奋斗,终于总结出这些常用的T-SQL语句. --创建数据库 CREATE DATAB ...

最新文章

  1. Redis 安装 启动 连接 配置 重启
  2. 转:Oracle SQL 内置函数大全 (一)
  3. 数据中心机房冷热通道优化设计经验谈
  4. 图论—割点zcmu2095
  5. 连招 横版 flash 游戏_街机游戏中的无限连究竟有多变态?有种对决叫作没开始就结束了!...
  6. C#.net调用Excel出现问题
  7. 电销机器人价格_箭鱼电销机器人:为什么电话机器人公司不用机器人给你打电话?...
  8. [Leetcode][第206题][JAVA][反转一个单链表][递归][迭代]
  9. 计算机网络(十六)-轮询访问介质访问控制
  10. shell下如何删除文件的某一列
  11. [河南省ACM省赛-第三届] AMAZING AUCTION (nyoj 251)
  12. 微软以75亿美元收购GitHub
  13. Linux下的基本操作
  14. 490 - Rotating Sentences
  15. SPI全双工模式下收发字节的理解
  16. 硬盘安装win10,笔者教你如何一步步从硬盘安装win10系统
  17. leejianjun的博客 PHP生成word并可下载
  18. DNS域名劫持的几种解决方法
  19. oracle导入dmp秒退,Oracle导入dmp遇到问题解决
  20. 《成为沃伦·巴菲特》笔记与感想

热门文章

  1. python线程同步
  2. 学习笔记2—MATLAB的copyfile技巧
  3. 培生同意以3亿美元出售华尔街英语
  4. javascript深入浅出——学习笔记(六种数据类型和隐式转换)
  5. ip、子网掩码、默认网关以及传输过程
  6. 最新电脑为什么用ghost无法安装系统?安装版正常,是何原因?
  7. 蓝牙耳机怎么换电池?
  8. 为何我的苹果手机,每次打开软件都提示要登录ID帐号,而且软件打不开
  9. 如何拍好运动风人像?
  10. 现外供电电压都达240V,音响系统要不要加稳压器?