Tuesday, May 29, 2012

Hash functions to improve string comparison and DB desgin


Scenario

I have gone through some difficult experience to compare strings much time during database design and Querying and think about we need to get some different solution to perform same operation. Many times there are requirements to create Varchar field as Primary column like Account Number in Chart of Account table in Accounts software or Comments column in any table with varchar(max) or (100) etc.

Problem

When we try to join string based column or use in our where clause even to find a single record with like or equals clause it increase our query cost. I came across same scenario in recent time to compare a string based column to just check whether it equals to parameter string value or not and it took 76% of total query cost. SQL SERVER Optimizer hints to create a cover index using Primary Key plus this column which improve query performance by at least 50% but at the same time I came across Hash Key function that satisfied my needs, storage and optimization. So I am going to discuss what is Hash key and how can we implement it using SQL SERVER 2008 and above.I have gathered information from many sources and msdn is one of them.

What is Hash Function


Hash is the value (int or varbinary) result of an algorithm (known as a hash function) applied to a given string. You just need to provide your string as input and you will get a unique hash value as an output. If we provide a complete page string to this function and then change just a character to pass value to it. It will return different values. There is small possibility to repeat same value and this will known as hash collision.

Where can I use Hash Function?

·         Security implementation as encryption of string data.
·         Reduce network traffic for cross DB queries because of small size required to compare instead of string values.
·         Less space requires to store and campare values because it returns int and binary values instead of string characters.
·         Performance increase due to small db types and indexes uses if we implement index on it.
·         Avoid whole string comparisons and requires comparing just int values as checksum.
·         If we create hash column in our DB design along with string columns then it is easy for us to implement joins on these column that will act like string joins as these are hash values of string values.

Careful using Hash Function

·         Hash collision may return more than your expected result set by returning same hash value. For this reason, CheckSum might return more values then you expected but if you want to use this column to identify your column changes then consider HashBytes. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.
·          
Types of Hash function
  • Hash Keys in Database Design
  • CheckSum()
  • HashBytes
For detail about these functions please go to 

No comments: