Is it possible one and the same object, particularly a string
or any primitive or very simple type (like a struct
), to produce different values of the .GetHashCode()
method when invoked on different machines?
For instance, is it possible for the expression "Hello World".GetHashCode()
to produce a different value on a different machine. I am primarily asking for C#.NET but I suppose this might apply to Java or even other languages?
Edit:
As pointed from answers and comments below, it is known to me that .GetHashCode()
can be overriden, and there is no guarantee for the result it produces between different version of the framework. Therefore it is important to clarify that I have simple types in mind (which cannot be inherited, therefore GetHashCode()
be overriden) and I am using the same versions of the framework on all machines.
Short answer: Yes.
But short answers are no fun, are they?
When you are implementing GetHashCode()
you have to make the following guarantee:
When
GetHashCode()
is called on another object that should be considered equal to this, in this App Domain, the same value will be returned.
That's it. There's some things you really need to try to do (spread the bits around with non-equal objects as much as possible, but don't take so long about it that it outweighs all the benefits of hashing in the first place) and your code will suck if you don't do so, but it won't actually break. It will break if you don't go that far, because then e.g.:
dict[myObj] = 3;
int x = dict[myObj];//KeyNotFoundException
Okay. If I'm implementing GetHashCode()
, why might I go further than that, and why might I not?
First, why might I not?
Maybe it's a slightly different version of the assembly and I improved (or at least attempted to) in between builds.
Maybe one is 32-bit and one is 64-bit and I was going nuts for efficiency and chose a different algorithm for each to make use of the different word sizes (this is not unheard of, especially when hashing objects like collections or strings).
Maybe some element I'm deciding to consider in deciding on what constitutes "equal" objects is itself varying from system to system in this sort of way.
Maybe I actually deliberately introduce a different seed with different builds to catch any case where a colleague is mistakenly depending upon my hash code! (I've heard MS do this with their implementation for string.GetHashCode()
, but can't remember whether I heard that from a credible or credulous source).
Mainly though, it'll be one of the first two reasons.
Now, why might I give such a guarantee?
Most likely if I do, it'll be by chance. If an element can be compared for equality on the basis of a single integer id alone, then that's what I'm going to use as my hash-code. Anything else will be more work for a less good hash. I'm not likely to change this, so I might.
The other reason why I might, is that I want that guarantee myself. There's nothing to say I can't provide it, just that I don't have to.
Okay, let's get to something practical. There are cases where you may want a machine-independent guarantee. There are cases where you may want the opposite, which I'll come to in a bit.
First, check your logic. Can you handle collisions? Good, then we'll begin.
If it's your own class, then implement so as to provide such a guarantee, document it, and you're done.
If it's not your class, then implement IEqualityComparer<T>
in such a way as to provide it. For example:
public class ConsistentGuaranteedComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return x == y;
}
public int GetHashCode(string obj)
{
if(obj == null)
return 0;
int hash = obj.Length;
for(int i = 0; i != obj.Length; ++i)
hash = (hash << 5) - hash + obj[i];
return hash;
}
}
Then use this instead of the built-in hash-code.
There's an interesting case where we may want the opposite. If I can control the set of strings you are hashing, then I can pick a bunch of strings with the same hash-code. Your hash-based collection's performance will hit the worse-case and be pretty atrocious. Chances are I can keep doing this faster than you can deal with it, so it can be a denial of service attack. There's not many cases where this happens, but an important one is if you're handling XML documents I send and you can't just rule out some elements (a lot of formats allow for freedom of elements within them). Then the NameTable
inside your parser will be hurt. In this case we create a new hash mechanism each time:
public class RandomComparer : IEqualityComparer<string>
{
private int hashSeed = Environment.TickCount;
public bool Equals(string x, string y)
{
return x == y;
}
public int GetHashCode(string obj)
{
if(obj == null)
return 0;
int hash = hashSeed + obj.Length;
for(int i = 0; i != obj.Length; ++i)
hash = hash << 5 - hash + obj[i];
hash += (hash << 15) ^ 0xffffcd7d;
hash ^= (hash >>> 10);
hash += (hash << 3);
hash ^= (hash >>> 6);
hash += (hash << 2) + (hash << 14);
return hash ^ (hash >>> 16)
}
}
This will be consistent within a given use, but not consistent from use to use, so an attacker can't construct input to force it to be DoSsed. Incidentally, NameTable
doesn't use an IEqualityComparer<T>
because it wants to deal with char-arrays with indices and lengths without constructing a string unless necessary, but it does do something similar.
Incidentally, in Java the hash-code for string
is specified and won't change, but this may not be the case for other classes.
Edit: Having done some research into the overall quality of the approach taken in ConsistentGuaranteedComparer
above, I'm no longer happy with having such algorithms in my answers; while it serves to describe the concept, it doesn't have as good a distribution as one might like. Of course, if one has already implemented such a thing, then one can't change it without breaking the guarantee, but if I'd now recommend using this library of mine, written after said research as follows:
public class ConsistentGuaranteedComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return x == y;
}
public int GetHashCode(string obj)
{
return obj.SpookyHash32();
}
}
That for RandomComparer
above isn't as bad, but can also be improved:
public class RandomComparer : IEqualityComparer<string>
{
private int hashSeed = Environment.TickCount;
public bool Equals(string x, string y)
{
return x == y;
}
public int GetHashCode(string obj)
{
return obj.SpookyHash32(hashSeed);
}
}
Or for even harder predictability:
public class RandomComparer : IEqualityComparer<string>
{
private long seed0 = Environment.TickCount;
private long seed1 = DateTime.Now.Ticks;
public bool Equals(string x, string y)
{
return x == y;
}
public int GetHashCode(string obj)
{
return obj.SpookyHash128(seed0, seed1).GetHashCode();
}
}
string.GetHashCode()
..." Now that the .NET source is available you now have a authoritative source that shows that they do actually use random haschodes in some builds if the FEATURE_RANDOMIZED_STRING_HASHING
build variable is set. Also, if it is a DEBUG
build they also do hash1 ^= ThisAssembly.DailyBuildNumber;
to make sure no one is doing anything stupid like trying to persist hash values, - Scott Chamberlain
It will produce different result even on the same machine on different runs.
So it basically can be used (and it is actually used) to check something during the current run of the program, but there is no sence to store it, to check something against it after. Cause the number you get is generated by runtime.
EDIT
For specific case of a string it will produce the same result even on different machines, except the case when machines have different architecture.