8

Is it possible one and the same object, particularly a string or any primitive or very simple type (like a struct), to produce different values of the .GetHashCode() method when invoked on different machines?

For instance, is it possible for the expression "Hello World".GetHashCode() to produce a different value on a different machine. I am primarily asking for C#.NET but I suppose this might apply to Java or even other languages?

Edit:

As pointed from answers and comments below, it is known to me that .GetHashCode() can be overriden, and there is no guarantee for the result it produces between different version of the framework. Therefore it is important to clarify that I have simple types in mind (which cannot be inherited, therefore GetHashCode() be overriden) and I am using the same versions of the framework on all machines.


  • For your enjoyment -- blogs.msdn.com/b/ericlippert/archive/2011/02/28/… - Austin Salonen
  • The difference exists for strings only between x86 and 64. Search for "Different return values for strings on x86 versus x64" in the link provided by @CodyGray - digEmAll
  • @digEmAll, I cannot, Actually my question was rather hypothetical than backed with real code. I am thinking of a way to implement internal load balancing on a windows distributed application - I have a service deployed on different local servers, and I need a good and stable hash code (across machines and startups), so I can determine the concrete service to run by a string token. And I was considering the built-in hash to be probably the fastest and easiest way, but it seems I've been wrong. - Ivaylo Slavov
  • @IvayloSlavov: If that is your requirement then GetHashCode is absolutely, positively the wrong thing to use. Use GetHashCode for one thing and one thing only: to balance a hash table. If you need a hash code for some other purpose, write your own hash code algorithm that is suitable for that purpose. - Eric Lippert
  • @IvayloSlavov: To get a stable hashing for strings use methods and classes of the namespace: System.Security.Cryptography. For example you could use MD5 - digEmAll

2 답변


14

Short answer: Yes.

But short answers are no fun, are they?

When you are implementing GetHashCode() you have to make the following guarantee:

When GetHashCode() is called on another object that should be considered equal to this, in this App Domain, the same value will be returned.

That's it. There's some things you really need to try to do (spread the bits around with non-equal objects as much as possible, but don't take so long about it that it outweighs all the benefits of hashing in the first place) and your code will suck if you don't do so, but it won't actually break. It will break if you don't go that far, because then e.g.:

dict[myObj] = 3;
int x = dict[myObj];//KeyNotFoundException

Okay. If I'm implementing GetHashCode(), why might I go further than that, and why might I not?

First, why might I not?

Maybe it's a slightly different version of the assembly and I improved (or at least attempted to) in between builds.

Maybe one is 32-bit and one is 64-bit and I was going nuts for efficiency and chose a different algorithm for each to make use of the different word sizes (this is not unheard of, especially when hashing objects like collections or strings).

Maybe some element I'm deciding to consider in deciding on what constitutes "equal" objects is itself varying from system to system in this sort of way.

Maybe I actually deliberately introduce a different seed with different builds to catch any case where a colleague is mistakenly depending upon my hash code! (I've heard MS do this with their implementation for string.GetHashCode(), but can't remember whether I heard that from a credible or credulous source).

Mainly though, it'll be one of the first two reasons.

Now, why might I give such a guarantee?

Most likely if I do, it'll be by chance. If an element can be compared for equality on the basis of a single integer id alone, then that's what I'm going to use as my hash-code. Anything else will be more work for a less good hash. I'm not likely to change this, so I might.

The other reason why I might, is that I want that guarantee myself. There's nothing to say I can't provide it, just that I don't have to.


Okay, let's get to something practical. There are cases where you may want a machine-independent guarantee. There are cases where you may want the opposite, which I'll come to in a bit.

First, check your logic. Can you handle collisions? Good, then we'll begin.

If it's your own class, then implement so as to provide such a guarantee, document it, and you're done.

If it's not your class, then implement IEqualityComparer<T> in such a way as to provide it. For example:

public class ConsistentGuaranteedComparer : IEqualityComparer<string>
{
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    if(obj == null)
      return 0;
    int hash = obj.Length;
    for(int i = 0; i != obj.Length; ++i)
      hash = (hash << 5) - hash + obj[i];
    return hash;
  }
}

Then use this instead of the built-in hash-code.

There's an interesting case where we may want the opposite. If I can control the set of strings you are hashing, then I can pick a bunch of strings with the same hash-code. Your hash-based collection's performance will hit the worse-case and be pretty atrocious. Chances are I can keep doing this faster than you can deal with it, so it can be a denial of service attack. There's not many cases where this happens, but an important one is if you're handling XML documents I send and you can't just rule out some elements (a lot of formats allow for freedom of elements within them). Then the NameTable inside your parser will be hurt. In this case we create a new hash mechanism each time:

public class RandomComparer : IEqualityComparer<string>
{
  private int hashSeed = Environment.TickCount;
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    if(obj == null)
      return 0;
    int hash = hashSeed + obj.Length;
    for(int i = 0; i != obj.Length; ++i)
      hash = hash << 5 - hash + obj[i];
    hash += (hash <<  15) ^ 0xffffcd7d;
    hash ^= (hash >>> 10);
    hash += (hash <<   3);
    hash ^= (hash >>>  6);
    hash += (hash <<   2) + (hash << 14);
    return hash ^ (hash >>> 16)
  }
}

This will be consistent within a given use, but not consistent from use to use, so an attacker can't construct input to force it to be DoSsed. Incidentally, NameTable doesn't use an IEqualityComparer<T> because it wants to deal with char-arrays with indices and lengths without constructing a string unless necessary, but it does do something similar.

Incidentally, in Java the hash-code for string is specified and won't change, but this may not be the case for other classes.

Edit: Having done some research into the overall quality of the approach taken in ConsistentGuaranteedComparer above, I'm no longer happy with having such algorithms in my answers; while it serves to describe the concept, it doesn't have as good a distribution as one might like. Of course, if one has already implemented such a thing, then one can't change it without breaking the guarantee, but if I'd now recommend using this library of mine, written after said research as follows:

public class ConsistentGuaranteedComparer : IEqualityComparer<string>
{
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    return obj.SpookyHash32();
  }
}

That for RandomComparer above isn't as bad, but can also be improved:

public class RandomComparer : IEqualityComparer<string>
{
  private int hashSeed = Environment.TickCount;
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    return obj.SpookyHash32(hashSeed);
  }
}

Or for even harder predictability:

public class RandomComparer : IEqualityComparer<string>
{
  private long seed0 = Environment.TickCount;
  private long seed1 = DateTime.Now.Ticks;
  public bool Equals(string x, string y)
  {
    return x == y;
  }
  public int GetHashCode(string obj)
  {
    return obj.SpookyHash128(seed0, seed1).GetHashCode();
  }
}


  • This goes far beyond what I've asked. I am very glad to have this info and really appreciate your effort. Thank you - Ivaylo Slavov
  • As I said, short answers are no fun :) - Jon Hanna
  • I was looking for information about NameTable's implementation of hashing - wondering why it was different than GetHashCode() and this answer covered that. Nice work! - Clay
  • "I've heard MS do this with their implementation for string.GetHashCode()..." Now that the .NET source is available you now have a authoritative source that shows that they do actually use random haschodes in some builds if the FEATURE_RANDOMIZED_STRING_HASHING build variable is set. Also, if it is a DEBUG build they also do hash1 ^= ThisAssembly.DailyBuildNumber; to make sure no one is doing anything stupid like trying to persist hash values, - Scott Chamberlain

1

It will produce different result even on the same machine on different runs.

So it basically can be used (and it is actually used) to check something during the current run of the program, but there is no sence to store it, to check something against it after. Cause the number you get is generated by runtime.

EDIT

For specific case of a string it will produce the same result even on different machines, except the case when machines have different architecture.


  • Can you clarify more. Please, take into account the revisiting of my question - I have clarified that I ask for non-inheritable simple or primitive types only. Would you be able to share some links for more information too? Thanks in advance - Ivaylo Slavov
  • That's not true. For strings you will get different value only if you change platform (x86 vs x64) otherwise GetHashValue returns always the same value. - digEmAll
  • @digEmAll: corrected - Tigran
  • @IvayloSlavov: check my edited post. - Tigran
  • Still wrong. Same string, same framework version, same result. It's not a rule that this would have to be the case though. - Jon Hanna

Linked


Related

Latest