This question already has an answer here:
I have asked similar question for the string.GetHashCode()
method in .NET. Taken from then, I have learned that we cannot rely on the implicit implementation of hash code for the buit-in types, if we are to use it across different machines. Therefore, I am assuming that the Java implementation of String.hashCode()
is also unstable across different hardware configurations and may behave differently across VMs (don't forget different VM implementations)
Currently we are discussing a way to safely transform a string into a number in Java, by hashing, but the hash algorithm must be stable across different nodes of a cluster, and be fast to evaluate, since there will be high frequency of usage. My team mates are insisting on the native hashCode
method, and I'll need some reasonable arguments to make them reconsider another approach. Currently, I can think only of the differences between machine configurations (x86 and x64), possibly different vendors of the JVM on some of the machines (hardly applicable in our case) and byte-order differences, depending on the machine the algorithm is being run. Of course, character encoding is probably to be also considered.
While all these things come into my mind, I am not 100% sure in either of them to be strong reason enough, and I'd appreciate your expertize and experience in this area. This will help me build stronger arguments to favor writing a custom hashing algorithm. Also, I'd appreciate advices on what not to do when implementing it.
The implementation of String.hashCode()
is specified in the documentation, so it's guaranteed to be consistent:
The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)
All of those operations are implemented platform-independently for Java -- the platform byte order is irrelevant, for example.
That said, ways of getting a String
can be tricky, if you're getting it from a file or another source of bytes. In that case, you're fine so long as you explicitly specify a Charset
. (Remember that String
s don't have different encodings per se; an encoding is a specification for conversions between a byte[]
and a String
.)
You can look at the sourcecode, also shown below. From what I can see (after all of 10 seconds of analysis) this should be stable across machines and architectures. And Louis confirms this by citing a spec, even better if you believe specs. :-)
However, this could vary if a different JRE chooses to implement it differently and violate the spec.
public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
String#hashCode
. - Luiggi MendozaString
class, it looks stable enough. But as it occurs to be the case in .NET there are reasons not to rely on that. Having in mind the virtual machines, the hardware differences and byte-order or encoding varieties, the Java case seems very similar to me. - Ivaylo Slavov