Antti-Juhani Kaijanaho: On identity
This essay was first posted to the mailing list of a course I am attending (TIE363 - Information Management and Sharing).
Identity is the attribute of a thing that determines sameness. By definition, the sameness of a thing is synonymous with the sameness of its identity. However, in most cases, identity is implicit, not an explicit thing: for example, what is the thing that makes 2 and 1 + 1 the same thing? I can state the conditions that hold whenever two number-theoretic terms are the same, but that does not get me any closer toward the identity of a number.
In logic, there is a litmus test for identity: two things cannot be same if they are not interchangeable, in other words, if two things are the same, it must be possible to replace the one with the other in any context without anything changing. The test is named Leibniz's rule.
Leibniz's rule wrecks havoc on the concept of a cryptographic hash as an identity. This email here is a document. It is a well known fact that there is another document with the same SHA-1 hash as this email's. Current technology is incapable of finding us that document, but we still can imagine that we have it; that thought experiment is as valid as any actual experiment, since we are talking concepts here. It is likely (though I have no way of knowing it) that the other document is pure gibberish to all English (or Finnish) speaking entities; hence substituting this mail with the other document in your screens right now would be very far from "not changing anything". With luck, I can expect all of you to understand this mail; I have no hope of you understanding the supposedly same document that I - hypothetically - just switched on your screens.
Now, for practical purposes identity is a nonissue. There is no need for us to know the inherent quality of thing that is its identity to work with that thing. What we need in practice is a way to determine (with reasonably accuracy) that two arbitrary things from the sphere of things we are interested in are the same. Thus we need a test for sameness; we don't need their identities, we just need some way of determining whether they are the same.
Now, how we do this depends a lot on the properties of the particular sphere of interest. Let me borrow a concept and a distinction from programming language theory:
The concept: STATE
The state of a thing is, let us say, the Cartesian combination of the states of all the attributes of the thing that can be directly observed. My state includes for example my height, my weight, and my name (bending the "directly" bit a bit). To take a more computerly example, the state of a web page is the HTML source combined with the state of the objects embedded in it (inline images, <link>ed stylesheets and other such things). The state of an integer variable in a program is the integer that it holds. The state of a complex number is the combination of its real and imaginary parts, both of which are real numbers.
The distinction: VALUE / OBJECT
A value is a thing whose identity is a function of its state. That is, its identity is determined completely by its state, and if the state changes, the identity changes (in other words, state cannot chagne). I am not a value: I am still the same person even though my weight changes constantly. A web page is not a value: my blog's main page is still the same page even after I publish a new entry on it. An integer variable is not a value, since it is still the same variable even after assignment is performed (which is quite often!). However, an integer is a value: there is no distinction between the state of an integer and its identity. A more revealing example is the complex number: its identity is completely determined by the values of its real and imaginary parts.
An object, however, is a thing whose identity is independent of its state. A thing that is not a value is an object. An integer variable is an object. I am an object. A web page is an object. Integers and complex numebrs are not objects. An object's state can and often does change.
For values, determining sameness is easy: just observe the state of the two values. However, for objects, direct observation of the state is not a path to determining sameness. So, we need something else.
The crucial concept for practical object sameness is an identifier. An identifier is a token for identity - something whose sameness is easy to determine and which is somehow linked to the identity it is standing in for. It is important to realize that an identifier is not the identity! Even the best identifiers are just stand-ins for the real thing.
In programming systems, objects are usually identified by their addresses. The reason why this identifier is not an identity is that even if there are no programming errors, two different objects can still share the same identifier (address), and a single object can have multiple identifiers (addresses). The reason why this works is that the only situation when the first can happen (assuming the absence dangling pointers) is when the first object has died before the other object is created - and since there are no longer any identifiers of the first object, the second object can reuse the identifier without confusion. The other scenario is that of a moving object: high performance garbage collectors often like to move objects around to reduce internal fragmentation and to allow super-fast object allocation, but as with the reused-identifier case, this creates no confusion, as all identifiers are changed transparently to refer to the new location of the object.
It is even possible, and often done, to represent objects as values. This can be done by giving each object an arbitrary identifier, usually an integer, making sure that two different objects get different identifiers. Then one can represent these objects as values where the state of an object is augmented with the corresponding identifier. Now, even though these representations are values, we can always locate the value representation correspondig to a particular object by looking for the value that contains the correct identifier. This trick is a standard part of relational database design: if you want to represent a human in a relational database, you store his or her attributes (such as name, address etc.) in a table and you additionally assign him or her an arbitrary identifier, the "primary key". (Now, later phases of the database design, normalization, probably breaks this table apart, but the idea is still there.)
Now, where do we get these "arbitrary" identifiers? There is really only one way: designate an assignment authority and task it with making sure that the same identifier is only given to one applicant. The authority may delegate its authority; this is what happens in Internet domain names. It may even designate some scheme for distributed identifier assignment (urn-5, a brainchild of the Fenfire project, is an example of this, based on the practical certainty of always getting a different random number if the random number source is good enough and enough random bits are extracted from the source).
The question of which identifier assignment authority to choose is a matter of what is practical. For example, the membership roll of an association does not need a distributed assignment authority, since it is easy enough to designate the secretary in charge of the membership as the assignment authority (with suitable guidance from the executive committee, of course:). However, a document management system where document sameness is a global property but where documents are created by parties who are unable or unwilling to converse with a central authority, a distributed scheme such as urn-5 is perfectly appropriate.
PS. I think it is a fundamental misunderstanding to talk of hashes or addresses or anything else concerete as identities; they are identifiers.
2005-03-26T15:57+0200 - /en/stuff
Trackback url: http://antti-juhani.kaijanaho.info/blog/en/stuff/identity.trackback (trackback on rikki / trackback is broken)
Your Comment