How useful is the feature of having an atom data type in a programming language?
A few programming languages have the concept of atom or symbol to represent a consta
As a C programmer I had a problem with understanding what Ruby symbols really are. I was enlightened after I saw how symbols are implemented in the source code.
Inside Ruby code, there is a global hash table, strings mapped to integers. All ruby symbols are kept there. Ruby interpreter, during source code parse stage, uses that hash table to convert all symbols to integers. Then internally all symbols are treated as integers. This means that one symbol occupies only 4 bytes of memory and all comparisons are very fast.
So basically you can treat Ruby symbols as strings which are implemented in a very clever way. They look like strings but perform almost like integers.
When a new string is created, then in Ruby a new C structure is allocated to keep that object. For two Ruby strings, there are two pointers to two different memory locations (which may contain the same string). However a symbol is immediately converted to C int type. Therefore there is no way to distinguish two symbols as two different Ruby objects. This is a side effect of the implementation. Just keep this in mind when coding and that's all.
In Ruby, symbols are often used as keys in hashes, so often that Ruby 1.9 even introduced a shorthand for constructing a hash. What you previously wrote as:
{:color => :blue, :age => 32}
can now be written as:
{color: :blue, age: 32}
Essentially, they are something between strings and integers: in source code they resemble strings, but with considerable differences. The same two strings are in fact different instances, while the same symbols are always the same instance:
> 'foo'.object_id
# => 82447904
> 'foo'.object_id
# => 82432826
> :foo.object_id
# => 276648
> :foo.object_id
# => 276648
This has consequences both with performance and memory consumption. Also, they are immutable. Not meant to be altered once when assigned.
An arguable rule of thumb would be to use symbols instead of strings for every string not meant for output.
Although perhaps seeming irrelevant, most code-highlighting editors colour symbols differently than the rest of the code, making the visual distinction.
Atoms (in Erlang or Prolog, etc.) or symbols (in Lisp or Ruby, etc.)—from herein only called atoms—are very useful when you have a semantic value that has no natural underlying "native" representation. They take the space of C-style enums like this:
enum days { MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY }
The difference is that atoms don't typically have to be declared and they have NO underlying representation to worry about. The atom monday
in Erlang or Prolog has the value of "the atom monday
" and nothing more or less.
While it is true that you could get much of the same use out of string types as you would out of atoms, there are some advantages to the latter. First, because atoms are guaranteed to be unique (behind the scenes their string representations are converted into some form of easily-tested ID) it is far quicker to compare them than it is to compare equivalent strings. Second, they are indivisible. The atom monday
cannot be tested to see if it ends in day
for example. It is a pure, indivisible semantic unit. You have less conceptual overloading than you would in a string representation in other words.
You could also get much of the same benefit with C-style enumerations. The comparison speed in particular is, if anything, faster. But... it's an integer. And you can do weird things like have SATURDAY
and SUNDAY
translate to the same value:
enum days { SATURDAY, SUNDAY = 0, MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY }
This means you can't trust different "symbols" (enumerations) to be different things and thus makes reasoning about code a lot more difficult. Too, sending enumerated types through a wire protocol is problematical because there's no way to distinguish between them and regular integers. Atoms do not have this problem. An atom is not an integer and will never look like one behind the scenes.
In Lisp symbol and atom are two different and unrelated concepts.
Usually in Lisp an ATOM is not a specific data type. It is a short hand for NOT CONS.
(defun atom (item)
(not (consp item)))
Also the type ATOM is the same as the type (NOT CONS).
Anything that is not a cons cell is an atom in Common Lisp.
A SYMBOL is a specific datatype.
A symbol is an object with a name and identity. A symbol can be interned in a package. A symbol can have a value, a function and a property list.
CL-USER 49 > (describe 'FOO)
FOO is a SYMBOL
NAME "FOO"
VALUE #<unbound value>
FUNCTION #<unbound function>
PLIST NIL
PACKAGE #<The COMMON-LISP-USER package, 91/256 internal, 0/4 external>
In Lisp source code the identifiers for variables, functions, classes and so on are written as symbols. If a Lisp s-expression is read by the reader, it does create new symbols if they are not known (available in the current package) or reuses an existing symbol (if it is available in the current package. If the Lisp reader reads a list like
(snow snow)
then it creates a list of two cons cells. The CAR of each cons cell point to the same symbol snow. There is only one symbol for it in the Lisp memory.
Also note that the plist (the property list) of a symbol can store additional meta information for a symbol. This could be the author, a source location, etc. The user can also use this feature in his/her programs.
Atoms provide fast equality testing, since they use identity. Compared to enumerated types or integers, they have better semantics (why would you represent an abstract symbolic value by a number anyway?) and they are not restricted to a fixed set of values like enums.
The compromise is that they are more expensive to create than literal strings, since the system needs to know all exising instances to maintain uniqueness; this costs time mostly for the compiler, but it costs memory in O(number of unique atoms).
A short example that shows how the ability to manipulate symbols leads to cleaner code: (Code is in Scheme, a dialect of Lisp).
(define men '(socrates plato aristotle))
(define (man? x)
(contains? men x))
(define (mortal? x)
(man? x))
;; test
> (mortal? 'socrates)
=> #t
You can write this program using character strings or integer constants. But the symbolic version has certain advantages. A symbol is guaranteed to be unique in the system. This makes comparing two symbols as fast as comparing two pointers. This is obviously faster than comparing two strings. Using integer constants allows people to write meaningless code like:
(define SOCRATES 1)
;; ...
(mortal? SOCRATES)
(mortal? -1) ;; ??
Probably a detailed answer to this question could be found in the book Common Lisp: A Gentle Introduction to Symbolic Computation.