For a course about chemical information we start at the lowest level - how computers deal with information internally. Some of you may already have some understanding of this, but there are some important topics we need to touch on to set things up for other topics. As a start, imagine you have a Word document that contains the following:
904-620-1938
Humans of course can very easily identify that this is a telephone number, but to a computer it looks like the following in its basic format – binary (more information on this later…)
00111001, 00110000, 00110100, 00101101,
00110110, 00110010, 00110000, 00101101,
00110001, 00110001, 00110011, 00111000
Whether it be on a regular hard disk drive (magnetic coating on a disk) or a flash/SSD (memory chips) all data on a computer is stored as 1’s and O’s (see http://www.pcmag.com/article2/0,2817,2404258,00.asp). As a result, all data on a computer must be represented in binary notation. Each recorded 1 or 0 is a ‘bit’ and eight bits make a ‘byte’. A byte is the basis of presenting data because eight bits, being either 1 or 0, can together represent numbers up to 255. This is because for each bit there are two possible permutations and eight bits gives 28 combinations - 256 values, or 0 thru 255.
In the early days of computer systems, a single byte was used as the way to represent text characters, and was defined by the American Standard Code for Information Interchange (ASCII – see http://www.ascii-code.com/). Initially, only the numbers 0-127 where used to represent letters, numbers, punctuation marks and symbols (32-127), and non-printed characters (0-31). Subsequently, extended ASCII was introduced which added accented characters, other punctuation marks, and symbols (128-255).
Looking back on the example of the telephone number above, we can now translate the binary into the telephone number:
Binary |
Decimal |
ASCII Character |
00110000
|
48
|
‘0’
|
00110001
|
49
|
‘1’
|
00110010
|
50
|
‘2’
|
00110011
|
51
|
‘3’
|
00110100
|
52
|
‘4’
|
00110101
|
53
|
‘5’
|
00110110
|
54
|
‘6’
|
00110111
|
55
|
‘7’
|
00111000
|
56
|
‘8’
|
00111001
|
57
|
‘9’
|
00101101
|
45
|
‘-’
|
Although we still ‘use’ ASCII today, in reality we use something called UTF-8. This is easier to say than how it is derived - Universal Coded Character Set + Transformation Format - 8-bit. Unicode (see http://unicode.org) started in 1987 as an effort to create a universal character set that would encompass characters from all languages and defined 16-bits, two bytes -> 216 -> 256 x 256 = 65536 possible characters – or code points. Today, the first 65536 characters are considered the “Basic Multilingual Plane”, and in addition there are sixteen other planes for representing characters giving a total of 1,114,112 code points. Thankfully, we don’t need to worry because if something is UTF-8 encoded it is backward compatible with the first 128 ASCII characters.
It’s worth pointing out at this stage that the development of Unicode is a good thing for science. We speak our own language and have special symbols that we use in many different situations (how about the equilibrium symbol? ⇌ ) and so publishers in science and technology have developed fonts for reporting scientific research. Check out and install STIX fonts (http://www.stixfonts.org) which would not be possible without Unicode.
Comments 4
Thanks for adding the article
STIX font
stix fonts
Sourceforge page