2.3.1. From one chemical representation to another: translation and identifier exchange
We can summarize what we’ve learned so far:
- Different names and compounds may be designed to represent different sorts of chemical compounds, different structural features of these compounds, for different purposes.
- You can figure out what a particular name or formula DOES and DOES NOT tell you about the structure of a chemical entity by asking the kinds of questions that we have discussed above.
- Almost all chemical names and formulas, even ones designed for a very specific purpose, get re-used in other ways. Cheminformatics involves a lot of this sort of re-use.
Often, effective re-use of a particular name or formula involves swapping the identifier that you’ve found for another identifier for the same compound that’s more convenient for your purposes. For example, if you are interested in comparing the structures of a list of compounds for which you have registry numbers, you need to swap those registry numbers for structural formulas, connection tables, or another sort of representation that gives you the structural information you’re looking for.
The final section of this module provides an overview of how you should think about this process of swapping one kind of identifier for another.
There are two ways of exchanging identifiers: lookup and translation. In the case of lookup, you locate the identifier that you have in an existing database that lists various different identifiers for each compound, and you select the other identifier that you want. This is like using a thesaurus.
In the case of translation, you use a set of rules (or a computer uses an algorithm) to take apart one sort of representation of a compound and to create another sort of representation for the same compound.
Like words for the same object in different languages, even when two names or formulas are meant to refer to exactly the same compound, they differ in their connotations. They describe the compound’s structure in more or less specific ways, they emphasize different kinds of family relationships, and they draw upon different ways of understanding chemical objects and phenomena.
Scholars of literature like to emphasize that there is no such thing as a literal, perfect translation of a poem or novel from one language into another. Translation always involves decisions about what aspects of meaning to try to preserve and which to allow to become obscured. Literary language is often purposefully ambiguous, whereas chemical nomenclature and notation is extraordinarily precise. Nevertheless, even when it comes to chemical names and formulas, it is still often true that what one kind of chemical name or formula communicates cannot be perfectly expressed in another kind of name or formula.
Of course, this does not mean that translating one kind of name or formula into another is a bad idea. In fact, communication about chemical compounds depends upon chemists and computers constantly engaging in this kind of chemical translation. We will discuss how you can approach this process carefully, anticipate where misunderstandings might arise, and take measures to avoid them.
2.3.2. Validation
Naoki Sakai, a scholar of translation in literature and politics, has written, “Every translation calls for a countertranslation.” Any time you take an idea from one language A and put it into another B, you should think about how someone encountering the idea in the second language might translate it into the first.
This goes for chemical communication as well. When you “translate” a formula or name from one format to another, you should perform a countertranslation: that is, you should take your new name or formula and make sure that you can get back to the one you started with. The same goes for lookup: you should make sure that you can look up the identifier that you started with using the identifier that you generated.
This will help you be aware of what might have gotten lost or inadvertently added in translation. You won’t always be able to completely solve any potential problems that arise. Sometimes, the identifier that you started with and the one that you generate are not equally specific: for example, you can translate a structural formula into a single molecular formula, but you cannot translate that molecular formula back into a structural formula. As we have said, perfect translation is often not possible. But the validation exercises of countertranslation and reverse-lookup will help you be aware of any problems that might arise, so that you can figure out other ways to head them off.
Large chemical databases use validation and counter-translation as part of standardizing the data included in their chemical records. For example, they may collect data that includes both systematic names and molecular structures and run each of these name-to-structure and structure-to-name conversions to match any previous instances of these compounds in their databases or identify any potential errors
We’ll cover validation in greater detail later on in this course.
2.3.3. Provenance
Above, we discussed how the history of where a system of notation came from and the purpose for which it was designed affects what kinds of chemical information you can express using that sort of name or formula.
Individual names and formulas have their own individual histories, which we call their provenance. Where did you find the name or formula? Who put it there? Who or what created it, and why? Was it copied in from another system? Has it already been translated from another format? Can you trace it back to an original experiment, calculation, or hypothesis?
You should keep these questions in mind for your own sake. You should especially keep in mind that the person or computer that you’re communicating with won’t necessarily know the answers to these questions, and that this is a potential source of misunderstanding. They may also need to evaluate this information for their own subsequent re-use purposes.
Whenever you exchange one chemical name or formula for another via translation or lookup, you should keep track of:
- The source and the form of the original name or formula
- The tools and resources that you used in doing the lookup or (if applicable) the translation.
You may wish to share this information along with your new name or formula itself, in order to avoid miscommunication. Whether or not you do that, it’s your responsibility to keep track of your translation process in this way. Doing so will help you figure out what could have gone wrong when confusion arises (and perhaps to prove that the confusion isn’t your fault!)
2.3.4. Knowing your audience and user community
Chemical communication takes a wide variety of forms. Different formats of chemical nomenclature and notation are more appropriate for different settings. Sometimes, it’s pretty clear what format is right (or wrong) in a given situation. Systematic names aren’t usually much good in casual conversation; you can’t do a google search for a sketch of a structural formula; a computer can’t analyze a reaction mechanism using trivial names. However, there are plenty of cases in which it takes some thought to figure out what kind of name or formula is most effective for what you want to communicate. In addition to thinking about the object that you’re communicating about, you should always know with whom or with what you are communicating, and select an appropriate variety of name or formula.
One simple way to think about your audience is in terms of chemists and non-chemists, humans and computers.
Of course, things are a little more complicated than this. Synthetic organic chemists have a certain area of expertise, and materials scientists another. Wikipedia “knows” a fair amount of chemistry because human experts in chemistry have manually added chemical information to many of its pages. Treat this simple table not as a set of boxes into which you must slot all of the different people and programs with which you communicate, but rather as a reminder that you should keep your audience in mind.
It is also useful to consider how translatable your chemical notation is for a diversity of unknown future cheminformatics applications. Follow common practices such as those used in the large public chemical databases, and/or carefully documenting your notation mapping and rules.
2.3.5. Further reading & references
Warr, W. A. Representation of chemical structures. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2011, 1, 557–579; DOI: 10.1002/wcms.36 (accessed May 29, 2104).
Warr, W. A. Some Trends in Chem(o)informatics. Chemoinformatics and computational chemical biology. Methods Mol. Biol. 2011, 672, 1–37; DOI: 10.1007/978-1-60761-839-3_1 (accessed May 29, 2104).
Wild, D. Introducing Cheminformatics: Navigating the world of chemical data. http://i571.wikispaces.com (accessed Sept. 29, 2015).
Willet, P. Chemoinformatics: a history. WIREs Comput. Mol. Sci. 2011, 1, 46–56; DOI: 10.1002/wcms.1 (accessed May 29, 2014).