Module 7: Representing and Managing Digital Spectra

Stuart Chalk, University of North Florida

By the end of this module students will:

  • Understand the formats for representing spectral data
    • JCAMP-DX, AnIML, ANDI, NetCDF, CSV, Tab delimited (XY format)
  • Where to obtain reliable spectral information
    • AIST Spectral Database for Organic Compounds (SDBS)
    • NIST Chemistry WebBook
    • ChemSpider
  • Simulated spectra
  • Spectral software

 

Module 7: Representing & Managing Digital Spectra (Stuart Chalk)

 

Since the early 1970’s microcomputers (as they were called at the time) have been a huge part of the development of scientific instrumentation.  As computer control of instrumentation became more prevalent, there was a need to also interface the detectors of instruments to the computer so that data (analog or digital) could be captured as it was generated, rather than output it on oscilloscope screens or chart recorders (see https://en.wikipedia.org/wiki/Chart_recorder).

In the early years of the digital capture of spectral data the main limitation was storage capacity.  As a result there was a practical limit on the time resolution (points per minute) and signal resolution (how many bits an analog signal was digitized as – see https://en.wikipedia.org/wiki/Analog-to-digital_converter).  It wasn’t until the early 1980’s and the advent of the 5 1/4“ floppy disk which initially stored an amazing ~100 kB (0.1 MB) of data, that scientists were easily able to collect and save digital spectra.

Today, instruments generate a vast amount of data and file sizes can be up to several GB each for certain techniques (e.g. GC-MS).  This module describes some of the common file formats for spectral data, websites where you can obtain reliable spectral data, and software for viewing/simulating spectra.
 


7.1 Spectral File Formats

As there are many instrument vendors, there are many different file formats for spectral data.  This approach was supported vendors who developed the instruments and software to operate them; thus, in order to go back and view spectra you need to use their proprietary software – which of course gets updated on a regular basis and generates revenue.  However, the majority of software does have an export format for the data (and maybe some of the spectral metadata) that can be directly imported into Excel or other applications because this has been a need for users for many years.  This is typically as Tab-Delimited Text (.txt file) or Comma Separate Variable (.csv file) files.

Over time many scientists have lamented (even complained) that it would be much better if all of the data collected on instruments were stored in a common format that would make it easy to view and share data.  While .txt and .csv files can be used for both activities, there has always been a need for a file format that supports both the spectral data and the associated metadata that describes the instrument type and settings, samples, analyte(s) and any other contextual information that gives the data value.  Historically, this has been achieved using the following specifications (what are listed below are major specifications with formal standards – there are others that are less formally defined).

ANalytical Data Interchange (ANDI)

This is for mass spectrometry and chromatography data and is described by the ASTM Standards E1947 – Mass Spectrometry (http://www.astm.org/Standards/E1947.htm) and E1947 – Chromatography (http://www.astm.org/Standards/E1948.htm). ANDI uses the Network Common Data Form (NetCDF) self describing data format that is generically defined to store array-oriented scientific data (http://www.unidata.ucar.edu/software/netcdf/).

 

Crystallographic Information Framework (CIF)

Used for crystallographic data only. This format is less traditional in its data organization than other formats because it not only contains x-ray crystallography patterns but a large, and very well organized, array of metadata about the system under study.  This text format is a fundamental standard in the discipline of crystallography and authors of new crystallographic work must provide data in this format to get their work published. (http://www.iucr.org/resources/cif)

 

The Joint Committee on Atomic and Molecular Physical Data – Data Exchange format (JCAMP-DX)

For UV/Visible Spectrophotometry, Infrared, Mass Spectrometry, Nuclear Magnetic Resonance, and Electron Spin Resonance data. JCAMP-DX is currently administered by the International Union of Pure and Applied Chemistry (IUPAC) at http://www.jcamp-dx.org, although the format has not been actively updated in almost 10 years.  Although JCAMP-DX has not formally been standardized, it is currently the de facto standard for sharing spectral data and all the major databases store their data in the format.

JCAMP-DX is a file specification consisting of a number of LABELLED-DATA-RECORDs or LDRs.  These are defined to allow reporting of spectral metadata and raw/processed instrument data. The table below outlines some of the main LDRs in JCAMP-DX.

An example JCAMP-DX file is shown below.

 

##TITLE= Cholesterol (pktab1.jdx)

##JCAMP-DX= 5   $$home made

##DATA TYPE= MASS SPECTRUM

##DATA CLASS= PEAK TABLE

##ORIGIN= Dept of Chem, UWI, Mona, JAMAICA

##OWNER= public domain

##$URL= http://wwwchem.uwimona.edu.jm:1104/spectra/testdata/index.html

##SPECTROMETER/DATA SYSTEM= Finnigan

##INSTRUMENTAL PARAMETERS= LOW RESOLUTION

##.SPECTROMETER TYPE= TRAP

##.INLET= GC

##.IONIZATION MODE= EI+

##XUNITS= m/z

##YUNITS= relative abundance

##XFACTOR= 1

##YFACTOR= 1

##FIRSTX= 0

##LASTX= 386

##NPOINTS= 46

##FIRSTY= 0

##PEAK TABLE= (XY..XY)

0,0

41,520 43,1000 55,630 67,417 69,404 79,544 81,906 91,685 95,772 105,801

107,685 119,439 121,515 133,468 135,461 145,760 147,571 159,529 161,386

173,249 185,150 187,149 199,216 213,454 215,130 228,122 229,187 231,150

247,568 255,378 260,165 261,106 275,306 287,24 297,22 301,207 314,67

325,65 328,20 339,19 353,354 368,791 369,262 371,140 386,324

##END=

 

Note the LDRs for XFACTOR and YFACTOR.  Although in the above example these LDRs are both 1 (because the raw data is already integers), these factors are commonly used to represent numeric values with a large number of decimal places as integers so that only one or two numbers (the factors) have to be stored as decimal values.  This means that storage of rounded numbers is minimized and any error incurred because of rounding is applied evenly to all the data points. An example of data in this format is below.

 

##YFACTOR=  9.5367E-7

##XYDATA= (X++(Y..Y))

 4400   68068800 68092800 68145600 68100800 68140800 68232000

 4394   68304000 68316800 68195200 68152000 68182400 68176000

 4388   68240000 68252800 68156800 68156800 68236800 68292800

 4382   68302400 68265600 68233600 68214400 68224000 68284800

 4376   68353600 68334400 68219200 68230400 68315200 68276800

 4370   68259200 68264000 68257600 68316800 68292800 68339200

 

The data in a JCAMP-DX file can be all be compressed using a number of different human readable compression formats.  This capability was added in order that spectral files were not to large for the storage media available (see above).  There are four formats for compression, called ASCII Squeezed Difference Format (ASDF) outlined in one of the original articles on the JCAMP-DX format (http://old.iupac.org/jcamp/protocols/dxir01.pdf). These are listed below and use the letters as pseudo-digits.

 

Pseudo-digits for ASDF Formats

1. ASCII digits

0            1            2            3            4            5              6            7            8            9

2. Positive SQZ digits

@          A           B           C            D           E              F            G           H           I

3. Negative SQZ digits

              a            b            c            d            e            f              g            h            i

4. Positive DIF digits

%          J             K           L            M          N              O           P            Q           R

5. Negative DIF digits

              j             k            I             m          n              o            p            q            r

6. Positive DUP digits

              S            T            U           V           W              X            Y            Z            s

Note: The above characters replace the leading digit, sign, and preceding space for SQZ, DIF, and DUP forms.
The remaining digits of a multi-digit number are standard ASCII.

 

Simple Examples of ASDF Formats

FIX Form: (22 chars)

1            2            3            3            2            1            0              -1          -2          -3

PAC Form: (19 chars)

              or:

1+2+3+3+2+1+0-1-2-3

1 2 3 3 2 1 0-1-2-3

SQZ Form: (10 chars)

1BCCBA@abc

DIF Form: (10 chars)

1JJ%jjjjjj

DIFDUP Form: (7 chars)

1JT%jX

 

As can be seen, the best compression is obtained by using the DIFDUP format which results in a 22/7 or 3.14x compression of the data.  However, this is not always the amount of compression that will get because it depends on how variable the data points are in the file and it is important to choose the right format to get the best compression.

In addition to these existing formats there are a couple of newer formats that will eventually replace the JCAMP-DX specification.

 

Analytical Information Markup Language (AnIML)

For all spectral data.  AnIML is an eXtensible Markup Language (XML) format for storing instrument data and metadata under development since 2004 (http://animl.sourceforge.net).  The specification is being coordinated under the American Standards and Testing of Materials (ASTM) E13.15 committee on analytical data.

It was recognized in 2004 that there needed to be a successor to JCAMP-DX because of i) advances in technology, ii) a recognized need to represent data from all analytical techniques, and iii) issues with variants of JCAMP-DX that made interoperability of the files difficult.  AnIML files consist of up to four data sections; SampleSet, ExperimentStepSet, AuditTrailEntrySet, and SignatureSet.  By design very little data/metadata is required so that legacy data, which may not have much or any metadata to describe it, can be stored in the AnIML format. An example ‘minimum’ AnIML file is shown below:

 

Allotrope Document Format

For all analytical data.  The Allotrope Foundation was formed in 2012 from seven (now twelve) pharma companies around the notion of changing instrument data standards so that they were uniform across all instrument vendors.  Three years later, Allotrope has just made available the first version of the Allotrope Document Format (ADF) that is based on the HD5 format for managing and storing data (https://www.hdfgroup.org/HDF5/).  The ADF format stores data and metadata from the entire laboratory process (not just the instrument data) and is arrange in layers for metadata (using controlled vocabularies and ontologies), data, and linked files.  The specification is so new that it has not yet been released to the community, but should be out by the end of the year.  See http://www.allotrope.org.

 

Other spectral file formats

Lists of file formats for the following analytical techniques can be found at


 


7.2: Sources of Spectral Data

AIST Spectral Database for Organic Compounds (SDBS) - http://sdbs.db.aist.go.jp

This database from Japan has a wealth of spectral information and is the best for searching, as there are many options to find what you need. Sadly, the majority of the spectra are presented as image files only (no JCAMP-DX) with only numeric peak data for MS and 1-NMR spectra (click the ‘peak data’ button. Currently, the site has the following amounts of spectral data:

MS: ~25000 spectra
1H NMR: ~15900 spectra
13C NMR: ~14200 spectra
FT-IR: ~54100 spectra
Raman: ~3500 spectra
ESR: ~2000 spectra

NIST Chemistry Webbook - http://webbook.nist.gov/chemistry/

The NIST Webbook is a prime source of spectral information about small organic compounds available as JCAMP-DX files, Scalable Vector Graphic (SVG - http://www.w3.org/Graphics/SVG/) files (an image format specified in XML), png image files, and scans of the original data (some from the 1960s). Searching for spectra on the site is done via the compound search and selection of the spectral data to be returned (if it is available for the compound(s) found). Other information such as thermodynamic property data, gas chromatograms (linked to original journal papers), and vibrational and electronic energy levels is also available on the site. Currently, the site has the following of spectral data:

IR spectra for over 16,000 compounds
Mass spectra for over 33,000 compounds
UV/Vis spectra for over 1600 compounds


7.2.1: ChemSpider Spectra Search

ChemSpider has over 10,000 spectra associated with chemical compounds in its database. Some of these are from commercial companies and organizations, but a large number have been uploaded by users of the website. As a result there may be spectra for compounds that are not available elsewhere. You will find that the spectra are available sometimes as JCAMP files ( in which case you can open them in the browser and interact with the spectrum using the embedded display tools - available as both Java and non-Java options) and other times users have provided pdfs, or image files( jpg, png etc).

In addition to the spectra that are available within ChemSpider, some records provide links to other sites that have spectra data not available in the ChemSpider database, such as NMR ShiftDB and the  NIST Chemistry WebBook.

It is also possible to get some information about spectra using one of the sites API’s (see below), which you need to create an account (free) to use. Access the following URL to get an XML file that lists all current spectral data by entering your security token in the ‘token’ field, which can be found on the http://www.chemspider.com/UserProfile.aspx page once you login on the ChemSpider website.

Finding spectra in ChemSpider records

There are tools to help you find ChemSpider records that have "Analytical data" - this term is used to encompass any of the following types of data: NMR, IR, UV-Vis, MS and Crystal structure. The option to return on but provide an option to filter results according to whether they have analytical data (this includes IR, MS, NMR, UV-vis and Crystal structures). This option is available by clicking on the Filter button that is available on of the search pages (SimpleStructureAdvanced) - but not on the quick search boxes that are provided on the homepage or in the header bar and then checking the the Filter only those having analytical data option.

The Filter options

While you can use this option in any search you should think carefully when performing exact searches: try searching for ubiquinone with the Filter only those having analytical data option checked and unchecked.

By contrast, the Filter only those having analytical data option works well with queries that are expected to return many results, for instance searching for all records which have a name that contains the substring hydroquinone (hint: use *hydroquinone* as the search string). In fact, using the Advanced search page you can add the filter to any of the searches, opening up the ability to search by, Molecular formula, Average mass, predicted properties etc. and see records which have some analytical data (but do remember that this filter can return records that have crystal structures but no spectra). 

Finding spectra in other resources from ChemSpider records

ChemSpider can also help users identify other resources that might have spectra data that are not available in ChemSpider itself.

If you look at the Data Sources tab (you often need to use the More dropdown to access this) in a record (eg Malonic acid - http://www.chemspider.com/Chemical-Structure.844.html) we have links to lots of other useful resources, you don’t have to scan through all of these to find useful data, there are sub-headings that display sub-sets of data sources according to their classification. Two relevant headings are Xtal structures (short for crystal structures) and Spectral data which list other websites and databases that have spectra data for that compound.

The data sources tab showing the Spectral data classification subset

Where possible the external id will act as a link to the record on the external resource.

 

Filter a ChemSpider search to find only records that reference a (spectral data) datasource

You can actually use the ChemSpider search interface to find data in one or more of the data sources that we agregate. This means that you can specify that a search only returns records that are listed as being in MassBank.  For example, from the Advanced search page you can use the Data source  Data source interface and check only the MassBank data source (you can actually check as many data sources as you wish). You can also select a to restrict your search to one or more classes of data source. Eg Spectroscopy databases (just a different label for the data sources that appear under the Spectral data sub-heading) found under Data Source  Data source typeThe interface for restricting search results according to data sources

The data source selection options can be combined with other search parameters.

Eg. Identifier (substring) = *yohimb* and Data source type = Spectroscopy databases,

or  monoisotopic mass = 104.01095 +/- 0.005 and Data source = MassBank

However, the one thing to remember is that the data in ChemSpider is only as recent as the last update provided by the data sources - so there might be cases where data that can be found by searching the individual databases is not found in a search on ChemSpider because it was added to the indivudual data base after the  last update supplied to ChemSpider.


ChemSpider - http://www.chemspider.com

ChemSpider has over 10,000 spectra associated with chemical compounds in its database. Some of these are from commercial companies and organizations, but a large number have been uploaded by users of the website. As a result there may be spectra for compounds that are not available elsewhere. Finding compounds that have spectra available is not easy to do and in fact you can only access this information via one of the sites API’s (see below), which you need to create an account (free) to use. Access the following URL to get an XML file that lists all current spectral data by entering your security token in the ‘token’ field, which can be found on thehttp://www.chemspider.com/UserProfile.aspx page once you login on the ChemSpider website.

http://www.chemspider.com/Spectra.asmx?op=GetAllSpectraInfo

You can also search for spectra using the other commands on the http://www.chemspider.com/Spectra.asmx page, and search the mass spectra using commands found in the MassSpecAPI page at http://www.chemspider.com/MassSpecAPI.asmx. For example to search for peaks in a mass spectrum of mass 1000 ± 0.1 you can go to (no token required):

http://www.chemspider.com/MassSpecAPI.asmx/SearchByMass2?mass=1000&range=0.1

Other Sources of Spectra

NIST Atomic Spectra Database - http://www.nist.gov/pml/data/asd.cfm
NIST Molecular Spectra Databases - http://www.nist.gov/pml/data/molspec.cfm
NMR Shift DB - http://nmrshiftdb.nmr.uni-koeln.de/
Human Metabolome Database - http://www.hmdb.ca
EPA Emissions Measurement Center Spectral Database -http://www3.epa.gov/ttn/emc/ftir/index.html
MassBank - http://www.massbank.jp/
Romanian Database of Raman Spectroscopy - http://rdrs.uaic.ro/index.html


 


7.3: Spectroscopy Software

Spectral Viewers

Jmol with JSpecView - <a href="http://jmol.sourceforge.net/">http://jmol.sourceforge.net/</a> - the best and most widely used online and offline spectral viewer. Browser plugin provides a lot of features for display of data and export in different formats (right mouse click the plugin to see options for viewing spectra and exporting data). Go to the links below to test out the functionality
<a href="http://chemapps.stolaf.edu/jmol/jsmol/jsv.htm">http://chemapps.stolaf.edu/jmol/jsmol/jsv.htm</a> - drag and drop spectra viewing or search for molecules and display simulated spectra
<a href="http://chemapps.stolaf.edu/jmol/jsmol/jsv_jme.htm">http://chemapps.stolaf.edu/jmol/jsmol/jsv_jme.htm</a> - draw a molecule and get simulated

spectrum

<a href="http://chemapps.stolaf.edu/jmol/jsmol/jsv_predict2.htm">http://chemapps.stolaf.edu/jmol/jsmol/jsv_predict2.htm</a> - same as the previous page but with 3D representation of molecular structure

JCAMP Viewer - <a href="http://pslc.uwsp.edu/Viewers.shtml">http://pslc.uwsp.edu/Viewers.shtml</a> - web page looks old but software has been updated recently
Flot JCAMP Viewer - <a href="http://webbook.nist.gov/chemistry">http://webbook.nist.gov/chemistry</a> - get from any page with spectra – requires web server
SpeckTackle Javascript Viewer - <a href="https://bitbucket.org/sbeisken/specktackle">https://bitbucket.org/sbeisken/specktackle</a> - new on the scene – requires web server
OpenSpectralWorks - <a href="http://scanedit.sourceforge.net/">http://scanedit.sourceforge.net/</a>
SpekWin32 - <a href="http://effemm2.de/spekwin/index_en.html">http://effemm2.de/spekwin/index_en.html</a>
OpenChrom - <a href="https://www.openchrom.net/">https://www.openchrom.net/</a> - primarily for chromatographic data but good for mas spectrometry as well
ACD/Labs NMR Processor - <a href="http://www.acdlabs.com/resources/freeware/nmr_proc/index.php">http://www.acdlabs.com/resources/freeware/nmr_proc/index.php</a> - NMR Only
MestreLabs MNova - <a href="http://mestrelab.com/software/mnova/">http://mestrelab.com/software/mnova/</a> - NMR Only ($)
ChemDoodle - <a href="https://www.chemdoodle.com/">https://www.chemdoodle.com/</a> - reads all JCAMP files ($)

Spectral Prediction

nmrdb.org - <a href="http://www.nmrdb.org/">http://www.nmrdb.org/</a>
NMR Shift DB Online - <a href="http://nmrshiftdb.nmr.uni-koeln.de/nmrshiftdb/">http://nmrshiftdb.nmr.uni-koeln.de/nmrshiftdb/</a>
ChemDoodle - <a href="https://web.chemdoodle.com/demos/simulate-nmr-and-ms/">https://web.chemdoodle.com/demos/simulate-nmr-and-ms/</a>
ACD/Labs iLabs - <a href="https://ilab.acdlabs.com/iLab2/">https://ilab.acdlabs.com/iLab2/</a> (limited online trial)


 


Module 7: Assignment

Assignment

1. Go to the NIST WebBook and download the following spectra in JCAMP-DX format
Mass spectra for each of the three isomers of nitrobenzoic acid
Gas phase IR spectra for each of the three isomers of nitroaniline

2. Go to ChemSpider and download the HNMR spectra of the three isomers of xylene in JCAMP-DX (can you find them?)

3. Go to the first JSpecView page above or download the JCAMP Viewer software for your OS. Import each of the spectra into the viewer you choose and export them to X,Y format.

4.Finally, import each of the X,Y data files into Excel with one sheet for the MS data, one sheet for the IR data, and another sheet for the NMR data. Plot the three IR spectra on one graph (so they are overlaid) and do the same for the MS and NMR data (so you end up with three graphs). Save the file.

5. Write a short paragraph comparing/contrasting for each of the thr



Annotations