This document is an R
vignette available under the CC BY-SA 4.0
license. It is part of the R package Rnmr1D
, a free
software available under the GPL version 3 or later, with copyright from
the Institut National de la Recherche Agronomique (INRA).
A development version of the Rnmr1D
package can be
downloaded from GitHub. To
install it, we can follow the indications in the README.md
file.
Rnmr1D is the main module in the NMRProcFlow web application (nmrprocflow.org)[1] concerning the NMR spectra processing.
Rnmr1D R package is aimed to performs the complete processing of a set of 1D NMR spectra from the FID (raw data) and based on a processing sequence (macro-command file). An additional file specifies all the spectra to be considered by associating their sample code as well as the levels of experimental factors to which they belong.
NMRProcFlow allows experts to build their own spectra processing workflow, in order to become re-applicable to similar NMR spectra sets, i.e. stated as use-cases. By extension, the implementation of NMR spectra processing workflows executed in batch mode can be considered as relevant provided that we want to process in this way very well-mastered and very reproducible use cases, i.e. by applying the same Standard Operating Procedures (SOP). A subset of NMR spectra is firstly processed in interactive mode in order to build a well-suited workflow. This mode can be considered as the ‘expert mode’. Then, other subsets that are regarded as either similar or being included in the same case study, can be processed in batch mode, operating directly within a R session.
See the NMRProcFlow online documentation https://nmrprocflow.org/ for further information.
To illustrate the possibilities of Rnmr1D 1.0, we will use the dataset provided within the package. This is a very small set of 1H NMR spectra (6 samples) acquired on a Bruker Advanced III 500Mz instrument (ZG sequence, solvent D20, pH 6), derived from plant leaves. The experimental design of the study focused on a treatment (stress vs. control) with 3 replicates for each samples (unpublished).
library(Rnmr1D)
<- system.file("extra", package = "Rnmr1D")
data_dir <- file.path(data_dir, "CD_BBI_16P02")
RAWDIR <- file.path(data_dir, "NP_macro_cmd.txt")
CMDFILE <- file.path(data_dir, "Samples.txt") SAMPLEFILE
The samples matrix with the correspondence of the raw spectra, as well as the levels of the experimental factors
<- read.table(SAMPLEFILE, sep="\t", header=T,stringsAsFactors=FALSE)
samples samples
## Spectrum Samplecode EXPNO PROCNO Treatment
## 1 CD_BBI_16P02-R1 R1 10 1 control
## 2 CD_BBI_16P02-R2 R2 10 1 control
## 3 CD_BBI_16P02-R3 R3 10 1 control
## 4 CD_BBI_16P02-R7 R7 10 1 stress
## 5 CD_BBI_16P02-R8 R8 10 1 stress
## 6 CD_BBI_16P02-R9 R9 10 1 stress
The Macro-commands list for processing
<- readLines(CMDFILE)
CMDTEXT grep("^#$", CMDTEXT, invert=TRUE)] CMDTEXT[
## [1] "#%% Vendor=bruker; Type=fid; LB=0.3; GB=0; ZF=2; BLPHC=FALSE; PHC1=TRUE; FP=0; TSP=TRUE"
## [2] ""
## [3] "# Baseline Correction: PPM Range = ( 4.966 , 9.348 )"
## [4] "airpls 4.966 9.348 3 "
## [5] ""
## [6] "# Baseline Correction: PPM Range = ( 0.396 , 4.712 )"
## [7] "airpls 0.396 4.712 4 "
## [8] ""
## [9] "# Baseline Correction: PPM Range = ( 0.621 , 1.522 )"
## [10] "airpls 0.621 1.522 5 "
## [11] ""
## [12] "# Baseline Correction: PPM Range = ( 1.227 , 1.353 )"
## [13] "airpls 1.227 1.353 6"
## [14] ""
## [15] "# Normalisation ( CSN ) of the Intensities based on the selected PPM ranges..."
## [16] "normalisation CSN"
## [17] "0.98 1.085"
## [18] "5.024 9.282"
## [19] "1.451 4.696"
## [20] "EOL"
## [21] ""
## [22] "# Zeroing the selected zones ..."
## [23] "zero"
## [24] "4.683 5.015"
## [25] "EOL"
## [26] ""
## [27] "# Alignment of the selected zones ( 5.024 , 9.611 )"
## [28] "clupa 10.2 10.5 5.024 9.611 0.01 5 0"
## [29] ""
## [30] "# Alignment of the selected zones ( 2.652 , 2.742 )"
## [31] "align 2.652 2.742 0.05 0"
## [32] ""
## [33] "# Alignment of the selected zones ( 2.651 , 2.677 )"
## [34] "align 2.651 2.677 0.05 0"
## [35] ""
## [36] "# Alignment of the selected zones ( 0.642 , 4.699 )"
## [37] "clupa 10.2 10.5 0.642 4.699 0.01 5 0"
## [38] ""
## [39] "# Bucketing - UNIFORM"
## [40] "bucket unif 10.2 10.5 0.01 3 0"
## [41] "4.696 0.98"
## [42] "9.389 5.008"
## [43] "EOL"
doProcessing is the main function of this package. Indeed, this function performs the complete processing of a set of 1D NMR spectra from the FID (raw data) and based on a processing sequence (macro-command file). An additional file specifies all the spectra to be considered by associating their sample code as well as the levels of experimental factors to which they belong. In this way it is possible to select only a subset of spectra instead of the whole set.
<- Rnmr1D::doProcessing(RAWDIR, cmdfile=CMDFILE, samplefile=SAMPLEFILE, ncpu=2) out
## Rnmr1D: --- READING and CONVERTING ---
## Rnmr1D: Vendor=bruker, Type=fid, LB=0.3, GB=0, ZF=2, BLPHC=FALSE, PHC1=TRUE, FP=0, TSP=TRUE
## Rnmr1D: Generate the 'samples' & 'factors' files from the list of raw spectra
## Rnmr1D: -- Nb Spectra = 6 -- Nb Cores = 2
##
## Rnmr1D: PPM range = [-0.9193 , 11.0776]
##
## Rnmr1D: Generate the final matrix of spectra...
## Rnmr1D: ------------------------------------
## Rnmr1D: Process the Macro-commands file
## Rnmr1D: ------------------------------------
## Rnmr1D:
## Rnmr1D: Baseline Correction: PPM Range = ( 4.966 , 9.348 )
## Rnmr1D: Type=airPLS, lambda= 3 , order= 1
## Rnmr1D: Baseline Correction: PPM Range = ( 0.396 , 4.712 )
## Rnmr1D: Type=airPLS, lambda= 4 , order= 1
## Rnmr1D: Baseline Correction: PPM Range = ( 0.621 , 1.522 )
## Rnmr1D: Type=airPLS, lambda= 5 , order= 1
## Rnmr1D: Baseline Correction: PPM Range = ( 1.227 , 1.353 )
## Rnmr1D: Type=airPLS, lambda= 6 , order= 1
## Rnmr1D: Normalisation of the Intensities based on the selected PPM ranges...
## Rnmr1D: Method =CSN
## Rnmr1D: Zeroing the selected PPM ranges ...
## Rnmr1D: Zone 1 = ( 4.683 , 5.015 )
## Rnmr1D: Alignment: PPM Range = ( 5.024 , 9.611 )
## Rnmr1D: CluPA - Resolution =0.01 - SNR threshold=5 - Reference=0
## Rnmr1D: --- Peak detection : nDivRange = 64
## Rnmr1D: --- Peak detection time: 3.03 sec
## Rnmr1D: --- The reference spectrum is: 3
## Rnmr1D: --- Spectra alignment to the reference: maxShift = 16
## Rnmr1D: --- Spectra alignment time: 0.51 sec
## Rnmr1D: Alignment: PPM Range = ( 2.652 , 2.742 )
## Rnmr1D: Rel. Shift Max.=0.05 - Reference=0
## Rnmr1D: Alignment: PPM Range = ( 2.651 , 2.677 )
## Rnmr1D: Rel. Shift Max.=0.05 - Reference=0
## Rnmr1D: Alignment: PPM Range = ( 0.642 , 4.699 )
## Rnmr1D: CluPA - Resolution =0.01 - SNR threshold=5 - Reference=0
## Rnmr1D: --- Peak detection : nDivRange = 64
## Rnmr1D: --- Peak detection time: 2.69 sec
## Rnmr1D: --- The reference spectrum is: 5
## Rnmr1D: --- Spectra alignment to the reference: maxShift = 16
## Rnmr1D: --- Spectra alignment time: 0.689999999999998 sec
## Rnmr1D: UNIF - Resolution =0.01 - SNR threshold=3
## Rnmr1D: Bucketing the selected PPM ranges ...
## Rnmr1D: UNIF - Resolution =0.01 - SNR threshold=3 - Append=0
## Rnmr1D: Zone 1 = ( 0.98 , 4.696 ), Nb Buckets = 375
## Rnmr1D: Zone 2 = ( 5.008 , 9.389 ), Nb Buckets = 443
## Rnmr1D: Total Buckets = 627
The ouput list includes severals metadata, data and other information.
ls(out)
## [1] "factors" "infos" "nuc" "origin" "procParams"
## [6] "rawids" "samples" "specMat"
$infos out
## Spectrum Samplecode EXPNO PROCNO PULSE NUC SOLVENT GRPDLY
## [1,] "CD_BBI_16P02-R1" "R1" "10" "1" "zg" "1H" "H2O+D2O" "76"
## [2,] "CD_BBI_16P02-R2" "R2" "10" "1" "zg" "1H" "H2O+D2O" "76"
## [3,] "CD_BBI_16P02-R3" "R3" "10" "1" "zg" "1H" "H2O+D2O" "76"
## [4,] "CD_BBI_16P02-R7" "R7" "10" "1" "zg" "1H" "H2O+D2O" "76"
## [5,] "CD_BBI_16P02-R8" "R8" "10" "1" "zg" "1H" "H2O+D2O" "76"
## [6,] "CD_BBI_16P02-R9" "R9" "10" "1" "zg" "1H" "H2O+D2O" "76"
## PHC0 PHC1 SF SI
## [1,] "5.85092517534365" "-0.109905242919922" "500.1625008" "32768"
## [2,] "5.80818530717959" "2.77555756156289e-17" "500.1625008" "32768"
## [3,] "5.73142749467959" "0.1466796875" "500.1625008" "32768"
## [4,] "5.76052905717959" "0.0591796875" "500.1625008" "32768"
## [5,] "5.78318530717959" "0.075" "500.1625008" "32768"
## [6,] "5.72381030717959" "0.140625" "500.1625008" "32768"
## SW SWH RELAXDELAY O1
## [1,] "12.0009016085441" "6002.40096038415" "25" "2500.8"
## [2,] "12.0009016085441" "6002.40096038415" "25" "2500.8"
## [3,] "12.0009016085441" "6002.40096038415" "25" "2500.8"
## [4,] "12.0009016085441" "6002.40096038415" "25" "2500.8"
## [5,] "12.0009016085441" "6002.40096038415" "25" "2500.8"
## [6,] "12.0009016085441" "6002.40096038415" "25" "2500.8"
plotSpecMat(out$specMat, ppm_lim=c(0.5,5), K=0.33)
plotSpecMat(out$specMat, ppm_lim=c(0.5,5), K=0, pY=0.1)
plotSpecMat(out$specMat, ppm_lim=c(0.5,5), K=0.33, asym=0)
<- c( rep("blue",length(out$samples$Treatment)));
cols$samples$Treatment=="stress"] <- "red"
cols[outplotSpecMat(out$specMat, ppm_lim=c(0.5,5), K=0.67, dppm_max=0, cols=cols)
It is possible to apply additionnal processing after the main processing. For that the doProcCmd function can process macro-commands included in a string array to be applied on the spectra set previously generated. In the previous processing, the bucketing has been done with a uniform approach. We are going to change this by an intelligent bucketing [2], more efficient to generate relevant variables as we will further see .
<- Rnmr1D::doProcCmd(out,
specMat.new c( "bucket aibin 10.2 10.5 0.2 3 0", "9.5 4.9", "4.8 0.5", "EOL" ), ncpu=2, debug=TRUE)
## Rnmr1D: AIBIN - Resolution =0.2 - SNR threshold=3
## Rnmr1D: Bucketing the selected PPM ranges ...
## Rnmr1D: AIBIN - Resolution =0.2 - SNR threshold=3 - Append=0
## Rnmr1D: Zone 1 = ( 4.9 , 9.5 ), Nb Buckets = 195
## Rnmr1D: Zone 2 = ( 0.5 , 4.8 ), Nb Buckets = 370
## Rnmr1D: Total Buckets = 565
$specMat <- specMat.new out
Before exporting, in order to make all spectra comparable each other, we have to account for variations of the overall concentrations of samples. In NMR metabolomics, the total intensity normalization (called the Constant Sum Normalization) is often used so that all spectra correspond to the same overall concentration. It simply consists to normalize the total intensity of each individual spectrum to a same value.
<- Rnmr1D::getBucketsDataset(out, norm_meth='CSN')
outMat 1:10] outMat[,
## B0_7495 B0_8354 B0_8663 B0_8989 B0_9167 B0_9343 B0_9436
## R1 13.015044 17.9604126 889.3085 13.340070 361.4917 141.47796 213.52321
## R2 10.884311 7.2600934 584.2360 5.970308 266.6833 107.46365 159.15690
## R3 9.083736 13.8894946 655.7992 8.877060 258.0548 101.69776 158.92785
## R7 10.556448 0.9288146 341.0188 7.403461 138.8360 49.59623 46.11968
## R8 10.653989 2.1476196 368.5862 5.074188 103.1625 40.82064 45.23590
## R9 9.645943 1.4730849 299.3819 10.715649 100.8052 34.04914 36.99747
## B0_9526 B0_9588 B0_9678
## R1 67.455240 50.690610 43.974227
## R2 40.060876 28.435390 26.718695
## R3 57.108014 48.318028 50.604803
## R7 8.070731 4.845127 5.949904
## R8 10.007071 6.674977 6.050443
## R9 7.613871 5.349257 6.717476
Note: The following features are integrated into the BioStatFlow web application the biostatistical analysis companion of NMRProcFlow (the ’ Clustering of Variables’ analysis in the default workflow).
At the bucketing step (see above), we have chosen the intelligent bucketing [2], it means that each bucket exact matches with one resonance peak. Thanks to this, the buckets now have a strong chemical meaning, since the resonance peaks are the fingerprints of chemical compounds. However, to assign a chemical compound, several resonance peaks are generally required in 1D 1 H-NMR metabolic profiling. To generate relevant clusters (i.e. clusters possibly matching to chemical compounds), we take advantage of the concentration variability of each compound in a series of samples and based on significant correlations that link these buckets together into clusters. Two approaches have been implemented:
In this approach an appropriate correlation threshold is applied on the correlation matrix before its cluster decomposition [3]. Moreover, an improvement can be done by searching for a trade-off on a tolerance interval of the correlation threshold : from a fixed threshold of the correlation (cval), the clustering is calculated for the three values (cval-dC, cval, cval+dC), where dC is the tolerance interval of the correlation threshold. From these three sets of clusters, we establish a merger according to the following rules: 1) if a large cluster is broken, we keep the two resulting clusters. 2) If a small cluster disappears, the initial cluster is conserved. Generally, an interval of the correlation threshold included between 0.002 and 0.01 gives good trade-off.
cval=0 => threshold automatically estimated
options(warn=-1)
<- Rnmr1D::getClusters(outMat, method='corr', cval=0, dC=0.003, ncpu=2) clustcor
## #-- Clustering --
## # Correlation Method: pearson
## # Correlation Threshold : 0.997
## # Correlation Tolerance: 0.003
## # Nb Clusters: 47
## #
In this approach a Hierachical Classification Analysis (HCA, hclust) is applied on the data after calculating a matrix distance (“euclidian” by default). Then, a cut is applied on the tree (cutree) resulting from hclust, into several groups by specifying the cut height(s). For finding best cut value, the cut height is chosen i) by testing several values equally spaced in a given range of the cut height, then, 2) by keeping the one that gives the more cluster and by including most bucket variables. Otherwise, a cut value has to be specified by the user (vcutusr)
vcutusr=0 => cut value automatically estimated
options(warn=-1)
<- Rnmr1D::getClusters(outMat, method='hca', vcutusr=0.12) clusthca
## #-- Clustering --
## # Distance Method: euclidean
## # Agglomeration Method: complete
## # Cutting Tree threshold: 0.12
## # Nb Clusters: 117
## #
The getClusters function returns a list containing the several components, in particular:
$clustertab[1:20, ] clustcor
## VAR CLID PPM
## [1,] "B0_8663" "C17" "0.8663"
## [2,] "B2_4291" "C17" "2.4291"
## [3,] "B0_9526" "C45" "0.9526"
## [4,] "B0_9588" "C45" "0.9588"
## [5,] "B0_9907" "C7" "0.9907"
## [6,] "B1_0009" "C7" "1.0009"
## [7,] "B1_0397" "C7" "1.0397"
## [8,] "B1_0536" "C7" "1.0536"
## [9,] "B1_0102" "C1" "1.0102"
## [10,] "B1_0214" "C1" "1.0214"
## [11,] "B1_5043" "C1" "1.5043"
## [12,] "B1_5131" "C1" "1.5131"
## [13,] "B2_1203" "C1" "2.1203"
## [14,] "B2_1282" "C1" "2.1282"
## [15,] "B2_1346" "C1" "2.1346"
## [16,] "B2_1438" "C1" "2.1438"
## [17,] "B2_1487" "C1" "2.1487"
## [18,] "B2_1590" "C1" "2.1590"
## [19,] "B2_1652" "C1" "2.1652"
## [20,] "B2_4481" "C1" "2.4481"
$clusters$C7 # same as clustcor$clusters[['C7']] clustcor
## [1] 0.9907 1.0009 1.0397 1.0536
Based on these clusters, it is possible to find candidate by querying online databases - See for example:
The’Corr’ approach produces fewer clusters but correlations between all buckets are guaranteed to be greater than or equal to the set threshold; unlike the’hca’ approach which produces many more clusters and especially small sizes (2 or 3) but which may result from aggregates having a wider range around the threshold of values of correlations between buckets.
<- ggplotCriterion(clustcor)
g1 #ggplotPlotly(g1, width=820, height=400)
g1
<- ggplotCriterion(clusthca)
g2 #ggplotPlotly(g2, width=820, height=400)
g2