30 mar. 2012

Comparing Spectra with different math treatments

It is amazing the quantity of graphics you can develop with R, and how you can show and manage  these graphics. Here in the same plot we compare the raw demo spectra, treated with MSC, SNV and with the first derivative (differences between consecutive  data points). We can see how SNV and MSC look similar, but if you look to the Y  axis, they are different.
In the first derivative spectra there is a big peak of noise at 1100 nm. At this wavelength the detector change from “Si” to “PbSn”, so the difference in absorbance’s is quite high (from 1098 to 1100), so we have to exclude this area if we develop the calibration with the full spectrum range.


Using: par(mfrow=c(2,2)) opens a window, where we plot our spectra with the four different treatments.

29 mar. 2012

Dividing the Sample Set in two (Validation & Training)

We have in the Demo sample set “66” samples.  In this post we´ll see one way to divide the set in two parts: one for “Validation” and another for Training or Calibration.
The selection will be random. And we are going to use the command: “sample”. I decided to select 10 samples for validation, and the rest for training.
demo_raw_val<-demo_raw[sample(66,10),]
If you repeat this sentence several times, you will get different sets every time.
In my case the samples selected are:
Samples: 25,50,8,49,39,12,16,63,35 y 41
These samples are in rows, and we have to create a training set removing them:
demo_raw_train<-demo_raw[c(-25,-50,-8,-49,-39,-12,-16,-63,-35,-41),]
We will create the same sample sets for the other data frame with math treatments:
demo_msc_train<-demo_msc[c(-25,-50,-8,-49,-39,-12,-16,-63,-35,-41),]
demo_snv_train<-demo_snv[c(-25,-50,-8,-49,-39,-12,-16,-63,-35,-41),]
demo_msc_val<-demo_msc[c(25,50,8,49,39,12,16,63,35,41),]
demo_snv_val<-demo_msc[c(25,50,8,49,39,12,16,63,35,41),]
It is important to look to the summary of the sample sets to check and compare the statistics for the different constituents.
Or to look to the distribution plots, like in this case for moisture:

28 mar. 2012

SCRIPT for NIR "DEMO" Tutorial - 001

########### RAW NIR Demo Data##################################
demo<-read.table("demo.txt",header=TRUE)
VIS<-demo[,6:355]
VIS<-as.matrix(VIS)
NIR<-demo[,356:1055]
NIR<-as.matrix(NIR)
ALL<-demo[,6:1055]
ALL<-as.matrix(ALL)
Protein<-demo[,1]
Fat<-demo[,2]
Ash<-demo[,3]
DM<-demo[,4]
Moisture<-demo[,5]
demo_raw<-data.frame(Protein=I(Protein),Fat=I(Fat),Ash=I(Ash),DM=I(DM),
+ Moisture=I(Moisture),VIS=I(VIS),NIR=I(NIR),ALL=I(ALL))
wave_vis<-seq(400,1098,by=2)
wave_nir<-seq(1100,2498,by=2)
wave_all<-seq(400,2498,by=2)
matplot(wave_vis,t(demo_raw$VIS),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_nir,t(demo_raw$NIR),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_all,t(demo_raw$ALL),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
########### MSC (Multiple Scatter Correction) ##########################
library(ChemometricsWithR)
VIS_msc<-msc(NIR)
VIS_msc<-as.matrix(VIS_msc)
NIR_msc<-msc(NIR)
NIR_msc<-as.matrix(NIR_msc)
ALL_msc<-msc(ALL)
ALL_msc<-as.matrix(ALL_msc)
demo_msc<-data.frame(Protein=I(Protein),Fat=I(Fat),Ash=I(Ash),DM=I(DM),
+ Moisture=I(Moisture),VIS=I(VIS_msc),NIR=I(NIR_msc),ALL=I(ALL_msc))
matplot(wave_vis,t(demo_msc$VIS),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_nir,t(demo_msc$NIR),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_all,t(demo_msc$ALL),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
########### SNV (Standard Normal Variate )####################################
tVIS<-t(VIS)
tNIR<-t(NIR)
tVISsnv<-scale(tVIS,center=TRUE,scale=TRUE)
tNIRsnv<-scale(tNIR,center=TRUE,scale=TRUE)
VIS_snv<-t(tVISsnv)
VIS_snv<-as.matrix(VIS_snv)
NIR_snv<-t(tNIRsnv)
NIR_snv<-as.matrix(NIR_snv)
tALL<-t(ALL)
tALLsnv<-scale(tALL,center=TRUE,scale=TRUE)
ALL_snv<-t(tALLsnv)
ALL_snv<-as.matrix(ALL_snv)
demo_snv<-data.frame(Protein=I(Protein),Fat=I(Fat),Ash=I(Ash),DM=I(DM),
+ Moisture=I(Moisture),VIS=I(VIS_snv),NIR=I(NIR_snv),ALL=I(ALL_snv))
matplot(wave_nir,t(demo_snv$NIR),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_vis,t(demo_snv$VIS),lty=1,pch=".",xlab="nm",ylab="log(1/R)")
matplot(wave_all,t(demo_snv$ALL),lty=1,pch=".",xlab="nm",ylab="log(1/R)")

22 mar. 2012

Export from ISI Scan / Importing into "R"

Some software’s like ISI Scan (Infrasoft International) have a nice feature to export in an ASCII file the spectral data. Every time we analyze a sample in routine, one file with the spectral data of that sample is sent to a folder we have configure previously.
This post was written after checking this point and after that I have imported the spectrum into “R”.
To configure the product to export spectra:


After some preparation of the  file and some work in "R", we can see and work with the spectra. We have seen in previous post how to read file into R with for example: "read.table".

21 mar. 2012

Win ISI (Demo.CAL) imported in "R"


I like to work with “R” (with the different chemometrics packages) and compare the results with other software’s, like in this case Win ISI.
Win ISI has a demo file, acquired in an instrument with a wavelength range from 400-2498 nm.
Spectra can be exported from Win ISI and imported into “R” with “read.table”, for example.
Data frame is generated and the raw spectra are ready to run Chemometrics.

20 mar. 2012

Excel: "Vincular una Matriz a un rango de celdas"

Savitzky-Golay in "VISION"

I had been explaining how to develop this nice pre-treatment in “R”  (see previous posts):
Savitzky-Golay filters in "R"
Applying Savitzky-Golay filters in “R”
, but let’s see how a commercial Chemometric Software as VISION, show the configuration window:

As we can see we have to configure the same parameters than in “R”, Matlab, or any other software which contains SG filters.

Applying Savitzky-Golay filters in “R”

When applying SG, we select a moving average window with an odd value “n” for the number of data points. SG fit a polynomial of “p” degree to this data points and give the value to the central point (this is the reason to have an odd value).
We apply also an smooth in the case of “m” = 0, or the first (m=1), second (m=2) or third (m=3) derivatives.

Apply a Savitzky-Golay smoothing filter
Description
Smooth data with a Savitzky-Golay smoothing filter.
Usage
sgolayfilt(x, p = 3, n = p + 3 - p%%2, m = 0, ts = 1)
## S3 method for class 'sgolayFilter'
filter(filt, x, ...)
Arguments

x
signal to be filtered.
p
filter order.
n
filter length (must be odd).
m
return the m-th derivative of the filter coefficients.
ts
time scaling factor.
filt
filter characteristics (normally generated by sgolay).


Playing with these three parameter we change the shape of the spectra.

Example:
The upper spectra was taken with p=2, n=7 and m=2
The lower one with p=2, n=31 and m=2
As we can see the size of the window has a big influence on the resolution of the peaks.
How to apply it in R:
(For the lower spectrum)
>sgolay_2_31_2<-apply(fattyac$NITm,1,sgolayfilt,p=2,n=31,m=2)
>matplot(wavelengths,sgolay_2_31_2,lty=1,pch=21,
 + xlab="nm",ylab="SG_2_31_2_abs")


17 mar. 2012

Savitzky-Golay filters in "R"

Derivatives are a good way to extract information to our spectra. As we know NIR contents overlapping bands, and spectra must be treated with math operations in order to extract as much information as possible and to correlate it with the constituent  of interest in the case of quantitative analysis or to discriminate two similar products.
Derivatives are one of those math treatments.
With “R”, we can apply a Savitzky Golay smoothing filter to apply derivatives to our spectra, and to select the order of the derivative.
Let´s have a look first to the R help:

Apply a Savitzky-Golay smoothing filter
Description
Smooth data with a Savitzky-Golay smoothing filter.
Usage
sgolayfilt(x, p = 3, n = p + 3 - p%%2, m = 0, ts = 1)
## S3 method for class 'sgolayFilter'
filter(filt, x, ...)
Arguments

x
signal to be filtered.
p
filter order.
n
filter length (must be odd).
m
return the m-th derivative of the filter coefficients.
ts
time scaling factor.
filt
filter characteristics (normally generated by sgolay).
...
additional arguments (ignored).

Let’s apply the default values for SG filter to the “fattyac” data (you can do the same for the “yarn” data or “gasoline” data).
But first remember how this data looks before any math treatment was applied:
Now let’s give to m, the values 1 (First Derivative) and 2 (Second Derivative)

Here we have a powerful tool, to continue progressing doing Chemometrics with R.

Bibliography:
Chemometrics with R (Ron Wehrens)
R help page

14 mar. 2012

NIT: Fatty acids study in R - Part 007

Once we have chosen the model, we can continue acquiring spectra of new samples. Spectra is exported to a txt or csv file and we imported in R to be reprocessed.
We use the function “predict” from the PLS package. I have done this with 20 new samples. We need first to apply to them the adequate math treatment (same as the used in the model). I call this sample set for prediction “fatt2ac_val”, after apply the “msc” math treatment.
So, let´s see the predictions:
> predict(C16_0,ncomp=12,newdata=fatty2ac_val)
, , 12 comps
    C16_0
220   22.01807
221   20.44803
222   19.79991
223   21.64232
224   20.29058
225   20.20099
226   21.52053
227   19.83305
228   18.95492
229   21.39239
230   21.11044
231   20.67454
232   19.28662
233   20.97292
234   21.70614
235   20.27464
236   19.70897
237   21.30686
238   20.21069
239   19.21576
I have used the model with 12 terms.
If this data has more than the spectra, (the Lab values) we can also validate and to check the number of terms to use.
> RMSEP(C16_0,newdata=fatty2ac_val,ncomps=12)
(Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps 
     1.6280       1.5987       1.2554       1.1071       1.3447       0.9122 
    6 comps      7 comps      8 comps      9 comps     10 comps     11 comps 
     0.8754       0.8413       0.6597       0.6154       0.5669       0.5791 
   12 comps     13 comps     14 comps     15 comps     16 comps 
     0.5935       0.6261       0.5994       0.6315       0.5787

We can see our RMSE error and compare it with the RMSEP obtained in the PLSR “LOO validation statistics” which was 0,5733.We can see also that we would get even lower values for validation with a lower number of components (10).
It seems that the is working almost as expected, but let´s have a look to the plots:
>predplot(C16_0,ncomp=7:15,newdata=fatty2ac_val,
+ asp=1,line=TRUE)
I can see a Bias, of almost 0,50.
The error corrected by the Bias would be 0.34.
Bias can be due to diferent reasons (temperature, sample presentation,particle size,optical path,state of the instrument,...).
This samples are fat triturated as a paste, and put it in a petri dish, and place it in an instrument in transmitance.
The next step would be to add this data to the data base and recalibrate.
Divide the data into a Calibration and a Validation Set. Be sure that in the validation set there are some samples of this last set.
The idea with all this series of post related to this data set was to work with data I used almost daily in my job, and I wanted to see how to proceed in R, and once you get use to it the results are very good for the understanding of chemometrics for multivariate data. I will continue exporting some of my data sets to work in R. The analysis of Fatty Acids by NIR/NIT can give good results for some of them (C16:0, C18:0, C18:1,..).


12 mar. 2012

NIT: Fatty acids study in R - Part 006

In one of the columns, for constituent C16_0, one sample (57) has a value of “zero” (we could see this in the histogram).The reason for that is that the laboratory did not supply this value. The PLS regression will consider the lab value as cero, so we will get a plot like this:

I observed also that the sample 219 has a high residual for the regressions of all the constituents, so I decided to remove these two samples from the sample set in order to continue, and to develop the models.
I am starting with R, so I will appreciate if you add comments in order to do this task in a simpler way.
I create two sample sets, in order to remove these two samples (219 and 57):

> fattyac1<-fattyac_msc[1:56,]
> fattyac2<-fattyac_msc[58:218,]
and I combined this three sets again:
> fattyac_msc1<-rbind(fattyac1,fattyac2)
Well, I can develop my regression now:

Now we have to take the decision of how many terms to choose. Let´s see the validation plot with 7 and 12 components (terms).

plot(C16_0,ncomp=7,which="validation")
abline(0,1,col="red")
plot(C16_0,ncomp=12,which="validation")
abline(0,1,col="red")


It is clear that the decision to choose one model or the other will have a great influence in the predictions. We need a validation set to make a better decision. But I think that it will work better with 12 terms.
It will be important, if possible to find samples with C16:0 values bellow 18 to add to our database in order to develop a better model.
Another decision could be to keep out this extreme sample until we find more, but we can decide to leave it, in order to extrapolate better in this zone.
It is important not to have unique samples in the model. In this case we have one. We have to consider this.

If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.

9 mar. 2012

NIT: Fatty acids study in R - Part 005 (Calibration)

There are several algorithms to run a PLS regression (I recommend to consult the books: “Introduction to Multivariate Analysis in Chemometrics - Kurt Varmuza & Peter Filzmozer” and “Chemometrics with R – Ron Wehrens”).
We are going to use the PLS package, and we are going to develop, maybe the constituent which looks more promising: Oleic Acic (C18:1).
Of course we are going to use the MSC pretreatment. For vcross validation we are going to use “leave one out”.

I decide a maximum of 16 PLS terms.
[1] C18_1_reg<- plsr(C18_1~NITmsc, ncomp = 16,data =fattyac_msc, 
    + validation = "LOO")



Looking to the Cross Validation statistics, it seems that 12 is the best number of terms to use, anyway lets see the plots.
par(mfrow=c(1,2))
plot(C18_1_reg,"validation",estimate="CV")
abline(v=12,col="red")
plot(C18_1_reg,ncomp=12,"prediction")
abline(0,1,col="red")




We can see how after Term 12 the RMSEP increase.
We can see in the XY plot, how we have a few samples (like the sample 219) with a high residual (probably wrong lab value in some cases). If you prefer to see the residual plot, where you can see sample 219 in the upper right corner:


If we check the residual list (C18_1_reg$residuals),the extreme sample 66, fits well in the model, so we decide to keep it. No we can remove sample 219 from the sample training set removing this row from the data frame.
C18_1_reg<- plsr(C18_1~NITmsc,ncomp = 16,
   + data =fattyac1_msc,validation = "LOO")
We repeat steo [1] again, but thid time we change "fattyas_msc" for "fattyas1_msc", and look to the plots and statistics again. In this case I overplot the plots with the outlier (red colour) and without the outlier (Blue):
  

 We keep this last model and we will test it in the future with new samples (independent test set).


If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.

7 mar. 2012

NIT: Fatty acids study in R - Part 004

We already know that sample 66 has a MD of 11.6. Let´s see the values for the six constituents for this sample:
> fattyac_msc[66,1:6]
   C16_0  C16_1  C18_0  C18_1  C18_2  C18_3
66  15.8     2     6    62.3   10.2    0.6

Let´s compare with the summary


> summary(fattyac_msc)
        C16_0           C16_1                 
 Min.   : 0.00   Min.   :1.500     
 1st Qu.:20.10   1st Qu.:2.000     
 Median :21.00   Median :2.200     
 Mean   :21.34   Mean   :2.267     
 3rd Qu.:22.90   3rd Qu.:2.500     
 Max.   :26.00   Max.   :3.500    
   
  C18_0            C18_1     
Min.   : 5.800   Min.   :43.80 
1st Qu.: 8.600   1st Qu.:51.95 
Median : 9.400   Median :54.50 
Mean   : 9.711   Mean   :53.93 
3rd Qu.:10.500   3rd Qu.:56.15 
Max.   :14.000   Max.   :62.30 

 C18_2            C18_3      
 Min.   : 5.500   Min.   :0.3000 
 1st Qu.: 7.600   1st Qu.:0.5000 
 Median : 8.500   Median :0.6000 
 Mean   : 8.503   Mean   :0.6032 
 3rd Qu.: 9.100   3rd Qu.:0.7000 
 Max.   :14.700   Max.   :1.3000  


Sample 66 has the higher value for C18:1 (oleic acid), but it is not isolated in the histogram. For some reasons this sample differs from the others especially from 100 to 1050 nm. We will wait forward to take a decision about this sample.
Until now we have been managing with the X matrix.
Now we start to study the Y matrix. First thing to do is to have a look to the summary, and of course to the histograms.
If you want to follow this tutorial, please send me an e_mail. I´ll send you the “txt” file attached.

> hist(C16_0,col="red")
> hist(C16_1,col="blue")
> hist(C18_0,col="green")
> hist(C18_1,col="brown")
> hist(C18_2,col="violet")
> hist(C18_3,col="orange")