21 feb. 2018

Análisis de la adulteración leche en polvo por NIR


No existe demasiada documentación sobre los métodos discriminantes de Win ISI 4, por lo que trataré de poner la que vaya encontrando para que podáis acceder a ella desde el blog. En este caso se emplean los nuevos métodos para detectar adulteraciones de la melanina en leche en polvo y os lo podéis descargar desde este enlace.
 
El artículo está traducido al castellano.

20 feb. 2018

PCR vs. PLS (part 7)


We have used PLS with tis data set in the posts series "Analyzing Soy Meal in transmittance", so I just want to add in these coming posts a comparison between the regresiones developed in PLS and in PCR.
 
The best way to do it is comparing some plots and in this post I show the comparison of the PCR and PLS regression with the same number of terms and with the same calibration set (odd samples), and the same validation during the development of the model (Cross Validation).
 
The plot shows how (as expected), the first terms are more correlated for the constituent of interest (protein in this case) for the PLS terms than for the PCR, so the RMS are better, but as soon as more terms are added are almost the same (a little bit better for the PLS).
## add extra space to right margin of plot within frame
par(mar=c(5, 4, 4, 6) + 0.1)
plot(Xodd_pcr3,"validation",estimate="CV",ylim=c(0.5,2.0),
     xlim=c(1,10),col="blue",lwd=2,
     main=" Corr & RMSEP PCR vs PLS")
par(new=TRUE)
plot(Xodd_pls3,"validation", estimate="CV",
ylim=c(0.5,2.0),xlim=c(1,10),col="red",lwd=2,
     main="")
## Plot the second plot and put axis scale on right
par(new=TRUE)
plot(terms,Xodd_pls3_cor,  xlab="", ylab="", ylim=c(0,1),
     axes=FALSE, type="l",lty=4,lwd=2, col="red")
par(new=TRUE)
plot(terms,Xodd_pcr3_cor,  xlab="", ylab="", ylim=c(0,1),
     axes=FALSE,type="l",lty=4,lwd=2, col="blue")
## a little farther out (line=4) to make room for labels
mtext("Correlation",side=4,col="black",line=3)
axis(4, ylim=c(0,1), col="black",col.axis="black",las=1)
legend("bottomright",legend=c("RMSEP PLS", "RMSEP PCR",
       "Corr PLS","Corr PCR"),
       col=c("red","blue","red","blue"),
       lty=c(1,1,4,4),lwd=2)
 A reason for the improvement in the PLS can be seen in the regression coefficients (with 5 terms), where the PLS (in red), are more sharp, and can give better estimations ( we try to analyze this in coming posts)
 

18 feb. 2018

PCR vs. PLS (Part 6)

After checking the RMSEP and correlation with different terms in the PCR I get this plot:
using this code:
par(mar=c(5, 4, 4, 6) + 0.1)
plot(Xodd_pcr3,"validation",estimate="CV",

     col="blue",ylim=c(0.5,2.5),
     xlim=c(1,10),lwd=2,xlab="",ylab="",main="")
par(new=TRUE)
plot(terms,Test_rms,type="l",col="red",ylim=c(0.5,2.5),
     xlim=c(1,10),lwd=2,xlab="Nº Terms",

     ylab="RMSEP",
     main=("CV vs Ext Validation & Corr"))
## Plot the second plot and put axis scale on right
par(new=TRUE)
plot(terms, Test_corr, pch=15,  xlab="", ylab="",

     ylim=c(0,1),axes=FALSE, type="b", col="green")
## a little farther out (line=4) to make room for labels
mtext("Correlation",side=4,col="green",line=3)
      axis(4, ylim=c(0,1), col="green",

      col.axis="green",las=1)
legend("topright",legend=c("CV", "Ext Val","Corr"),
       col=c(1:3),lwd=2)
## In the plot 5 seems to be the best option

   for the number of terms
 
Now we can compare the results in the XY plot, as we saw in the post "PCR vs. PLS (part 5)" to get the conclusions.
The statistics for this external validation set with 5 terms are (Monitor function):

> monitor10c24xyplot(pred_vs_ref_test_5t)
Validation Samples  = 25 
Reference Mean  = 45.6236 
Predicted Mean  = 46.38495 
RMSEP    : 1.046828 
Bias     : -0.761354 
SEP      : 0.7332779 
Corr     : 0.9131366 
RSQ      : 0.8338185 
Slope    : 0.7805282 
Intercept: 9.418837 
RER      : 8.115895   Poor 
RPD      : 2.428305   Fair 
Residual Std Dev is : 0.633808 
    ***Slope/Intercept adjustment is recommended***
BCL(+/-): 0.3020428 
***Bias adjustment in not necessary***
Without any adjustment and using  SEP as std dev 
the residual distibution is: Residuals into 68% prob (+/- 1SEP) = 15 % = 60 Residuals into 95% prob (+/- 2SEP) = 21 % = 84 Residuals into 99.5% prob (+/- 3SEP) = 23 % = 92 Residuals outside 99.5% prob (+/- 3SEP) = 2 % = 8 With S/I correction and using Sres as standard deviation,
the Residual Distribution would be: Residuals into 68% prob (+/- 1Sres) = 20 % = 80 Residuals into 95% prob (+/- 2Sres) = 24 % = 96 Residuals into 99.5% prob (+/- 3Sres) = 25 % = 100 Residuals outside 99.5% prob (> 3Sres) = 0 % = 0

17 feb. 2018

PCR vs. PLS (part 5)

It´s time for an external validation with an independent test. Samples in the calibration are from soy meal scanned in a NIT instrument (850 to 1050 nm), but there was some years without testing the calibration with new samples so I have the ocasión to scan new samples in a new instrument (from a new generation). Of course the path-length for transmission must be configure to the same value and the sample must be representative.
 
The validation set has 25 samples were we try to find the wider range as possible. The reference values comes principally from two laboratories, but there are two samples from a different laboratory.
 
we develop in the lasts posts a PCR calibration where we saw that with 10 terms we obtained the better results, but maybe this value is high in order to make the calibration more transferable. The test validation will help us to check this.
 
If we don´t consider the Lab origin, and we over-plot the results of predictions over the XY calibration (including cross validation predictions), we get this plot:
There are some samples that fit quite well, others have a Bias, and other random error. So we want to understand better what is going on, so we can give a color depending of the lab which gives the reference value.
 
plot(pred_pcr_test,Prot_test,xlim=c(40,52),
       ylim=c(40,52),cex=2,pch=21,
     bg=(Lab_test),col=Lab_test,lwd=4,xlab="",ylab="")

plot(pred_pcr_test,Prot_test,xlim=c(40,52),
     ylim=c(40,52),cex=2,pch=21,
     bg=(Lab_test),col=Lab_test,lwd=4,xlab="",ylab="")
legend("topleft",legend=c("Lab1", "Lab2","Lab3"),
       cex=1.5,pch=19,
       col=c(1:3))




Well, we can take some conclusions from this plot, but we need to check the predictions with different number of terms to see if the Estándar Error of Prediction (with 10 terms the RMSEP is 1,65) decrease and the RSQ (with 10 terms 0,593) increase.
 

16 feb. 2018

PCR vs. PLS (part 4)

In the post "Splitting spectral data into training and test sets"  , we kept apart the even samples for a validation, so it is the time after the calculations in post (PCR vs PLS part 3) to see if the performance of the model, with this even validation sample set, is satisfactory.
 
First we can over-plot the predictions over the plot of the blue and red samples we saw in post (PCR vs PLS part 3), and the first impression is that they fit quite well, so the performance seems to be as expected during the development of the model.
 
par(new=TRUE)
plot(Xodd_pcr3.evenpred,Prot[even,],col="green",

     bg=3,pch=23,xlim=c(40,52),ylim=c(40,52),
     xlab="",ylab="")
legend("topleft",legend=c("Odd", "Odd CV","Even"),
       col=c("blue","red","green"),pch=c(1,1,18),

       cex=0.8,bg=)
abline(0,1)

The prediction error for the Even Test Set is:

RMSEP(Xodd_pcr3,ncomp=10,newdata=soy_ift_even$X_msc,
      intercept=FALSE)
[1]  0.9616

Probably this "even" sample set is not really an independent set, so we need to go a step farther and check the model with a really independent set with samples taken in a different instrument, with reference values from different laboratories,.....This will be the argument of the next part in these series about PCR and PLS regressions.

15 feb. 2018

Introduction to the dplyr R package


PCR vs. PLS (part 3)

A common way to develop the regressions with PCR or PLS is with "Cross Validation", where we divide the training data set into groups that you can select depending of the number of samples: In the case you have few samples you can select a high (10 for example) value and in the case you have a lot of samples, you can select a low value (for example 4). One group is keep out and the regression is developed with all the others. We use the group keep outside to validate repeating the process as many times as validation groups.
 
So Iin the case of 4 groups), we obtain 4 validation statistic values as RMSEP, RSQ, Bias,... for every group we use in the development of the regression, so we can see from which term the RMSEP start to become higher, or the RMSEP values are almost flat and make not sense to add more terms to the regression.
 
Really the cross validation help us to avoid over-fitting, but it helps also to detect outliers (good or bad outliers we can say).
 
It is interesting that the samples in the training set have a certain way of sorting or randomnes . We don´t want that there are similar samples in the training and validation set, but this require inspection of the spectra previous to the development of the regression looking for neighbors, or other sources of redundancy. In the case there are similar samples we want that they stay if possible in the same  group.
 
In the development of the PCR, we van select the Cross Validation option and also the number of segments or groups for validation, in this case 10:

Xodd_pcr3<-pcr(Prot[odd,] ~ X_msc[odd,],validation="CV",
           segments=10,ncomp=30)

plot(Xodd_pcr3,"validation",estimate="CV")

If we see the plot of the RMSEP for the 30 terms, we see that the in the fourth the RMSEP decrease dramatically, but after 10 it does make not sense to use those terms, so we have to select 10 or less.
An external validation (if possible with more independent samples) can help us with the decision.
 
A particular case of Cross Validation is the Leave One Out validation, where a sample unique sample is keep out for validation and the rest stay for the calibration. The process is repeated until all the samples have been part of the validation process. So in this case there are the same numbers of segments or groups than number of samples. This process is quite interesting when we have few samples in the training set.

About the X-Y plots (Predicted vs Reference values), it is important to interpret them in the case of cross validation. There are plots which show the predicted values vs the reference values when all the samples are part of the regression (blue dots) and those plots are not realistic, so it is better to see the plot, when every dot (sample), has a value when it is not part of the model because it is  in the validation set (red dots), this way we have a better idea of the performance of the regression for future samples.
 
plot(Xodd_pcr3,"validation",estimate="CV")
plot(Xodd_pcr3,"prediction",ncomp=10,col="red",
     xlim=c(40,52),ylim=c(40,52))
Xodd_pcr3.pred<-predict(Xodd_pcr3,ncomp=10,
                newdata=X_msc[odd,],
                xlab="Predicted",ylab="Reference")
par(new=TRUE)
plot(Xodd_pcr3.pred,Prot[odd,],col="blue",
     xlim=c(40,52),ylim=c(40,52),
     xlab="",ylab="")
abline(0,1)


This is not a nice tutorial set where the samples fit well, but it is what you often fine in several applications where you want to try to get the best model for a certain product parameter and instrument.

eRum 2018 (European R users meeting)

The European R Users Meeting, eRum, is an international conference that aims at integrating users of the R language living in Europe. Although the useR! conference series also serve similar goals, but as it's alternating between Europe and USA (and more recently Australia in 2018), we decided to start another conference series in the years when the useR! is outside of Europe.
 

12 feb. 2018

To consider when validating with LOCAL

I talk in other posts about LOCAL calibration. Once the LOCAL model is placed for routine, and the time pass, is time for validation.
 
LOCAL may have several products merged in a unique database from which we have developed the calibration, so it is important to label every sample in the validation set with a value (in this case 1: Black, 2:Red and 3:Green) corresponding to 3 different types of pork meat meal.
 
The validation is just to compare the reference values from the laboratory and the predicted values from our Near Infrared Instrument in this case, and the best way to represent it is a XY plot. The constituents most important in meat meal are: Moisture, Protein, Fat and Ash.
 
A first look to the general XY plots will gives an idea how it works:
The plot tell us that we have to split the validation set into 3 validation sets one for every class to calculate the statistics for every one.
In the case of Protein we will get this validation:

For the Green product:
For the Red Product:
For the Black Product:

As we can see we have a significant Bias for product 1, but not for the other two products, so we can proceed with the validation adjustments (3 products linked to a LOCAL database, and one of them bias adjusted until the calibration is updated).
 
Plots with R
 
 

11 feb. 2018

PCR vs. PLS (part 2)

Continue from post: "PCR vs PLS (part 1)":

As we add more terms to the model (PCR or PLS) the Standard Error of Calibration decrease and decrease, for this reason is necessary to have a validation method to decide how many terms to keep.
Cross Validation, or an External Validation set are some of the options to use.
The "PLS" R package has the function "RMSEP" which give us the standard error of the calibration, and we can have an idea of which terms are important for the calibration.
Continuing with the post PCR vs PLS (part 1), one we have the model we can check the Standard Error of Calibration with the 5 terms used in the model.

RMSEP(Xodd_pcr, estimate="train")
(Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps  
      1.985        1.946        1.827        1.783        1.167        1.120 
 
As we can see the error decrease all time, so we can be tempted to use 5 or even more terms in the model.
One of the values of the  pcr function is "fitted.values" which is an array with the predictions depending of the number of terms.
As we can see in the RMSEP values, 4 seems to be a good number to choose, because there is a big jump in the RMSEP from R PC term 3 to PC term 4 (1.783 to 1.167), so we can keep the predicted values with this number of terms and compare it with the reference values to calculate other statistics.
 
Anyway these statistics must be considered as optimistic and we have to wait for the validation statistics (50% of the samples in the even position taken apart in the post "Splitting Samples into a Calibration and a Validation sets" .

Statistics with the Odd samples (the ones used to develop the calibration)
pred_pcr_tr<-Xodd_pcr2$fitted.values[,,4]
##The reference values are:
ref_pcr_tr<-Prot[odd,]
## We create a table with the sample number,

## reference and predicted values
pred_vs_ref_tr<-cbind(Sample[odd,],ref_pcr_tr,pred_pcr_tr)


Now using the Monitor function I have developed:
monitor10c24xyplot(pred_vs_ref_tr)

We can see other statistics as the RSQ ,Bias,...,etc.

Now we make the evaluation with the "Even" samples taken apart for the validation:
library(pls)
pred_pcr_val<-as.numeric(predict(Xodd_pcr2,ncomp=4,newdata=X_msc_val))
pred_pcr_val<-round(pred_pcr_val,2)
pred_vs_ref_val<-data.frame(Sample=I(Sample[even]),

                 Reference=I(Prot_val),
                 Prediction=I(pred_pcr_val))
pred_vs_ref_val[is.na(pred_vs_ref_val)] <- 0  
                 #change NA by 0 for Monitor function

As we can see in this case the RMSEP is a Validation statistic, so it can be considered as the Standard Error of Validation, and its value it´s almost the same as the RMSEP for Calibration.
RMSEP Calibration .........1,17
RMSEP for Validation .....1,13