Chemometrics 101: the math that turns spectra into decisions

A NIR or Raman spectrometer outputs an array — typically several hundred to a few thousand intensity values, one per wavelength channel. This is not an answer. The answer is a number you can put on a control chart: a concentration, a moisture percent, a degree of cure, a pass-or-fail. The function that turns the spectrum into the number is the chemometric model.

The toolbox is small. Most production chemometric models in use today fit one of three patterns: principal component analysis for exploratory and detection work, partial least squares for quantitative prediction, and a small set of preprocessing steps that have changed only modestly in three decades. Newer methods exist — convolutional neural networks, Gaussian process regression — and have a place. They have not displaced the classical methods, which remain easier to validate and easier to defend in an audit.

Why ordinary regression fails

If you have one chemical of interest absorbing at one specific wavelength, you do not need chemometrics. You measure the absorbance at that wavelength, plot it against concentration, and use the resulting line. This is Beer-Lambert with a univariate calibration — the kind of thing a spectroscopy textbook covers in chapter two.

The problem with real samples is that you almost never have one chemical absorbing at one specific wavelength. Mixtures overlap; baseline shifts; instrument response drifts; the matrix changes. The signal at any single wavelength is a sum of contributions from many components plus noise. Ordinary least squares regression on individual wavelengths fails because the wavelengths are not independent of each other, and the model’s predictions become unstable.

Chemometrics is the set of techniques developed to handle this — to do regression and classification when the predictors are correlated, numerous, and noisy.

Principal component analysis: detecting that something is different

PCA does one thing well: it finds the directions in a high-dimensional dataset that explain the most variance, ranked. The first principal component is the line through the cloud of spectra that explains the most variance; the second is orthogonal to it, explaining the next-most variance; and so on.

For process analytics, PCA is rarely the final model. Its production role is anomaly detection. You build a PCA model on a training set of acceptable spectra. Each new spectrum gets projected onto that model and you measure two distances: how far it is from the model in the principal-component space (a Hotelling T² statistic, summarizing in-plane distance) and how far it is from the model perpendicular to the plane (a residual, often called Q or SPE).

Spectra that fall outside the training distribution on either statistic indicate something the model has not seen — a contamination, a phase change, a probe fouling event, a recipe deviation. This is among the most useful pieces of inline analytics: a single number, easy to alarm on, that catches problems the calibration was not built for.

Partial least squares: predicting a number

PLS is the workhorse of quantitative spectroscopy. The setup: you have spectra, and for each spectrum you have a reference value (a concentration measured by HPLC, a moisture by Karl Fischer, a viscosity by rheometer). You want a function that predicts the reference value from the spectrum.

PLS finds latent variables — linear combinations of the original wavelengths — that maximize covariance with the reference values, not just variance among the spectra. This is the key distinction from PCA. PCA finds directions of variance regardless of whether they predict anything; PLS finds directions that predict.

The output is a prediction equation: y = b₀ + b₁ x₁ + b₂ x₂ + … where the xᵢ are wavelengths (or principal directions) and the bᵢ are coefficients. In production it is a single matrix multiplication: the spectrum comes in, the predicted concentration comes out, sub-millisecond.

The model is characterized by three numbers worth knowing:

Number of latent variables (LVs): how complex the model is. Too few — the model is biased. Too many — the model fits noise (overfit) and breaks on new data. The choice is made by cross-validation.
Root-mean-square error of cross-validation (RMSECV): how far off the model is on samples held out during fitting. The honest measure of model performance.
Root-mean-square error of prediction (RMSEP): the same thing, on a properly independent test set. Lower is better; it must be in the same units as the reference value.

A PLS model with seven LVs and an RMSEP of 0.3 % w/w on a 100 g/L API concentration is a quantitatively serious result; without those numbers the claim “we use chemometrics” is decoration.

Preprocessing: where models live or die

Raw spectra carry artifacts that are not chemistry: baseline drift, multiplicative scatter, particle-size effects, instrument-warm-up drift. A chemometric model fit to raw spectra will memorize these artifacts and fail on the next instrument or the next batch of probes.

Preprocessing is the set of transformations applied before modeling to remove what is not chemistry. The standard toolbox:

Standard normal variate (SNV) and multiplicative scatter correction (MSC) — remove additive baseline shifts and multiplicative scatter, common in NIR diffuse reflectance.
Savitzky-Golay derivatives — emphasize peak features, remove broad baseline curvature.
Mean centering and unit variance scaling — standard before PLS so all wavelengths are weighted comparably.
Spectral region selection — keep only the wavelengths where the analyte absorbs, throw away the rest. Often the most consequential preprocessing decision.

The choice of preprocessing usually matters more than the choice of model. A well-preprocessed PLS frequently outperforms a poorly-preprocessed neural network on the same problem.

What modern methods add

Convolutional neural networks for spectral feature extraction, Gaussian process regression for uncertainty quantification, transfer learning across instruments — these have found legitimate roles. They tend to add value when:

the relationship between spectrum and reference is non-linear in ways PLS struggles to capture (high-concentration matrices, strong band overlap with chemical interaction);
a calibrated uncertainty interval is needed for each prediction, not a point estimate;
a model trained on one instrument needs to transfer to another without re-calibration.

For percent-level concentration measurements in moderately well-behaved matrices, PLS still wins on operational simplicity, validation cost, and inspector-friendliness. ICH Q14 explicitly recognizes both classical and enhanced analytical procedures as legitimate; the choice is engineering, not regulatory.

What to ask before approving a chemometric model

Five questions will surface most production problems before they happen:

What is the RMSEP, in the units of the reference method, on a properly independent test set?
How many latent variables, and what does cross-validation say about that number?
What preprocessing steps were applied, and what happens to the prediction if they are slightly varied?
What is the validity range of the model — what concentrations, what matrices, what conditions?
How will the model be re-validated when the process changes?

A chemometric model that cannot answer these in writing is not yet a production model.