Modelling the uncertainty and stochasticity

We wish to obtain a mathematical model for the qPCR experiment that captures both the stochasticity in amplification at each cycle and the uncertainty in fluorescence reads, so that we may provide a reliable estimate of the magnitude and uncertainty in the initial copy number of DNA molecules. This is particularly important when working in a low-copy number regime (e.g. single cell data) where stochastic effects are expected to be particularly important. This project extends the work of Lalam, 2007 and uses a binomial branching process as the stochastic process of a Hidden Markov Model to model the qPCR amplification.

For a qPCR experiment with n cycles, our mathematical model has the following random variables:

  • X0 – the initial number of molecules (unobserved)
  • X1, ..., Xn – the number of molecules present after each amplification cycle (unobserved)
  • F1, ..., Fn – the fluorescence reads observed after each cycle (observed)

The variable of interest is X0, the initial copy number of mtDNA molecules. The rest of the variables are used to explain the observed fluorescence intensities though the hidden variables Xi, representing the unobserved number of amplified molecules at each cycle i.

At each qPCR cycle, only a proportion of the molecules are amplified. In other words, each target mtDNA molecule has a probability r of being amplified, which is the efficiency of the qPCR experiment. This efficiency is assumed to be constant throughout the exponential phase of the amplification curve. A Binomial distribution is used to capture the number of new copies at each cycle, as each molecule undergoes an independent Bernoulli trial with probability r:

Xi+1 = Xi + Binom(Xi , r), where i ≥ 0

The fluorescence intensity emitted by the reporter dye at a given cycle i is proportional to the number of molecules Xi by a constant α. However, the measurements are made in the presence of background fluorescence, which we model as an additive Gaussian noise. The experimental data used for inference is background corrected, so the Gaussian distribution has mean 0 and unknown variance σ2 (N(0, σ2)):

Fi = αXi + N(0, σ2)


HMM diagram
Figure 12. The amplification process is modelled as a binomial branching process. At each discrete step of the process (qPCR cycle), a molecule is amplified with probability r. When r = 0.5, this is equivalent to a coin flip.
HMM diagram
Figure 13. The Hidden Markov Model used to represent the qPCR experiment. The hidden states are Xi (the copy number at each amplification cycle i) and the observed states are fluorescence intensity reads Fi. The model is parameterised by θ = [X0 , r, σ2].










Inference diagram
Figure 14. Bayesian inference is used to obtain the posterior distribution of the HMM parameters based on fluorescence observations. The model accounts for the stochasticity of amplification (the binomial branching process) and for the uncertainty in observations (Gaussian noise).

Hidden Markov Model

A Hidden Markov Model (HMM) is a stochastic model with hidden states, where future states only depend on the current one. The specification of a HMM consists of hidden state variables, observations, transition probabilities and observation probabilities.

The binomial branching process is a Markov process, as the number of molecules in each state only depends on the number of molecules in the previous state. We model the process as a HMM with the following specification:

  • X1:n – hidden state variables
  • F1:n – observations
  • transition probabilities given by the binomial with probability r
  • observation probability given by Gaussian noise with variance σ2
The model is parameterised by θ = [X0 , r, σ2] (Fig. 13).

The HMM representation influences the choice of inference method, requiring one that accounts for transitions between hidden states when computing the likelihood of observations for a fixed parameter θ. The pseudo-marginal Metropolis Hastings algorithm was used to perform inference on the HMM (more in Inference).

The model can be extended to account for other sources of uncertainty, by including additional states in the graphical model above. For example, the stochasticity in the pipetting procedure can be modelled by adding an extra state (more details in Future Work).

Estimation of the fluorescence coefficient α

The fluorescence coefficient α, also known as an optical calibration factor, relates the number of target molecules to fluorescence intensity:

Fi = αXi + N(0, σ2)

Lalam, 2007 treats α as a known constant and do not specify a method to determine its value. We found that variability in α can induce significant amounts of uncertainty in our estimates of X0. The explanation behind this effect is that α and X0 are co-dependent when the fluorescence intensity is fixed. In other words, X0 can have any value if α is unrestricted. Hence taking variability of α into account is important in order to provide a more reliable estimate in our uncertainty in X0.

The choice of model

The fluorescence coefficient α can be estimated using samples where the initial copy number X0 is known. The experimental data for such samples is in the form of dilution series, with initial copy numbers varying between 1,000 and 100,000 molecules. A binomial branching process, such as the one used to describe the qPCR experiment above, behaves deterministically for molecule numbers as large as the ones in the dilution series. The pseudo-marginal Metropolis Hasting inference algorithm fails for deterministic processes (detailed explanation in Inference).

As it was found that the experiment behaves deterministically for the dilution series used in estimating α, a sigmoid curve model was considered more appropriate to represent the data.

Figure 15. A number of 1000 qPCR amplification simulations with efficiency r = 0.7 were performed in silico for X0=50 and X0=100,000, respectively. The amplification curve segment between cycles 32-35 is shown, with the 5-95% confidence interval, showing the variability in amplification between different experiments with the same initial copy number X0. The plot shows how the stochasiticity has a larger effect for small numbers of molecules.

Sigmoid model

Goll et al., 2006 propose a sigmoidal model to represent the qPCR process. In the paper, the model is used to perform absolute quantification on qPCR data and obtain the initial copy number X0 by using non-linear regression. The model does not capture the stochasticity of the amplification process, which is why it was not considered for single-cell data. However, the amplification process behaves deterministically in the case of dilution series with large initial copy numbers, that are used to obtain the fluorescence coefficient α, making the sigmoid model an appropriate choice.

The fluorescence curve is represented by the following equation:

Sigmoid equation

where Fi represents the fluorescence at cycle i, Fmax the maximal fluorescence intensity, C1/2 the cycle with half of the maximal fluorescence intensity, k is a constant related to efficiency and Fb represents the background fluorescence. Non-linear regression is applied to the data to fit the four sigmoid parameters (Fmax, C1/2, k, Fb), using the Python function scipy.optimize.curve_fit. The fluorescence at cycle 0 is estimated by using these parameters in the following equation:

Sigmoid equation

Finally, the fluorescence coefficient α is obtained:

Sigmoid equation

Figure 16. Example of sigmoid fitting on real fluorescence data where X0 = 1,000,000 is known. Sigmoid fitting is used to estimate F0, the initial fluorescence intensity. The estimate for α = 2.97817284227e-09 is computed as the ratio between F0 and X0.

Variability in the coefficient α

Since literature papers assume that the parameter α is a fixed constant known in advance, it is interesting to see what the variability of α is depending on different factors such as the qPCR primer or experimental setting.

The estimation of the fluorescence coefficient α was done on experimental data from 3 different days (16/12/2016, 02/03/2017, 10/03/2017) and with 3 different primers (BJCOII, BJ-CYB C57, BJmmMUP) used to amplify standards with known initial copy number. The estimation showed both inter- and intra-experimental variability. The adjoining figure shows the estimate value of α plotted on the log scale, with visible inter-experimental variability present on orders of magnitude of α.

Figure 17. The α estimates show significant inter-experimental variability when different primers are used to amplify the standards (plotted on the log scale)..

Reassuringly, it can be observed from the plot above that the distributions of α estimates from samples with the same primer (BJmmMUP) are comparable, although the data comes from very different experimental settings, on different days (16/12/2017 vs. 10/03/2017). More work is needed to establish whether these two distributions differ significantly in their means. However, these results indicate that the variability in α between experiments is explained by the type of primer. This supports using an α estimate in the inference for a new sample if the same primer was used on the standards for α-estimation and on the new sample.

Figure 18. When the same primer is used, the distribution of α does not show significat inter-experimental variability. While further work is needed to establish this, it supports the practice of using an α estimate from standards that use the same primer as the sample of interest for which X0 is inferred.

Exponential phase identification

The mathematical model used to represent the qPCR amplification assumes that the efficiency is constant (see more details in Model). The fluorescence reads corresponding to the exponential phase must first be identified, then used as observation data in the inference method.

In order to identify the exponential phase, we use two models, a sigmoidal curve and an exponential curve. We iteratively fit both models on the curve and progressively discard the last cycle from the data. The exponential phase is identified when the exponential model fits the data better than the sigmoid model. The key aspect is what "fitting the data better" means.

The mathematical definitons of the sigmoid and exponential model are omitted here, but depicted in the adjoining figure. The sigmoid model is more complex with respect to parameter count, as it uses four parameters to restrict the shape of the curve, whereas the exponential model uses three.

We use the Akaike Information Criterion (AIC) to determine which of the two models best fits the data. AIC is a common measure and incorporates both the goodness-of-fit of the data (the likelihood of the data given the model) and the complexity of the model (the number of parameters used to estimate the curve).

The figures to the side show how AIC considers the trade-off between complexity and the goodness-of-fit. For the fluorescence data shown here, the sigmoid is a better fit when considering 36 cycles or more. On the other hand, the exponential is a better fit when considering less than 36 cycles. Therefore, we would use the first 35 cycles for inference on this data set.

Figure 19. The exponential phase is identified by using AIC model selection to determine how many cycles need to be discarded from the end, before the exponential becomes a better model. For 45 cycles, the sigmoid is a better fit than the exponential. For 35 cycles, the exponential is a better fit according to AIC. Although the goodness-of-fit is identical, the exponential has lower complexity with respect to parameter count.