Algorithms for calculating variance: Difference between revisions

Jump to navigation Jump to search
I believe there was a mistake in the computation of sample_reliability_variance. (see analogous case in online_weighted_covariance algorithm below)
 
imported>LionmerterTHE
m clean up, typo(s) fixed: ically- → ically
 
Line 6: Line 6:
A formula for calculating the variance of an entire [[statistical population|population]] of size ''N'' is:
A formula for calculating the variance of an entire [[statistical population|population]] of size ''N'' is:


:<math>\sigma^2 = \overline{(x^2)} - \bar x^2 = \frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N}\right)^2</math>
:<math>\sigma^2 = \overline{(x-\bar x)^2}  = \overline{(x^2)} - \bar x^2 = \frac{\sum_{i=1}^N x_i^2}{N} - \left(\frac{\sum_{i=1}^N x_i}{N}\right)^2</math>


Using [[Bessel's correction]] to calculate an [[estimator bias|unbiased]] estimate of the population variance from a finite [[statistical sample|sample]] of ''n'' observations, the formula is:
Using [[Bessel's correction]] to calculate an [[estimator bias|unbiased]] estimate of the population variance from a finite [[statistical sample|sample]] of ''n'' observations, the formula is:
Line 14: Line 14:
Therefore, a naïve algorithm to calculate the estimated variance is given by the following:
Therefore, a naïve algorithm to calculate the estimated variance is given by the following:


<div style="margin-left: 35px; width: 600px">
{{framebox|blue|margin-left=35px|width=400px}}
{{framebox|blue}}
* Let {{math|''n'' ← 0, Sum ← 0, SumSq ← 0}}
* Let {{math|''n'' ← 0, Sum ← 0, SumSq ← 0}}
* For each datum {{mvar|x}}:
* For each datum {{mvar|x}}:
Line 23: Line 22:
* {{math|Var {{=}} (SumSq − (Sum × Sum) / n) / (n − 1)}}
* {{math|Var {{=}} (SumSq − (Sum × Sum) / n) / (n − 1)}}
{{frame-footer}}
{{frame-footer}}
</div>


This algorithm can easily be adapted to compute the variance of a finite population: simply divide by ''n'' instead of ''n''&nbsp;−&nbsp;1 on the last line.
This algorithm can easily be adapted to compute the variance of a finite population: simply divide by ''n'' instead of ''n''&nbsp;−&nbsp;1 on the last line.


Because {{math|SumSq}} and {{math|(Sum×Sum)/''n''}} can be very similar numbers, [[Catastrophic cancellation|cancellation]] can lead to the [[precision (arithmetic)|precision]] of the result to be much less than the inherent precision of the [[floating-point arithmetic]] used to perform the computation.  Thus this algorithm should not be used in practice,<ref name="Einarsson2005">{{cite book|first=Bo|last=Einarsson|title=Accuracy and Reliability in Scientific Computing|url=https://books.google.com/books?id=8hrDV5EbrEsC|year=2005|publisher=SIAM|isbn=978-0-89871-584-2|page=47}}</ref><ref name="Chan1983">{{cite journal|url=http://cpsc.yale.edu/sites/default/files/files/tr222.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://cpsc.yale.edu/sites/default/files/files/tr222.pdf |archive-date=2022-10-09 |url-status=live|first1=Tony F.|last1=Chan|author1-link=Tony F. Chan|first2=Gene H.|last2=Golub|author2-link=Gene H. Golub|first3=Randall J.|last3=LeVeque|title=Algorithms for computing the sample variance: Analysis and recommendations|journal=The American Statistician|volume=37|issue=3|pages=242–247|year=1983|jstor=2683386|doi=10.1080/00031305.1983.10483115}}</ref> and several alternate, numerically stable, algorithms have been proposed.<ref name=":1">{{Cite book|last1=Schubert|first1=Erich|last2=Gertz|first2=Michael|date=2018-07-09|title=Numerically stable parallel computation of (co-)variance|url=http://dl.acm.org/citation.cfm?id=3221269.3223036|publisher=ACM|pages=10|doi=10.1145/3221269.3223036|isbn=9781450365055|s2cid=49665540}}</ref> This is particularly bad if the standard deviation is small relative to the mean.
Because {{math|SumSq}} and {{math|(Sum×Sum)/''n''}} can be very similar numbers, [[Catastrophic cancellation|cancellation can lead to]] the [[precision (arithmetic)|precision]] of the result to be much less than the inherent precision of the [[floating-point arithmetic]] used to perform the computation.  Thus this algorithm should not be used in practice,<ref name="Einarsson2005">{{cite book|first=Bo|last=Einarsson|title=Accuracy and Reliability in Scientific Computing|url=https://books.google.com/books?id=8hrDV5EbrEsC|year=2005|publisher=SIAM|isbn=978-0-89871-584-2|page=47}}</ref><ref name="Chan1983">{{cite journal|url=http://cpsc.yale.edu/sites/default/files/files/tr222.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://cpsc.yale.edu/sites/default/files/files/tr222.pdf |archive-date=2022-10-09 |url-status=live|first1=Tony F.|last1=Chan|author1-link=Tony F. Chan|first2=Gene H.|last2=Golub|author2-link=Gene H. Golub|first3=Randall J.|last3=LeVeque|title=Algorithms for computing the sample variance: Analysis and recommendations|journal=The American Statistician|volume=37|issue=3|pages=242–247|year=1983|jstor=2683386|doi=10.1080/00031305.1983.10483115}}</ref> and several alternate, numerically stable, algorithms have been proposed.<ref name=":1">{{Cite book|last1=Schubert|first1=Erich|last2=Gertz|first2=Michael|date=2018-07-09|title=Numerically stable parallel computation of (co-)variance|url=http://dl.acm.org/citation.cfm?id=3221269.3223036|publisher=ACM|pages=10|doi=10.1145/3221269.3223036|isbn=9781450365055|s2cid=49665540}}</ref> This is particularly bad if the standard deviation is small relative to the mean.


===Computing shifted data===
==Computing shifted data==


The variance is [[Invariant (mathematics)|invariant]] with respect to changes in a [[location parameter]], a property which can be used to avoid the catastrophic cancellation in this formula.
The variance is [[Invariant (mathematics)|invariant]] with respect to changes in a [[location parameter]], a property which can be used to avoid the catastrophic cancellation in this formula.
Line 43: Line 41:


If just the first sample is taken as <math>K</math> the algorithm can be written in [[Python (programming language)|Python programming language]] as
If just the first sample is taken as <math>K</math> the algorithm can be written in [[Python (programming language)|Python programming language]] as
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
def shifted_data_variance(data):
def shifted_data_variance(data):
Line 60: Line 57:
</syntaxhighlight>
</syntaxhighlight>


This formula also facilitates the incremental computation that can be expressed as
===Two-pass algorithm===
<syntaxhighlight lang="python">
K = Ex = Ex2 = 0.0
n = 0
 
 
def add_variable(x):
    global K, n, Ex, Ex2
    if n == 0:
        K = x
    n += 1
    Ex += x - K
    Ex2 += (x - K) ** 2
 
def remove_variable(x):
    global K, n, Ex, Ex2
    n -= 1
    Ex -= x - K
    Ex2 -= (x - K) ** 2
 
def get_mean():
    global K, n, Ex
    return K + Ex / n
 
def get_variance():
    global n, Ex, Ex2
    return (Ex2 - Ex**2 / n) / (n - 1)
</syntaxhighlight>
 
==Two-pass algorithm==
An alternative approach, using a different formula for the variance, first computes the sample mean,
An alternative approach, using a different formula for the variance, first computes the sample mean,
:<math>\bar x = \frac {\sum_{j=1}^n x_j} n,</math>
:<math>\bar x = \frac {\sum_{j=1}^n x_j} n,</math>
Line 104: Line 72:
</syntaxhighlight>
</syntaxhighlight>


This algorithm is numerically stable if ''n'' is small.<ref name="Einarsson2005"/><ref>{{cite book |last=Higham |first=Nicholas J. |url=https://epubs.siam.org/doi/book/10.1137/1.9780898718027 |title=Accuracy and Stability of Numerical Algorithms |publisher=Society for Industrial and Applied Mathematics |year=2002 |edition=2nd |publication-place=Philadelphia, PA |chapter=Problem 1.10 |doi= 10.1137/1.9780898718027|isbn=978-0-898715-21-7 |id=e{{ISBN|978-0-89871-802-7}}, 2002075848 }} Metadata also listed at [https://dl.acm.org/doi/10.5555/579525 ACM Digital Library].</ref> However, the results of both of these simple algorithms ("naïve" and "two-pass") can depend inordinately on the ordering of the data and can give poor results for very large data sets due to repeated roundoff error in the accumulation of the sums. Techniques such as [[compensated summation]] can be used to combat this error to a degree.
This piece of code, if run on [[CPython]] 3.12 and newer, is always numerically stable. This is because these versions use Neumaier's [[compensated summation]] scheme for the <code>sum()</code> function, making it resistant to repeated roundoff error.<ref>{{Multiref2|[https://docs.python.org/3.12/whatsnew/3.12.html#other-language-changes What's New in Python 3.12]|{{cite web |last1=Hettinger |first1=R. |title=Improve accuracy of builtin sum() for float inputs · Issue #100425 · python/cpython |url=https://github.com/python/cpython/issues/100425 |website=GitHub - CPython v3.12 Added Features |access-date=7 October 2023 |language=en}}}}</ref> Many numerically oriented languages provide similar constructs.
 
This algorithm, if implemented with a naive addition <code>sum = 0; for x in data: sum += x; mean = sum / n</code>, would only be numerically stable if ''n'' is small because of the accumulation of roundoff error.<ref name="Einarsson2005"/><ref>{{cite book |last=Higham |first=Nicholas J. |url=https://epubs.siam.org/doi/book/10.1137/1.9780898718027 |title=Accuracy and Stability of Numerical Algorithms |publisher=Society for Industrial and Applied Mathematics |year=2002 |edition=2nd |publication-place=Philadelphia, PA |chapter=Problem 1.10 |doi= 10.1137/1.9780898718027|isbn=978-0-898715-21-7 |id=e{{ISBN|978-0-89871-802-7}}, 2002075848 }} Metadata also listed at [https://dl.acm.org/doi/10.5555/579525 ACM Digital Library].</ref>


==Welford's online algorithm==
== Incremental algorithms ==
It is often useful to be able to compute the variance in a [[One-pass algorithm|single pass]], inspecting each value <math>x_i</math> only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.  For such an [[online algorithm]], a [[recurrence relation]] is required between quantities from which the required statistics can be calculated in a numerically stable fashion.
It is often useful to be able to compute the variance in a [[One-pass algorithm|single pass]], inspecting each value <math>x_i</math> only once; for example, when the data is being collected without enough storage to keep all the values, or when costs of memory access dominate those of computation.  For such an [[online algorithm]], a [[recurrence relation]] is required between quantities from which the required statistics can be calculated in a numerically stable fashion.


The following formulas can be used to update the [[mean]] and (estimated) variance of the sequence, for an additional element ''x''<sub>''n''</sub>. Here, <math display="inline">\overline{x}_n = \frac{1}{n} \sum_{i=1}^n x_i </math> denotes the sample mean of the first ''n'' samples <math>(x_1,\dots,x_n)</math>, <math display="inline">\sigma^2_n = \frac{1}{n} \sum_{i=1}^n \left(x_i - \overline{x}_n \right)^2</math> their [[Variance#Biased_sample_variance|biased sample variance]], and <math display="inline">s^2_n = \frac{1}{n - 1} \sum_{i=1}^n \left(x_i - \overline{x}_n \right)^2</math> their [[Variance#Unbiased_sample_variance|unbiased sample variance]].
=== Incremental shifted data ===
Our shifted-data algorithm from before includes the simple updating of data in a loop, which is naturally incremental. It can be expressed as:<ref name="Chan1983" />
 
<syntaxhighlight lang="python">
class ShiftDataVariance:
    def __init__(self):
        self.K = 0.0
        self.n = 0
        self.Ex = 0.0
        self.Ex2 = 0.0
 
    def add_variable(self, x: float):
        if self.n == 0:
            self.K = x
        self.n += 1
        self.Ex += x - self.K
        self.Ex2 += (x - self.K) ** 2
   
    def remove_variable(self, x: float):
        self.n -= 1
        self.Ex -= x - self.K
        self.Ex2 -= (x - self.K) ** 2
 
    def add_variables(self, xs: list[float]):
        # Python uses Neumaier's algorithm for the builtin sum() function,
        # which is more accurate than a simple loop.
        if self.n == 0 and xs:
            self.K = xs[0]
        self.n += len(xs)
        self.Ex += sum(x - self.K for x in xs)
        self.Ex2 += sum((x - self.K) ** 2 for x in xs)
   
    def get_mean(self) -> float:
        return self.K + self.Ex / self.n
   
    def get_variance(self) -> float:
        return (self.Ex2 - self.Ex**2 / self.n) / (self.n - 1)
</syntaxhighlight>
 
===Welford's online algorithm===
Analogous to the two-pass algorithm above, it remains desirable to use the actual [[mean]] of the data instead of just picking the first element. The following formulas can be used to update the [[mean]] and (estimated) variance of the sequence, for an additional element ''x''<sub>''n''</sub>. Here, <math display="inline">\overline{x}_n = \frac{1}{n} \sum_{i=1}^n x_i </math> denotes the sample mean of the first ''n'' samples <math>(x_1,\dots,x_n)</math>, <math display="inline">\sigma^2_n = \frac{1}{n} \sum_{i=1}^n \left(x_i - \overline{x}_n \right)^2</math> their [[Variance#Biased sample variance|biased sample variance]], and <math display="inline">s^2_n = \frac{1}{n - 1} \sum_{i=1}^n \left(x_i - \overline{x}_n \right)^2</math> their [[Variance#Unbiased sample variance|unbiased sample variance]].


:<math>\bar x_n = \frac{(n-1) \, \bar x_{n-1} + x_n}{n} = \bar x_{n-1} + \frac{x_n - \bar x_{n-1}}{n} </math>
:<math>\bar x_n = \frac{(n-1) \, \bar x_{n-1} + x_n}{n} = \bar x_{n-1} + \frac{x_n - \bar x_{n-1}}{n} </math>
Line 117: Line 127:
:<math>s^2_n = \frac{n-2}{n-1} \, s^2_{n-1} + \frac{(x_n - \bar x_{n-1})^2}{n} = s^2_{n-1} + \frac{(x_n - \bar x_{n-1})^2}{n} - \frac{s^2_{n-1}}{n-1}, \quad n>1 </math>
:<math>s^2_n = \frac{n-2}{n-1} \, s^2_{n-1} + \frac{(x_n - \bar x_{n-1})^2}{n} = s^2_{n-1} + \frac{(x_n - \bar x_{n-1})^2}{n} - \frac{s^2_{n-1}}{n-1}, \quad n>1 </math>


These formulas suffer from numerical instability {{cn|date=March 2023}}, as they repeatedly subtract a small number from a big number which scales with ''n''. A better quantity for updating is the sum of squares of differences from the current mean, <math display="inline">\sum_{i=1}^n (x_i - \bar x_n)^2</math>, here denoted <math>M_{2,n}</math>:
These formulas suffer from numerical instability, as they repeatedly subtract a small number from a big number which scales with ''n''. A better quantity for updating is the sum of squares of differences from the current mean, <math display="inline">\sum_{i=1}^n (x_i - \bar x_n)^2</math>, here denoted <math>M_{2,n}</math>:


: <math>\begin{align}
: <math>\begin{align}
Line 125: Line 135:
\end{align}</math>
\end{align}</math>


This algorithm was found by Welford,<ref>{{cite journal |first=B. P. |last=Welford |year=1962 |title=Note on a method for calculating corrected sums of squares and products |journal=[[Technometrics]] |volume=4 |issue=3 |pages=419–420 |jstor=1266577 |doi=10.2307/1266577}}</ref><ref>[[Donald E. Knuth]] (1998). ''[[The Art of Computer Programming]]'', volume 2: ''Seminumerical Algorithms'', 3rd edn., p.&nbsp;232. Boston: Addison-Wesley.</ref> and it has been thoroughly analyzed.<ref name="Chan1983" /><ref>{{cite journal |last=Ling |first=Robert F. |year=1974 |title=Comparison of Several Algorithms for Computing Sample Means and Variances |journal=Journal of the American Statistical Association |volume=69 |issue=348 |pages=859–866 |doi=10.2307/2286154|jstor=2286154 }}</ref> It is also common to denote <math>M_k = \bar x_k</math> and <math>S_k = M_{2,k}</math>.<ref>{{cite web |last=Cook |first=John D. |date=30 September 2022 |orig-date=1 November 2014 |title=Accurately computing sample variance |url=http://www.johndcook.com/standard_deviation.html |website=John D. Cook Consulting: Expert consulting in applied mathematics & data privacy}}</ref>
The following algorithm was found by Welford,<ref>{{cite journal |first=B. P. |last=Welford |year=1962 |title=Note on a method for calculating corrected sums of squares and products |journal=[[Technometrics]] |volume=4 |issue=3 |pages=419–420 |jstor=1266577 |doi=10.2307/1266577}}</ref><ref>[[Donald E. Knuth]] (1998). ''[[The Art of Computer Programming]]'', volume 2: ''Seminumerical Algorithms'', 3rd edn., p.&nbsp;232. Boston: Addison-Wesley.</ref> and it has been thoroughly analyzed.<ref name="Chan1983" /><ref>{{cite journal |last=Ling |first=Robert F. |year=1974 |title=Comparison of Several Algorithms for Computing Sample Means and Variances |journal=Journal of the American Statistical Association |volume=69 |issue=348 |pages=859–866 |doi=10.2307/2286154|jstor=2286154 }}</ref> It is also common to denote <math>M_k = \bar x_k</math> and <math>S_k = M_{2,k}</math>.<ref>{{cite web |last=Cook |first=John D. |date=30 September 2022 |orig-date=1 November 2014 |title=Accurately computing sample variance |url=http://www.johndcook.com/standard_deviation.html |website=John D. Cook Consulting: Expert consulting in applied mathematics & data privacy}}</ref> An example Python implementation for Welford's algorithm is given below, using the same framework as the above "shifted data" algorithm:
 
<syntaxhighlight lang="python">
class WelfordVariance:
    def __init__(self):  # Comparison to ShiftDataVariance:
        self.mean = 0.0  # = K + Ex / n
        self.count = 0    # = n
        self.M2 = 0.0    # = Ex2 - (Ex)^2 / n


An example Python implementation for Welford's algorithm is given below.
    def add_variable(self, x: float):
        self.count += 1
        old_mean = self.mean
        self.mean += (x - self.mean) / self.count
        self.M2 += (x - old_mean) * (x - self.mean)


<syntaxhighlight lang="python">
    def remove_variable(self, x: float):
# For a new value new_value, compute the new count, new mean, the new M2.
        self.count -= 1
# mean accumulates the mean of the entire dataset
        new_mean = self.mean
# M2 aggregates the squared distance from the mean
        self.mean -= (x - self.mean) / self.count
# count aggregates the number of samples seen so far
        self.M2 -= (x - new_mean) * (x - self.mean)
def update(existing_aggregate, new_value):
       
    (count, mean, M2) = existing_aggregate
     def get_mean(self) -> float:
    count += 1
        return self.mean
    delta = new_value - mean
    mean += delta / count
    delta2 = new_value - mean
     M2 += delta * delta2
    return (count, mean, M2)


# Retrieve the mean, variance and sample variance from an aggregate
    def get_variance(self) -> float:
def finalize(existing_aggregate):
        return self.M2 / self.count
    (count, mean, M2) = existing_aggregate
      
     if count < 2:
    def get_sample_variance(self) -> float:
        return float("nan")
         return self.M2 / (self.count - 1)
    else:
         (mean, variance, sample_variance) = (mean, M2 / count, M2 / (count - 1))
        return (mean, variance, sample_variance)
</syntaxhighlight>
</syntaxhighlight>


Line 157: Line 170:
The [[#Parallel algorithm|parallel algorithm]] below illustrates how to merge multiple sets of statistics calculated online.
The [[#Parallel algorithm|parallel algorithm]] below illustrates how to merge multiple sets of statistics calculated online.


==Weighted incremental algorithm==
===Weighted incremental algorithm===
The algorithm can be extended to handle unequal sample weights, replacing the simple counter ''n'' with the sum of weights seen so far.  West (1979)<ref>{{cite journal |last=West |first=D. H. D. |year=1979 |title=Updating Mean and Variance Estimates: An Improved Method |journal=[[Communications of the ACM]] |volume=22 |issue=9 |pages=532–535 |doi=10.1145/359146.359153 |s2cid=30671293 |doi-access=free}}</ref> suggests this [[incremental computing|incremental algorithm]]:
The algorithm can be extended to handle unequal sample weights, replacing the simple counter ''n'' with the sum of weights seen so far.  West (1979)<ref>{{cite journal |last=West |first=D. H. D. |year=1979 |title=Updating Mean and Variance Estimates: An Improved Method |journal=[[Communications of the ACM]] |volume=22 |issue=9 |pages=532–535 |doi=10.1145/359146.359153 |s2cid=30671293 |doi-access=free}}</ref> suggests this [[incremental computing|incremental algorithm]]:


<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
from collections import namedtuple
WeightedVariances = namedtuple("WeightedVariances", "pop freq reli")
def weighted_incremental_variance(data_weight_pairs):
def weighted_incremental_variance(data_weight_pairs):
     w_sum = w_sum2 = mean = S = 0
     w_sum = w_sum2 = mean = S = 0
Line 175: Line 190:
     # Frequency weights
     # Frequency weights
     sample_frequency_variance = S / (w_sum - 1)
     sample_frequency_variance = S / (w_sum - 1)
     # Reliability weights
     # Reliability weights
     sample_reliability_variance = S / (w_sum - w_sum2 / w_sum)
     sample_reliability_variance = S / (w_sum - w_sum2 / w_sum)
    return WeightedVariances(population_variance, sample_frequency_variance, sample_reliability_variance)
</syntaxhighlight>
</syntaxhighlight>


Line 192: Line 208:
This may be useful when, for example, multiple processing units may be assigned to discrete parts of the input.
This may be useful when, for example, multiple processing units may be assigned to discrete parts of the input.


Chan's method for estimating the mean is numerically unstable when <math>n_A \approx n_B</math> and both are large, because the numerical error in <math>\delta = \bar x_B - \bar x_A</math> is not scaled down in the way that it is in the <math>n_B = 1</math> case. In such cases, prefer <math display="inline">\bar x_{AB} = \frac{n_A \bar x_A + n_B \bar x_B}{n_{AB}}</math>.
Chan's method for estimating the mean is numerically unstable when <math>n_A \approx n_B</math> and both are large, because the [[numerical error]] in <math>\delta = \bar x_B - \bar x_A</math> is not scaled down in the way that it is in the <math>n_B = 1</math> case. In such cases, prefer <math display="inline">\bar x_{AB} = \frac{n_A \bar x_A + n_B \bar x_B}{n_{AB}}</math>. Reusing the <code>WelfordVariance</code> class from above, we have:
 
<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
def parallel_variance(n_a, avg_a, M2_a, n_b, avg_b, M2_b):
def merge(a: WelfordVariance, b: WelfordVariance) -> WelfordVariance:
     n = n_a + n_b
     ab = WelfordVariance()
     delta = avg_b - avg_a
    ab.count = a.count + b.count
     M2 = M2_a + M2_b + delta**2 * n_a * n_b / n
     delta = b.mean - a.mean
     var_ab = M2 / (n - 1)
     ab.mean = (a.count * a.mean + b.count * b.mean) / ab.count
    return var_ab
    ab.M2 = a.M2 + b.M2 + delta**2 * a.count * b.count / ab.count
    return ab
      
# example:
ab = merge(a, b)
print(ab.get_sample_variance())
</syntaxhighlight>
</syntaxhighlight>
This can be generalized to allow parallelization with [[Advanced Vector Extensions|AVX]], with [[Graphics processing unit|GPUs]], and [[Computer cluster|computer clusters]], and to covariance.<ref name=":1" />
 
This algorithm allows one to divide a dataset into several pieces, run them in parallel, and then merge the results together. This enables parallelization in any way, including [[Advanced Vector Extensions|AVX]], with [[Graphics processing unit|GPUs]], and [[computer cluster]]s. The algorithm can also be adapted for higher-order statistics as well as covariance.<ref name=":1" /><ref name=Terriberry/>


==Example==
==Example==
Assume that all floating point operations use standard [[IEEE 754#Double-precision 64 bit|IEEE 754 double-precision]]{{Broken anchor|date=2025-06-11|bot=User:Cewbot/log/20201008/configuration|target_link=IEEE 754#Double-precision 64 bit|reason= }} arithmetic. Consider the sample (4, 7, 13, 16) from an infinite population. Based on this sample, the estimated population mean is 10, and the unbiased estimate of population variance is 30.  Both the naïve algorithm and two-pass algorithm compute these values correctly.
Assume that all floating point operations use standard [[Double-precision floating-point format|IEEE 754 double-precision]] arithmetic. Consider the sample (4, 7, 13, 16) from an infinite population. Based on this sample, the estimated population mean is 10, and the unbiased estimate of population variance is 30.  Both the naïve algorithm and two-pass algorithm compute these values correctly.


Next consider the sample ({{nowrap|10<sup>8</sup>&nbsp;+&nbsp;4}}, {{nowrap|10<sup>8</sup>&nbsp;+&nbsp;7}}, {{nowrap|10<sup>8</sup>&nbsp;+&nbsp;13}}, {{nowrap|10<sup>8</sup>&nbsp;+&nbsp;16}}), which gives rise to the same estimated variance as the first sample.  The two-pass algorithm computes this variance estimate correctly, but the naïve algorithm returns 29.333333333333332 instead of 30.
Next consider the sample ({{nowrap|10<sup>8</sup>&nbsp;+&nbsp;4}}, {{nowrap|10<sup>8</sup>&nbsp;+&nbsp;7}}, {{nowrap|10<sup>8</sup>&nbsp;+&nbsp;13}}, {{nowrap|10<sup>8</sup>&nbsp;+&nbsp;16}}), which gives rise to the same estimated variance as the first sample.  The two-pass algorithm computes this variance estimate correctly, but the naïve algorithm returns 29.333333333333332 instead of 30.
Line 211: Line 234:


==Higher-order statistics==
==Higher-order statistics==
Terriberry<ref>{{Cite web |last=Terriberry |first=Timothy B. |date=15 October 2008 |orig-date=9 December 2007 |title=Computing Higher-Order Moments Online |url=http://people.xiph.org/~tterribe/notes/homs.html |url-status=dead |archive-url=https://web.archive.org/web/20140423031833/http://people.xiph.org/~tterribe/notes/homs.html |archive-date=23 April 2014 |access-date=5 May 2008}}</ref> extends Chan's formulae to calculating the third and fourth [[central moment]]s, needed for example when estimating [[skewness]] and [[kurtosis]]:
Terriberry<ref name=Terriberry>{{Cite web |last=Terriberry |first=Timothy B. |date=15 October 2008 |orig-date=9 December 2007 |title=Computing Higher-Order Moments Online |url=http://people.xiph.org/~tterribe/notes/homs.html |url-status=dead |archive-url=https://web.archive.org/web/20140423031833/http://people.xiph.org/~tterribe/notes/homs.html |archive-date=23 April 2014 |access-date=5 May 2008}}</ref> extends Chan's formulae to calculating the third and fourth [[central moment]]s, needed for example when estimating [[skewness]] and [[kurtosis]]:
:<math>
:<math>\begin{align}
\begin{align}
M_{3,X} = M_{3,A} + M_{3,B} & {} + \delta^3\frac{n_A n_B (n_A - n_B)}{n_X^2} + 3\delta\frac{n_AM_{2,B} - n_BM_{2,A}}{n_X} \\[6pt]
M_{3,X} = M_{3,A} + M_{3,B} & {} + \delta^3\frac{n_A n_B (n_A - n_B)}{n_X^2} + 3\delta\frac{n_AM_{2,B} - n_BM_{2,A}}{n_X} \\[6pt]
M_{4,X} = M_{4,A} + M_{4,B} & {} + \delta^4\frac{n_A n_B \left(n_A^2 - n_A n_B + n_B^2\right)}{n_X^3} \\[6pt]
M_{4,X} = M_{4,A} + M_{4,B} & {} + \delta^4\frac{n_A n_B \left(n_A^2 - n_A n_B + n_B^2\right)}{n_X^3} \\[6pt]
Line 238: Line 260:
</math>
</math>


By preserving the value <math>\delta / n</math>, only one division operation is needed and the higher-order statistics can thus be calculated for little incremental cost.
By preserving the value <math>\delta / n</math>, only one division operation is needed and the [[higher-order statistics]] can thus be calculated for little incremental cost.


An example of the online algorithm for kurtosis implemented as described is:
An example of the online algorithm for kurtosis implemented as described is:
Line 300: Line 322:
  |doi=10.1177/1475921709341014
  |doi=10.1177/1475921709341014
}}</ref>
}}</ref>
offer two alternative methods to compute the skewness and kurtosis, each of which can save substantial computer memory requirements and CPU time in certain applications. The first approach is to compute the statistical moments by separating the data into bins and then computing the moments from the geometry of the resulting histogram, which effectively becomes a [[one-pass algorithm]] for higher moments. One benefit is that the statistical moment calculations can be carried out to arbitrary accuracy such that the computations can be tuned to the precision of, e.g., the data storage format or the original measurement hardware.  A relative histogram of a random variable can be constructed in the conventional way: the range of potential values is divided into bins and the number of occurrences within each bin are counted and plotted such that the area of each rectangle equals the portion of the sample values within that bin:
offer two alternative methods to compute the skewness and kurtosis, each of which can save substantial [[computer memory]] requirements and [[CPU time]] in certain applications. The first approach is to compute the statistical moments by separating the data into bins and then computing the moments from the geometry of the resulting [[histogram]], which effectively becomes a [[one-pass algorithm]] for higher moments. One benefit is that the statistical moment calculations can be carried out to arbitrary accuracy such that the computations can be tuned to the precision of, e.g., the data storage format or the original measurement hardware.  A relative histogram of a [[random variable]] can be constructed in the conventional way: the range of potential values is divided into bins and the number of occurrences within each bin are counted and plotted such that the area of each rectangle equals the portion of the sample values within that bin:


: <math> H(x_k)=\frac{h(x_k)}{A}</math>
: <math> H(x_k)=\frac{h(x_k)}{A}</math>
Line 310: Line 332:
             = \frac{1}{A} \sum_{k=1}^K x_k^n h(x_k) \, \Delta x_k
             = \frac{1}{A} \sum_{k=1}^K x_k^n h(x_k) \, \Delta x_k
</math>
</math>
: <math>
: <math>
  \theta_n^{(h)}= \sum_{k=1}^{K} \Big(x_k-m_1^{(h)}\Big)^n \, H(x_k) \, \Delta x_k
  \theta_n^{(h)}= \sum_{k=1}^{K} \Big(x_k-m_1^{(h)}\Big)^n \, H(x_k) \, \Delta x_k
Line 362: Line 383:


==Covariance==
==Covariance==
Very similar algorithms can be used to compute the [[covariance]].
Very similar algorithms can be used to compute the [[covariance]].


===Naïve algorithm===
===Naïve algorithm===
Line 385: Line 406:
:<math>\operatorname{Cov}(X,Y) = \operatorname{Cov}(X-k_x,Y-k_y) = \dfrac {\sum_{i=1}^n (x_i-k_x) (y_i-k_y) - (\sum_{i=1}^n (x_i-k_x))(\sum_{i=1}^n (y_i-k_y))/n}{n}. </math>
:<math>\operatorname{Cov}(X,Y) = \operatorname{Cov}(X-k_x,Y-k_y) = \dfrac {\sum_{i=1}^n (x_i-k_x) (y_i-k_y) - (\sum_{i=1}^n (x_i-k_x))(\sum_{i=1}^n (y_i-k_y))/n}{n}. </math>


and again choosing a value inside the range of values will stabilize the formula against catastrophic cancellation as well as make it more robust against big sums. Taking the first value of each data set, the algorithm can be written as:
and again choosing a value inside the range of values will stabilize the formula against catastrophic cancellation as well as make it more robust against big sums. Taking the first value of each [[data set]], the algorithm can be written as:


<syntaxhighlight lang="python">
<syntaxhighlight lang="python">
Line 395: Line 416:
     ky = data_y[0]
     ky = data_y[0]
     Ex = Ey = Exy = 0
     Ex = Ey = Exy = 0
     for ix, iy in zip(data_x, data_y):
     for ix, iy in zip(data_x, data_y):  # for higher precision, use sum():
         Ex += ix - kx
         Ex += ix - kx                   # Ex = sum(ix - kx for ix in data_x)  # or sum(ix) - kx * n
         Ey += iy - ky
         Ey += iy - ky                   # Ey = sum(iy - ky for iy in data_y)  # or sum(iy) - ky * n
         Exy += (ix - kx) * (iy - ky)
         Exy += (ix - kx) * (iy - ky)    # Exy = sum((ix - kx) * (iy - ky) for ix, iy in zip(data_x, data_y))  
     return (Exy - Ex * Ey / n) / n
     return (Exy - Ex * Ey / n) / n
</syntaxhighlight>
</syntaxhighlight>
Line 416: Line 437:


     covariance = 0
     covariance = 0
     for i1, i2 in zip(data1, data2):
     for i1, i2 in zip(data1, data2):    # for higher precision, use sum():
         a = i1 - mean1
         a = i1 - mean1                 # covariance = sum((i1 - mean1) * (i2 - mean2) for i1, i2 in zip(data1, data2))
         b = i2 - mean2
         b = i2 - mean2
         covariance += a * b / n
         covariance += a * b / n
     return covariance
     return covariance
</syntaxhighlight>
</syntaxhighlight>
A slightly more accurate compensated version performs the full naive algorithm on the residuals.  The final sums <math display="inline">\sum_i x_i</math> and <math display="inline">\sum_i y_i</math> ''should'' be zero, but the second pass compensates for any small error.


===Online===
===Online===
Line 454: Line 473:
         C += dx * (y - meany)
         C += dx * (y - meany)


    population_covar = C / n
     # covariance and Bessel's correction for sample
     # Bessel's correction for sample variance
     return (C / n, C / (n - 1))
     sample_covar = C / (n - 1)
</syntaxhighlight>
</syntaxhighlight>


Line 500: Line 518:


:<math>\operatorname{Cov}_N(X,Y) = \frac{C_N}{\sum_{i=1}^{N} w_i}</math>
:<math>\operatorname{Cov}_N(X,Y) = \frac{C_N}{\sum_{i=1}^{N} w_i}</math>
(This can be used with Python's <code>sum</code> for better accuracy. Block updating is also related to the parallel/merge algorithm.)


==See also==
==See also==