Manuscript

Original Publication:   R.U.E. 't Lam, Analytica Chimica Acta 659 (2010) 68–84.     doi:10.1016/j.aca.2009.11.032

Scrutiny of variance results for outliers: Cochran's test optimized

Ruben U.E. ‘t Lam*

The Dow Chemical Company, P.O. Box
48, 4530 AA  Terneuzen, The Netherlands

* Tel.: +31 115 673746; fax: +31 115 673729; E-mail: rutlam@dow.com.

ABSTRACT
ISO Standard 5725 “Accuracy (trueness and precision) of measurement methods and results” recommends Cochran’s C test to numerically verify if three or more normally distributed data sets show “homogeneity of variances” or “homoscedasticity”. The C test is a one-sided outlier test that will identify deviant standard deviations. It can be run on summary data using a pocket calculator. However, the C test has limitations. It only applies to data sets of equal size. It uses critical values that are only available for the upper tail of the variance distribution, at selected numbers of data sets, selected numbers of replicates per set and only at two significance levels. Cochran’s C test will not identify an outlying low variance, but may mistake a high variance for an outlier instead. We transform the C test into a more general “G test”.
Expressions are derived to calculate upper limit as well as lower limit critical values for data sets of equal and unequal size at any significance level. The expressions are validated against literature values and through simulations in Excel. Representative critical values are tabulated for those who prefer to work from tables. The power of the G test is verified for data sets of equal and unequal size. The G test appears superior to the C test in detecting effects from low variances. The G test allows positive identification of exceptionally low variances. The application of the G test is illustrated with a numerical example.

Keywords:
Variance outliers; unbalanced designs; two-sided test; Cochran's C test; homogeneity of variances; heteroscedasticity.


1. Introduction

1.1. Method validation requirements

                Routine analyses should be performed using validated methods. These methods may be publically available or may have been validated in-house. Validation requirements vary with application area, type of analysis and source of the requirements.
                Most types of methods require the determination of precision [1– 4]. For internal laboratory validation this will be the repeatability precision and/or the intermediate precision. If a method is to become a public standard, its reproducibility precision must also be determined [4]. Following ICH [5], repeatability is the precision under the same operating conditions over a short interval of time. Intermediate precision includes within-laboratory variations: different days, different analysts, different equipment, etc. Reproducibility is the precision between laboratories. The standard deviation and the variance are common measures of precision [1,2].
                If a method will run in-house at one concentration in one specific matrix, a single series of ten repeat analyses [3] should be sufficient for estimating repeatability precision. More complex situations require more elaborate validation designs: several series of repeat analyses at different concentrations, in different matrices and/or different laboratories. Ideally, the design is balanced: Each lab analyzes each sample the same number of times on the same number of occasions. In practice, designs are often unbalanced, e.g. some measurement results appear unsatisfactory, some participants deviate from the original design, or independent (historic) data is appended to the study.
                Whether the design is balanced or not, each separate data series typically generates a different estimate for the repeatability standard deviation. If the differences are small, and if real differences are not anticipated, one often assumes “homogeneity of variances”. In that case, one may construct an overall estimate for the repeatability standard deviation by pooling the individual variances [6]. Homogeneity of variances is also called “homoscedasticity”. Its complement is called “heterogeneity of variances” or “heteroscedasticity”.

1.2. Homoscedasticity tests

                How does one decide if variances are homogeneous? If two data sets are considered, and if each set is normally distributed, one may do an F test [1,2,6–8]. If more data sets are involved, there is no universal answer. A commercial statistics package such as JMP [9] offers four tests: O’Brien’s test [10], Brown–Forsythe’s test [11], Levene’s test [8,12] and Bartlett’s test [1,2,7,8,13]. Other tests are Scheffé–Box’s log-ANOVA test [7,14], Box’s M test [15,16] and Kullback’s test [17]. Public standards [18,19] on determining the precision of tests methods, i.e. ASTM E 691 and ISO 5725, propose Mandel’s k test [1,2] to graphically evaluate homogeneity of variances. On top of that, various min–max tests and outlier tests have been promoted for verifying homoscedasticity: range tests [20] summarized by Youden and Steiner [21], Hartley’s Fmax test [1,2,7,22] and Cochran’s C test [1,2,6,23]. ISO 5725 presents the C test as numerical alternative for Mandel’s k test. The use of the C test was established in 1986 by a team led by Horwitz [24] on behalf of IUPAC and ISO.
                At The Dow Chemical Company we frequently evaluate precision data from both balanced and unbalanced designs. To meet ASTM and ISO standards, we used to rely on Mandel’s k test to evaluate the consistency among variances. The original k test requires data sets of equal size, but we generalized the k test to data sets of unequal size. However, we observed another drawback with the k test: the chance to fail this test increases with increasing numbers of data sets (§ 6.1). 

1.3. Cochran’s C test

                The complications with the k test lead us to examine the numerical alternative from ISO 5725: Cochran’s C test. The C test focuses on the highest observed variance. If this variance classifies as an outlier, it is omitted from the range and the test is repeated on the remaining values [19]. The C test has several advantages:

  • The C test is easily performed. It runs on a pocket calculator.
  • The C test takes summary data, i.e. variances or standard deviations. Knowledge of the individual data points is not a prerequisite.
  • The C test flags the variance value(s) that is (are) supposed to be causing variance heterogeneity.
  • As the C test is an outlier test, it responds to extreme variance values rather than to gradual variance changes. While this property may be a drawback in some applications (e.g. in data screening prior to ANOVA), it is advantageous if individual outliers are actionable or may be eliminated.

                Still, the C test is not ideal: it is only defined for data sets of equal size; it relies on critical values that are only available for selected numbers of data sets, selected numbers of replicates per set and at two significance levels [1,2,6,19,25,27]; its power to detect heteroscedasticity due to high variances is acceptable, but its power to detect heteroscedasticity due to low variances is minor [28]; a high variance may be mistaken for an outlier, while the actual outlier resides at low variance values (§ 3.7); and – last but not the least – its critical values must be read from tables.

1.4. G test

                We report an improved, general version of Cochran’s C test – the “G test” – which allows statistical evaluations of method precision data from balanced and unbalanced designs. Expressions are derived (and validated against literature values and through simulations in Excel) to calculate upper- and lower limit critical values at any significance level for data sets of either equal or unequal size. The critical values for the G test encompass the critical values for the original Cochran’s C test. Eq. (28) reproduces the critical values of Cochran’s original C test.
                The G test is primarily intended to substitute Cochran’s C test in evaluating the internal consistency of variances from precision studies. Mandel’s k test is the most common alternative in this application [1,2,18,19]. Therefore we limit our discussion of the G test to a comparison with the k and the C test (§ 6.1 § 6.2). The G, C and k test are variance outlier tests that should be differentiated from “true” homoscedasticity tests such as O’Brien’s test, Brown–Forsythe’s test, Levene’s test and Bartlett’s test. An in-depth comparison with a variety of main-stream homoscedasticity tests would require an ample study beyond the scope of this paper. 


2. Background

2.1. Determining the repeatability standard deviation



Consider an interlaboratory study (ILS) to determine repeatability precision, with L participating labs. Each lab i (1 i L) performs a number of replicate tests. Each lab reports the number of tests ni it has performed and the standard deviation Si it has obtained. The present treatment allows ni to vary between labs, i.e. the treatment covers both balanced and unbalanced designs. For the pooled repeatability standard deviation Spool over all participating labs we may then write [6]:


and
where Spool the pooled repeatability standard deviation, i any participating lab (1 i L), L the total number of participating labs, ni the number of replicates at lab i, νi the (ni 1) number of degrees of freedom at lab i, Si the repeatability standard deviation at lab i, νpool the number of degrees of freedom of Spool.
For convenience, we identify i with a lab and L with the total number of labs. However, our treatment also applies if we indentify i and L with products, analytes or concentrations.
Imagine we suspect that the standard deviation at lab j is “deviant”, i.e. Sj is significantly different from the standard deviations at the other labs. We may then decide to exclude Sj, pooling the standard deviations over the remaining labs:
and
where S(poolj) the pooled repeatability standard deviation over all labs but lab j, j the lab that is suspected to be deviant, νj the (nj 1) number of degrees of freedom at lab j, Sj the repeatability standard deviation at lab j, ν(poolj) the number of degrees of freedom of S(poolj).

Before excluding Sj from the calculation of the pooled repeatability standard deviation Eq. (3), we must decide if Sj is really deviant. We can base such a decision on hypothesis testing [1,2,7,8,29].

2.2. Simple null hypothesis

                Occasionally, we may want to focus on one specific lab j, e.g. we may have an a-priori reason to suspect that the particular lab could show a deviant standard deviation and we are not concerned if any other lab could show a deviant standard deviation as well. In that case, we may test the following simple null hypothesis:

H0(j): σj = σpool

Our alternative hypothesis depends on the details of the problem under consideration. If we believe that lab j could have suffered from detector instability, we want to test if its standard deviation could be exceptionally large. In that case our alternative hypothesis reads:

H1(j): σj > σ(pool-j)                    (variance too large)

If we suspect that lab j has been reporting averages from duplicate runs instead of single runs, we want to test if its standard deviation could be unusually small. Our alternative hypothesis then becomes:

H1(j): σj < σ(pool-j)                    (variance too small)

If either deviation from H0(j) could occur, we need a two-sided alternative hypothesis:

H1(j): σj ≠ σ(pool-j)                    (variance either too large or too small)

2.3. Composite null hypothesis

In the field of method validation, we often want to verify if any of the observed standard deviations could be deviant. In that case, we need a composite null hypothesis:

H0: σj = σpool; 1 ≤ jL           (the variances are equal)

In practice, H0 is examined by examining L simple null hypotheses:

H0(1): σ1 = σpool;  … ; H0(j): σj = σpool ;  …;  H0(L): σL = σpool

If H0 is rejected, we want to identify the labs j that have produced the outliers. We label these labs f1, f2,.., fm, where m is the total number of identified outliers, and define Φ ≡ {f1, f2, …, fm}. For the three possible alternative hypotheses we write:

H1: σj < σ(pool-Φ);  j ϵ Φ          (variances from labs f1 to fm are too small)
H1: σj > σ(pool-Φ);  j ϵ Φ          (variances from labs f1 to fm are too large)
H1: σjσ(pool-Φ);  j ϵ Φ          (variances from labs f1 to fm are deviant)

2.4. Significance level

If we test a simple null hypothesis H0(j) at significance level α, we accept a probability α to obtain a so called “false positive”. This is the probability to reject H0(j) even if H0(j) is actually true. Such a decision is called a Type I error. To minimize the risk for Type I errors we must minimize α.
                If we want to test a composite null hypothesis H0 at significance level α, we effectively must test L simple null hypotheses H0(i). If the significance level of each separate test is ζ, i.e. if each H0(i) has a probability ζ to be rejected even when it is true, then the overall probability α to incorrectly reject at least one simple H0(i) will exceed ζ. The exact relationship between α and ζ may be unknown, but the following inequality should hold [30]:

  α/Lζ1 (1α)1/L                                                                                                                                       (5)

 
where α the significance level of the overall test, ζ the significance level of a separate test j.

For small α, 1 (1α)1/Lα/L, and we may safely estimate ζ using the so called Bonferroni adjustment:

  ζ = α/L                                                                                                                                                     (6)

For large α, Eq. 6 actually provides an upper limit value for ζ. However, our validation results in § 4.1 indicate that in case of the G test, Eq. 6 is satisfactory for all possible α.

2.5. Power

If we test at significance level α, we accept a probability α to obtain a false positive. However, we can also obtain false negatives: we may accept H0 even if H0 is actually wrong. Such a decision is called a Type II error. The risk for Type II errors increases with decreasing α. To minimize the risk for Type II errors we must maximize α. In practice one often takes α = 0.05, to balance the risk for false positives (§ 2.4) against the risk for false negatives.
                The probability to make a type II error is called β. The power η of a statistical test is defined as the probability that the test will not make a Type II error, so we may write:

  η ≡ 1 β                                                                                                                                                  (7)

The ideal test for evaluating H0 would show a very narrow “power curve” with a minimum η = 0 if H0 is met, and with η = 1 if H0 is not met.

2.6. Existing variance outlier tests

We need a suitable outlier test for verifying the composite null hypothesis H0: σj = σpool. Public standards [18,19] offer two options in case the alternative hypothesis is H1: σj > σ(pool-Φ): Mandel’s k test and Cochran’s C test. Mandel’s k test [1,2] evaluates the ratio:

  kj ≡ Sj / Spool                                                                                                                                            (8)

ASTM E 691 [18] provides a formula to calculate critical values for the k test:
 
where kc(α) the critical value for within-laboratory consistency statistic kj at significance level α, Fc the critical F ratio at significance level α for n and (npool n) degrees of freedom; Fc is available from Excel [31]: Fc = FINV(α,n, (npool n)).
The critical values kc are upper limit values, i.e. when evaluating the k statistic, we demand kj kc. If all kj kc, we accept H0: σj = σpool (the variances are equal). If any kj > kc, we reject H0 and accept H1: σj > σ(pool Φ); j ϵ Φ (variances from labs f1 to fm are too large).
                An issue with the k test is that it disregards the Bonferroni adjustment Eq. (6). The α parameter in Eq. (9) does not represent the overall significance level of the k test: the actual risk to obtain at least one false positive will increase with the number of labs L. The matter is discussed quantitatively in § 6.1. ASTM E 691 reduces the probability of obtaining false positives by selecting α = 0.005 rather than α = 0.05 (§ 2.4). Still the problem remains that the overall probability of false positives will depend on the number of participating labs. For numerical purposes, ISO 5725 [19] relies on Cochran’s C test [1-2,6,23] rather than on the k test. The C test evaluates the ratio:
where Cj is Cochran’s C at lab j.
 
Selections of upper limit critical values CUL have been tabulated [1,2,6,19,25–27] at overall significance levels α = 0.01 and α = 0.05. If all Cj CUL, we accept H0: σj = σpool. If any Cj > CUL, we reject H0, label the lab with the largest variance as “f1”, and remove its variance from the data. The C test is then repeated on the remaining variance data. The process continues until all remaining outliers (from labs f2 to fm) have been identified and removed. Once all outliers have been omitted, we accept the alternative hypothesis
H1: σj > σ(poolΦ); j ϵ Φ (variances from labs f1 to fm are too large).
Neither the k test nor the C test defines lower limit critical values. Consequently, these outlier tests do not apply if H1: σj < σ(poolΦ) or if H1: σj σ(poolΦ).


2.7. Limitations of Cochran’s C test

While we would use the C test rather than the k test for identifying variance outliers, we recognize that the C test does have limitations and drawbacks:

  1. By definition, Cj Eq. (10) gives equal weight to all standard deviations Si. This is realistic for balanced designs, but not for unbalanced designs. Balanced designs are preferred, but unbalanced designs do occur in practice.
  2. There are tables with upper limit critical values CUL to decide if Sj is exceptionally large, but there are no tables with lower limit critical values CLL to decide if Sj is exceptionally small.
  3. As there are no tables on CLL, one cannot perform two-sided tests.
  4. A high Sj value may bemarked as outlier, while the actual outlier is at low standard deviations. See also § 3.8.
  5. Tables on CUL are available only for α = 0.01 and α = 0.05.
  6. For balanced designs, the tables on CUL are complete [19] for 2 n 6 and 2  L 40. For n > 6 and/or L > 41, only selected values are available.
  7. Lengthy look-up tables discourage the incorporation of the C test in dedicated software.
  8. Published tables contain transcription errors. For α = 0.05: Taylor [6] at (n = 2, L = 3) and (n = 10, L = 5); Van Belle [25] at (n = 11, L = 8); ISO 5725 [19] at (n = 6, L = 13).  For CUL at (n = 11, L = 8) we calculate 0.2829.  Appendix C presents the other corrected values.


3. Theoretical development

3.1. G ratio

To handle the limitations in § 2.7 we generalize Eq. (10):
 where Gj is the generalized test statistic for lab j.

3.2. Relation to F ratio

We divide the numerator and denominator by Σνi and substitute Eqs. (1) and (2): 
We invert Eq. (12) and separate Sj from Spool: 
We identify the term (S(pool−j)/Sj with the inverse of Fisher’s F ratio according to Kreyszig [29]:



Fj has νj degrees of freedom in the numerator and ν(poolj) = (νpool νj) degrees of freedom in the denominator. We invert Eq. (13) and substitute Eqs. (14) and (4), to relate the G ratio to Fisher’s F ratio: 
 
Critical F values are available from Excel [31]: 

         Fc = FINV(ζ, νj, νpool νj                                                                                                                      (16)

When calculating F ratios it is common practice [1,2,6–8] to require that the larger of the two variance values is placed in the numerator: F = (Smax)²/(Smin)². This convention has the advantage that Fc only needs to be tabulated at significance levels < 0.5. However, following this usage would complicate our treatment in § 3.3 § 3.7, so we prefer to adhere to the definition from Kreyszig [29]. Here we note that the Excel function FINV [31] will accept any significance level 0 ζ 1.

3.3. One-sided critical values

For estimating the critical values for one-sided G tests at an overall significance level α1, we apply the Bonferroni adjustment Eq. (6) and set:

  upper limit: ζUL(1) = α1/L                                                                                                                      (17)
  lower limit: ζLL(1) = 1 ζUL(1) = 1 α1/L                                                                                          (18)

We combine Eqs. (15) (18) to calculate critical values for G:
When performing the G test, we check if the following inequalities are met:

  upper limit GUL test : Gj < GUL (α1, νj, vpool, L)                                                                               (21)
  lower limit GLL test : GLL (α1, νj, νpool, L) < Gj                                                                                (22)

3.4. Two-sided critical values

For estimating the critical values for two-sided G tests at an overall significance level α2, we again apply Eq. (6) and set:
  upper limit: ζUL(2) = α2/2L                                                                                                                     (23)
  lower limit: ζLL(2) = 1 ζUL(2)= 1 α2/2L                                                                                           (24)

We combine Eqs. (15) and (16) and Eqs. (23) and (24) to calculate critical G values:


When performing the G test, we check if the following inequality is met:

two-sided test : GLL (α2, νj, νpool, L) < Gj < GUL (α2, νj, νpool, L)                                                   (27)

3.5. Balanced designs

For balanced designs Eqs. (19)-(20) and (25)-(26) reduce to:

3.6. Multiple outliers

ISO 5725-2 reads [19]: “7.3.3.6 If the highest standard deviation is classed as an outlier, then the value should be omitted and Cochran’s test repeated on the remaining values. (. . .)”. So if several potential outliers are identified, only the highest value should be eliminated in the first round.
For the G test, we generalize this to:
If several standard deviations are classed as outliers, omit the value that is most likely to be a true positive, and repeat the G test on the remaining values.
Our generalization allows for coexisting outliers at high and/or low variances and recognizes that – for unbalanced designs – the probability that a given standard deviation is an outlier will not only depend on its value, but also on its number of degrees of freedom Inequality (27). A particular numerical value may pass the G test at one degree of freedom, while failing the G test at two degrees of freedom. To develop a criterion for deciding which standard deviation is most likely to be an outlier, we rearrange Eq. (15):


The probability to obtain an F ratio Fj by coincidence is available from Excel [31]:

pF(j) = FDIST(Fj, νj, (νpool νj))                                                                                                            (33)

pF(j) has the potentially confusing property that it decreases with increasing standard deviation. We therefore define:

γj = 1 pF(j) = 1 FDIST(Fj, νj, (νpool νj))                                                                                         (34)

This criterion may also be written as:
At exceptionally small standard deviations, γj is close to zero, while at exceptionally large standard deviations, γj approaches one. Eqs. (34) and (35) are equivalent [7,29]. However, Eq. (34) is sensitive to rounding errors when evaluating small standard deviations, while Eq. (35) is sensitive to rounding errors when evaluating large standard deviations.
When conducting a two-sided G test, it is convenient to use a decision parameter that is either equal to γj, or equal to 1 γj, whichever value is smallest. For this purpose we define:

δj minimum (γj, (1 γj))                                                                                                                        (36)

3.7. Conducting a G test
To conduct a one-sided upper limit GUL test at the α1 significance level, we proceed as follows:
We determine γj Eq. (34) for each data series 1 j L. If all γj (1 α1/L), we accept H0: σj = σpool; 1 j L. If any γj > (1 α1/L), we reject H0, we label the data series with the largest γ as “f1”, and we omit its variance from the data. The test is then repeated on the remainder of the data. The process continues until all remaining variance outliers (from data series f2 to fm) have been identified and removed. Once all outliers have been removed, we accept the alternative hypothesis H1: σj > σ(poolΦ); j ϵ Φ (variances from data series f1 to fm are too large).
To conduct a one-sided lower limit GLL test at the α1 significance level, we proceed as follows:
We determine γj Eq. (35) for each data series 1 j L. If all γj α1/L, we accept H0: σj = σpool; 1 j L. If any γj < α1/L, we reject H0, we label the data series with the smallest γ as “f1”, and we omit its variance from the data. The test is then repeated on the remainder of the data. The process continues until all remaining variance outliers (from data series f2 to fm) have been identified and removed. Once all outliers have been removed, we accept the alternative hypothesis H1: σj < σ(poolΦ); j ϵ Φ (variances from data series f1 to fm are too small).
To conduct a two-sided G test at the α2 significance level, we proceed as follows:
We determine δj Eq. (36) for each data series 1 j L. If all δj α2/2L, we accept H0: σj = σpool; 1 j L. If any δj < α2/2L, we reject H0, we label the data series with the smallest δ as “f1”, and we omit its variance from the data. The test is then repeated on the remainder of the data. The process continues until all remaining variance outliers (from data series f2 to fm) have been identified and removed. Once all outliers have been removed, we accept the alternative hypothesis H1: σj σ(poolΦ); j ϵ Φ (variances from data series f1 to fm are deviant).

3.8. Artificial outliers

An extreme standard deviation Sj will induce an extreme G ratio Gj Eq. (12). However, as an extreme standard deviation will also influence the pooled standard deviation Spool Eq. (1), the G ratios Gij of the other standard deviations are affected too. Fig. 1 illustrates the effect for a balanced design with four labs. Each lab runs ten replicates. The standard deviation of lab 1 is allowed to vary between 0.1 and 10; the standard deviations at labs 2 4 remain constant at S2 = 0.68, S3 = 1.00 and S4 = 1.24. The G ratios are compared to the critical values (GLL and GUL) for one-sided lower limit and upper limit G tests at a significance level α1 = 0.05. 


Fig. 1  Effect of a varying standard deviation S1
on the position of G1, G2, G3 and G4.
In the center of the plot, at 0.50 < S1 < 1.69, all G ratios range between GLL and GUL. Here, when we conduct a one-sided G test (to either side), we will find that we can accept the null hypothesis H0: σj = σpool. This result is satisfactory: in this part of the plot the four labs have similar standard deviations, so H0 should not be rejected.



If we allow S1 to decrease, to 0.39 < S1 < 0.50, G1 will drop below GLL, while the other G ratios remain between GLL and GUL. When conducting a GUL test,wewill accept H0 as before, but when we conduct a GLL test we will reject H0 in favor of the alternative hypothesis H1: σ1 < σ(pool1). This outcome is satisfactory too: apparently S1 has become sufficiently small to be considered an outlier.


If we let S1 decrease further, below 0.39, we still find G1 < GLL, so the outcome of a GLL test will remain unchanged H1: σ1 < σ(pool1). However, for S1 < 0.39 we now find G4 > GUL. Consequently, when we perform a GUL test, we must reject H0 and accept the alternative hypothesis H1: σ4 > σ(pool4). Apparently, the net effect of S1 being extremely small is that S4 may be identified as a high extreme in the GUL test! We conclude that if a data set contains an extremely low variance value, the GUL test may generate an “artificial outlier”: the observation of an outlier at high standard deviations may be caused by the presence of a more extreme value at low standard deviations. The opposite problem can occur as well: A GLL test may mark a low standard deviation as an outlier, due to the presence of a more extreme value at high standard deviations.
The occurrence of artificial outliers can be avoided by using a two-sided test: in the first round only the variance with the lowest δ value Eq. (36) will be considered for omission. Therefore we recommend the two-sided G test – even if one is primarily interested in identifying extremely high (low) variances, unless the existence of extremely low (high) variances can be ruled out beforehand (§ 2.2). In the latter case, we would use a one-sided G test because one-sided tests are more powerful than two-sided tests (at the same overall significance level) [1,7,8].


4. Validation

4.1. Validation of critical values for the G test

If we test at an overall significance level α, we run a risk α to obtain a “false positive”; this is the risk to conclude that any of the participating labs is showing a deviant repeatability even if the true repeatability at all labs is the same (§ 2.4). Put differently, if the true repeatability standard deviations at all labs would be identical, and if we could run an infinite number of interlaboratory studies to determine these standard deviations, 1% of the studies should fail the G test at the 0.01 significance level, 5% of the studies should fail the G test at the 0.05 significance level, etc. We used this notion to validate Eqs. (19) (20), (25) (26) and (28) (31) by simulations in Excel. Five labs were supposed to participate in an ILS to determine the repeatability precision of a test method. Both a balanced design and a severely unbalanced design were conducted, as outlined in Table 1.
 
Table 1  Outline of two ILS designs.
 

 
Table 2  Lower limit and upper limit critical G values for designs 1 and 2.



Replicates were randomly drawn from a normal distribution with μ = 0 and σ = 1 using ZNORMAL (mean, SD) [32]. We calculated the standard deviation Si for each separate lab i, the pooled standard deviation over all labs Eq. (1), and the test statistic Gi for each separate lab Eq. (11). For each Gi value we then checked inequality (27) at selected overall significance levels α2. The used limit values for design 1 (Eqs. (28) (31)) and design 2 (Eqs. (19), (20), (25) and (26)) are supplied in Table 2. The limit values for 0.01 α2 0.10 are practically relevant; the other limit values are considered for validation purposes.
Each design was run 106 times, counting how often Gi failed inequality (27). The observed percentages of false negatives closely matched prediction, at any overall significance level. Table 3 shows the validation results for design 2 as an example. We conclude that the Bonferroni adjustment Eq. (6) holds for any 0 α2 1 and that Eqs. (19), (20), (25), (26) and (28) (31) are satisfactory.


Table 3  Validation results for design 2 (unbalanced design).





Lab 1
Lab 2
Lab 3
Lab 4
Lab 5
False Positives
α1
α2
Limit
Range
G1
G2
G3
G4
G5
ΣGi
Observed
Expected
0.001
0.002
LL
0 - 0.1%
185
173
194
191
211
954
0.095%
0.1%
0.005
0.01
LL
0 - 0.5%
939
1011
957
1009
1027
4943
0.494%
0.5%
0.01
0.02
LL
0 - 1 %
1925
2022
1933
2028
2033
9941
0.994%
1.0%
0.025
0.05
LL
0 - 2.5 %
4929
4975
4957
5070
5062
24993
2.499%
2.5%
0.05
0.1
LL
0 - 5 %
9986
10053
9947
10111
10152
50249
5.025%
5%
0.1
0.2
LL
0 - 10 %
19939
20146
19834
20359
20218
100496
10.050%
10%
0.25
0.5
LL
0 - 25 %
50245
50120
49786
50320
49754
250225
25.023%
25%
0.5
1
LL
0 -50 %
100351
100078
99817
99754
99758
499758
49.976%
50%












0.5
1
UL
50 - 100 %
100326
99836
100175
99902
99982
500221
50.022%
50%
0.25
0.5
UL
75 - 100 %
50232
49809
49785
50095
49952
249873
24.987%
25%
0.1
0.2
UL
90 - 100 %
20116
20065
20175
20116
19763
100235
10.024%
10%
0.05
0.1
UL
95 - 100 %
9994
10125
10059
10077
9860
50115
5.012%
5%
0.025
0.05
UL
97.5 - 100 %
4940
5056
5142
5036
5018
25192
2.519%
2.5%
0.01
0.02
UL
99 - 100 %
1911
2001
2045
2019
2067
10043
1.004%
1%
0.005
0.01
UL
99.5 - 100 %
944
988
1011
1008
1037
4988
0.499%
0.5%
0.001
0.002
UL
99.9 - 100 %
194
188
194
190
174
940
0.094%
0.1%


4.2. Validation of upper limit values for the C test

For validating Eq. (28), we are not restricted to simulation experiments (§ 4.1), as there is literature data on CUL at α1 = 0.01 and α1 = 0.05. We define:

·         CUL() = CUL (α1, n, L) from literature
·         CUL(F) = CUL(α1, n, L) from Eq. (28)
·         ΔCUL = CUL(F) CUL()

                We copied the CUL() data from Van Belle [25] at α1 = 0.01 and α1 = 0.05. We replaced the erroneous value at α1 = 0.05, n = 11, and L = 8 by the correct value from Zwillinger [26]. We then calculated CUL(F) at α1 = 0.01 and α1 = 0.05 from inequality (27) and the corresponding difference values ΔCUL. Both at α1 = 0.01 and α1 = 0.05 we observed that for all CUL(F) values, |ΔCUL| 8×104. This observation confirmed that Eq. (28) is satisfactory.


5. Results

5.1. Reference tables

                 Limit values for Cochran’s C test are usually read from tables. For this purpose Appendices A–C supply reference values for CUL (α1, n, L) at α1 = 0.01, 0.025 and 0.05, while Appendices D–F supply reference values for CLL (α1, n, L) at α1 = 0.01, 0.025 and 0.05. CUL and CLL were calculated with Eqs. (28) and (29). The limit values also apply to two-sided tests at α2 = 0.02, 0.05 and 0.10 Eqs. (30) and (31).

5.2. Power of the G test

                The power of the G test ηG Eq. (7) depends on α, as well as the effect size (σj /σ0), the numbers of degrees of freedom νj and νpool, and the number of labs L. To get an impression of ηG we simulated four interlaboratory studies according to designs 1 and 2 of Table 1. Details on the set-up of the studies are presented in Table 4.

Table 4  Set-up of four simulated interlaboratory studies.
                                                                                                                                              



Simulation


Simulation





1


2
3
4
Lab (i)

νi
σi

νi
σi
σi
σi
1

9
0 to 6

1
10-3 to 103
1
1
2

9
1

2
1
1
1
3

9
1

9
1
10-3 to 103
1
4

9
1

14
1
1
1
5

9
1

19
1
1
10-3 to 103









Σ

45


45




                Except for σi = 0, replicates were randomly drawn from a normal distribution with μ = 0, using ZNORMAL (mean, SD) [32]. We calculated the standard deviation for each separate lab, the pooled standard deviation over all labs Eq. (1), and test statistic Gj for each separate lab Eq. (11).
                We ran simulation 1 at selected σ1 values between 0 and 6. For each σ1 value we completed 2 × 105 runs. Each run we checked if G1 met inequality (27) at α2 = 0.002, α2 = 0.1 and α2 =1 (Table 2, ν = 9). For each series of 2×105 runs value we calculated the fraction of runs that met inequality (27). This fraction was identified with the overall probability β to obtain a “false negative”. The corresponding power was estimated from Eq. (7): η 1 β. If the effect size (σ1/σ0) 1, only “false negatives” and “true positives” may occur. Consequently, if (σ1/σ0) 1, η may also be identified with the fraction of “true positives”, i.e. the fraction of runs that fails inequality (27). The results are shown in Fig. 2.
  
Fig. 2  Power curves for regular outliers in simulation 1.

                As expected, the power of the G test is greatest for α2 = 1.  However, at any overall significance level α2 > 0, the G test will identify standard deviations that are sufficiently deviant, since ηG tends to 1 both for (σ1/σ0) approaching 0 and for (σ1/σ0) approaching . For low effect size this favorable behavior comes as a bonus. According to Legendre and Borcard [28] the power of Cochran’s C test at α = 0.05 is 0.3 at best for (σ1/σ0) approaching 0. Here we must realize that although the C test is a one-sided upper limit test, it will occasionally detect heteroscedasticity due to extremely low variances, through the occurrence of “artificial outliers” at high variance values (§ 3.8).


                We refer to the corresponding points of the power curve as “upper limit artifacts” AUL. The power data on AUL are shown in Fig. 3 (solid lines). Note that while the AUL data are observed at high standard deviations, they are caused by extremely low standard deviations. Therefore the AUL data pose a separate branch of the power curve at small (σ1/σ0). Using Eq. (29) we can also define “lower limit artifacts” ALL at high (σ1/σ0), by checking if Gj (j = 2, 3, 4 or 5) fails inequality (27). The obtained power data are again included in Fig. 3 (dashed lines).
                Fig. 4 compares ηA and ηG at α2 = 0.1. The artificial branches are denoted AUL and ALL, while the regular branches are denoted GLL and GUL. Fig. 4 shows that AUL indeed has low power but that the regular branch GLL resolves this issue.
                                                                                                                                                                                                                   

                Simulations 2 4 were run (at α2 = 0.1) to verify the performance of the G test for unbalanced designs. The obtained power data are shown in Figs. 5 and 6. Fig. 5 again shows the poor power of AUL for identifying small standard deviations, especially at low degrees of freedom, while Fig. 6 confirms that the G test resolves the issue.

 

6. Discussion

6.1. Comparison with Mandel’s k test

                Although different in appearance, Mandel’s k test is related to Cochran’s C test and the G test, as all three tests are outlier tests. Besides, both ASTM E 691 and ISO 5725 offer Mandel’s k test as a graphical way to evaluate homogeneity of variances. We therefore pay special attention to this test. Mandel’s k test evaluates kj:
kj ≡ Sj / Spool                                                                                                                                           (8a)

When we rewrite Eq. (8a) and then substitute Eqs. (1) and (11) we obtain:
Combining Eqs. (37) and (19) we derive for the upper limit critical values of k:
For balanced designs Eq. (38) reduces to:
Eq. (39) generates upper limit critical values for k, and is identical to the original formula Eq. (9) except that Eq. (9) evaluates Fc at α1 rather than at α1/L. Apparently, Mandel’s k test does not apply a Bonferroni adjustment Eq. (6). One should be aware of this complication when performing the k test. The k test should only be used for a visual comparison of variances. For quantitative purposes we recommend the G test.

6.2. Comparison with Cochran’s C test

ISO 5725-2 presents a concise discussion of Cochran’s C test, from which we quote in italics [19]:

                7.3.3.3 Cochran’s criterion applies strictly only when all the standard deviations are derived from the same number (n) of test results obtained under repeatability conditions. In actual cases, this number may vary (. . .). This part of ISO 5725 assumes, however, that in a properly organized experiment such variations (. . .) will be limited and can be ignored, and therefore Cochran’s criterion is applied using for n the number of test results occurring in the majority of cells.
                The G test handles variations in the number of test results accurately Eqs. (19) and (34).

                7.3.3.4 Cochran’s criterion tests only the highest value in a set of standard deviations and is therefore a one-sided outlier test. Variance heterogeneity may also, of course, manifest itself in some of the standard deviations being comparatively too low. However, small values of standard deviation may be very strongly influenced by the degree of rounding of the original data and are for that reason not very reliable. In addition it seems unreasonable to reject data from a laboratory because it has accomplished a higher precision in its test results than the other laboratories. Hence Cochran’s criterion is considered adequate.

                Here we have concerns. In § 3.8 we argue that the C test may identify an artificial outlier at high variances, while the actual outlier is at low variances. This issue is resolved by the two-sided G test. Only if we would have a-priori reasons to assume that outliers could only occur at a given side of the variance range, we would prefer a one-sided G test.

                7.3.3.5 A critical examination (. . .) may sometimes reveal that the standard deviations from a particular laboratory are at all or at most levels lower than those of other laboratories. This may indicate that the laboratory works with a lower repeatability standard deviation than the other laboratories (. . .). If this occurs it should be reported to the panel, which should then decide whether the point is worthy of a more detailed investigation.
                The two-sided G test will confirm if the low standard deviations indeed differ significantly from the other standard deviations. If there are a-priori reasons to suspect the occurrence of extremely low variances only, a one-sided G test should be performed (Eqs. 20 and 35).

                7.3.3.6 If the highest standard deviation is classed as an outlier, then the value should be omitted and Cochran’s test repeated on the remaining values. This process can be repeated but it may lead to excessive rejections when (. . .) the underlying assumption of normality is not sufficiently well approximated to. Repeated application of Cochran’s test is here proposed only as a helpful tool in view of the lack of a statistical test designed for testing several outliers together. Cochran’s test is not designed for this purpose and great caution should be exercised in drawing conclusions.
                Similar considerations apply to the G tests, although the two-sided G test is less likely to produce erroneous outliers than the one-sided G tests. Still, caution is also recommended in drawing conclusions from repeated application of the two-sided G test, as any G test will assume that the evaluated standard deviations are derived from normally distributed data sets.


7. Example case

                We illustrate a practical application of the two-sided G test. Table 5 is supposed to summarize historical repeatability data for a given test method on eight different samples. Each sample is from a different product (1 i 8). The data sets have unequal size. There is no a-priori reason to assume superior or inferior method precision for any product.

Table 5  Assumed historical repeatability data.


                We want to determine with 95% confidence (α2 = 0.05) if any product(s) could show a standard deviation that is significantly different from the other products. We therefore perform a two-sided G test according to § 3.7 as outlined in Table 6:

Table 6  Two-sided G test; cycle 1.
               Both δ2 and δ8 are < (α2/2L = 0.05/16 ) 0.0031; therefore S2 and S8 both qualify as potential outliers. However, δ2 < δ8, so we postpone judging S8. For the time being, we conclude that Sample 2 has a significantly lower standard deviation than the other samples. We omit S2, and repeat the test on the other standard deviations (Table 7).

Table 7  Two-sided G test; cycle 2.
                Now δ1 < (α2/2L = 0.05/14) 0.0036, so S1 qualifies as a potential outlier. As there are no other potential outliers identified, we decide that Sample 1 has a significantly lower standard deviation than Samples 3 to 8. We omit S1, and repeat the test on the standard deviations from Samples 3 to 8 (Table 8).

Table 8  Two-sided G test; cycle 3.


No additional outliers are observed. We decide to accept the standard deviations from Samples 3 to 8.


8. Conclusion

Cochran’s C test has been successfully extended:

To cover unbalanced designs.
To identify variances that are significantly smaller than other variances.
To avoid artificial outliers.
To perform two-sided tests.

                Expressions have been developed to calculate lower limit and upper limit critical values at any significance level and for any (either balanced or unbalanced) design.


Acknowledgements

                Thanks are due to The Dow Chemical Company for kind permission to publish this study. Special thanks are due to Erik Jan van den Heuvel for guidance in generating random data and in developing dedicated software, to Enric Comas for help in evaluating existing software, to Alison Delmage, Jill Doiron, Phil Gaarenstroom, Karen Pell and the Chemometrics Group for critically reading the manuscript, to Swee-Teng Chin and Randy Pell for technical review, to Norman Deck for customer support and to Leslie Fan and Lloyd Colegrove for faith and managerial support.


Appendices

Appendices A–F supply upper limit values CUL (α1, n, L) and lower limit values CLL (α1, n, L) for one-sided tests at α1 = 0.01, 0.025 and 0.05. The limit values also apply to two-sided tests at α2 = 0.02, 0.05 and 0.10.  You may visit Appendices A-F using the links below:
Upper limit values CUL at significance levels α1 = 0.01 (one-sided test) and α2 = 0.02 (two-sided test); Eqs. 28 and 30
Upper limit values CUL at significance levels α1 = 0.025 (one-sided test) and α2 = 0.05 (two-sided test); Eqs. 28 and 30

Upper limit values CUL at significance levels α1 = 0.05 (one-sided test) and α2 = 0.10 (two-sided test); Eqs. 28 and 30

Lower limit values CLL at significance levels α1 = 0.01 (one-sided test) and α2 = 0.02 (two-sided test); Eqs. 29 and 31

Lower limit values CLL at significance levels α1 = 0.025 (one-sided test) and α2 = 0.05 (two-sided test); Eqs. 29 and 31

Lower limit values CLL at significance levels α1 = 0.05 (one-sided test) and α2 = 0.10 (two-sided test); Eqs. 29 and 31

References


Use this link to visit the References.