Bonferroni校正:如果在同一數(shù)據(jù)集上同時檢驗n個獨立的假設,那么用于每一假設的統(tǒng)計顯著水平,應為僅檢驗一個假設時的顯著水平的1/n。

簡介

舉個例子:如要在同一數(shù)據(jù)集上檢驗兩個獨立的假設,顯著水平設為常見的0.05。此時用于檢驗該兩個假設應使用更嚴格的0.025。即0.05* (1/2)。該方法是由Carlo Emilio Bonferroni發(fā)展的,因此稱Bonferroni校正。

這樣做的理由是基于這樣一個事實:在同一數(shù)據(jù)集上進行多個假設的檢驗,每20個假設中就有一個可能純粹由于概率,而達到0.05的顯著水平。

維基百科原文

Bonferroni correction

Bonferroni correction states that if an experimenter is testing n independent hypotheses on a set of data, then the statistical significance level that should be used for each hypothesis separately is 1/n times what it would be if only one hypothesis were tested.

For example, to test two independent hypotheses on the same data at 0.05 significance level, instead of using a p value threshold of 0.05, one would use a stricter threshold of 0.025.

The Bonferroni correction is a safeguard against multiple tests of statistical significance on the same data, where 1 out of every 20 hypothesis-tests will appear to be significant at the α = 0.05 level purely due to chance. It was developed by Carlo Emilio Bonferroni.

A less restrictive criterion is the rough false discovery rate giving (3/4)0.05 = 0.0375 for n = 2 and (21/40)0.05 = 0.02625 for n = 20.

數(shù)據(jù)分析中常碰見多重檢驗問題(multiple testing).Benjamini于1995年提出一種方法,是假陽性的。在統(tǒng)計學上,這也就等價于控制FDR不能超過5%.

根據(jù)Benjamini在他的文章中所證明的定理,控制fdr的步驟實際上非常簡單。

設總共有m個候選基因,每個基因?qū)膒值從小到大排列分別是p(1),p(2),...,p(m),

The False Discovery Rate (FDR) of a set of predictions is the expected percent of false predictions in the set of predictions. For example if the algorithm returns 100 genes with a false discovery rate of .3 then we should expect 70 of them to be correct.

The FDR is very different from ap-value, and as such a much higher FDR can be tolerated than with a p-value. In the example above a set of 100 predictions of which 70 are correct might be very useful, especially if there are thousands of genes on the array most of which are not differentially expressed. In contrast p-value of .3 is generally unacceptabe in any circumstance. Meanwhile an FDR of as high as .5 or even higher might be quite meaningful.

FDR錯誤控制法是Benjamini于1995年提出一種方法,通過控制FDR(False Discovery Rate)來決定P值的域值. 假設你挑選了R個差異表達的基因,其中有S個是真正有差異表達的,另外有V個其實是沒有差異表達的,是假陽性的。實踐中希望錯誤比例Q=V/R平均而言不能超過某個預先設定的值(比如0.05),在統(tǒng)計學上,這也就等價于控制FDR不能超過5%.

對所有候選基因的p值進行從小到大排序,則若想控制fdr不能超過q,則只需找到最大的正整數(shù)i,使得 p(i)<= (i*q)/m.然后,挑選對應p(1),p(2),...,p(i)的基因做為差異表達基因,這樣就能從統(tǒng)計學上保證fdr不超過q。因此,F(xiàn)DR的計算公式如下:

p-value(i)=p(i)*length(p)/rank(p)

參考文獻

1.Audic, S. and J. M. Claverie (1997). The significance of digital gene expression profiles. Genome Res 7(10): 986-95.

2.Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. 29: 1165-1188.

計算方法 請參考 R統(tǒng)計軟件的p.adjust函數(shù):

> p<-c(0.0003,0.0001,0.02)

> p

[1] 3e-04 1e-04 2e-02

>

> p.adjust(p,method="fdr",length(p))

[1] 0.00045 0.00030 0.02000

>

> p*length(p)/rank(p)

[1] 0.00045 0.00030 0.02000

> length(p)

[1] 3

> rank(p)

[1] 2 1 3

sort(p)

[1] 1e-04 3e-04 2e-02[1]