multiple hypothesis testing
Vijaykumar Muley wrote:
Dear all,
Myself Vijaykumar Muley working as senior research fellow. By training I
am
a computational biologist with not a strong knowledge of statistics. I
have
done some analysis which is explained as follows,
I have 10340 (X) profiles of binary vectors with same length(N=845), I
will
call then "gene profiles"
for example...
v1 v2 v3 v4.....vN
a 1 0 1 0 1
b 0 0 1 0 0
c 1 0 1 1 1
d 0 1 1 1 1
e 0 0 1 1 1
. . . . ........
. . . . ........
. . . . ........
upto
10340
then I have some other binary profiles with same length (N=845), here I
will
call then "expression profile";
v1 v2 v3 v4.....vN
f1 1 0 1 0 1
f2 0 0 1 0 0
f3 1 0 1 1 1
now I am comparing profile f1 with all X profiles using hypergeometic
distribution function. What I am getting is p-value(probability) of the
similarity between profile f1 and all X profiles i.e. 10340 by random
chance
alone.
for example,
#pair p-value
f1,a 1e-20
f1,b 0.01
.
.
upto
f1,10340 0.05
same thing i am doing with f2 and f3.
if we arrange this data(output) in better readable format, it looks like
f1 f2 f3
a 1e-20 0.01 0.10
b 0.01 1e-9 0.02
c 1e-3 0.1 0.30
d 0.03 0.07 1e-5
e 1e-1 0.01 1e-9
. . . . ........
. . . . ........
. . . . ........
upto
10340
I hope everyone understood what type of output I am getting.
Now I want to perform multiple hypothesis comparision(P-value adjustment)
on
this data , so that I will get the statistically significant associations
between various "expression profiles" and "gene profiles" at specific
alpha
level;
Most conservative method for p-value adjustment is bonferroni and many
others with less conservation, I dont care which method I use but the
problem here is
according to what parameter I should use for correct or adjust p-values ?.
so in case of Bonferroni correction,
should I multiply the each p-value with 10340 or
as I have compared 3 expression profiles against 10340 gene profiles,
should
I multiply p-value with 3*10340
I am aksing this for understanding. What I want to do is
From the above gene, p-value table, I want to calculate the percentage of
false positive rate at each p-values from 0.0001 to 0.05 So that I can use a good cutoff as significance level (alpha) to exclude the gene profiles which are weakly associated with all expression profiles. (If I am correct, to do this I need to use other p-value correction methods, either simulation based, resampling or Benjamini and Hochberg (B&H). Please can any one suuggests me about p-value adjustment or p-value correction, I mean statistically or technically which number should I consider for correction, 10340 or 3 * 10340, as I have three features to associate with same 10340 gene set. or if I am wrong, can any one tell me the protocol which I should refer to get fair number of significant associations between genes and expression profiles. I am using package "multtest" for p-value adjustment but literally I am not getting for correction, should I give p-values for each expression profile alone or give it all p-values ie. 3*10340. I have gone through many tutorials and articles for multiple hypothesis testing but really couldnt get exactly, what is it. Please give me some clues, some of you may be actively working on p-value adjustment / multiple hypothesis testing, I expect some suggestions. I will be grateful for you kind help. sincerely,
Please do NOT reply to a digest when posting to the list, you should start a new thread (or at the very least delete the digest to which you are replying from your email). You may be interested False Discovery Rate (FDR) methods proposed by Benjamini & Hochberg[1] and various related work/papers/software[2][3] Neil [1] Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist Soc B 57:289-300 [2] http://genomics.princeton.edu/storeylab/qvalue/
View this message in context: http://www.nabble.com/multiple-hypothesis-testing-tp22512331p22557450.html Sent from the R help mailing list archive at Nabble.com.