Chi-Square test and survey results

4 messages · gheine at mathnmaps.com, Jean V Adams, Jan van der Laan +1 more

Original

1

4

gheine at mathnmaps.com

Tue, Oct 11, 2011 12:31 PM #

An organization has asked me to comment on the validity of their
recent all-employee survey.  Survey responses, by geographic region, 
compared
with the total number of employees in each region, were as follows:

All.Employees Survey.Respondents
Region_1            735                142
Region_2            500                 83
Region_3            897                 78
Region_4            717                133
Region_5            167                 48
Region_6            309                  0
Region_7            806                125
Region_8            627                122
Region_9            858                177
Region_10           851                160
Region_11           336                 52
Region_12          1823                312
Region_13            80                  9
Region_14           774                121
Region_15           561                 24
Region_16           834                134

How well does the survey represent the employee population?
Chi-square test says, not very well:

Pearson's Chi-squared test

data:  ByRegion
X-squared = 163.6869, df = 15, p-value < 2.2e-16

By striking three under-represented regions (3,6, and 15), we get
a more reasonable, although still not convincing, result:

Pearson's Chi-squared test

data:  ByRegion[setdiff(1:16, c(3, 6, 15)), ]
X-squared = 22.5643, df = 12, p-value = 0.03166

This poses several questions:

1)  Looking at a side-by-side barchart (proportion of responses vs.
proportion of employees, per region), the pattern of survey responses
appears, visually, to match fairly well the pattern of employees.  Is
this a case where we trust the numbers and not the picture?

2) Part of the problem, ironically, is that there were too many 
responses
to the survey.  If we had only one-tenth the responses, but in the same
proportions by region, the chi-square statistic would look much better,
(though with a warning about possible inaccuracy):

data:  data.frame(ByRegion$All.Employees, 0.1 * 
(ByRegion$Survey.Respondents))
X-squared = 17.5912, df = 15, p-value = 0.2848

Is there a way of reconciling a large response rate with an 
unrepresentative
response profile?  Or is the bad news that the survey will give very 
precise
results about a very ill-specified sub-population?

(Of course, I would put in softer terms, like "you need to assess the 
degree
of homogeneity across different regions" .)

3) Is Chi-squared really the right measure of how representative is the 
survey?

<<<<<<< >>>>>>>>>

Thanks for any help you can give - hope these questions make sense -

George H.

Jean V Adams

Wed, Oct 12, 2011 5:23 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111012/e2549e09/attachment.pl>

Jan van der Laan

Wed, Oct 12, 2011 6:24 AM #

George,

Perhaps the site of the RISQ project (Representativity indicators for  
Survey Quality) might be of use: http://www.risq-project.eu/ . They  
also provide R-code to calculate their indicators.

HTH,
Jan



Quoting gheine at mathnmaps.com:

An organization has asked me to comment on the validity of their
recent all-employee survey.  Survey responses, by geographic region, compared
with the total number of employees in each region, were as follows:

ByRegion

          All.Employees Survey.Respondents
Region_1            735                142
Region_2            500                 83
Region_3            897                 78
Region_4            717                133
Region_5            167                 48
Region_6            309                  0
Region_7            806                125
Region_8            627                122
Region_9            858                177
Region_10           851                160
Region_11           336                 52
Region_12          1823                312
Region_13            80                  9
Region_14           774                121
Region_15           561                 24
Region_16           834                134

How well does the survey represent the employee population?
Chi-square test says, not very well:

chisq.test(ByRegion)

        Pearson's Chi-squared test

data:  ByRegion
X-squared = 163.6869, df = 15, p-value < 2.2e-16

By striking three under-represented regions (3,6, and 15), we get
a more reasonable, although still not convincing, result:

chisq.test(ByRegion[setdiff(1:16,c(3,6,15)),])

        Pearson's Chi-squared test

data:  ByRegion[setdiff(1:16, c(3, 6, 15)), ]
X-squared = 22.5643, df = 12, p-value = 0.03166

This poses several questions:

1)  Looking at a side-by-side barchart (proportion of responses vs.
proportion of employees, per region), the pattern of survey responses
appears, visually, to match fairly well the pattern of employees.  Is
this a case where we trust the numbers and not the picture?

2) Part of the problem, ironically, is that there were too many responses
to the survey.  If we had only one-tenth the responses, but in the same
proportions by region, the chi-square statistic would look much better,
(though with a warning about possible inaccuracy):

data:  data.frame(ByRegion$All.Employees, 0.1 *   
(ByRegion$Survey.Respondents))
X-squared = 17.5912, df = 15, p-value = 0.2848

Is there a way of reconciling a large response rate with an unrepresentative
response profile?  Or is the bad news that the survey will give very precise
results about a very ill-specified sub-population?

(Of course, I would put in softer terms, like "you need to assess the degree
of homogeneity across different regions" .)

3) Is Chi-squared really the right measure of how representative is the
survey?

<<<<<<< >>>>>>>>>

Thanks for any help you can give - hope these questions make sense -

George H.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Greg Snow

Wed, Oct 12, 2011 2:34 PM #

The chisq.test function is expecting a contingency table, basically one column should have the count of respondents and the other column should have the count of non-respondents (yours looks like it is the total instead of the non-respondents), so your data is wrong to begin with.  A significant chi-square here just means that the proportion responding differs in some of the regions, that does not mean that the sample is representative (or not representative).  What is more important (and not in the data or standard tests) is if there is a relationship between why someone chose to respond and the outcomes of interest.

If you are concerned with different proportions responding then you could do post-stratification to correct for the inequality when computing other summaries or tests (though region 6 will still give you problems, you will need to make some assumptions, possibly combine it with another region that is similar).

Throwing away data is rarely, if ever, beneficial.

Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of gheine at mathnmaps.com
> Sent: Tuesday, October 11, 2011 1:32 PM
> To: r-help at r-project.org
> Subject: [R] Chi-Square test and survey results
> 
> An organization has asked me to comment on the validity of their
> recent all-employee survey.  Survey responses, by geographic region,
> compared
> with the total number of employees in each region, were as follows:
> 
> > ByRegion
>            All.Employees Survey.Respondents
> Region_1            735                142
> Region_2            500                 83
> Region_3            897                 78
> Region_4            717                133
> Region_5            167                 48
> Region_6            309                  0
> Region_7            806                125
> Region_8            627                122
> Region_9            858                177
> Region_10           851                160
> Region_11           336                 52
> Region_12          1823                312
> Region_13            80                  9
> Region_14           774                121
> Region_15           561                 24
> Region_16           834                134
> 
> How well does the survey represent the employee population?
> Chi-square test says, not very well:
> 
> > chisq.test(ByRegion)
> 
>          Pearson's Chi-squared test
> 
> data:  ByRegion
> X-squared = 163.6869, df = 15, p-value < 2.2e-16
> 
> By striking three under-represented regions (3,6, and 15), we get
> a more reasonable, although still not convincing, result:
> 
> > chisq.test(ByRegion[setdiff(1:16,c(3,6,15)),])
> 
>          Pearson's Chi-squared test
> 
> data:  ByRegion[setdiff(1:16, c(3, 6, 15)), ]
> X-squared = 22.5643, df = 12, p-value = 0.03166
> 
> This poses several questions:
> 
> 1)  Looking at a side-by-side barchart (proportion of responses vs.
> proportion of employees, per region), the pattern of survey responses
> appears, visually, to match fairly well the pattern of employees.  Is
> this a case where we trust the numbers and not the picture?
> 
> 2) Part of the problem, ironically, is that there were too many
> responses
> to the survey.  If we had only one-tenth the responses, but in the same
> proportions by region, the chi-square statistic would look much better,
> (though with a warning about possible inaccuracy):
> 
> data:  data.frame(ByRegion$All.Employees, 0.1 *
> (ByRegion$Survey.Respondents))
> X-squared = 17.5912, df = 15, p-value = 0.2848
> 
> Is there a way of reconciling a large response rate with an
> unrepresentative
> response profile?  Or is the bad news that the survey will give very
> precise
> results about a very ill-specified sub-population?
> 
> (Of course, I would put in softer terms, like "you need to assess the
> degree
> of homogeneity across different regions" .)
> 
> 3) Is Chi-squared really the right measure of how representative is the
> survey?
> 
> <<<<<<< >>>>>>>>>
> 
> Thanks for any help you can give - hope these questions make sense -
> 
> George H.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.