Back to formatted view
Raw Message

Message-ID: <2046958597.14381.1463778378865.JavaMail.yahoo@mail.yahoo.com>
Date: 2016-05-20T21:06:18Z
From: data_analyst20bhrl at yahoo.com
Subject: Inverse normal transformation of ranked data

I am?using ddply on?a data set that contains 2+ million rows; trying to rank the values of a variable within groups, and then transform the ranks to (approximate) z-scores --- i.e generate quantiles on the normal scale.
Here is some sample data for one group:x <- NA 0.3640951 0.1175880 0.3453916 0.4214050 0.7469022 0.1091423 0.6099482? ? ? ? NA? ? ? ? NA 0.6786140 0.1785854 0.9750262? ? ? ? NA

I have tried the following two alternatives:?
(1) Using the qnorm function from the stats package in conjunction with the percent_rank function from the dplyr? package:For example:
y <- qnorm(percent_rank(x))
This produces -Inf and Inf for the extreme values in the sample data. This issue is resolved if I use the rank function from the stats package instead, for example:y <- qnorm(rank(x, na.last = "keep", ties.method = "average")/length(x))
but if there are no NAs in a certain group, the upper extreme data point is still evaluated to Inf.
(2) Using the ztransform function from the GenABEL package:
For example: 
y <- ztransform(percent_rank(x))
This preserves the extreme values but produces one of the following types of errors when used on my full data set.

Error in ztransform(x) : trait is binary
ORError in ztransform(x) : trait is monomorphic
I suspect these errors may be due to the fact that there are very few observations and/or several missing values (NAs) within certain groups, but I am not sure since there are several hundred groups.? 
Is there a better way?
Sent from Yahoo Mail. Get the app
	[[alternative HTML version deleted]]