Hierarchical factors

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20100505/c8e099a7/attachment.pl>
I think you are perhaps unintentionally obscuring two issues. One is  
whether R might have the statistical functions to deal with such an  
arrangement, and here "mixed models" would be the phrase you ought to  
be watching for, while the other would be whether it would have pre- 
written data management functions that would directly support the  
particular data layout you might be getting from public-access gov't  
files. The second is what I _thought_ you were soliciting in your  
original posting. I was a bit surprised that no one mentioned the  
survey package, since I have seen it used in such situations,  but I  
cannot track down the citation at the moment. You might want to look  
at Gelman's blogs:

http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html

See also work on nested case within cohort desgns:
http://aje.oxfordjournals.org/cgi/content/full/kwp055v1

And Damico's article:
"Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis  
Techniques in Health Policy Data"
R Journal, 2002 , n 2.
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf
David.

On May 5, 2010, at 10:23 PM, Marshall Feldman wrote:

> Thanks for sharing this, Ista.
>
> I've come to the conclusion that R doesn't have what I'm looking for,
> either in the base or the packages.
>
> Although your examples are insightful, the examples we've been
> discussing are deliberately easier than what one would expect in most
> serious applications. Imagine for instance that we're studying wage
> structures of industries in different geographic labor markets. We
> therefore might have four variables: wages, industries, occupations,  
> and
> places. We might want to see if wage differentials are more or less
> constant or if they are higher in some geographic areas than in  
> others.
> Since industries, occupations, and places are typically coded
> hierarchically as we've been discussing, we might want to figure out  
> how
> to examine different wage levels within industries, etc. Doing this
> manually would require lots of w
> whereas conceptually  the
>
> On 5/4/2010 6:00 AM,
>> Message: 49 Date: Mon, 3 May 2010 13:22:59 -0400 From: Ista Zahn
>> <istazahn at gmail.com> To: Marshall Feldman <marsh at uri.edu> Cc:
>> r-help at r-project.org Subject: Re: [R] Hierarchical factors Message- 
>> ID:
>> <x2xf55e7cf51005031022se4c46967s174efeef95331abc at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1 Hi Marshall, I'm not
>> aware of any packages that implement these features as you described
>> them. But most of the tasks are already fairly easy in R -- see  
>> below.
>> On Mon, May 3, 2010 at 11:18 AM, Marshall Feldman <marsh at uri.edu>  
>> wrote:
>>>>
>>>> Thanks for getting back so quickly Ista,
>>>>
>>>> I was actually casting about for any examples of R software that  
>>>> deals with this kind of structure. But your question is a good  
>>>> one. Here are a few things I'd like to be able to do:
>>>>
>>>> Store data in R at the finest level of detail but easily refer to  
>>>> higher levels of aggregation. If the data include such higher  
>>>> levels, this is trivial, but otherwise I'd like to aggregate  
>>>> fairly easily. The following is not functioning code, but it  
>>>> should give you the idea:
>>>>
>>>> start with a data frame (call it d) having row.names = to the 6  
>>>> digit NAICS code and columns w/ various variables, assume one is  
>>>> named employment.
>>>> d[,"employment"]??? ??? ??? ??? ??? ?? # Would print all  
>>>> employment data
>>>> d["441222","employment"]??? ??? # Would print only Boat Dealer  
>>>> employment
>>>> d["44","employment]??? ??? ??? ???? # Would print total  
>>>> employment for Retail Trade
>>>
>> d[,"employment"] #prints all employment data
>> d[rownames(d) == "441222","employment"] #prints only boat dealer  
>> employment
>> d[grep("^44", rownames(d)),"employment"] # prints total employment  
>> for
>> retail trade
>>
>>
>>>>
>>>> Recursive nesting. I'm not sure how to convey this except with  
>>>> examples. Suppose the data frame also has a "wages" column with  
>>>> average weekly wages in the industry, and the industry code is  
>>>> also a factor variable (industry). So a simple analysis of  
>>>> variance might look like:
>>>>
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry, d)
>>>>
>>>> ??? ??? But now what I'd like to do is to break this down within  
>>>> 2-digit sectors. Assuming the data frame has another variable,  
>>>> industry 2, this would look like:
>>>>
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry2/industry)
>>>>
>>>> ???? ??? But what if we either (a) don't want to bother creating  
>>>> separate variables for each level of aggregation in industry or  
>>>> (b) want to extended the model formula language to include  
>>>> various nesting strategies. This might look like:
>>>>
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry// 
>>>> *)??? ??? ??? ??? ??? # Nest all meaningful levels industry/ 
>>>> industry2/industry3/industry4/industry5/industry6. If the coding  
>>>> system skips some levels, R is smart enough to omit the skipped  
>>>> levels.
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry//levels 2,4,6)???? #  
>>>> I'm using "//" as a hypothetical extension to the model language  
>>>> that is followed by a "levels" keyword and then a list of levels  
>>>> within the hierarchy. This example would expand
>>>> ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ?? # to 
>>>>  aov(wages ~ industry2/industry4/industry6)
>>>>
>>>> ??? ??? One could extend this last example to include a notation  
>>>> allowing the analysis to be repeated at varying levels of depth  
>>>> (e.g., industry||2,6) would repeat the ANOVA for industry2 and  
>>>> industry6)
>>>>
>>>
>> I can see how that might be useful. But it is easy enough to split  
>> the
>> variables out, for example (assuming that each level consists of two
>> digits):
>>
>>   d$ind1<- substr(rownames(d), 1,2)
>>   d$ind2<- substr(rownames(d), 3,4)
>>   d$ind2<- substr(rownames(d), 5,6)
>>
>>
>>
>>>> Since the factor hierarchy is completely nested (i.e., every 6- 
>>>> digit industry is below a 5 digit industry), a single function  
>>>> can operate on the codes recursively. Three variants come to  
>>>> mind. In the first, we'd use some kind of apply function to drill  
>>>> down to a certain level and return a list of results, one for  
>>>> each level:
>>>>
>>>> ??? ??? ??? ??? ? means<-  
>>>> drill(wages,industry,mean)??? ??? ??? ??? ??? ??? # Would return  
>>>> a list. The first component would a vector of mean wages for  
>>>> industries at the 2-digit level, the second, a vector for the 3- 
>>>> digit level, etc.
>>>> ??? ??? ??? ??? ? means<-  
>>>> drill(wages,industry,mean,maxlvl=3)??? ???? # Would stop at the  
>>>> 3rd level of the hierarchy (4-digit code). One could also imagine  
>>>> a maxdigits optionas an alternative (maxdigits = y means stop at  
>>>> the y-digit level)
>>>>
>>>
>> Again, I can see how this would be useful, but it's already pretty
>> easy (once we have split out the grouping variables) to do something
>> like
>>
>> grp.means<- list(
>> l1 = aggregate(d$wages, list(d$ind1), mean),
>> l2 = aggregate(d$wages, list(d$ind2), mean),
>> l3 = aggregate(d$wages, list(d$ind3), mean)
>> )
>>
>> I know this wasn't what you were looking for (as I said, I'm not  
>> aware
>> of any package that implements the functionality you describe). But
>> the existing facilities in R are quite flexible, and handling this
>> kind of data in R is already fairly straightforward.
>>
>> Best,
>> Ista
>>
>>
>>>> Second, suppose we have a data frame like d, only this time it's  
>>>> a time series (each row is a different date). Now we might want  
>>>> to generate vectors of the rate of change in employment at each  
>>>> industry level. It might look like:
>>>>
>>>> ??? rate<- function(x) { (x - lag(x))/lag(x)) }
>>>> ??? rates<- as.list()
>>>> ??? i<- 1
>>>> ??? rates<- for j %in% levels(industry)?  
>>>> {?? ??? ??? ??? ??? ??? ??? ??? ? ?? ??? ??? ??? # The levels  
>>>> function parses the hierarchical factor into the various levels  
>>>> of its coding system
>>>> ??? ??? ??? ??? ??? rates[[i]]<- rate(emplyment[,level(industry)  
>>>> == j])??? ??? ???? # The level function sets a particular one of  
>>>> these levels
>>>> ??? ??? ??? ??? ??? i<- i + 1
>>>> ??? ??? ??? ??? }
>>>>
>>>> A third variant would be a genuinely recursive function that  
>>>> keeps on calling itself at each level of the factor until it has  
>>>> either reached a pre-specified depth or exhausted all levels of  
>>>> the factor.
>>>>
>>>> I hope this gives you a good idea of the sorts of things one  
>>>> might do with hierarchical factors.
>>>>
>>>> ??? Marsh Feldman
>>>>
>>>>
>>>>
>>>> On 5/3/2010 9:57 AM, Ista Zahn wrote:
>>>>
>>>> Hi Marshell,
>>>> What exactly do you mean by "handles this kind of data structure"?
>>>> What do you want R to do?
>>>>
>>>> Best,
>>>> Ista
>>>>
>>>> On Mon, May 3, 2010 at 9:44 AM, Marshall Feldman<marsh at uri.edu>   
>>>> wrote:
>>>>
>>>>
>>>> Hello,
>>>>
>>>> Hierarchical factors are a very common data structure. For  
>>>> instance, one
>>>> might have municipalities within states within countries within
>>>> continents. Other examples include occupational codes, biological
>>>> species, software types (R within statistical software within  
>>>> analytical
>>>> software), etc.
>>>>
>>>> Such data structures commonly use hierarchical coding systems. For
>>>> example, the 2007 North American Industry Classification System  
>>>> (NAICS)
>>>> <http://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2007>has  
>>>> twenty
>>>> two-digit codes (e.g., 42 = Wholesale trade), within each of these
>>>> varying numbers of 3-digit codes (e.g., 423 = Merchant wholesalers,
>>>> durable goods), then varying numbers of 4-digit codes (4231 = Motor
>>>> Vehicle and Motor Vehicle Parts and Supplies Merchant  
>>>> Wholesalers), then
>>>> varying numbers of five-digit codes, varying numbers of six-digit  
>>>> codes,
>>>> etc. At the lowest level (longest code) one can readily tell all  
>>>> the
>>>> higher levels. For example, 441222 is "Boat Dealers" who are part  
>>>> of
>>>> 44122, "Motorcycle, Boat, and Other Motor Vehicle Dealers," which  
>>>> is
>>>> part of 4412 (Other Motor Vehicle Dealers), which is part of 441  
>>>> (Motor
>>>> Vehicle and Parts Dealers), which is part of 44 (Retail Trade).  
>>>> (The US
>>>> Census Bureau has extended the 6-digit NAICS to an even more
>>>> fine-grained 10-digit system.)
>>>>
>>>> I haven't seen any R packages or sample code that handles this  
>>>> kind of
>>>> data, but I don't want to reinvent the wheel and would rather  
>>>> stand on
>>>> the shoulders of you giants. Is there any package or other R-based
>>>> software out there that handles this kind of data structure?
>>>>
>>>> ? ? Thanks,
>>>> ? ? Marsh Feldman
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ? ? ? ?[[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org  mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Marshall Feldman, PhD
>>>> Director of Research and Academic Affairs
>>>> Center for Urban Studies and Research
>>>> The University of Rhode Island
>>>> email: marsh @ uri .edu (remove spaces)
>>>>
>>>> Contact Information:
>>>>
>>>> Kingston:
>>>>
>>>> 202 Hart House
>>>> Charles T. Schmidt Labor Research Center
>>>> The University of Rhode Island
>>>> 36 Upper College Road
>>>> Kingston, RI 02881-0815
>>>> tel. (401) 874-5953:
>>>> fax: (401) 874-5511
>>>>
>>>> Providence:
>>>>
>>>> 206E Shepard Building
>>>> URI Feinstein Providence Campus
>>>> 80 Washington Street
>>>> Providence, RI 02903-1819
>>>> tel. (401) 277-5218
>>>> fax: (401) 277-5464
>>>
>>
>> --
>> Ista Zahn
>> Graduate student
>> University of Rochester
>> Department of Clinical and Social Psychology
>> http://yourpsyche.org
>>
>
> -- 
> Dr. Marshall Feldman, PhD
> Director of Research and Academic Affairs
> CUSR Logo
> Center for Urban Studies and Research
> The University of Rhode Island
> email: marsh @ uri .edu (remove spaces)
>
>
>      Contact Information:
>
>
>        Kingston:
>
> 202 Hart House
> Charles T. Schmidt Labor Research Center
> The University of Rhode Island
> 36 Upper College Road
> Kingston, RI 02881-0815
> tel. (401) 874-5953:
> fax: (401) 874-5511
>
>
>        Providence:
>
> 206E Shepard Building
> URI Feinstein Providence Campus
> 80 Washington Street
> Providence, RI 02903-1819
> tel. (401) 277-5218
> fax: (401) 277-5464
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT
I think you are perhaps unintentionally obscuring two issues. One is 
whether R might have the statistical functions to deal with such an 
arrangement, and here "mixed models" would be the phrase you ought to 
be watching for, while the other would be whether it would have 
pre-written data management functions that would directly support the 
particular data layout you might be getting from public-access gov't 
files. The second is what I _thought_ you were soliciting in your 
original posting. I was a bit surprised that no one mentioned the 
survey package, since I have seen it used in such situations,  but I 
cannot track down the citation at the moment. You might want to look 
at Gelman's blogs:

http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html 

See also work on nested case within cohort desgns:
http://aje.oxfordjournals.org/cgi/content/full/kwp055v1

And Damico's article:
"Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis 
Techniques in Health Policy Data"
R Journal, 2002 , n 2.
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf

First, I apologize for my last, somewhat incoherent post. I was 
composing it late at night, grew too tired to think, and thought I left 
it open to finish this morning. Looks as if I should have quit about an 
hour earlier since apparently the garbled message went out anyway.

Dave, you're right, although I would describe my question as combining 
rather than obscuring two issues. My thinking is that first one would 
want the data structure (actually a data type or class). A set of 
functions could then handle conversion to factors, etc. that would allow 
easy use of most existing statistical functions. New statistical 
functions could then be designed, or old ones retrofitted, to handle the 
new data type internally. Eventually, it would be great to integrate it 
into the formula language.

The data type would have an inheritance pattern sort of like this: 
factor -> hierarchy -> specific system. By "specific system" I mean 
either a standard or user-defined coding system that extends the 
hierarchy class. For example, NAICS would be a data type and any 
variable in this class would be both hierarchical and map to the labels 
associated with the industry definitions. The hierarchy class would be 
what I was describing, with information on how to parse individual 
character strings at various levels of aggregation. Finally, although my 
idea would extend R's factor data type, strictly speaking this would not 
be inheritance. Real factors replicate and include labels in the storage 
associated with individual variables. Most hierarchical systems are very 
large, including hundreds of levels and long labels. So factors would 
usually be a very inefficient way to handle them. Imagine, for example, 
an application analyzing Internet routing or airline traffic, with each 
node on a route having a spatial hierarchical code 
(country.state.county.city) and a separate variable for each node. Ugh!

Instead, my idea would be to use an approach similar to SAS's formats, 
where the labels are stored separately and the individual codes map 
through a few relatively simple algorithms. SAS, for example, maps codes 
to labels either 1:1 (a character representation of the code maps to a 
label) or by evaluating the code and mapping it according to a 
predefined range of values. SAS recently implemented a feature that 
allows 1:many mapping so that, for instance, an AGE variable could map 
to simultaneously map to "Adult" and "Senior Citizen." Some statistical 
procedures in SAS will now repeat the analysis for all the mappings, so 
a single call to describe a variable generates counts of both adults and 
seniors.

While something similar to SAS formats would itself be a useful addition 
to R (and has been discussed before), my idea extends this by adding the 
ability to parse a hierarchical code at its various levels. This could 
then be integrated into appropriate statistical functions, or the 
analyst could write a function to deparse the code into its levels and 
then call the statistical function as needed. At a minimum, the 
hierarchy class would have to include an as.factor() function.

Given R's thousands of packages, I sent my post to find out if something 
like this already existed.

Thanks to everyone for your feedback. This list is great! The answer to 
my question is:

 > answer <- little.red.hen(question)

Marsh Feldman

On 5/5/10 [May 5, 10] 11:29 PM, David Winsemius wrote:
I think you are perhaps unintentionally obscuring two issues. One  
is whether R might have the statistical functions to deal with such  
an arrangement, and here "mixed models" would be the phrase you  
ought to be watching for, while the other would be whether it would  
have pre-written data management functions that would directly  
support the particular data layout you might be getting from public- 
access gov't files. The second is what I _thought_ you were  
soliciting in your original posting. I was a bit surprised that no  
one mentioned the survey package, since I have seen it used in such  
situations,  but I cannot track down the citation at the moment.  
You might want to look at Gelman's blogs:

http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html

See also work on nested case within cohort desgns:
http://aje.oxfordjournals.org/cgi/content/full/kwp055v1

And Damico's article:
"Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis  
Techniques in Health Policy Data"
R Journal, 2002 , n 2.
http://journal.r-project.org/archive/2009-2/ 
RJournal_2009-2_Damico.pdf

First, I apologize for my last, somewhat incoherent post. I was  
composing it late at night, grew too tired to think, and thought I  
left it open to finish this morning. Looks as if I should have quit  
about an hour earlier since apparently the garbled message went out  
anyway.

Dave, you're right, although I would describe my question as  
combining rather than obscuring two issues. My thinking is that  
first one would want the data structure (actually a data type or  
class). A set of functions could then handle conversion to factors,  
etc. that would allow easy use of most existing statistical  
functions. New statistical functions could then be designed, or old  
ones retrofitted, to handle the new data type internally.  
Eventually, it would be great to integrate it into the formula  
language.

The data type would have an inheritance pattern sort of like this:  
factor -> hierarchy -> specific system. By "specific system" I mean  
either a standard or user-defined coding system that extends the  
hierarchy class. For example, NAICS would be a data type and any  
variable in this class would be both hierarchical and map to the  
labels associated with the industry definitions. The hierarchy class  
would be what I was describing, with information on how to parse  
individual character strings at various levels of aggregation.  
Finally, although my idea would extend R's factor data type,  
strictly speaking this would not be inheritance. Real factors  
replicate and include labels in the storage associated with  
individual variables. Most hierarchical systems are very large,  
including hundreds of levels and long labels. So factors would  
usually be a very inefficient way to handle them. Imagine, for  
example, an application analyzing Internet routing or airline  
traffic, with each node on a route having a spatial hierarchical  
code (country.state.county.city) and a separate variable for each  
node. Ugh!

Instead, my idea would be to use an approach similar to SAS's  
formats, where the labels are stored separately and the individual  
codes map through a few relatively simple algorithms. SAS, for  
example, maps codes to labels either 1:1 (a character representation  
of the code maps to a label) or by evaluating the code and mapping  
it according to a predefined range of values. SAS recently  
implemented a feature that allows 1:many mapping so that, for  
instance, an AGE variable could map to simultaneously map to "Adult"  
and "Senior Citizen." Some statistical procedures in SAS will now  
repeat the analysis for all the mappings, so a single call to  
describe a variable generates counts of both adults and seniors.

While something similar to SAS formats would itself be a useful  
addition to R (and has been discussed before), my idea extends this  
by adding the ability to parse a hierarchical code at its various  
levels. This could then be integrated into appropriate statistical  
functions, or the analyst could write a function to deparse the code  
into its levels and then call the statistical function as needed. At  
a minimum, the hierarchy class would have to include an as.factor()  
function.

I have seen statements that R and ROOT can be compiled together on the  
same machine. ROOT is an object oriented database system developed at  
CERN (also where the WWW started) that supports hierarchical  
organization of data:

http://en.wikipedia.org/wiki/ROOT

The BioConductor "project" ought to be considered as a potential  
source of coding, and the geospatial interest group as well.

See for instance the xps package in BioC
http://bioconductor.org/packages/release/bioc/html/xps.html
http://www.iscb.org/uploaded/css/G04Stratowa.pdf

You might try corresponding with the xps author Christian Stratowa.
Given R's thousands of packages, I sent my post to find out if  
something like this already existed.

Thanks to everyone for your feedback. This list is great! The answer  
to my question is:

answer <- little.red.hen(question)
Marsh Feldman
David Winsemius, MD
West Hartford, CT