An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20100505/c8e099a7/attachment.pl>
Hierarchical factors
4 messages · Marshall Feldman, David Winsemius
I think you are perhaps unintentionally obscuring two issues. One is whether R might have the statistical functions to deal with such an arrangement, and here "mixed models" would be the phrase you ought to be watching for, while the other would be whether it would have pre- written data management functions that would directly support the particular data layout you might be getting from public-access gov't files. The second is what I _thought_ you were soliciting in your original posting. I was a bit surprised that no one mentioned the survey package, since I have seen it used in such situations, but I cannot track down the citation at the moment. You might want to look at Gelman's blogs: http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html See also work on nested case within cohort desgns: http://aje.oxfordjournals.org/cgi/content/full/kwp055v1 And Damico's article: "Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data" R Journal, 2002 , n 2. http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf
David.
On May 5, 2010, at 10:23 PM, Marshall Feldman wrote:
> Thanks for sharing this, Ista.
>
> I've come to the conclusion that R doesn't have what I'm looking for,
> either in the base or the packages.
>
> Although your examples are insightful, the examples we've been
> discussing are deliberately easier than what one would expect in most
> serious applications. Imagine for instance that we're studying wage
> structures of industries in different geographic labor markets. We
> therefore might have four variables: wages, industries, occupations,
> and
> places. We might want to see if wage differentials are more or less
> constant or if they are higher in some geographic areas than in
> others.
> Since industries, occupations, and places are typically coded
> hierarchically as we've been discussing, we might want to figure out
> how
> to examine different wage levels within industries, etc. Doing this
> manually would require lots of w
> whereas conceptually the
>
> On 5/4/2010 6:00 AM,
>> Message: 49 Date: Mon, 3 May 2010 13:22:59 -0400 From: Ista Zahn
>> <istazahn at gmail.com> To: Marshall Feldman <marsh at uri.edu> Cc:
>> r-help at r-project.org Subject: Re: [R] Hierarchical factors Message-
>> ID:
>> <x2xf55e7cf51005031022se4c46967s174efeef95331abc at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1 Hi Marshall, I'm not
>> aware of any packages that implement these features as you described
>> them. But most of the tasks are already fairly easy in R -- see
>> below.
>> On Mon, May 3, 2010 at 11:18 AM, Marshall Feldman <marsh at uri.edu>
>> wrote:
>>>>
>>>> Thanks for getting back so quickly Ista,
>>>>
>>>> I was actually casting about for any examples of R software that
>>>> deals with this kind of structure. But your question is a good
>>>> one. Here are a few things I'd like to be able to do:
>>>>
>>>> Store data in R at the finest level of detail but easily refer to
>>>> higher levels of aggregation. If the data include such higher
>>>> levels, this is trivial, but otherwise I'd like to aggregate
>>>> fairly easily. The following is not functioning code, but it
>>>> should give you the idea:
>>>>
>>>> start with a data frame (call it d) having row.names = to the 6
>>>> digit NAICS code and columns w/ various variables, assume one is
>>>> named employment.
>>>> d[,"employment"]??? ??? ??? ??? ??? ?? # Would print all
>>>> employment data
>>>> d["441222","employment"]??? ??? # Would print only Boat Dealer
>>>> employment
>>>> d["44","employment]??? ??? ??? ???? # Would print total
>>>> employment for Retail Trade
>>>
>> d[,"employment"] #prints all employment data
>> d[rownames(d) == "441222","employment"] #prints only boat dealer
>> employment
>> d[grep("^44", rownames(d)),"employment"] # prints total employment
>> for
>> retail trade
>>
>>
>>>>
>>>> Recursive nesting. I'm not sure how to convey this except with
>>>> examples. Suppose the data frame also has a "wages" column with
>>>> average weekly wages in the industry, and the industry code is
>>>> also a factor variable (industry). So a simple analysis of
>>>> variance might look like:
>>>>
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry, d)
>>>>
>>>> ??? ??? But now what I'd like to do is to break this down within
>>>> 2-digit sectors. Assuming the data frame has another variable,
>>>> industry 2, this would look like:
>>>>
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry2/industry)
>>>>
>>>> ???? ??? But what if we either (a) don't want to bother creating
>>>> separate variables for each level of aggregation in industry or
>>>> (b) want to extended the model formula language to include
>>>> various nesting strategies. This might look like:
>>>>
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry//
>>>> *)??? ??? ??? ??? ??? # Nest all meaningful levels industry/
>>>> industry2/industry3/industry4/industry5/industry6. If the coding
>>>> system skips some levels, R is smart enough to omit the skipped
>>>> levels.
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry//levels 2,4,6)???? #
>>>> I'm using "//" as a hypothetical extension to the model language
>>>> that is followed by a "levels" keyword and then a list of levels
>>>> within the hierarchy. This example would expand
>>>> ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ?? # to
>>>> aov(wages ~ industry2/industry4/industry6)
>>>>
>>>> ??? ??? One could extend this last example to include a notation
>>>> allowing the analysis to be repeated at varying levels of depth
>>>> (e.g., industry||2,6) would repeat the ANOVA for industry2 and
>>>> industry6)
>>>>
>>>
>> I can see how that might be useful. But it is easy enough to split
>> the
>> variables out, for example (assuming that each level consists of two
>> digits):
>>
>> d$ind1<- substr(rownames(d), 1,2)
>> d$ind2<- substr(rownames(d), 3,4)
>> d$ind2<- substr(rownames(d), 5,6)
>>
>>
>>
>>>> Since the factor hierarchy is completely nested (i.e., every 6-
>>>> digit industry is below a 5 digit industry), a single function
>>>> can operate on the codes recursively. Three variants come to
>>>> mind. In the first, we'd use some kind of apply function to drill
>>>> down to a certain level and return a list of results, one for
>>>> each level:
>>>>
>>>> ??? ??? ??? ??? ? means<-
>>>> drill(wages,industry,mean)??? ??? ??? ??? ??? ??? # Would return
>>>> a list. The first component would a vector of mean wages for
>>>> industries at the 2-digit level, the second, a vector for the 3-
>>>> digit level, etc.
>>>> ??? ??? ??? ??? ? means<-
>>>> drill(wages,industry,mean,maxlvl=3)??? ???? # Would stop at the
>>>> 3rd level of the hierarchy (4-digit code). One could also imagine
>>>> a maxdigits optionas an alternative (maxdigits = y means stop at
>>>> the y-digit level)
>>>>
>>>
>> Again, I can see how this would be useful, but it's already pretty
>> easy (once we have split out the grouping variables) to do something
>> like
>>
>> grp.means<- list(
>> l1 = aggregate(d$wages, list(d$ind1), mean),
>> l2 = aggregate(d$wages, list(d$ind2), mean),
>> l3 = aggregate(d$wages, list(d$ind3), mean)
>> )
>>
>> I know this wasn't what you were looking for (as I said, I'm not
>> aware
>> of any package that implements the functionality you describe). But
>> the existing facilities in R are quite flexible, and handling this
>> kind of data in R is already fairly straightforward.
>>
>> Best,
>> Ista
>>
>>
>>>> Second, suppose we have a data frame like d, only this time it's
>>>> a time series (each row is a different date). Now we might want
>>>> to generate vectors of the rate of change in employment at each
>>>> industry level. It might look like:
>>>>
>>>> ??? rate<- function(x) { (x - lag(x))/lag(x)) }
>>>> ??? rates<- as.list()
>>>> ??? i<- 1
>>>> ??? rates<- for j %in% levels(industry)?
>>>> {?? ??? ??? ??? ??? ??? ??? ??? ? ?? ??? ??? ??? # The levels
>>>> function parses the hierarchical factor into the various levels
>>>> of its coding system
>>>> ??? ??? ??? ??? ??? rates[[i]]<- rate(emplyment[,level(industry)
>>>> == j])??? ??? ???? # The level function sets a particular one of
>>>> these levels
>>>> ??? ??? ??? ??? ??? i<- i + 1
>>>> ??? ??? ??? ??? }
>>>>
>>>> A third variant would be a genuinely recursive function that
>>>> keeps on calling itself at each level of the factor until it has
>>>> either reached a pre-specified depth or exhausted all levels of
>>>> the factor.
>>>>
>>>> I hope this gives you a good idea of the sorts of things one
>>>> might do with hierarchical factors.
>>>>
>>>> ??? Marsh Feldman
>>>>
>>>>
>>>>
>>>> On 5/3/2010 9:57 AM, Ista Zahn wrote:
>>>>
>>>> Hi Marshell,
>>>> What exactly do you mean by "handles this kind of data structure"?
>>>> What do you want R to do?
>>>>
>>>> Best,
>>>> Ista
>>>>
>>>> On Mon, May 3, 2010 at 9:44 AM, Marshall Feldman<marsh at uri.edu>
>>>> wrote:
>>>>
>>>>
>>>> Hello,
>>>>
>>>> Hierarchical factors are a very common data structure. For
>>>> instance, one
>>>> might have municipalities within states within countries within
>>>> continents. Other examples include occupational codes, biological
>>>> species, software types (R within statistical software within
>>>> analytical
>>>> software), etc.
>>>>
>>>> Such data structures commonly use hierarchical coding systems. For
>>>> example, the 2007 North American Industry Classification System
>>>> (NAICS)
>>>> <http://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2007>has
>>>> twenty
>>>> two-digit codes (e.g., 42 = Wholesale trade), within each of these
>>>> varying numbers of 3-digit codes (e.g., 423 = Merchant wholesalers,
>>>> durable goods), then varying numbers of 4-digit codes (4231 = Motor
>>>> Vehicle and Motor Vehicle Parts and Supplies Merchant
>>>> Wholesalers), then
>>>> varying numbers of five-digit codes, varying numbers of six-digit
>>>> codes,
>>>> etc. At the lowest level (longest code) one can readily tell all
>>>> the
>>>> higher levels. For example, 441222 is "Boat Dealers" who are part
>>>> of
>>>> 44122, "Motorcycle, Boat, and Other Motor Vehicle Dealers," which
>>>> is
>>>> part of 4412 (Other Motor Vehicle Dealers), which is part of 441
>>>> (Motor
>>>> Vehicle and Parts Dealers), which is part of 44 (Retail Trade).
>>>> (The US
>>>> Census Bureau has extended the 6-digit NAICS to an even more
>>>> fine-grained 10-digit system.)
>>>>
>>>> I haven't seen any R packages or sample code that handles this
>>>> kind of
>>>> data, but I don't want to reinvent the wheel and would rather
>>>> stand on
>>>> the shoulders of you giants. Is there any package or other R-based
>>>> software out there that handles this kind of data structure?
>>>>
>>>> ? ? Thanks,
>>>> ? ? Marsh Feldman
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ? ? ? ?[[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Marshall Feldman, PhD
>>>> Director of Research and Academic Affairs
>>>> Center for Urban Studies and Research
>>>> The University of Rhode Island
>>>> email: marsh @ uri .edu (remove spaces)
>>>>
>>>> Contact Information:
>>>>
>>>> Kingston:
>>>>
>>>> 202 Hart House
>>>> Charles T. Schmidt Labor Research Center
>>>> The University of Rhode Island
>>>> 36 Upper College Road
>>>> Kingston, RI 02881-0815
>>>> tel. (401) 874-5953:
>>>> fax: (401) 874-5511
>>>>
>>>> Providence:
>>>>
>>>> 206E Shepard Building
>>>> URI Feinstein Providence Campus
>>>> 80 Washington Street
>>>> Providence, RI 02903-1819
>>>> tel. (401) 277-5218
>>>> fax: (401) 277-5464
>>>
>>
>> --
>> Ista Zahn
>> Graduate student
>> University of Rochester
>> Department of Clinical and Social Psychology
>> http://yourpsyche.org
>>
>
> --
> Dr. Marshall Feldman, PhD
> Director of Research and Academic Affairs
> CUSR Logo
> Center for Urban Studies and Research
> The University of Rhode Island
> email: marsh @ uri .edu (remove spaces)
>
>
> Contact Information:
>
>
> Kingston:
>
> 202 Hart House
> Charles T. Schmidt Labor Research Center
> The University of Rhode Island
> 36 Upper College Road
> Kingston, RI 02881-0815
> tel. (401) 874-5953:
> fax: (401) 874-5511
>
>
> Providence:
>
> 206E Shepard Building
> URI Feinstein Providence Campus
> 80 Washington Street
> Providence, RI 02903-1819
> tel. (401) 277-5218
> fax: (401) 277-5464
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
On 5/5/10 [May 5, 10] 11:29 PM, David Winsemius wrote:
I think you are perhaps unintentionally obscuring two issues. One is whether R might have the statistical functions to deal with such an arrangement, and here "mixed models" would be the phrase you ought to be watching for, while the other would be whether it would have pre-written data management functions that would directly support the particular data layout you might be getting from public-access gov't files. The second is what I _thought_ you were soliciting in your original posting. I was a bit surprised that no one mentioned the survey package, since I have seen it used in such situations, but I cannot track down the citation at the moment. You might want to look at Gelman's blogs: http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html See also work on nested case within cohort desgns: http://aje.oxfordjournals.org/cgi/content/full/kwp055v1 And Damico's article: "Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data" R Journal, 2002 , n 2. http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf
First, I apologize for my last, somewhat incoherent post. I was composing it late at night, grew too tired to think, and thought I left it open to finish this morning. Looks as if I should have quit about an hour earlier since apparently the garbled message went out anyway. Dave, you're right, although I would describe my question as combining rather than obscuring two issues. My thinking is that first one would want the data structure (actually a data type or class). A set of functions could then handle conversion to factors, etc. that would allow easy use of most existing statistical functions. New statistical functions could then be designed, or old ones retrofitted, to handle the new data type internally. Eventually, it would be great to integrate it into the formula language. The data type would have an inheritance pattern sort of like this: factor -> hierarchy -> specific system. By "specific system" I mean either a standard or user-defined coding system that extends the hierarchy class. For example, NAICS would be a data type and any variable in this class would be both hierarchical and map to the labels associated with the industry definitions. The hierarchy class would be what I was describing, with information on how to parse individual character strings at various levels of aggregation. Finally, although my idea would extend R's factor data type, strictly speaking this would not be inheritance. Real factors replicate and include labels in the storage associated with individual variables. Most hierarchical systems are very large, including hundreds of levels and long labels. So factors would usually be a very inefficient way to handle them. Imagine, for example, an application analyzing Internet routing or airline traffic, with each node on a route having a spatial hierarchical code (country.state.county.city) and a separate variable for each node. Ugh! Instead, my idea would be to use an approach similar to SAS's formats, where the labels are stored separately and the individual codes map through a few relatively simple algorithms. SAS, for example, maps codes to labels either 1:1 (a character representation of the code maps to a label) or by evaluating the code and mapping it according to a predefined range of values. SAS recently implemented a feature that allows 1:many mapping so that, for instance, an AGE variable could map to simultaneously map to "Adult" and "Senior Citizen." Some statistical procedures in SAS will now repeat the analysis for all the mappings, so a single call to describe a variable generates counts of both adults and seniors. While something similar to SAS formats would itself be a useful addition to R (and has been discussed before), my idea extends this by adding the ability to parse a hierarchical code at its various levels. This could then be integrated into appropriate statistical functions, or the analyst could write a function to deparse the code into its levels and then call the statistical function as needed. At a minimum, the hierarchy class would have to include an as.factor() function. Given R's thousands of packages, I sent my post to find out if something like this already existed. Thanks to everyone for your feedback. This list is great! The answer to my question is: > answer <- little.red.hen(question) Marsh Feldman
On May 6, 2010, at 7:13 AM, Marshall Feldman wrote:
On 5/5/10 [May 5, 10] 11:29 PM, David Winsemius wrote:
I think you are perhaps unintentionally obscuring two issues. One is whether R might have the statistical functions to deal with such an arrangement, and here "mixed models" would be the phrase you ought to be watching for, while the other would be whether it would have pre-written data management functions that would directly support the particular data layout you might be getting from public- access gov't files. The second is what I _thought_ you were soliciting in your original posting. I was a bit surprised that no one mentioned the survey package, since I have seen it used in such situations, but I cannot track down the citation at the moment. You might want to look at Gelman's blogs: http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html See also work on nested case within cohort desgns: http://aje.oxfordjournals.org/cgi/content/full/kwp055v1 And Damico's article: "Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data" R Journal, 2002 , n 2. http://journal.r-project.org/archive/2009-2/ RJournal_2009-2_Damico.pdf
First, I apologize for my last, somewhat incoherent post. I was composing it late at night, grew too tired to think, and thought I left it open to finish this morning. Looks as if I should have quit about an hour earlier since apparently the garbled message went out anyway. Dave, you're right, although I would describe my question as combining rather than obscuring two issues. My thinking is that first one would want the data structure (actually a data type or class). A set of functions could then handle conversion to factors, etc. that would allow easy use of most existing statistical functions. New statistical functions could then be designed, or old ones retrofitted, to handle the new data type internally. Eventually, it would be great to integrate it into the formula language. The data type would have an inheritance pattern sort of like this: factor -> hierarchy -> specific system. By "specific system" I mean either a standard or user-defined coding system that extends the hierarchy class. For example, NAICS would be a data type and any variable in this class would be both hierarchical and map to the labels associated with the industry definitions. The hierarchy class would be what I was describing, with information on how to parse individual character strings at various levels of aggregation. Finally, although my idea would extend R's factor data type, strictly speaking this would not be inheritance. Real factors replicate and include labels in the storage associated with individual variables. Most hierarchical systems are very large, including hundreds of levels and long labels. So factors would usually be a very inefficient way to handle them. Imagine, for example, an application analyzing Internet routing or airline traffic, with each node on a route having a spatial hierarchical code (country.state.county.city) and a separate variable for each node. Ugh! Instead, my idea would be to use an approach similar to SAS's formats, where the labels are stored separately and the individual codes map through a few relatively simple algorithms. SAS, for example, maps codes to labels either 1:1 (a character representation of the code maps to a label) or by evaluating the code and mapping it according to a predefined range of values. SAS recently implemented a feature that allows 1:many mapping so that, for instance, an AGE variable could map to simultaneously map to "Adult" and "Senior Citizen." Some statistical procedures in SAS will now repeat the analysis for all the mappings, so a single call to describe a variable generates counts of both adults and seniors. While something similar to SAS formats would itself be a useful addition to R (and has been discussed before), my idea extends this by adding the ability to parse a hierarchical code at its various levels. This could then be integrated into appropriate statistical functions, or the analyst could write a function to deparse the code into its levels and then call the statistical function as needed. At a minimum, the hierarchy class would have to include an as.factor() function.
I have seen statements that R and ROOT can be compiled together on the same machine. ROOT is an object oriented database system developed at CERN (also where the WWW started) that supports hierarchical organization of data: http://en.wikipedia.org/wiki/ROOT The BioConductor "project" ought to be considered as a potential source of coding, and the geospatial interest group as well. See for instance the xps package in BioC http://bioconductor.org/packages/release/bioc/html/xps.html http://www.iscb.org/uploaded/css/G04Stratowa.pdf You might try corresponding with the xps author Christian Stratowa.
Given R's thousands of packages, I sent my post to find out if something like this already existed. Thanks to everyone for your feedback. This list is great! The answer to my question is:
answer <- little.red.hen(question)
Marsh Feldman
David Winsemius, MD West Hartford, CT