Prev 3470 / 21312 Next

[Bioc-devel] Syntactically correct names in DataFrames

Fri, Jun 29, 2012 9:22 PM

On 06/29/2012 02:35 PM, Michael Lawrence wrote:

Thanks for the fix. I was worried that validation rejecting duplicated
colnames would be intentional. Looks like I don't need to worry anymore.

Thanks again,
H.

Michael

      > data.frame(list(aa=2:4, aa=LETTERS[2:4]), check.names=FALSE)

      > DataFrame(list(aa=2:4, aa=LETTERS[2:4]), check.names=FALSE)

      DataFrame with 3 rows and 2 columns
               aa        aa.1
        <integer> <character>
      1         2           B
      2         3           C
      3         4           D

    Not OK.

    I also tend to think that automatic name mangling features is generally
    causing more problems than it solves (if it solves any problem at all).
    Same thing with automatic coercion from character to factor (which I'm
    glad DataFrame() is not trying to mimic).

    Cheers,
    H.



    On 06/28/2012 06:58 AM, Michael Lawrence wrote:

        Hi Florian,

        A guiding principle in the design of DataFrame was consistency with
        data.frame, so that is why we check for syntactic validity of
        the column
        names.  The underlying reasons for this are probably historic
        and related
        to the rough equivalence between lists and environments.

        As for the error you encountered below, that seems to be fixed
        in devel.

        Michael

        On Thu, Jun 28, 2012 at 6:40 AM, Hahne, Florian
        <florian.hahne at novartis.com
        <mailto:florian.hahne at novartis.com>>__wrote:

            Hi all,
            I have been playing around with the DataFrame class a bit
            and realized
            that it always enforces syntactically correct column names.
            Since it is a
            generalization of the basic R data.frames I am not quite
            sure why that has
            to be the case.

            Assuming I start with a regular data.frame with non-standard
            names:

                foo <- data.frame("1a"=1:3, b=4:6, check.names=FALSE)
                foo

              1a b
            1  1 4
            2  2 5
            3  3 6


            Coercing this into a DataFrame forces a name change:

                DataFrame(foo)

            DataFrame with 3 rows and 2 columns
                    X1a         b
              <integer> <integer>
            1         1         4
            2         2         5
            3         3         6


                as(foo, "DataFrame")

            DataFrame with 3 rows and 2 columns
                    X1a         b
              <integer> <integer>
            1         1         4
            2         2         5
            3         3         6


            My first intuition was to try this:

                DataFrame(foo, check.names=FALSE)

            DataFrame with 3 rows and 3 columns
            Error in matrix(unlist(lapply(object, function(x) paste("<",
            class(x),  :
              length of 'dimnames' [2] not equal to array extent
            In addition: Warning message:
            In if (check.names) vnames <- make.names(vnames, unique =
            TRUE) :
              the condition has length > 1 and only the first element
            will be used

            Now apparently there are multiple things going on here.
            First of all,
            check.names is recycled by the DataFrame constructor because
            it thinks
            that it is just another variable to add to the DataFrame
            later. The
            initializer method however seems to recognize it for the
            coercion into a
            data.frame, but it complains because it's length is >1. Also
            the show
            method is broken because things don't really match anymore.
            The Data.Table
            show method in IRanges seems to be the culprit here.

            My simple question here is: why are syntactic names enforced
            at all. And
            if that is a feature could't there be a way to turn this off?

            A very simple fix would be this:
            Index: DataFrame-class.R
            ==============================__==============================__=======
            --- DataFrame-class.R   (revision 67116)
            +++ DataFrame-class.R   (working copy)
            @@ -183,7 +183,7 @@
                 varlist <- unlist(varlist, recursive = FALSE, use.names
            = FALSE)
                 nms <- unlist(varnames[ncols > 0L])
                 if (check.names)
            -      nms <- make.names(nms, unique = TRUE)
            +      nms <- make.unique(nms)
                 names(varlist) <- nms
               } else names(varlist) <- character(0)



            Of course I didn't check all of the downstream effects, but
            I don't really
            see why anything should rely on syntacticly correct names.
            In case there
            is, the erratic check.names behavior certainly needs some
            fixing, after
            all it could just be a normal column name in the DataFrame.

            Thanks,
            Florian

            _________________________________________________
            Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
            mailing list
            https://stat.ethz.ch/mailman/__listinfo/bioc-devel
            <https://stat.ethz.ch/mailman/listinfo/bioc-devel>


                [[alternative HTML version deleted]]


        _________________________________________________
        Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
        mailing list
        https://stat.ethz.ch/mailman/__listinfo/bioc-devel
        <https://stat.ethz.ch/mailman/listinfo/bioc-devel>



    --
    Herv? Pag?s

    Program in Computational Biology
    Division of Public Health Sciences
    Fred Hutchinson Cancer Research Center
    1100 Fairview Ave. N, M1-B514
    P.O. Box 19024
    Seattle, WA 98109-1024

    E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
    Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
    Fax: (206) 667-1319 <tel:%28206%29%20667-1319>

Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

Thread (7 messages)

Hahne, Florian Syntactically correct names in DataFrames Jun 28 Michael Lawrence Syntactically correct names in DataFrames Jun 28 Hahne, Florian Syntactically correct names in DataFrames Jun 29 Hahne, Florian Syntactically correct names in DataFrames Jun 29 Hervé Pagès Syntactically correct names in DataFrames Jun 29 Michael Lawrence Syntactically correct names in DataFrames Jun 29 Hervé Pagès Syntactically correct names in DataFrames Jun 29