Skip to content
Prev 325006 / 398503 Next

reading a character translation table into R

I have a txt file (attached) that defines equivalents among characters 
in latin1 (or iso-8859-1), numeric &#xxx; codes, HTML entities
and latex equivalents.  A portion of the file is shown inline below, but 
may not be rendered well in this email.

I'd like to read this into R to use as a character translation table, 
but am stuck on two things:
- The 5 fields in the file are column-aligned and are separated by 2+ 
white space characters.
In perl this is trivial to read and parse via something like
         @entries = split("\n", $charTable);
         foreach (@entries) {
                 ($desc, $char, $code, $html, $tex) = split(/\s\s+/);
         }
AFAIK, the only function for reading such data is utils::read.fwf, but I 
have to specify the field widths.
I don't know of any function that allows even a simple regrex like this 
as a sep= argument.

- The TeX field contains many backslashed codes that need to be escaped 
in R. Is it necessarty
to manually edit the file to change '\pounds' --> '\\pounds', '\S' --> 
'\\S', etc. or is there something
like raw mode input that would do this where necessary?

Description                         Char
  Code      HTML        TeX
double quote                         "    " "
ampersand                            &    & &amp        \&
apostrophe                           '    ' '
less than                            <    &#060; &lt;        $<$
greater than                         >    &#062; &gt;        $>$
non-breaking space                   .    &#160; &nbsp;      ~
inverted exclamation                 ?    &#161; &iexcl;     !'
cent sign                            ?    &#162; &cent;
pound sterling                       ?    &#163; &pound;     \pounds
general currency sign                ?    &#164; &curren;
yen sign                             ?    &#165; &yen;
broken vertical bar                  ?    &#166; &brvbar;
section sign                         ?    &#167; &sect;      \S
umlaut (dieresis)                    ?    &#168; &uml;       \"{}
copyright                            ?    &#169; &copy;      \copyright
feminine ordinal                     ?    &#170; &ordf;      $^a$
left angle quote, guillemotleft      ?    &#171; &laquo;     \guillemotleft
not sign                             ?    &#172; &not;
soft hyphen                          ?    &#173; &shy;
registered trademark                 ?    &#174; &reg;       \textregistered
macron accent                        ?    &#175; &macr;
degree sign                          ?    &#176; &deg;       $^o$
plus or minus                        ?    &#177; &plusmn;    $\pm$
superscript two                      ?    &#178; &sup2;      $^2$
superscript three                    ?    &#179; &sup3;      $^3$
acute accent                         ?    &#180; &acute;     \'{}
micro sign                           ?    &#181; &micro;     $\mu$
paragraph sign                       ?    &#182; &para;      \P
middle dot                           ?    &#183; &middot;    $\cdot$
cedilla                              ?    &#184; &cedil;     \c{}
superscript one                      ?    &#185; &sup1;      $^1$
masculine ordinal                    ?    &#186; &ordm;      $^o$
right angle quote, guillemotright    ?    &#187; &raquo;     \guillemotright
fraction one-fourth                  ?    &#188; &frac14;    $\frac14$
fraction one-half                    ?    &#189; &frac12;    $\frac12$
fraction three-fourths               ?    &#190; &frac34;    $\frac34$
inverted question mark               ?    &#191; &iquest;    ?'
capital A, grave accent              ?    &#192; &Agrave;    \`A
capital A, acute accent              ?    &#193; &Aacute;    \'A
capital A, circumflex accent         ?    &#194; &Acirc;     \^A
capital A, tilde                     ?    &#195; &Atilde;    \~A
capital A, dieresis or umlaut mark   ?    &#196; &Auml;      \"A
capital A, ring                      ?    &#197; &Aring;     \AA
capital AE diphthong (ligature)      ?    &#198; &AElig;     \AE