Skip to content
Prev 334241 / 398506 Next

How can I find nonstandard or control characters in a large file?

andrewH wrote:

            
This is not an R solution, but here's a Windows utility I wrote to 
produce a table of frequency counts for all hex characters x00 to xFF in 
a file.

http://www.efg2.com/Lab/OtherProjects/CharCount.ZIP

Normally, you'll want to scrutinize anything below x20 or above x7F, 
since ASCII printable characters are in the range x20 to x7E. You can 
see how many tab (x09) characters are in the file, and whether the line 
endings are from Linux (x0A) or Windows (paired x0A and x0D).


The ZIP includes Delphi source code, but provides a Windows executable. 
  I made a change several months ago to allow drag-and-drop, so you can 
just drop the file on the application to have the characters counted. 
Just run the EXE after unzipping.  No installation is needed.

Once you find problems characters in the file, you can read the file as 
character data and use sub/gsub or other tools to remove or alter 
problem characters.

efg
Earl F Glynn
UMKC School of Medicine
Center for Health Insights