Skip to content

md5sum issues

12 messages · Ivan Calandra, Jeff Newmiller, Ivan Krylov +1 more

#
Dear useRs,

I have some kind of a weird issue with md5sum() and I'm not sure where I 
should start.

I have a repository on GitHub, with a local Git installation and 
connected with RStudio.
I am working on Windows 10 and a colleague of mine works on Linux.
We both pull the latest commits of all files, but the checksums are 
different.
Even stranger (to me at least), I get a different checksum from the 
local file (downloaded through Git via pulling) and the same file that I 
manually download from GitHub. The checksum of the manual download from 
GitHub is the same as that of my colleague on Linux.
This happens to all text-based files (Rmd, MD, CSV...) but not to 
non-editable files (PDF, XLSX...).

For example (I have shortened the paths):
 > library(tools)

 > md5sum(file.choose()) # local repo
D:\\...\\SSFAcomparisonPaper\\README.md
"e3b08fc2ab8b3c8b57e681f862a77f32"

 > md5sum(file.choose()) # downloaded from GitHub
C:\\Users\\...\\Downloads\\README.md
"05fab51e18b962a9f3266c7b79016ce6"

 > md5sum(file.choose()) # local repo
D:\\...\\SSFAcomparisonPaper\\...\\SSFA_GuineaPigs_plot.pdf
"d9b331642bfd0d192e4eff5808b2a30f"

 > md5sum(file.choose()) # downloaded from GitHub
C:\\Users\\...\\Downloads\\SSFA_GuineaPigs_plot.pdf
"d9b331642bfd0d192e4eff5808b2a30f"

I am not sure whether it is an issue with the algorithm of md5sum(), 
whether it's a R/RStudio/Git/GitHub/Windows issue, so I would be 
grateful if you could help me sorting it out.

Thank you in advance,
Ivan
#
Sounds like a newline discrepancy issue. Highly unlikely to be an R issue.
On February 2, 2021 8:01:05 AM PST, Ivan Calandra <calandra at rgzm.de> wrote:

  
    
#
Thank you Jeff for the pointer.

If it's not an R issue, I guess it will be difficult to solve...
But maybe there is a workaround using R, like using another function or 
editing the files...? Does anyone have any idea?

Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra
On 02/02/2021 17:05, Jeff Newmiller wrote:
#
On Tue, 2 Feb 2021 17:01:05 +0100
Ivan Calandra <calandra at rgzm.de> wrote:

            
This is probably caused by Git helpfully converting text files from LF
(0x10) line endings to CR LF (0x13 0x10) when checking out the
repository clone on Windows (and back when checking in).

This configuration option is described in Pro Git:
https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration#_core_autocrlf
#
On 03/02/2021 2:14 a.m., Ivan Krylov wrote:
I agree with Ivan K, but don't agree with the advice in that book.

It's best to just leave files alone, not to convert between LF and 
CR-LF.  I don't think this confuses many Windows editors these days, but 
if your editor forces files into CR-LF form, you should fix the editor, 
not try to work around it.

In my opinion everyone should run

  git config --global core.autocrlf false

Some more arguments for this (in the context of Github Actions) are here:

 
https://github.community/t/git-config-core-autocrlf-should-default-to-false/16140

Duncan Murdoch
#
Thank you Ivan and Duncan for your help.

I understand your point Duncan, but the thing is that I do have an issue 
here.
Is it then due to RStudio or even Windows? If it is, I can forget about 
a solution on that end, so I would focus on what I can do, and this Git 
setting seems to be the best place to start.

Or am I missing something (I am still a newbie on these things...)?

Ivan C

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra
On 03/02/2021 10:06, Duncan Murdoch wrote:
#
On 03/02/2021 4:42 a.m., Ivan Calandra wrote:
In my opinion, you should run

  git config --global core.autocrlf false

in an RStudio terminal session.  That will set the git options so they 
don't mess up the md5sum values.

You should also go to the RStudio options, and in the Code section, 
Saving tab, choose Serialization to be Posix (LF) and default text 
encoding to be UTF-8.

Unfortunately, RStudio will still mess up the .Rproj file (see 
https://github.com/rstudio/rstudio/issues/1929); there's not much you 
can do about that.  Just try not to commit the Windows version to the 
repository if any non-Windows users are sharing it.

But do note that other people have different opinions.  They argue that 
files should be converted to Windows native format by git.  That works 
in some narrow use cases, but as soon as you try to extract a file from 
git on one system and work on it on another, it breaks.

Duncan Murdoch
#
Thank you very much Duncan for your help. I'll try that.

Best,
Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra
On 03/02/2021 11:48, Duncan Murdoch wrote:
#
This CR vs LF vs CRLF newline discrepancy has been around since the 70s and the CP/M operating system. And it remains an issue in over-the-wire internet text protocols today, which actually use the CRLF version like Windows. Sorry, UNIX... world domination of LF encoding failed.

The problem with pretending there is no issue as Duncan is advocating is that text is treated differently than binary, and every time you pretend it isn't it comes back to bite you. Applying binary algorithms like MD5 to text is one of these areas where your expectation that this will be successful is what creates the problem in the first place. A similar issue occurs in file encoding.. two files may both contain the word "Hello" but if they are encoded in UCS16 and UTF8 respectively then the MD5 results will be different.

Git does not (currently) support differences in encoding, but it does support text vs non-text (newline) differences because they are unavoidable. Pushing forward with your expectation that text files should compare the same in binary by assuming text will always be like UNIX text just defers the problem for another day.

Since I don't know what problem you are actually trying to solve, I cannot offer a concrete solution. But I would begin by not assuming that MD5 works the same on text and binary files... because it doesn't.
On February 3, 2021 2:48:56 AM PST, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

  
    
#
Dear Jeff,

If I understood you correctly, it makes sense that I explain more about 
my goal here:

I am trying to find ways to have analyses that are as reproducible as 
possible (knowing that it is not going to be perfect). One part is to 
show which file(s) I use as input and what output was created, so that 
potential readers/users of my analysis can check that the file they have 
is indeed the same that I use (and not a corrupted or modified version).
Does that make sense?

And for this purpose, I originally used file information (like creation 
time and so on), but I quickly realized this doesn't help much. Then I 
tried with MD5 and I thought it was solved, but it was obviously not solved.

Duncan solution seems to work (I have not fully checked yet, though), 
but I am really open to other, more robust alternatives.

Thanks for the input!
Ivan

--
Dr. Ivan Calandra
TraCEr, laboratory for Traceology and Controlled Experiments
MONREPOS Archaeological Research Centre and
Museum for Human Behavioural Evolution
Schloss Monrepos
56567 Neuwied, Germany
+49 (0) 2631 9772-243
https://www.researchgate.net/profile/Ivan_Calandra
On 03/02/2021 17:15, Jeff Newmiller wrote:
#
On 03/02/2021 11:15 a.m., Jeff Newmiller wrote:
That misrepresents my position.  Obviously there's an issue.  I'm 
suggesting a simple solution.

Duncan Murdoch

is that text is treated differently than binary, and every time you 
pretend it isn't it comes back to bite you. Applying binary algorithms 
like MD5 to text is one of these areas where your expectation that this 
will be successful is what creates the problem in the first place. A 
similar issue occurs in file encoding.. two files may both contain the 
word "Hello" but if they are encoded in UCS16 and UTF8 respectively then 
the MD5 results will be different.
#
Well, you can use binary input files like RDS, qs, or parquet. But you already have your code and data in Git, so checking your input is redundant... just put in a binary output reference file and a test that verifies it.
On February 3, 2021 8:25:33 AM PST, Ivan Calandra <calandra at rgzm.de> wrote: