Technical Community

Hash Values (SHA-1) in Git: WHa

January 21 ,2020

In this post, we’re going to talk about the way that Git labels and refers to its commits using hash values. In the previous post, we talked about the typical Git workflow and changes moved from our working directory to our staging index, into our repository. We had called these simply A, B, and C. They represented different changesets. 

Hash Values (SHA-1) in Git

In the simple example here, they represented changes to only a single file, but in real usage, this could be changed to multiple files and directories, all packaged together into a single snapshot. It’s a snapshot of changes to our project. 

Hash Values (SHA-1) in Git

We called them A, B, and C just to keep it simple, but that’s not the way that Git refers to them. Instead of that, Git generates something called a checksum for each of the changesets. That’s the hash value. 

According to Wiki:

In cryptography, SHA-1 (Secure Hash Algorithm 1) is a cryptographic hash function that takes an input and produces a 160-bit (20-byte) hash value known as a message digest – typically rendered as a hexadecimal number, 40 digits long. It was designed by the United States National Security Agency and is a United State. Federal Information Processing Standard.

A checksum is a simple number that’s created by taking data and feeding it into a mathematical algorithm. So the checksum algorithm converts data into a simple number and we call that simple number a checksum. GIT very strongly relies on SHA-1 for the identification and integrity checking of all file objects and commits. It is possible to create 2 GIT repositories with the same commit hash and different contents (could be a backdoored).

Hash Values (SHA-1) in Git

Attackers could potentially selectively serve either repository to targeted users. This will require attackers to calculate their own collision. This attack required over 9,223,372,036,854,775,808 SHA1 computations. This means that it will take the equivalent processing power as approximately 6,500 years of single-CPU computations and about 110 years of single-GPU computations.

We don’t need to understand much about how the algorithms work, but we should know that there’s a fundamental property that’s very important. The same data put into the same mathematical algorithm always returns the same result or the same checksum. That’s why we call it a checksum because we can check and make sure that it’s the same. It’s used to guarantee data integrity, and data integrity’s fundamentally built into Git because of this. That’s not true of all version control systems. 

The label that Git uses for each one of its snapshots of changes is fundamentally tied to what’s in those changes. If we were to change that information, then the label or hash value would change. So each hash value is not only unique, it’s directly tied to the contents that are inside of it. The algorithm that Git uses is the SHA-1 hash algorithm which basically is a cryptographic hash function taking input and produces a 160-bit (20-byte) hash value. We don’t need to know anything about SHA-1 or how it compares to other algorithms that are out there, but we do need to know its name because it’s frequently used. 

People will refer to this value as being the SHA value or the S-H-A value. So if we hear someone say, what’s the SHA value of that commit? That’s what they’re referring to. It’s the hash value that’s used to label each one of the commits. The number that it generates is a 40-character hexadecimal string. Hexadecimal means that it can contain the number zero through nine and the letters A through F, and it would look something like this, 5c15e8bd and so on. 

So Git takes the entire changeset to all the files and directories that are being changed, things that have been staged, and it runs them through this SHA-1 algorithm and it comes up with this 40-long character string. And that’s what it uses to label the commit. Not only does Git do that with our changeset, but it also does something else important for data integrity. 

In addition to using the code that’s in each one of our snapshots, it also uses the metadata as well. It means that you can’t change the commit message or the commit author or the parent of the commit without also changing its SHA value. That gives us a nice chain of data integrity because when it goes to generate snapshot A, it takes the parent, the author, the message, and all the code changes, and it generates its SHA value. 

Hash Values (SHA-1) in Git

Then when we make snapshot B, snapshot B also goes through that same process, but it includes the SHA value from snapshot A, so it’s linked to A. If we were to change something in A, then A’s SHA value would change and B won’t point to it anymore. One of the really nice features about Git is the fact that this data integrity of not only our changesets but also the history of changes and how they relate to each other is built-in.

To summarize what we have discussed so far:

  • Git generates a checksum for each of the changesets (Hash value)
  • Checksum algorithm converts data into a simple number (called a Checksum)
  • Same data always equals the same checksum
  • Data integrity is fundamental
  • Changing data would change the checksum
  • Git uses SHA-1 hash algorithm to create checksums

Other articles about Git you might like:

Also published on

Share post on