Intermediate

Understanding Git's Use of SHA-1

Explore how Git uses SHA-1 hashes for content-addressable storage, commit integrity, and distributed version control.

Why Git Uses Hashes

Git is fundamentally a content-addressable filesystem. Every object (commit, tree, blob, tag) is identified by the SHA-1 hash of its content:

  • -Content-addressable: Objects are stored and retrieved by their hash, not by filename
  • -Integrity: Any corruption is immediately detectable-hash won't match
  • -Deduplication: Identical content has the same hash, stored only once
  • -Distributed: Hashes are globally unique, enabling decentralized collaboration
Git's Design Philosophy

Linus Torvalds designed Git to detect corruption instantly. If a single bit flips in any object, the hash changes and Git knows something is wrong. This makes Git incredibly reliable for distributed development.

Git Object Types

Git stores four types of objects, each identified by a SHA-1 hash:

Blob (Binary Large Object)

Stores file content. The hash is computed from the file data plus a header.

SHA-1("blob " + filesize + "\0" + file_content)

Tree

Stores directory structure. Lists filenames, permissions, and blob/tree hashes.

100644 blob a906cb2a4a904a152e80877d4088654daad0c859  README.md
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0  src/

Commit

Stores metadata: tree hash, parent commit(s), author, committer, message, timestamp.

tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0
parent 0d1d7fc32e5a0204bd39d46d4c8a4e9a8b5c6e7f
author Alice <alice@example.com> 1715097600 +0000
committer Alice <alice@example.com> 1715097600 +0000

Add user authentication feature

Tag

Stores annotated tag information: object hash, type, tagger, message.

object 0d1d7fc32e5a0204bd39d46d4c8a4e9a8b5c6e7f
type commit
tag v1.0.0
tagger Alice <alice@example.com> 1715097600 +0000

Release version 1.0.0

How Git Computes Hashes

Let's compute a Git blob hash manually to understand the process:

Example file content:
Hello, Git!
Git adds a header:
blob 12\0Hello, Git!
Format: type + space + size + null byte + content
Compute SHA-1:
8ab686eafeb1f44702738c8b0f24f2567c36da6d
Verify with Git:
echo "Hello, Git!" | git hash-object --stdin
Output: 8ab686eafeb1f44702738c8b0f24f2567c36da6d
Compute manually with OpenSSL:
(printf "blob 12\0"; echo "Hello, Git!") | openssl sha1
Python implementation:
import hashlib

content = b"Hello, Git!"
header = f"blob {len(content)}\0".encode()
store = header + content

hash_val = hashlib.sha1(store).hexdigest()
print(hash_val)  // 8ab686eafeb1f44702738c8b0f24f2567c36da6d

Content-Addressable Storage

Git stores objects in .git/objects/ using the hash as the path:

Object storage structure:
.git/objects/8a/b686eafeb1f44702738c8b0f24f2567c36da6d
          ↑↑  ↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑
          dir  filename (remaining 38 characters)
First 2 characters = directory, remaining 38 = filename. This prevents too many files in one directory.

Commit Hash Integrity

Every commit hash depends on its entire history, creating a tamper-evident chain:

Commit dependency chain:
Commit C = SHA-1(tree + parent_B + author + message)
  ↓
Commit B = SHA-1(tree + parent_A + author + message)
  ↓
Commit A = SHA-1(tree + author + message)
Changing commit A changes its hash, which changes B's parent reference, which changes B's hash, which changes C's parent reference, etc.
Why Rewriting History Changes Hashes

Commands like git rebase or git commit --amend create new commits with different hashes. The old commits still exist (until garbage collected), but the branch now points to new commits.

Practical Git Commands

View object type:
git cat-file -t 8ab686ea
Output: blob, tree, commit, or tag
View object content:
git cat-file -p 8ab686ea
Shows the actual content of the object
Find object by content:
git hash-object README.md
Computes what the hash would be (doesn't store it)
Verify repository integrity:
git fsck --full
Checks all objects for corruption
Show commit details:
git show --format=raw HEAD
Shows tree hash, parent, author, committer

SHA-1 Collision Concerns

In 2017, Google demonstrated the first SHA-1 collision (SHAttered attack). Git's response:

Git's Collision Detection

Git now includes collision detection. If you try to add an object that collides with an existing one, Git rejects it. This prevents the SHAttered attack from working against Git repositories.

Migration to SHA-256

Git is transitioning to SHA-256. Git 2.29+ supports SHA-256 repositories. Command: git init --object-format=sha256

Practical Risk Assessment

Creating a SHA-1 collision requires massive computational resources (Google spent $110,000 in compute time). For most projects, SHA-1 remains secure enough. Critical infrastructure should migrate to SHA-256.

Real-World Applications

Distributed Collaboration

Developers can work offline, create commits, and later merge without conflicts because hashes are globally unique.

Efficient Storage

Identical files across branches are stored only once. Git deduplicates automatically using content hashes.

Corruption Detection

If a disk error corrupts a file, Git detects it immediately because the hash won't match. Run git fsck to verify integrity.

Reproducible Builds

Commit hashes uniquely identify code state. CI/CD systems use hashes to ensure they're building the exact code that was tested.

Try It Yourself

Experiment with Git hashes using our Hash Calculator:

Exercise: Compute a Git Blob Hash
  1. 1. Create a file: echo "Hello, Git!" > test.txt
  2. 2. Get Git's hash: git hash-object test.txt
  3. 3. Go to the Hash Calculator
  4. 4. Select SHA-1 algorithm
  5. 5. Enter: blob 12\0Hello, Git! (with actual null byte)
  6. 6. Compare the hash-it should match Git's output

Official Resources

Git Documentation

SHA-1 & SHA-256

Related Guides