What Hash Functions Do
A hash function takes an input of any length and produces a fixed-length output — a string of characters that uniquely represents the input data. Change even a single character in the input, and the output changes dramatically and unpredictably. This property makes hash functions useful for verifying data integrity, authenticating messages, indexing data structures, and protecting passwords.
Cryptographic hash functions have four essential properties. Determinism: the same input always produces the same output. Quick computation: calculating the hash is fast, regardless of input size. Preimage resistance: given a hash output, it is computationally infeasible to find an input that produces it. Avalanche effect: a tiny change in input produces a completely different output.
Consider a practical example. The SHA-256 hash of the string "hello" is 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824. Change one letter to "hellp" and the hash becomes 71b5c6099f4c2c1b71c7bb551f5e1f59e72b9d07b9e0e7396ea962d10d5bce0a — completely different with no discernible relationship to the first hash. This dramatic response to small changes is what makes hash functions valuable for detecting modifications.
Hash functions are one-way operations. You can compute the hash of any input, but you cannot reverse the process to recover the original input from the hash. This is not encryption — encryption is reversible with the correct key. Hashing is intentionally irreversible, which makes it suitable for storing data that needs to be verified without being recoverable.
Understanding when to use which hash function — and when not to use certain ones — is critical for both security and functionality. Not all hash functions are created equal, and some that were once considered secure are now dangerously broken.
MD5: Fast but Broken
MD5 (Message Digest Algorithm 5) produces a 128-bit hash value, typically rendered as a 32-character hexadecimal string. It was designed in 1991 by Ronald Rivest and became one of the most widely used hash functions on the internet. For many years, MD5 was the standard for file integrity verification, password storage, and digital signatures.
Today, MD5 is cryptographically broken and should not be used for any security-sensitive purpose. In 2004, researchers demonstrated practical collision attacks — finding two different inputs that produce the same MD5 hash. By 2008, researchers had forged a rogue SSL certificate using an MD5 collision, proving that the vulnerability was exploitable in real-world scenarios.
The practical implication is that an attacker can create a malicious file that has the same MD5 hash as a legitimate file. If you verify file integrity by comparing MD5 hashes, a matching hash no longer guarantees that the file has not been tampered with. The hash collision means two different files can legitimately share the same MD5 output.
When MD5 is still acceptable: non-security applications where speed matters more than collision resistance. MD5 is sometimes used for cache keys, checksum deduplication in non-adversarial contexts, and quick data fingerprinting where the threat model does not include malicious actors crafting collisions. If you are hashing log entries to detect duplicates, MD5 works fine because nobody is trying to forge a log entry that matches a specific hash.
When MD5 must never be used: password storage, file integrity verification for downloads, digital signatures, SSL certificates, and any application where an attacker could benefit from finding a collision. For all of these use cases, SHA-256 or stronger is the minimum acceptable standard.
SHA-256: The Current Standard
SHA-256 (Secure Hash Algorithm 256-bit) is part of the SHA-2 family designed by the NSA and published by NIST. It produces a 256-bit hash value, rendered as a 64-character hexadecimal string. SHA-256 is currently the most widely recommended hash function for general-purpose cryptographic use.
SHA-256 is used pervasively across modern systems. The Bitcoin blockchain uses SHA-256 for proof-of-work mining and transaction verification. TLS certificates use SHA-256 for digital signatures — you have encountered it every time your browser displayed a padlock icon. Git uses SHA-256 (via SHA-1 historically, migrating to SHA-256) to identify commits, trees, and blobs by their content hash. Software distributors provide SHA-256 checksums alongside downloads so users can verify that the file they received matches the file the publisher intended.
The security of SHA-256 rests on two properties. Collision resistance: no two different inputs have been found that produce the same SHA-256 output, despite extensive research and massive computational resources devoted to the search. Preimage resistance: given a SHA-256 hash, there is no known method for finding an input that produces it that is faster than brute-force trying every possible input — a computational task that would take longer than the age of the universe with current technology.
SHA-256 is slower than MD5 — roughly two to three times slower in software implementations — but this speed difference is irrelevant for the vast majority of applications. A SHA-256 hash of a typical file computes in milliseconds. The security benefit of using a cryptographically sound algorithm vastly outweighs the marginal performance cost.
For password hashing specifically, SHA-256 is adequate but not ideal. Password hashing benefits from intentionally slow algorithms that increase the cost of brute-force attacks. Bcrypt, Argon2, and PBKDF2 are purpose-built for password storage and should be used instead of raw SHA-256 when hashing passwords.
Other Hash Algorithms: SHA-1, SHA-512, and Beyond
SHA-1 produces a 160-bit hash and was the successor to MD5, widely used in SSL certificates, Git commits, and digital signatures through the 2000s. In 2017, Google and CWI Amsterdam demonstrated the SHAttered attack — a practical collision attack against SHA-1. The attack required approximately 9,223,372,036,854,775,808 SHA-1 computations and 6500 CPU-years of computing. While expensive, this proved that SHA-1 collisions are achievable with sufficient resources. SHA-1 has been deprecated for most security applications since 2012 and should not be used in new systems.
SHA-512 is the longer sibling of SHA-256, producing a 512-bit hash (128 hex characters). It provides a larger security margin against future attacks and performs better than SHA-256 on 64-bit processors because it processes data in 64-bit words. However, for most applications, the additional security margin is unnecessary — SHA-256 is not known to have any practical vulnerabilities, and doubling the hash length does not double the practical security. SHA-512 is preferred in environments with 64-bit hardware where the performance advantage matters.
SHA-3 (and its extendable-output variants SHAKE128 and SHAKE256) is the newest member of the Secure Hash Algorithm family, selected through a public competition organized by NIST. It uses a different internal structure (Keccak sponge construction) than SHA-1 and SHA-2 (Merkle-Damgard construction), which provides algorithmic diversity — if a vulnerability is found in SHA-2's construction, SHA-3 is unaffected. SHA-3 is recommended for new systems that want the highest security margin, but SHA-256 remains perfectly adequate for the vast majority of applications.
For non-cryptographic purposes — hash tables, bloom filters, checksums in controlled environments — faster algorithms like MurmurHash, CityHash, or xxHash are appropriate. These are not designed for security but are significantly faster for data indexing and comparison tasks where collision resistance against adversarial inputs is not a concern.
Practical Applications in Development
File integrity verification is the most straightforward application. When you download a file from the internet, the publisher often provides a SHA-256 checksum. After downloading, compute the SHA-256 hash of the received file and compare it against the published checksum. If they match, the file is identical to what the publisher intended. If they differ, the file was corrupted during transfer or tampered with. The Hash Generator on Utiliify computes file hashes directly in the browser without uploading your files to any server.
Data deduplication uses hashes to identify duplicate content without comparing files byte-by-byte. Upload systems hash each file and store the hash alongside the file. When a new upload arrives, the system hashes it and checks whether that hash already exists. If it does, the file is a duplicate and does not need to be stored again. This approach powers the storage efficiency of cloud backup systems, email attachment processing, and content delivery networks.
API request signing uses HMAC (Hash-based Message Authentication Code) to verify that API requests have not been tampered with in transit. The sender concatenates the request parameters with a shared secret key, hashes the result, and includes the hash in the request. The receiver performs the same computation and compares the hashes. If they match, the request is authentic and unmodified. If they differ, something changed during transmission.
Content-addressable storage uses the hash of the content as its identifier. Git is the most famous example: every object (commit, tree, blob) is identified by its SHA-1 hash (migrating to SHA-256). This means that content and identity are the same thing — if the content changes, the identifier changes, making it impossible to modify content without detection. IPFS, Docker image layers, and Nix package manager all use content-addressable storage based on cryptographic hashes.
Decision Guide: Which Hash to Use
For file integrity checks and download verification: SHA-256. It is the current standard, universally supported, and not known to have any vulnerabilities. The Hash Generator on Utiliify computes SHA-256 hashes for any text or file input.
For password storage: Bcrypt, Argon2, or PBKDF2 — never MD5, SHA-256, or any raw hash function. Password hashing requires intentionally slow algorithms with per-user salts to resist brute-force and rainbow table attacks. Raw hash functions are too fast for password storage.
For API signatures and authentication: HMAC-SHA256. The HMAC construction adds a secret key to the hash function, providing both integrity verification and authentication in a single operation.
For data indexing and deduplication: SHA-256 for security-sensitive contexts, or xxHash/MurmurHash for performance-sensitive contexts where adversarial collisions are not a concern.
For cache keys and fingerprinting: MD5 is acceptable in non-adversarial contexts where speed matters and nobody is trying to engineer a collision. SHA-256 is the safer default if you have no reason to prefer speed.
The principle is straightforward: when security matters, use SHA-256 or stronger. When only speed matters and the threat model excludes adversarial inputs, faster algorithms are appropriate. Never use MD5 or SHA-1 for any purpose where collision resistance provides a security guarantee — they are broken, and continuing to use them for security purposes creates real vulnerabilities in your systems.