What Actually Happens When You Delete a File
Try the interactive lab for this articleTake the quiz (6 questions · ~4 min)When you run rm holiday-photos.tar.gz and the prompt returns instantly, something counterintuitive has happened: almost nothing. The file's bytes are still there. The filename is gone, but the data it pointed at is not overwritten. On a mechanical hard drive with the right tool and a bit of luck, you can recover most of it days later. On a modern SSD with TRIM, it may already be unreachable before you finish typing the next command. On an ext4 filesystem with an active journal, parts of the file's metadata may be preserved for minutes even after the free list reclaims the data blocks.
This article walks through the full story. From the unlink syscall, through the VFS layer, down into the on-disk structures of ext4 and NTFS, through the block allocator, past the journal, into the world of forensic carving tools like photorec and extundelete, and finally into the NAND flash controller that makes most of this work irrelevant on an SSD. The goal is not to sell anyone on a recovery tool. It is to explain why deletion behaves the way it behaves, what actually stays on disk, and what your options really are when something important disappears.
The Syscall Behind rm
Let me clear up the most common misconception first. rm does not delete files. It calls unlink(2), a POSIX syscall whose job is to remove one name from the filesystem. A file on Unix can have multiple names (hard links) pointing at the same inode. unlink removes one. If that was the last name, and no process still has the file open, the filesystem is free to reclaim the inode and its data blocks. Otherwise, the file continues to exist, nameless but alive, until the last reference disappears.
You can see this yourself. Open a terminal in Berlin, write a small log file, tail it in one window, delete it in another:
# terminal 1
$ echo "important data" > /tmp/obs.log
$ tail -f /tmp/obs.log
# terminal 2
$ rm /tmp/obs.log
$ ls -l /proc/$(pgrep tail)/fd/ | grep obs
l-wx------ 1 alex alex 64 Apr 6 11:42 3 -> /tmp/obs.log (deleted)The file is marked (deleted) but the file descriptor still works. The tail process can still read from it. Its inode still has a refcount of one (from the open file description). Only when tail closes the file does the filesystem actually release the inode and the data blocks. This is why you can free up disk space by restarting a process that holds a rotated log file. The bytes were already logically deleted, but the reference count kept them alive.
The C function behind rm looks roughly like this:
#include <unistd.h>
int main(int argc, char **argv) {
if (unlink(argv[1]) == -1) {
perror("unlink");
return 1;
}
return 0;
}That is the whole operation from userspace. One syscall. The kernel does the work.
One more subtlety matters here. unlink returns before the file is actually released. The syscall only detaches the name. Everything that happens afterwards (freeing inodes, freeing blocks, issuing TRIM, flushing metadata to disk) is scheduled work that may take milliseconds to seconds. If your application measures "file deleted" by the moment unlink returns zero, you are measuring a metadata rename, not a data release. A hot loop that creates and deletes files at millions per second will easily outrun the filesystem's cleanup machinery and pile up pending work in memory.
VFS, Inodes, and the Dentry Cache
The Linux kernel implements filesystems behind an abstraction called the Virtual Filesystem Switch. VFS defines a shared vocabulary (superblocks, inodes, dentries, files) that individual filesystems implement. When unlink enters the kernel, it runs through the VFS layer before it ever touches ext4 or XFS or Btrfs code.
The path is roughly:
vfs_unlink()infs/namei.clocks the parent directory's inode and looks up the target dentry.- It checks permissions: the process must have write permission on the parent directory. File permissions on the target itself are not required, which is why you can delete files you cannot read.
- It calls the filesystem's own
->unlinkmethod through theinode_operationsstruct. For ext4 that isext4_unlink()infs/ext4/namei.c. - The filesystem modifies on-disk metadata: directory entry removed, inode link count decremented.
- If link count reaches zero and no open file descriptors exist, the filesystem schedules the inode for deallocation.
- The dentry cache entry is marked negative or freed. The page cache pages belonging to the file are eventually evicted.
A dentry is the kernel's in-memory representation of a directory entry: a name, a pointer to an inode, and a place in a hash table and a tree. Dentries are what the kernel uses to walk paths quickly without rereading directory blocks for every lookup. When unlink removes a name, the corresponding dentry is unhashed. Any future lookup of that name returns -ENOENT without touching the disk.
The inode itself contains everything the file needs to exist: ownership, permissions, timestamps, size, and the block pointers that lead to the actual data. On a classic Unix-style filesystem, an inode lives at a fixed on-disk location and is addressed by an integer inode number. You can see the inode number of any file with stat or ls -i:
$ ls -i /etc/hostname
266241 /etc/hostnameThe dentry cache has one more role that matters here. It is the reason repeated lookups of the same path are cheap, and it is also why stale entries can outlive their files. When you delete a file, the kernel invalidates the positive dentry and replaces it with a negative one: a placeholder that remembers "this name does not exist" so future lookups can return -ENOENT in constant time without touching the disk. Negative dentries are reclaimed under memory pressure. On a long-running server in Copenhagen with plenty of RAM, the kernel may keep millions of negative dentries around, which is why tools like slabtop often show dentry as one of the largest slab caches on the system.
How ext4 Actually Removes a File
ext4 is a good filesystem to study because it is a direct descendant of the original ext2 design, still uses classic Unix structures, and is the default on most Linux distributions in Europe. Everything we discuss applies conceptually to XFS, Btrfs, APFS, and NTFS, but ext4 is the clearest teaching example.
An ext4 filesystem is divided into block groups, each containing a fixed number of inodes and data blocks. Each block group has:
- A block bitmap: one bit per data block, 1 meaning allocated.
- An inode bitmap: one bit per inode in the group, 1 meaning allocated.
- An inode table: an array of inode structures.
- The data blocks themselves.
When ext4 deletes a file, it touches several of these structures in a single journal transaction:
- Directory entry: the entry for the file's name in the parent directory is removed. In ext4, directories are arrays of variable-length
ext4_dir_entry_2records. Removal is done by extending the previous record's length to cover the deleted entry, rather than physically moving bytes. The old record's inode number and filename remain on disk, still readable if you scan the raw directory block. - Inode link count:
i_links_countis decremented. For a regular file with one link, it goes from 1 to 0. - Inode deletion time:
i_dtimeis set to the current Unix timestamp. This is one of the most useful forensic clues in the whole system. On a live filesystem,i_dtimeis 0. On a deleted inode, it records the exact moment of deletion. - Inode bitmap: the bit corresponding to this inode is cleared, marking it free.
- Block bitmap: the bits corresponding to the file's data blocks are cleared, marking them free.
- Block group descriptor: free inode count and free block count are incremented.
Crucially, ext4 does not zero the inode itself. The extent tree (or the older indirect block pointers, on files created before the extents feature was enabled) is left in place. The data blocks are not zeroed either. The block and inode bitmaps are updated, which is how the filesystem knows the space is reusable, but the underlying bytes remain until something else writes over them.
This has a very practical consequence. Immediately after rm, everything you need to reconstruct the file still exists on disk:
- The inode structure, including size, timestamps, and the extent tree pointing at the data blocks.
- The data blocks themselves, still containing the file's bytes.
- Often, the directory entry in its slack space, with the original filename.
What is missing is only the reachability: the filesystem no longer considers those inodes and blocks allocated, so the next write can grab them at any moment. Recovery is a race against reuse.
A Concrete Walk-Through
Let me make this tangible. Say you have an ext4 filesystem on /dev/sdb1, mounted at /mnt/data, and you create and delete a 1 MB file:
$ sudo dd if=/dev/urandom of=/mnt/data/secrets.bin bs=1M count=1
$ ls -i /mnt/data/secrets.bin
131073 /mnt/data/secrets.bin
$ sudo debugfs -R "stat <131073>" /dev/sdb1
Inode: 131073 Type: regular Mode: 0644 Flags: 0x80000
Generation: 2938471231 Version: 0x00000000
User: 0 Group: 0 Size: 1048576
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 2048
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x66122a4a -- Sat Apr 6 12:01:46 2026
atime: 0x66122a4a -- Sat Apr 6 12:01:46 2026
mtime: 0x66122a4a -- Sat Apr 6 12:01:46 2026
Size of extra inode fields: 32
EXTENTS:
(ROOT/0):34816-35071The inode has one extent pointing at blocks 34816 through 35071. Now delete the file and inspect again:
$ rm /mnt/data/secrets.bin
$ sync
$ sudo debugfs -R "stat <131073>" /dev/sdb1
Inode: 131073 Type: regular Mode: 0644 Flags: 0x80000
Generation: 2938471231 Version: 0x00000000
User: 0 Group: 0 Size: 1048576
File ACL: 0 Directory ACL: 0
Links: 0 Blockcount: 2048
Fragment: Address: 0 Number: 0 Size: 0
dtime: 0x66122a82 -- Sat Apr 6 12:02:42 2026
...
EXTENTS:
(ROOT/0):34816-35071Links is now 0. dtime is set. The extents are still there. The data blocks between 34816 and 35071 still contain the original random bytes, even though the block bitmap now says they are free. That is why tools like extundelete work: they read orphaned inodes and reconstruct files from intact extent trees.
The Extent Tree in Detail
ext4 stores file layout as a tree of extents rather than the older array of direct and indirect block pointers used by ext2 and ext3. An extent is a triple: logical block offset, physical block number, length. A single extent can describe up to 32768 contiguous blocks, which is 128 MiB at a 4 KiB block size. Most files on a healthy filesystem fit in the four extents that can be stored inline in the inode itself, no tree walk required.
The on-disk format is worth knowing, because recovery tools work directly with it. An ext4 inode is 256 bytes by default and contains a 60-byte block addressing area. For extents, the first 12 bytes are an ext4_extent_header:
struct ext4_extent_header {
__le16 eh_magic; // 0xF30A
__le16 eh_entries; // number of valid entries
__le16 eh_max; // capacity
__le16 eh_depth; // 0 = leaf, >0 = index
__le32 eh_generation;
};After the header, up to four 12-byte entries follow. If depth is zero, each entry is an ext4_extent (logical offset, length, upper 16 bits of physical, lower 32 bits of physical). If depth is greater than zero, each entry is an ext4_extent_idx that points at a deeper block, which itself is a fresh header plus entries. This forms a small B+ tree whose depth almost never exceeds three levels.
Here is what makes this important for recovery. When a file is deleted, the extent tree in the inode is left intact. Only the block bitmap is updated. If you can read the raw inode, you can walk the tree exactly as the kernel would, and you can read every data block the file used to occupy. This is how extundelete works. It scans the inode table for inodes with dtime != 0, parses their extent trees, and pulls out the data before the blocks are overwritten.
Older ext3 filesystems used indirect block pointers instead. An inode stored 12 direct pointers, one indirect (pointing at a block of pointers), one double-indirect, and one triple-indirect. When a file was deleted, ext3 actively zeroed the indirect blocks as part of truncation to avoid leaking the block list. This made recovery on ext3 much harder than on ext4. Switching from ext3 to ext4 accidentally made undelete easier, a side effect the authors did not intend.
The Journal: The Double-Edged Sword
ext4 uses JBD2 (Journaling Block Device 2) to protect metadata integrity. Every metadata change is first written to a circular log, then flushed to its final location. If a crash happens mid-transaction, the journal is replayed on the next mount, ensuring the filesystem is never left in an inconsistent state.
For deletion, this means the metadata changes (directory entry removal, inode update, bitmap updates) are batched into a single atomic transaction. Either the whole deletion happens or none of it does. This is good for crash safety. It is mixed news for recovery.
The journal buys you something useful: recent metadata changes, including the original inode state before deletion, may still be sitting in the journal. Tools like ext4magic can parse the journal and recover recently deleted files by extracting the pre-deletion inode image. The catch is that ext4 by default uses data=ordered mode, which journals metadata only. File content is not written through the journal, only flushed to disk before the metadata that references it. You get a consistent filesystem, not time travel for file contents.
If you really want the journal to protect file data, you can mount with data=journal. The tradeoff is roughly a 2x write slowdown, because every data block is written twice (once to the journal, once to its final home). Most systems stick with ordered and accept that deletion is generally one-way.
The other role the journal plays in forensics is timing. Journal transactions are sequential and contain timestamps. Even after the filesystem has been heavily modified, the journal can give you a rough reconstruction of what happened and when. This is why corporate incident responders often freeze the journal alongside the raw disk during acquisition.
Debugging a Live Filesystem
One reason I keep reaching for ext4 examples is that debugfs gives you a window into a live filesystem that no other mainstream filesystem offers. You can watch deletion happen step by step. Open two terminals on a spare ext4 filesystem mounted on /mnt/scratch:
# terminal 1: create and watch
$ dd if=/dev/urandom of=/mnt/scratch/target.bin bs=1M count=4
$ INO=$(stat -c '%i' /mnt/scratch/target.bin)
$ BLK=$(sudo debugfs -R "stat <$INO>" /dev/vdb1 2>/dev/null | awk '/EXTENTS/{getline; print}')
$ echo "inode=$INO extent=$BLK"
# terminal 2: dump a data block
$ sudo debugfs -R "bdump 34816" /dev/vdb1 > before.bin
# terminal 1: delete and sync
$ rm /mnt/scratch/target.bin
$ sync
# terminal 2: dump the same block again
$ sudo debugfs -R "bdump 34816" /dev/vdb1 > after.bin
$ diff -q before.bin after.binThe files are identical. The block contains the original random bytes. Only the allocation status has changed. Run debugfs -R "testb 34816" /dev/vdb1 and you will see "Block 34816 is not in use", but the bytes are right there, one direct read away.
This is the mental model you want when reasoning about deletion. The filesystem is a reservation system over an underlying store. Deletion is a reservation cancellation, not a destruction of the underlying bytes. The only reason your drive eventually does not leak every file you ever wrote is that later reservations overwrite the same space.
NTFS and the Master File Table
NTFS takes a different shape but lands in the same place. Instead of separate inode tables and data blocks, NTFS stores files as records in a single structure called the Master File Table, or MFT. Each MFT record is 1 KiB by default and contains a set of attributes. Small files (a few hundred bytes) are stored entirely inside the MFT record, a feature called resident data. Larger files have their data in ordinary disk clusters, with the MFT record holding a compressed list of cluster runs (offset plus length pairs).
When you delete a file on NTFS:
- NTFS marks the MFT record as unused by clearing a flag in the record header (
0x0001is the "in use" flag). - The directory index (a B+ tree stored in the parent directory's
$INDEX_ROOTand$INDEX_ALLOCATIONattributes) removes the file's entry. The entry is not zeroed; it is simply marked as deleted, leaving it readable if you parse the raw index. - The clusters the file used are marked free in the $Bitmap metafile.
- The $LogFile (NTFS's journal) records the operation so that recovery after a crash can replay it.
As with ext4, the file's actual bytes are not touched. The MFT record still contains the filename, sizes, timestamps, and cluster runs. Tools like ntfsundelete walk the MFT looking for records with the in-use flag cleared, then reconstruct the file from the cluster runs. The main reason recovery fails on NTFS is MFT record reuse. NTFS aggressively reuses deleted MFT records, so by the time you notice the file is gone, its record may already belong to something else.
Windows complicates the picture with the Recycle Bin, which is not really deletion at all. When Explorer "deletes" a file, it moves it into a per-user $Recycle.Bin\<SID> folder, renaming it to something like $Rxxx.ext and writing a small sidecar $Ixxx.ext file containing the original path. The actual MFT record is rewritten with the new name and location. The real DeleteFile call happens only when you empty the bin or use Shift+Delete. Understanding this layer is often enough to recover from user error without any forensics at all.
The Free List and Why Reuse Is Fast
Filesystems maintain some form of free list: the data structure they use to find unused blocks quickly when a new file needs storage. In ext4, this is the block bitmap within each block group, plus hints like the group's free block count and the preferred allocation window. In NTFS, it is the $Bitmap metafile. In XFS, it is a pair of B+ trees, one sorted by block number and one sorted by extent size.
Allocators tend to be locality-aware. ext4 prefers to place a file's data near its inode, in the same block group if possible. New allocations start from the group's last allocation point, advancing forward. This means that after deleting a file, the space it freed is rarely the first place the next allocation looks. That is good news for recovery, because deleted blocks linger longer than you might expect on a filesystem with plenty of free space.
The bad news is that modern operating systems do a lot of background writing. Package managers update caches, browsers write session data, log rotators compress archives, desktop environments thumbnail images. On a typical Linux workstation in Amsterdam doing nothing obvious, there can be hundreds of megabytes of writes per hour. If you want to recover a specific file, you should unmount the filesystem as soon as possible, then work on a forensic image:
$ sudo umount /mnt/data
$ sudo dd if=/dev/sdb1 of=/tmp/sdb1.img bs=4M status=progress
$ sudo losetup -r /dev/loop0 /tmp/sdb1.img
$ sudo extundelete /dev/loop0 --restore-all --output-dir /tmp/recoveredThe -r flag on losetup makes the loop device read-only, so recovery tools cannot accidentally write back to the image.
File Carving: When Metadata Is Gone
All the techniques above rely on metadata: the inode or the MFT record pointing at the data blocks. If the metadata is overwritten but the data blocks themselves are not, recovery by metadata fails. This is the world of file carving.
Carving ignores filesystem structures entirely. It scans the raw disk image byte by byte looking for file headers, then reconstructs files by following known format rules. Every common file format has a recognisable magic number at the start. JPEG files begin with FF D8 FF. PNG files begin with 89 50 4E 47 0D 0A 1A 0A. ZIP archives begin with 50 4B 03 04. PDF files begin with %PDF-. A carver reads the image looking for these signatures, then decides how much of what follows belongs to the file.
For formats with a defined end marker (JPEG's FF D9, PNG's IEND chunk, ZIP's central directory locator), the carver can determine file length exactly. For formats without one, or for files fragmented across non-contiguous blocks, the carver has to guess. This is where tools differ. A simple carver just reads forward from a header until it hits the next header, producing files that may be corrupted at the end. A smarter carver like photorec uses format-specific validators that understand internal structure (JPEG DQT tables, DCT coefficient boundaries, ZIP local file header chaining) to find file boundaries and reject garbage.
Carving is slow. Running photorec on a 1 TB disk image in Frankfurt takes hours on a fast machine. It is also bad at fragmented files. If your 200 MB video file was split into five non-contiguous pieces by the allocator, a naive carver will pull out the first 40 MB (until the next file's data appears) and stop. Modern carvers can sometimes stitch fragments together using content analysis, but results are much worse for fragmented files than for files that happened to land in one contiguous run.
A realistic carving session looks like:
$ sudo photorec /dev/sdb1
PhotoRec 7.1, Data Recovery Utility, July 2019
Christophe GRENIER <[email protected]g>
https://www.cgsecurity.org
Disk /dev/sdb1 - 500 GB / 465 GiB (RO)
Partition Start End Size in sectors
P ext4 0 0 1 60801 80 63 976773167
PhotoRec will try to locate:
[X] tar tar archive
[X] jpg JPG picture
[X] mp4 MP4
[X] pdf Portable Document Format
[X] zip zip archive
[ ] ...PhotoRec will write out every candidate file it finds into a numbered directory, one file per recovered item, with generic names like f0001234.jpg. You then have to sort through them manually. For family photos this is often acceptable. For reconstructing a misbehaving git repository, it is usually hopeless.
The SSD Problem: TRIM and the FTL
Everything above describes how spinning disks and classic filesystems behave. Modern SSDs turn the whole picture upside down.
An SSD does not store data the way the operating system sees it. Internally, an SSD is a collection of NAND flash chips organised into pages (typically 4 KiB to 16 KiB) and blocks (typically 128 to 512 pages). The key property of NAND flash is that you can write a page, you can read a page, but you cannot rewrite a page. To change a page's contents, the entire block containing it must be erased first. And blocks wear out: each block can only be erased a few thousand times (for TLC NAND) or a few hundred (for QLC) before failing.
To hide this from the operating system, every SSD runs a Flash Translation Layer (FTL) on its controller. The FTL maintains a mapping from logical block addresses (what the OS asks for) to physical pages on the NAND. When the OS writes LBA 1234, the FTL picks a fresh erased page somewhere on the flash, writes the new data there, and updates the mapping. The old page becomes "invalid" but stays physically present until garbage collection erases the block.
Now consider what happens when the OS deletes a file. The filesystem clears the bits in its block bitmap. The corresponding LBAs are now free from the filesystem's point of view. But the SSD has no way to know that. From the FTL's perspective, those LBAs are still mapped to valid physical pages. During garbage collection, the FTL will dutifully copy that data to new pages to preserve it, wasting write bandwidth and NAND endurance.
The fix is the TRIM command, part of the ATA standard, and its SCSI equivalent UNMAP. When the filesystem frees blocks, it sends a TRIM command telling the SSD "these LBAs no longer contain meaningful data." The FTL can then mark the corresponding physical pages as invalid immediately, without having to preserve them. This dramatically reduces write amplification and extends the drive's life.
On Linux, TRIM can happen two ways:
- Discard on delete: mount the filesystem with
-o discardand TRIMs are issued as part of every deletion. Low latency, slight performance cost on each delete. - Periodic fstrim: run
fstrimon a timer (most distros run it weekly). Free blocks are batched and TRIMmed in bulk. This is the default on systemd-based distros viafstrim.timer.
The forensic consequence is severe. Once TRIM is issued for an LBA, the SSD is free to return zeros or the original data on subsequent reads; this is controlled by the drive's "deterministic read zeros after TRIM" (RZAT) setting. Most modern drives do return zeros. You can check with hdparm -I /dev/nvme0n1 | grep -i trim. If RZAT is enabled, the data is effectively unrecoverable from the operating system's side the moment TRIM lands. Not because the NAND has been erased (it probably has not yet, physically), but because the FTL refuses to return it. The data will be physically overwritten at the next garbage collection, after which it is gone for good.
This is why undelete tools largely do not work on modern SSDs. By the time you run extundelete, the filesystem has freed the blocks, TRIM has been issued, the drive reports them as zeros, and extundelete gets zero-filled data where the file used to be. The inode metadata might still be intact, but the content it points at is gone.
If you have critical data on an SSD and you think it may have been deleted, the first thing to do is stop writing to the drive. Do not run fstrim. Do not mount with discard. Unmount the filesystem. Power down if necessary. Then image the drive. If the SSD has not issued TRIM yet (perhaps because the filesystem is mounted without discard and fstrim has not run since the deletion), the data may still be readable. This window is short on any modern system, measured in minutes to days.
Secure Deletion: What shred Actually Does
If your worry is the opposite, preventing recovery rather than enabling it, then the usual advice is shred or the -P flag on rm (which does not exist on Linux; BSD rm has it as an alias for shred-like behaviour).
shred works by overwriting the file's contents in place, multiple times, with different patterns, then optionally unlinking the file. The classic rationale was that on certain magnetic media, remnants of overwritten data could be detected using specialised equipment, so multiple overwrite passes were needed to be safe. The academic basis for this (Gutmann's 1996 paper on secure deletion) applied to PRML encoding on 1990s drives. On modern PMR and SMR hard drives, a single overwrite with random data is more than enough.
But here is the catch. shred overwrites the logical blocks the filesystem says belong to the file. It has no control over what physically happens on the disk. On ext4 with data=ordered, this usually works: the new writes go to the same blocks as the old data, overwriting it in place. On filesystems that do copy-on-write (Btrfs, ZFS, APFS), every write goes to a fresh block, leaving the old data intact. On any SSD, the FTL remaps writes to fresh physical pages. The old pages are marked invalid and will be garbage-collected eventually, but shred has no way to force that erase. The data is still physically on the NAND until the GC decides it is time.
For SSDs, the correct approach is the drive's own secure erase mechanism. ATA defines a SECURITY ERASE UNIT command that tells the drive to internally wipe all NAND. Most modern SSDs implement this as a "crypto erase": the drive encrypts all data with an internal key, and the erase command simply throws away the key. After that, the old ciphertext is unreadable. You can run this on Linux with hdparm --security-erase (for SATA) or nvme format --ses=1 (for NVMe). Both leave the drive empty and unrecoverable.
For whole-disk security that does not depend on trusting the drive firmware, the only reliable option is full-disk encryption from the start. If your drive is encrypted with LUKS and you destroy the key, the entire drive becomes indistinguishable from random noise. There is no file to recover because there is no file to begin with.
Recovery Tools in Practice
Let me describe what each of the common tools actually does so you know when to use which.
extundelete walks the ext3/ext4 journal looking for inode structures that have a non-zero dtime and reconstructs the filesystem state at a given historical moment. It can recover files whose inodes are still intact and whose data blocks have not been reused. It fails on files whose extent trees were destroyed.
ext4magic is similar but more aggressive. It can use multiple journal snapshots to find files that extundelete misses, including files deleted some time before the most recent mount.
testdisk operates at the partition level. It repairs corrupted partition tables, recovers lost partitions, and can rebuild MFT and FAT structures. It is the tool of choice when the problem is "my disk won't mount" rather than "I deleted one file."
photorec is testdisk's file carving mode. It scans raw blocks looking for file headers regardless of filesystem. It works on any filesystem and even on raw unformatted media. Use it when metadata is hopelessly damaged.
ntfsundelete parses the NTFS MFT for records marked deleted and reconstructs files from intact cluster runs. It is fast and accurate when it works, but NTFS aggressively reuses MFT records, limiting its effectiveness.
foremost and scalpel are older file carvers. They are still useful for specific formats but have been largely superseded by photorec.
bulk_extractor is a forensic scanner that looks for patterns rather than whole files: email addresses, credit card numbers, URLs, cryptographic keys. It is what investigators reach for when they want to triage a disk image quickly.
A realistic recovery attempt on Linux looks like:
# 1. Stop writes immediately
$ sudo umount /dev/sdb1
# 2. Image the disk
$ sudo dd if=/dev/sdb1 of=sdb1.img bs=4M status=progress conv=noerror,sync
# 3. Verify the image and work on a copy
$ sha256sum sdb1.img > sdb1.img.sha256
$ cp sdb1.img sdb1-work.img
# 4. Try metadata-based recovery first
$ mkdir recovered
$ extundelete sdb1-work.img --restore-all --output-dir recovered/
# 5. Fall back to carving if that fails
$ photorec sdb1-work.imgThe Filesystem Slack Space
Every file on disk occupies a whole number of allocation units. If your filesystem uses 4 KiB blocks and your file is 5 KiB, it occupies two blocks. The first block is fully used. The second block contains the last 1 KiB of the file plus 3 KiB of whatever was there before. That 3 KiB is called file slack, and it is a major source of forensically interesting data.
When the operating system writes a new file into existing free blocks, only the exact file bytes are written. The filesystem does not zero-pad the slack. So if a 5 KiB file is now occupying a block that previously held a much larger deleted file, the last 3 KiB of the block still contains the end of the old file. Tools like icat and blkcat from The Sleuth Kit can dump this slack and carve it for recognisable content.
NTFS has an analogous concept called RAM slack (the space between the end of a file and the end of its last disk sector) and drive slack (the space from the end of the last sector to the end of the last cluster). Windows zero-fills RAM slack to the end of the sector but leaves drive slack alone, because the sector is the unit at which writes go to the physical medium.
Slack is especially interesting on filesystems with large blocks. At 64 KiB blocks, a 1 KiB log message can leave 63 KiB of previous-file data behind. The Linux kernel's default is 4 KiB, matching the page size, so slack is bounded to under 4 KiB per file. Still useful, but less dramatic than it used to be.
Crash Consistency and Orphan Inodes
Sometimes a file disappears not because you deleted it but because the system crashed at the wrong moment. Linux filesystems handle this through a concept called orphaned inodes.
When a file is open while its last link is being removed, ext4 keeps the inode alive (for the benefit of the open descriptor) but adds it to the superblock's orphan list. This is an on-disk linked list of inodes that should be deleted when the last reference drops. If the system crashes before that happens, the next mount sees the orphan list, and e2fsck (or the automatic journal replay) walks through it cleaning up.
During normal operation, orphaned inodes are invisible: they have no directory entries and no one holds references to them. But they are allocated, their data blocks are still valid, and their content is intact. Tools aware of the orphan list can sometimes recover files from recent crashes by walking it before the next cleanup.
The key lesson here is that filesystems are not simple. They have multiple concurrent states, multiple metadata structures referring to the same data, and multiple recovery paths. Understanding which state your filesystem is in after an unexpected event is often the difference between recovering everything and recovering nothing.
Copy-on-Write Filesystems Change Everything
Btrfs, ZFS, and APFS change the deletion story fundamentally. They do not overwrite data in place. Every modification writes new blocks, and the old blocks stay referenced until no snapshot, no clone, and no transaction rollback mechanism needs them.
On ZFS, deleting a file merely decrements reference counts. The data blocks remain pinned as long as any snapshot that was taken before the deletion still exists. This means that if you had a snapshot from an hour before the deletion, recovery is trivial: zfs rollback or just copying from /.zfs/snapshot/<name>/. No forensic tooling needed. The data is simply there.
Btrfs works the same way. Its default subvolumes can hold snapshots, and most Btrfs-aware distros (Fedora, openSUSE, and increasingly others in continental Europe) are configured by Snapper to take snapshots before and after every package operation. An accidental rm is a one-command restore away.
APFS on macOS and iOS takes this further. Time Machine on modern Apple hardware uses APFS snapshots to capture the filesystem state hourly. Files deleted between snapshots are recoverable from the snapshot chain without any dedicated tools.
The tradeoff is that copy-on-write filesystems are harder to securely delete from. Because old versions can linger in snapshots indefinitely, a simple rm plus shred is not enough. You also have to delete every snapshot that might contain the file. ZFS has zfs destroy -r for this. Btrfs has btrfs subvolume delete. APFS has no user-facing control at all; the system manages snapshots on its own schedule.
Physical Layer: Why HDDs and SSDs Diverge
It is worth spending a moment on why hard drives and solid-state drives produce such different deletion behaviour, because the reasons are not obvious until you look at the physics.
A hard drive stores each bit as a tiny magnetised region on a spinning platter. The read head senses the field direction and decodes it back into bits. Writing is destructive: the write head flips the magnetisation of whichever region passes underneath it. Crucially, rewriting the same region is cheap. The head passes over it, applies a stronger field, and the old state is replaced. There is no wear-out from ordinary writes. A platter can be rewritten tens of millions of times per track before bit error rates climb.
This is why HDDs behave simply. The filesystem says "write LBA 1234 with this content", the drive maps LBA 1234 to a specific track and sector, and the write head overwrites that sector. If the filesystem later says "LBA 1234 is free, reuse it for this new data", the same physical sector is overwritten again. Deletion, followed by reuse, physically destroys the old bytes at the platter level, one sector at a time.
NAND flash behaves almost the opposite way. A flash cell stores charge on a floating gate. Writing (or "programming") raises the voltage threshold of the gate to encode a value. The problem is that you can only program in one direction: from erased to written. To go back, you must perform an erase operation, which resets the entire block of cells. Erase is disruptive, requires a high-voltage pulse through the tunnel oxide, and physically ages the cell. After a few hundred to a few thousand erase cycles, the oxide degrades and the cell becomes unreliable.
This constraint shapes every aspect of SSD design. Since you cannot overwrite in place, the controller maintains a pool of pre-erased blocks and writes new data to them. The old location becomes garbage. When the pool runs low, garbage collection picks blocks that are mostly garbage, copies the still-valid pages to fresh locations, and erases the source block. Wear-levelling algorithms spread writes evenly so no single block runs out of erase cycles while others sit unused.
The direct consequence is that the OS has lost control of which physical cells hold which data. When you "overwrite LBA 1234" on an SSD, the controller allocates a new physical page, writes to it, and marks the old page as invalid. The old data is still sitting on the NAND until the garbage collector decides to sweep that block. You cannot tell the drive to destroy a specific physical location because the FTL hides that location from you.
There is no equivalent of "overwrite" at the NAND level except at the granularity of a whole block, and even then, only the FTL controls when the erase actually happens. The closest you can get to deterministic destruction is the ATA secure erase command, which instructs the drive to erase every block on the NAND, or a crypto erase, which throws away the encryption key so all existing ciphertext becomes meaningless.
This is why a reasonable forensic examiner treats "the data is gone on an SSD" as a statement about what the drive will let you read, not about what is physically present on the chips. Sophisticated attacks that remove the NAND packages and read them directly can sometimes find data the FTL has hidden. For ordinary recovery and ordinary operations, though, the FTL's behaviour is what matters, and the FTL is designed to make your old data unreachable as fast as possible.
Why This Matters in Production
All of this matters when you are running systems in production. A few practical rules fall out of the theory.
Backups are the only real recovery mechanism. Every technique in this article is a last resort. The correct answer to "I deleted a file" is to restore from backup. If you do not have backups, the gap between deletion and reuse is short, especially on busy servers, and recovery success rates are low.
Snapshots are a second layer. ZFS, Btrfs, and LVM snapshots give you point-in-time recovery without needing to run recovery tools at all. They are not backups (a snapshot cannot survive the loss of its underlying storage) but they handle the "I deleted the wrong file" case instantly.
Application logs lie about deletion. When your code "deletes" a temporary file and the log says success, what really happened is unlink returned zero. The file may still exist for any process that has it open, including your own. lsof is your friend when debugging "why is /tmp full when I just cleaned it up" mysteries.
Secure deletion requires planning. If you care about data not being recoverable, encrypt the disk from the beginning, manage keys carefully, and rely on crypto erase at end of life. Trying to retroactively shred files on a copy-on-write filesystem or an SSD is wishful thinking.
TRIM is mostly your friend. It extends SSD life, improves write performance, and reduces the attack surface for accidentally recoverable data. The only reason to turn it off is if you are actively trying to keep deleted data recoverable.
Journal entries persist. The ext4 journal is a small ring buffer, but on a lightly loaded system it can retain a surprising amount of recent metadata history. Analysing it is a standard part of an incident responder's toolkit.
A Worked Recovery Example
To show the whole stack working together, here is a complete scenario. You run a small backup service in Barcelona on a Linux box with a single 2 TB SATA HDD, formatted ext4, mounted at /srv/backups. One morning an automation script runs rm -rf /srv/backups/archive/2025 by mistake, deleting several hundred thousand files before anyone notices. The scripts run as root, so no permission failure slows it down.
The right response looks like this.
-
Freeze the filesystem immediately. Kill the backup service. Unmount the filesystem. If you cannot unmount because the directory is in use, reboot into a rescue environment. Every second of live use eats into the blocks you might recover.
-
Image the whole partition. Use
ddorddrescueto produce a full copy on a different disk.ddrescueis the better choice because it handles bad sectors gracefully and writes a map file so you can resume an interrupted copy. Verify the image's hash and store it somewhere safe. All recovery work happens on a copy. -
Snapshot the journal. Use
debugfs -R "logdump -a"to dump the JBD2 journal into a text file. This is where recent inode states live, and it is your best source for metadata that was active when the deletion happened. -
Try metadata-based recovery. Run
extundeleteagainst the image with--restore-all. For each orphaned inode it finds, it rebuilds a file from the extent tree. Expect a mix of fully recovered files, partially recovered files (where some blocks have been reused), and complete misses. -
Fall back to carving. For files that extundelete cannot rebuild, run
photorecagainst the same image. Carving produces files without names, but a backup archive has recognisable structure (tar headers, gzip signatures, sqlite3 database magic) that carvers can detect. You will get a pile of candidate files to sort. -
Verify. Check the recovered files for content integrity. Open a few at random. Run
tar tvfon archives. Compare SHA-256 hashes against the service's existing manifests if they exist. Anything that verifies clean is worth keeping. Everything else goes into a "maybe" pile for manual triage. -
Document. Write up exactly what happened, which tools were used, how many files were recovered, and what the residual data loss is. This is the difference between a recoverable incident and a catastrophe.
On a typical HDD with a healthy ext4 filesystem and a quick response, you can expect to recover most files deleted within the last hour. The success rate drops sharply after that: every subsequent write is a chance that a deleted block is reused. After a day of continued use, recovery rates on a busy server are typically under 20%. On an SSD with TRIM, the same scenario usually produces zero recoverable files, because the blocks are already logically zeroed by the drive.
Backups would have handled this in minutes. Every recovery tool discussed here is strictly worse than having a backup, and much worse than having a recent snapshot. The reason to know about them is to handle the case where the backups are also broken, which happens more often than anyone wants to admit.
The Short Version
When you run rm, the kernel walks a well-defined path: VFS, filesystem-specific unlink, metadata update, free list adjustment, journal commit. The file's data is not touched. Its metadata is marked deleted but often still intact. On a spinning disk with a quiet system, that state persists for minutes to hours and can be recovered with tools that understand the on-disk format.
On an SSD, the TRIM command pierces this entire layer of abstraction by telling the drive that the blocks are free. The drive can then return zeros on read, and the data is gone long before it would be reused on a classic disk. This is why "undelete" went from routine on HDDs to nearly impossible on modern SSDs.
On copy-on-write filesystems with snapshots, deletion is largely cosmetic until the snapshots are destroyed. Recovery is trivial as long as the right snapshot still exists.
Understanding the layers (syscall, VFS, filesystem, block layer, drive firmware) is what lets you reason about what is actually recoverable in any given situation. The lab for this article lets you step through the ext4 deletion process interactively, watching the inode, bitmaps, data blocks, and journal change state in real time. It is worth playing with before you need to think carefully about a real deletion in anger.