Background
Four years of data that nobody ever cleaned up
Like many fast-growing startups, the platform shipped features quickly in its early years and never built a process to clean up data that was no longer referenced. Every uploaded image stayed in object storage indefinitely, whether or not the application still pointed to it.
After roughly four years the unreferenced data had accumulated to a point where it could no longer be ignored, and by then the platform was serving heavy traffic around the clock, which made the cleanup far harder than it would have been early on.
The problem
The junk and the crown jewels lived in the same place
Analysis of the object storage showed that the large majority of stored objects were orphaned, uploaded at some point, no longer referenced anywhere in the application or its database. Only a small fraction was live, in-use user media.
The large majority of stored objects were no longer referenced anywhere.
The catch: orphaned data and live user media lived in the same place, fully intermixed. There was no clean separation - a cleanup meant telling the two apart with absolute precision.
And it could not be done quietly. The platform serves traffic continuously, so the cleanup had to run safely against a live production system while users were actively uploading new content.
The risk we caught
Two bugs that would have deleted live user images
A one-line note instead of an unrecoverable loss
Before any deletion ran, we reviewed the proposed cleanup procedure and identified two correctness bugs in the logic that decided which objects were orphaned. Each one would have silently misclassified live, in-use images as orphans, and the cleanup would have deleted them on production with no error and no warning.
If caught after execution
Unrecoverable - live user media gone for good.
Caught in review
A one-line note. Fixed before anything ran.
The solution
A phased, fully reversible operation
We redesigned the cleanup so that no single mistake could cause irreversible loss. Every phase was built to run against the live system, with clear ownership and sign-off before any irreversible step.
Independent backup first
A full copy of the data was taken to an isolated location before anything was touched, a recovery path for the whole operation.
Hard safety gates
The logic that classifies orphaned objects aborts automatically if its inputs look implausible, rather than relying on a human to notice.
Quarantine instead of direct deletion
Objects identified as orphaned are first moved to a separate holding area and only then removed from the source, so nothing is deleted in a single irreversible step.
A verification window
After the cleanup, removed data stays recoverable for an extended period, during which any missed reference can be restored before anything becomes permanent.
The result
A 95% storage reduction, designed to be safe
Projected figures. The numbers below are the projected outcome of the operation as designed. They are to be confirmed against the live system once the cleanup completes.
The platform is set to reclaim the large majority of its object storage while keeping every piece of live user media intact.
Storage volume
~95% reduction (projected)
Object count
Only live, referenced media remains (projected)
More important than the numbers: the operation was structured so that the worst realistic mistake is recoverable, on a system that never stopped serving users.
Takeaway
The value was not the cleanup - it was making it safe
Storage left unmanaged for years is a common outcome of fast early growth, not a failure. The hard part is not deleting data, it is deleting the right data, with certainty, on a live system, when getting it wrong is irreversible. The value here was not the cleanup itself but turning a high-stakes operation into a safe one.
Services applied