Background

Four years of data that nobody ever cleaned up

Like many fast-growing startups, the platform shipped features quickly in its early years and never built a process to clean up data that was no longer referenced. Every uploaded image stayed in object storage indefinitely, whether or not the application still pointed to it.

After roughly four years the unreferenced data had accumulated to a point where it could no longer be ignored, and by then the platform was serving heavy traffic around the clock, which made the cleanup far harder than it would have been early on.

The problem

The junk and the crown jewels lived in the same place

Analysis of the object storage showed that the large majority of stored objects were orphaned, uploaded at some point, no longer referenced anywhere in the application or its database. Only a small fraction was live, in-use user media.

~97%Reclaimable

~97%Orphaned - no longer referenced anywhere

~3%Active - live media the application still uses

The large majority of stored objects were no longer referenced anywhere.

The catch: orphaned data and live user media lived in the same place, fully intermixed. There was no clean separation - a cleanup meant telling the two apart with absolute precision.

For a platform like this, user-generated media is not just data, it is the core product. An irreversible deletion of live content would not be a setback, it would be an extinction-level event: four years of accumulated user data, none of it recreatable.

And it could not be done quietly. The platform serves traffic continuously, so the cleanup had to run safely against a live production system while users were actively uploading new content.

The risk we caught

Two bugs that would have deleted live user images

Caught in review

A one-line note instead of an unrecoverable loss

Before any deletion ran, we reviewed the proposed cleanup procedure and identified two correctness bugs in the logic that decided which objects were orphaned. Each one would have silently misclassified live, in-use images as orphans, and the cleanup would have deleted them on production with no error and no warning.

If caught after execution

Unrecoverable - live user media gone for good.

Caught in review

A one-line note. Fixed before anything ran.

The solution

A phased, fully reversible operation

We redesigned the cleanup so that no single mistake could cause irreversible loss. Every phase was built to run against the live system, with clear ownership and sign-off before any irreversible step.

Independent backup first

A full copy of the data was taken to an isolated location before anything was touched, a recovery path for the whole operation.

Hard safety gates

The logic that classifies orphaned objects aborts automatically if its inputs look implausible, rather than relying on a human to notice.

Quarantine instead of direct deletion

Objects identified as orphaned are first moved to a separate holding area and only then removed from the source, so nothing is deleted in a single irreversible step.

A verification window

After the cleanup, removed data stays recoverable for an extended period, during which any missed reference can be restored before anything becomes permanent.

The result

A 95% storage reduction, designed to be safe

Projected figures. The numbers below are the projected outcome of the operation as designed. They are to be confirmed against the live system once the cleanup completes.

The platform is set to reclaim the large majority of its object storage while keeping every piece of live user media intact.

Storage volume

Before~10 TB

After~0.5 TB

~95% reduction (projected)

Object count

Before~40M

After~1M

Only live, referenced media remains (projected)

More important than the numbers: the operation was structured so that the worst realistic mistake is recoverable, on a system that never stopped serving users.

Takeaway

The value was not the cleanup - it was making it safe

Storage left unmanaged for years is a common outcome of fast early growth, not a failure. The hard part is not deleting data, it is deleting the right data, with certainty, on a live system, when getting it wrong is irreversible. The value here was not the cleanup itself but turning a high-stakes operation into a safe one.

Services applied

What the engagement covered

Cloud storage analysis

AWS infrastructure design

Data-loss risk review

Code review

Safe operational planning

Production-system change management

Reclaiming years of cloud storage without risking a single user's data