The Strategy Behind ReversingLabs’ Monster Scale Key-Value Migration
Migrating 300+ TB of data and 400+ services from a key-value database to ScyllaDB – with zero downtime
ReversingLabs recently completed the largest migration in their history: migrating more than 300 TB of data, more than 400 services, and data models from their internally-developed key-value database to ScyllaDB seamlessly, and with zero downtime. Services using multiple tables — reading, writing, and deleting data, and even using transactions — needed to go through a fast and seamless switch.
How did they pull it off? Martina recently shared their strategy, including data modeling changes, the actual data migration, service migration, and a peek at how they addressed distributed locking.
Here’s her complete tech talk:
And you can read highlights below…
About ReversingLabs
Reversing Labs is a security company that aims to analyze every enterprise software package, container and file to identify potential security threats and mitigate cybersecurity risks. They maintain a library of 20B classified samples of known “goodware” (benign) and malware files and packages. Those samples are supported by ~300 TB of metadata, which are processed using a network of approximately 400 microservices.
As Martina put it: “It’s a huge system, complex system – a lot of services, a lot of communication, and a lot of maintenance.”
Never build your own database (maybe?)
When the ReversingLabs team set out to select a database in 2011, the options were limited.
Cassandra was at version 0.6, which lacked role-level isolation
DynamoDB was not yet released
ScyllaDB was not yet released
MongoDB 1.6 had consistency issues between replicas
PostgreSQL was struggling with multi-version concurrency control (MVCC), which created significant overhead
“That was an issue for us—Postgres used so much memory,” Martina explained. “For a startup with limited resources, having a database that ate all our memory was a problem. So we built our own data store. I know, it’s scandalous—a crazy idea today—but in this context, in this market, it made sense.”
The team built a simple key-value store tailored to their specific needs—no extra features, just efficiency. It required manual maintenance and was only usable by their specialized database team. But it was fast, used minimal resources, and helped ReversingLabs, as a small startup, handle massive amounts of data (which became a core differentiator).
However, after 10 years, ReversingLabs’ growing complexity and expanding use cases became overwhelming – to the database itself and the small database team responsible for it. Realizing that they reached their home-grown database’s tipping point, they started exploring alternatives.
Enter ScyllaDB. Martina shared: “After an extensive search, we found ScyllaDB to be the most suitable replacement for our existing database. It was fast, resilient, and scalable enough for our use case. Plus, it had all the features our old database lacked. So, we decided on ScyllaDB and began a major migration project.”
Migration Time
The migration involved 300 TB of data, hundreds of tables, and 400 services. The system was complex, so the team followed one rule: keep it simple. They made minimal changes to the data model and didn’t change the code at all.
“We decided to keep the existing interface from our old database and modify the code inside it,” Martina shared. “We created an interface library and adapted it to work with the ScyllaDB driver. The services didn’t need to know anything about the change—they were simply redeployed with the new version of the library, continuing to communicate with ScyllaDB instead of the old database.”
Moving from a database with a single primary node to one with a leaderless ring architecture did require some changes, though. The team had to adjust the primary key structure, but the value itself didn’t need to be changed. In the old key-value store, data was stored as a packed protobuf with many fields. Although ScyllaDB could unpack these protobufs and separate the fields, the team chose to keep them as they were to ensure a smoother migration. At this point, they really just wanted to make it work exactly like before. The migration had to be invisible — they didn’t want API users to notice any differences.
Here’s an overview of the migration process they performed once the models were ready:
1. Stream the old database output to Kafka
The first step was to set up a Kafka topic dedicated to capturing updates from the old database.
2. Dump the old database into a specified location
Once the streaming pipeline was in place, the team exported the full dataset from the old database.
3. Prepare a ScyllaDB table by configuring its structure and settings
Before loading the data, they needed to create a ScyllaDB table with the new schema.
4. Prepare and load the dump into the ScyllaDB table
With the table ready, the exported data was transformed as needed and loaded into ScyllaDB.
5. Continuously stream data to ScyllaDB
They set up a continuous pipeline with a service that listened to the Kafka topic for updates and loaded the data into ScyllaDB.
After the backlog was processed, the two databases were fully in sync, with only a negligible delay between the data in the old database and ScyllaDB.
It’s a fairly straightforward process…but it had to be repeated for 100+ tables.
Next Up: Service Migration
The next challenge was migrating their ~400 microservices. Martina introduced the system as follows:
“We have master services that act as data generators. They listen for new reports from static analysis, dynamic analysis, and other sources. These services serve as the source of truth, storing raw reports that need further processing. Each master service writes data to its own table and streams updates to relevant queues.
The delivery services in the pipeline combine data from different master services, potentially populating, adding, or calculating something with the data, and combining various inputs. Their primary purpose is to store the data in a format that makes it easy for the APIs to read. The delivery services optimize the data for queries and store it in their own database, while the APIs then read from these new databases and expose the data to users.”
Here’s the 5-step approach they applied to service migration:
1. Migrate the APIs one by one
The team migrated APIs incrementally. Each API was updated to use the new ScyllaDB-backed interface library. After redeploying each API, the team monitored performance and data consistency before moving on to the next one.
2. Prepare for the big migration day
Once the APIs were migrated, they had to prepare for the big migration day. Since all the services before the APIs are intertwined, they all had to be migrated all at once.
3. Stop the master services
On migration day, the team stopped the master services (data generators), causing input queues to accumulate until the migration was complete. During this time, the APIs continued serving traffic without any downtime. However, the data in the databases was delayed for about an hour or two until all services were fully migrated.
4. Migrate the delivery services
After stopping the master services, the team waited for the queues between the master and delivery services to empty – ensuring that the delivery services processed all data and stopped writing. The delivery services were then migrated one by one to the new database. There was no data at this point because the master services were stopped.
5. Migrate and start the master services
At last, it was time to migrate and start the master services. The final step was to shut down the old database because everything was now working on ScyllaDB.
“It worked great, Martina shared. “We were happy with the latencies we achieved. If you remember, our old architecture had a single master node, which created a single point of failure. Now, with ScyllaDB, we had resiliency and high availability, and we were quite pleased with the results.”
And Finally…Resource Locking
One final challenge: resource locking. Per Martina, “In the old architecture, resource locking was simple because there was a single master node handling all writes. You could just use a mutex on the master node, and that was it—locking was straightforward. Of course, it needed to be tied to the database connection, but that was the extent of it.”
ScyllaDB’s leaderless architecture meant that the team had to figure out distributed locking. They leveraged ScyllaDB’s lightweight transactions and built a distributed locking mechanism on top of it. The team worked closely with ScyllaDB engineers, going through several proofs of concept (POCs)—some successful, others less so. Eventually, they developed a working solution for distributed locking in their new architecture. You can read all the details in Martina’s blog post, Implementing distributed locking with ScyllaDB.