April 21, 2025 • Kevin Bai
At Slash, we’re powering billions of dollars of spend annually. Having grown over 1000% during the past 12 months, we’re constantly solving new problems at increasing scale. As the amount of data grew and our system became more complex, assumptions we made about data relationships would often break. We encountered issues from multiple sources: bad backfills, bugs in production code, and small mistakes during data migrations. Over time, it became increasingly hard to verify the correctness of our data which led us to build what we now call Health Checks — a system designed to continuously verify data correctness in production.
Health checks are a tool to help us verify invariants in our codebase. When we build systems, we always design them with a set of implicit and explicit invariants, which are rules / assumptions that we expect to always hold true so that we can build more complex systems on top of them.
One trivial example of an invariant: in the
Card
table in our DB, where each row represents a credit card, there is astatus
column where one of the values can be “closed” and aclosedReason
column that is NULLABLE. One invariant is that thestatus
column is “closed” if and only if theclosedReason
field is NOT NULL.
The example above isn't necessarily a condition for which we'd write a health check, but is an example of an invariant that we ensure holds true while working with this area of the codebase.
Our first iteration of health checks looked something like this below:
/**
* This health check ensures that for every declined AuthorizationSet, there are no transfer intents
*/
export const hcrForDeclinedAuthSets = createHealthCheckRoutineDefinition({
programmaticName: 'chargeCard.authSets.declinedAuthSets',
title: 'Charge card declined auth sets',
description:
'This health check ensures for every declined AuthorizationSet, there are no transfer intents',
schedule: {
type: 'interval',
interval: 'minutes',
frequency: 30,
},
})
.bulkCheckFn((deps: DefineDependencies<[HealthCheckModule.Service]>) => {
return async (
cursor:
| HealthCheckRoutineCursor<{
authorizationSetId: string;
}>
| undefined
) => {
// getErrorAccounts -> a query to Snowflake
const errorAccounts = await getErrorAccounts(cursor, deps);
return {
data: errorAccounts.data.map((val) => ({
key: val.authorizationSetId,
})),
nextCursor: errorAccounts.nextCursor,
};
};
})
.singleCheckFn(() => async (params) => {
const singleCheckRes = await runQuery(
sql`...`
);
return {
success: singleCheckRes.length === 0,
};
});
We initially designed health checks to be defined by SQL queries. We would run a SQL query against a table to find any “potential errors”. But because running SQL queries against large tables is expensive, we would run these checks against an eventually consistent read replica (in our case, this was Snowflake). For each result returned from the query, we would then run a check against our main production database to ensure it wasn't a false positive. This design had a few issues:
We decided:
One big learning we’ve had as a team over time is that predictable systems that exert constant load are ideal. The biggest culprit to production issues has typically been sudden unpredictable changes such as a large spike in DB load, or a sudden change in a database internal query planner.
The counterintuitive aspect about putting constant load on systems, especially with respect to databases and queues, is that it can seem wasteful. I used to believe it was better to put no load on systems by default and only do work the few times a day when needed. However, this can lead to spiky workloads that would sometimes degrade our system unexpectedly. In actuality, we've found it's usually better for there to be a small constant load running against the DB over a long period of time, even if the constant load contributes something like 1-5% usage to the overall CPU 24/7. When load is predictable, it’s easy to monitor usage rates across the board. This predictability allows us to scale horizontally with confidence and plan ahead for potential future performance issues.
There's an insightful read about this by the team at AWS: Reliability, constant work, and a good cup of coffee
The second version of health checks now looks like:
export const savingsAccount = createHealthCheckGroup({
over: SlashAccountGroup,
name: 'savings_account_interest_entities',
schedules: {
full_scan: {
type: 'full_scan',
config: {},
props: {
numberOfItemsToPaginate: 2000,
minimumIntervalSinceStartOfLastRun: 3 * 60 * 60 * 1000,
},
},
},
checks: {
interest_limit_rule_should_exist: {
async check(group, ctx) {
if (group.groupType === 'savings') {
const activeInterestLimits = await from(SlashAccountGroupLimitRule)
.leftJoin(LimitRule, {
on: (_) => _.lr.id.equals(_.slr.limitRuleId),
})
.where((_) =>
_.slr.slashAccountGroupId
.equals(group.id)
.and(_.slr.isActive.equals(true))
.and(_.lr.type.equals('interest'))
);
ctx.shouldHaveLengthOf(activeInterestLimits, 1, {
message:
'Savings accounts should have exactly one active interest limit',
});
}
},
},
// other checks
...
}
});
With this design, it becomes very easy to add new health checks onto a single entity since each health check is defined as an asynchronous single function. We could have these health checks iterating over entire tables multiple times a day without needing to make a tradeoff between frequency since the load is constant. For the health check above, each entity performs a fast lookup to the LimitRule
table. These checks run continuously, putting minimal but constant load on the database at any given point in time.
How this system actually runs under the hood can be simplified to roughly the following design:
Health checks today form an important basis in our testing pipeline. It’s the foundation for our ongoing production verification tests. As we’ve grown, more and more general testing has been required to keep our product stable and functional. It took us several tries to get this right. We've wanted to implement some form of ledger testing / data verification in production for the past three years. However, it wasn’t until a year and a half ago when we finally built something out. Overall, some important lessons we’ve learned are:
Today, health checks are the first step we've taken to better test and verify our systems outside of more traditional forms such as testing and observability. As we continue to grow, we'll progressively reflect and figure out new ways to keep our product stable.