SQLHokanson

Keeping the lights on: Managing large database environments


Monitoring the distribution database size

Part 2 of the Replication Monitoring Series

We can answer the question, ‘Is the replication environment healthy?’ with a single monitoring metric: the size of the distribution database.

Because the distribution database functions largely as a queue, it should remain relatively small. The Distributor holds replication commands from the Publisher only until all Subscribers have successfully executed them. Once delivered, the Distribution Clean Up job purges these commands. Consequently, a replication command should only reside in the distribution database for the duration of the cleanup schedule interval.

A distribution database that grows without shrinking signals a replication issue. Since Subscribers process commands via FIFO (First-In, First-Out), a single error can halt processing for that subscription. This forces the distribution database to retain commands, causing it to swell. Therefore, database size is a reliable indicator of overall health: a consistently small database implies a healthy environment, while a larger-than-normal database indicates a bottleneck or failure.

Capture the distribution database size

For a quick and easy way to check the database size, the query below provides an efficient method for checking distribution database size. We have filtered the results to show only distribution databases. Many monitoring tools already include database level metrics, incorporating those metrics into your existing monitoring playbook may be a more streamlined approach.

SELECT
    d.name AS DatabaseName,
    CONVERT(DECIMAL(10,3), ROUND(SUM(f.size) * 8 / 1024.0 / 1024.0, 2)) AS [Size_GB]
FROM sys.databases d
JOIN sys.master_files f ON d.database_id = f.database_id
WHERE d.is_distributor = 1
GROUP BY d.name

In a large production replication environment, the above query may Your sentence is clear and grammatically sound, but there are a few minor punctuation and stylistic adjustments that will make it read more professionally.

If we capture the database size on a regular interval (e.g., every 30 seconds) and visualize it with a monitoring tool, we can quickly assess the health of that distribution database’s replication environment:

Monitoring trends:

  • The database size remains low throughout the day
    • What does this mean? The overall replication environment is healthy; there are no errors or backlogs.
  • The database size steadily trending upwards
    • What does this mean? The Distribution Cleanup job is unable to purge replication commands. This usually happens because one or more Subscribers are unable to apply the replication commands at the destination. Most likely, an error is occurring at a Subscriber. You need to investigate further.
    • Alternatively, the Distribution Cleanup job cannot keep up with the replication volume. Look into tuning the job’s batch size.
  • The database size suddenly increase in row count
    • What does this mean? A large DML transaction most likely occurred at the Publisher. This requires all Subscriber Agents to replicate the same large transaction. Consequently, you may experience a delay while these commands are processed and applied at the Subscribers.
    • Once subscribers consumes all the replication command, we should expect the database size to trend back down.

A database replication horror story

Let me share a horror story that my team lived through—I honestly hope no one else ever comes close to this. I am sure both Andy Mallon and Barys Kuchmel would agreed as we worked on this together.

We had recently moved a ‘high throughput’ distribution database from a physical server to a VM as part of a larger company effort to modernize and consolidate our server footprint. After the move, replication seemed to be flowing perfectly, and all metrics showed the new VM was handling the load just fine. We thought we were in the clear and had pulled off the migration without taking a performance hit.

But over the next few weeks, we started seeing signs that the Distributor was struggling. All subscribers were getting their data on time, but we noticed the distribution database size was growing. Typically, the size hovered around 2–5 GB, but over time, it ballooned to 50–100 GB. Since it’s a queue database, the replication commands should have been short-lived, but they were sticking around. We checked and confirmed that there were not replication errors that would prevent the commands from getting purged. After additional investigation, we determined the Distribution Cleanup job just couldn’t keep up with the volume of replicated commands.

Fast forwarding a bit: the database eventually grew to over 1 TB. We had to perform emergency surgery, truncating both the msrepl_command and msrepl_transactions tables to get things back to a stable state (but saved undistributed commands beforehand). Once the dust settled and additional troubleshooting/performance tuning, we migrated the Distributor back onto physical hardware. I will save “how we did the emergency surgery” for another blog post later on.

The monitoring metric for the distribution database size was the key indicator early on that highlighted that we had a growing problem. Even though replication was flowing and all subscribers were up to date in terms of the replicated data, we discovered then we had a potential problem and we were able to act on it before it impacted our business.



One response to “Monitoring the distribution database size”

  1. This was an invaluable experience troubleshooting SQL internals under tight deadlines, and the amazing teamwork made all the difference! Thanks for sharing the story.

Leave a Reply

Discover more from SQLHokanson

Subscribe now to keep reading and get access to the full archive.

Continue reading