Friday, April 10, 2009

Exchange Cluster Continuous Replication Failed

We had an Exchange 2007 Mailbox server's storage drive fill up due to log files not being cleaned out due to the backup job failing (don't ask). After we deleted the unnecessary log files and got the store back online, we weren't able to get its CCR passive node to bring 3 of the 5 stores online. We ran "Update-StorageGroupCopy -DeleteExistingFiles", and the process would complete. I would then show up as "healthy" with a "Get-StorageGroupCopyStatus" for a minute or so, and then fail. Checking the event log we got:
Event ID: 2059
Source: MSExchangeRepl
The log file 404149 for is missing on the production copy. Continuous replication for this storage group is blocked. If you removed the log file, please replace it. If the log is lost, the passive copy will need to be reseeded using the Update-StorageGroupCopy cmdlet in the Exchange Management Shell.
It wouldn't make sense straight away, as the database on the active node was "up to date" and shouldn't need those log files. After some research, I found that this was a result of the failed backup process. When we deleted the log files on the active node, we'd broken the replay log process (as we'd deleted log files that were created after the last time the database was backed up).

Given that we're using a backup solution that backs up the databases from the passive node, I had to use NTBackup Utility to do a normal backup on the database. Once this completed I was able to use the Update-StorageGroupCopy command to get the database replication back to a healthy state. In theory this process should work on a Standby Continuous Replication
SCR cluster as well.

1 comment:

Chris said...

I was having the same issue and this article saved the day. Thanks for the help.