In a Replicated environment, it is possible that someone or the other at some point of time by mistake performs a DML operation against the Replicated Database on the subscriber.In a Transactional Replicated environment which is a One way traffic such a mistake can prove out to be dangerous.In this article,Satnam Singh a Senior SQL Server DBA from Larsen&Toubro Infotech,Mumbai,India discusses one such situation which he faced on his production server a couple of days back.
Scenario:
It was a lazy Sunday and suddenly I got a call from one of the member of my support team saying that the “Transactional Replication between the OLTP and the Reporting Server is not working”. Due to this one of the application which the users in the Australia were using wasn’t reflecting the upto date data. When I viewed the Replication Monitor on the OLTP Server (Publisher), I noticed that the system is throwing a 20598 error message, “The row was not found at the subscriber while applying the Replicated Commands”
Solution:
There was a database named ABC present on the Production OLTP Database Server (10.x.x.x) whose data was getting replicated onto the database named ABC present on the Production Reporting Server (172.x.x.x). I logged onto the Production OLTP Server and decided to check the Replication Status. When I viewed the Log Reader Agent status, I noticed that the Replicated Commands were generated properly but when I viewed the Synchronization status of the subscriber, I viewed the below error message.
The row was not found at the Subscriber when applying the replicated command
At this stage, I was pretty sure that something was wrong with the database on the Reporting Server.
The next challenge was to find the rows which the Replication process wasn’t able to identify on the Reporting Server (172.x.x.x) . Before identifying those rows we first need to find the table to which they belonged to . I then executed the below T-SQL query against the distribution database on the Production OLTP Server (10.x.x.x).
Select * from msrepl_errors WHERE ERROR_TEXT LIKE '%The row was not found at the Subscriber when applying the replicated command.'
Once the above query was executed, I received the following Output Message.
Xact_seqno: 0x0002E3010002E89D006000000000
Command:id :1,2,3,4,5,6,9
Where Xact_seqno is the Transaction sequence number.
I then executed the below query against the distribution database on the Production OLTP Server (10.x.x.x) to find the database id and the article id, article id refers to the id of the particular table. In Transactional Replication, each Replicated table is treated as an individual article.
select * from msrepl_commands (nolock) where command_id = 1 and xact_seqno = 0x0002E3010002E89D006000000000
Once the above T-SQL query was executed, I received the corresponding database id in the output message as 19 and the article id which as 384.
The next step was to find the exact table name which was having an issue on the subscriber.In order to do that, I executed the below T-SQL query against the database named distribution on the Production OLTP Server (Publisher).
select * from dbo.MSarticles where article_id IN (SELECT Article_id from MSrepl_commands where xact_seqno = 0x0002E3010002E89D0060) and publication_id =19
xact_seqno and publication_id is the one which we have obtained by executing the T-SQL commands as discussed above which was 0x0002E3010002E89D0060 and 19 respectively.
When the above T-SQL query was executed, in the output message the name of the table was reflected as tbl_employee .Now at this stage, I was 100% sure that tbl_employee was the table which was having an issue.
Now after identifying the table name the next challenge was to find the Replicated command which was having an issue to get propagated onto the subscriber.
In order to find the Replicated command, I executed the below T-SQL query against the distribution database on the Production OLTP Server.
sp_browsereplcmds @article_id = 384, @command_id = 1, @xact_seqno_start = '0x0002E3010002E89D0060', @xact_seqno_end = '0x0002E3010002E89D0060', @publisher_database_id = 19
Each and every input parameter specified in the sp_browsereplcmds system stored procedure has been obtained from the steps which we have discussed above.
As soon as the above T-SQL query was executed, I received the following command:
{CALL [sp_MSdel_dbotbl_employee] (8463)}
After having a look at the command, I came to a final conclusion that the system couldn’t find a row whose id was 8463 in the table named tbl_employee on the subscriber.
I repeated the same procedure for the command_id 2, 3, 4,5,6,9 and the Replicated commands obtained was as follows:
{CALL [sp_MSdel_dbotbl_employee] (8464)} {CALL [sp_MSdel_dbotbl_employee] (8519)} {CALL [sp_MSdel_dbotbl_employee] (8525)} {CALL [sp_MSdel_dbotbl_employee] (8526)} {CALL [sp_MSdel_dbotbl_employee] (8581)} {CALL [sp_MSdel_dbotbl_employee] (8643)}
Now at this stage, I got a list of all the id’s which were to be present in the table named tbl_employee on the ABC database on the subscriber. Just FYI, the id’s were 8463, 8464, 8519,8525,8526,8581 and 8643 respectively as obtained above.
I then decided to insert dummy records in the table named tbl_employee present in the ABC database on subscriber for each of the corresponding id’s obtained above. It was done by executing the below T-SQL query against the ABC database on subscriber.
Insert tbl_employee (id) Select 8463 Insert tbl_employee (id) Select 8464 Insert tbl_employee (id) Select 8519 Insert tbl_employee (id) Select 8525 Insert tbl_employee (id) Select 8526 Insert tbl_employee (id) Select 8581 Insert tbl_employee (id) Select 8643
Once the above Insert statements got executed, I started checking the synchronization status between the Publisher and the subscriber and found that the Replicated commands are getting delivered properly. It took exactly 53 minutes for both the servers to come back in sync. Just FYI, once the Replication process started again, the dummy records inserted above got updated successfully with the required ones.
At the end, I come to a conclusion that the issue started due to some DML statements been fired against the Database named ABC on the subscriber. I couldn’t afford re-initializing the subscriber because the database size was pretty huge having plenty of indexes on it and the SLA time was just 4 hrs. In order to ensure that I complete the entire activity within the proper SLA it was necessary for me to develop a proper strategy that worked well.
Many Thanks to all the viewers for their valuable time in reading this article.Sincere Thanks to the support team as well with whom i work at Larsen&Toubro Infotech, their co-operation always help me to present some Real World scenarios to the SQL Server community. Sincere Thanks to Amit Bansal and Sarabpreet Singh as well for always motivating me.Please do let me know in case you have a better solution to the above problem. Your comments and feedback will be highly appreciated.
Regards
Satnam Singh
Like us on FaceBook | Follow us on Twitter
Follow me on FaceBook| Join the fastest growing SQL Server group on FaceBook
I always remember That June 24 Stuff, trouble shoot the logs “What was changed” and by Whom?, by the way wonderful explaination in detail.
Hey Satnam,
Could you please let me know when you get such errors? I have tried to re-create such errors but failed Frown. My transactional replication working perfectly as per the specification.
Thanks,
Sushil Rout
Thanks for this…Clear explanation .. Also can you please let me know the status of DMV’s after server restart and why?