Big, complex businesses can frequently run into problems with capturing historical data.  Perhaps the database in question was never meant to capture historical data but now there is a business need to perform some kind of auditing or data warehousing.  Maybe that database is owned by the company.  Maybe it’s owned by some third party.  Maybe it’s a legacy system that cannot be modified.  The basic format of the problem in discussion here is: there is a database that a) cannot / should not be modified, b) is, for one reason or another, not directly accessible, and c) does not maintain its own history.  At least, not on the scale that the business requirements dictate are necessary.

Now normally, if we owned both databases or could develop against both databases directly, the architecture and design required would be a no-brainer: create a database suited to the long-term storage of historical data (a data warehouse or something similar), write the code artifacts/packages needed to transfer the data from the external database to the new warehouse, automate the transfer at appropriate time intervals, and then sit back and let it do its thing.  Problem solved.  Or, if we could get a file extract from the source database, we could transfer the file contents to the warehouse via an ETL or ELT process.  Alternatively, if we needed a more real-time synchronization, we could take advantage of something like Amazon Kinesis Streams  or some other real-time data transfer solution.  This alternative gives us a traditional Computer Science Producer/Consumer situation where a producer package would be created to send the data out of the source, a consumer package would be created to accept the data coming in, and, barring other development details, we’d again be good to go.

Ok, maybe not a “no-brainer” – this constitutes at least a couple weeks of development and testing, not to mention the planning, architecture, and design work that needs to precede development and testing.  But still, the above is much more in line with our day-to-day processing.  The situation for this particular problem is further complicated by the hypothesized lack of access to the source database, which, in general, may be completely expected and required but represents a technical hurdle nonetheless.  Potential solutions to this problem include but are not limited to: Replication, Database Auditing or Change Data Capture, Consumption of a File Feed, and utilization of Kinesis Streams / something similar.  In order to keep this blog post a reasonable length, I’ll focus on Replication.

Database Replication is an interesting topic.  This particular discussion focuses on achieving database replication within a Microsoft SQL Server environment but similar (if perhaps not exactly the same) options exist within the other major Relational Database Management Systems.  When you use replication, you have multiple databases that all need to end up being copies of some source database.  SQL Server provides 3 different methods for achieving replication: Transactional Replication, Merge Replication, and Snapshot Replication.  Each form of replication solves a different replication problem.

With Transactional Replication, we have a publisher (source) database and one or more subscriber databases.  The publisher pushes the transactions (changes) out to the subscriber(s), and through this process the subscriber is brought up-to-date with the publisher such that we end up with two identical databases.  The Publisher database in this scenario is the only one that really experiences changes; the other databases are / should be read-only copies.  Since each change is moved to the subscriber individually, the system on the subscriber’s end can be configured to perform custom auditing of the changes to the data, allowing each change to be captured and actioned on.  Transactional replication has the caveat, however, that every table involved must have a primary key (a column of unique values used to identify each record uniquely).

The scenario with Merge Replication is very different, and you’ll soon see why it’s not relevant here.  With Merge Replication, there is one main database and one or more satellite subscriber databases.  Both the main database and the satellites may change – none of the DBs are read-only.  Changes on any of the databases using merge replication will replicate to the other databases involved, and in this way all of the databases will be brought into sync with each other.  In our scenario, the publisher database isn’t meant to change with the new database so Merge Replication is not an appropriate solution.

With Snapshot Replication, a snapshot is taken of the current state of the publisher database and distributed to the subscribers.  Subscribers then synchronize from the snapshot.  Snapshot replication is actually also the first step of Transactional Replication and Merge replication; it’s how the subscriber databases are initialized.  Unlike transactional replication, this method is better suited to entities that change in non-trivial ways infrequently.

Therefore for this particular problem, if we focus on Replication oriented solutions, transactional replication would be the natural choice. We can set up transactional replication between the original database and a new subscriber database, and then make modifications on the subscriber database to trigger a transfer to the warehouse whenever the entities we want to capture history on change.  The transfers then happen automatically, history is captured and can be moved to where it needs to be, and this historical data can be utilized in reporting solutions.

Replication is just one possible way to address the capture of historical data and may, for your specific situation, be a suboptimal solution.  Data-driven solutions aren’t really a once-and-done kind of situation.  They require a continued intellectual investment and periodic reevaluation, especially when requirements change, or new requirements are discovered, or a better methodology has been devised.  The other options mentioned at the beginning will be explored further in a later post.

Sources:

Amazon Kinesis Documentation: http://docs.aws.amazon.com/streams/latest/dev/working-with-kinesis.html

Creating a Successful Auditing Strategy for your SQL Server Databases: http://www.sqlshack.com/creating-successful-auditing-strategy-sql-server-databases/

Merge Replication: https://msdn.microsoft.com/en-us/library/ms152746.aspx

Snapshot Replication: https://msdn.microsoft.com/en-us/library/ms151832.aspx

Specify Synchronization Schedule: https://msdn.microsoft.com/en-us/library/ms151702.aspx

Synchronize Data: https://msdn.microsoft.com/en-us/library/ms151793.aspx

Synchronize a Push Subscription: https://msdn.microsoft.com/en-us/library/ms151169.aspx

Transactional Replication: https://msdn.microsoft.com/en-us/library/ms151176.aspx