Thursday, September 29, 2016

Can a Purely State Based Database Schema Migration Be Trusted? by Robert Lucente

A really good article on the tradeoffs between state based vs migration based schema transformation is titled Critiquing two different approaches to delivering databases: Migrations vs state by Alex Yates on 6/18/2015. The blog Database Schema Migration by S. Lott on 9/27/2016 ends with "It's all procedural migration. I'm not [sure] declarative ("state") tools can be trusted beyond discerning the changes and suggesting a possible migration."

Let me start by examining the statement of the problem: state based vs migration based schema transformation. The problem statement implies that the solution is an either or type of thing. Why not both?

As opposed to speaking in generalities, let me pick a specific problem which is often encountered in real systems which demonstrates the core issue. Also, let me pick a particular tool set to execute on this particular problem.

The specific problem involves
  1. Adding some domain table and its associated data.
    1. Adding a not null / mandatory column to some table that already has data.
      1. Creating a foreign key between the domain table and the new mandatory column.
        Below is a picture for the above words. The "stuff" in color are the tables and columns being added.


        In a migration based schema approach the following steps would have to be performed
        1. SomeTable exists with data.
          1. A new domain table (SomeDomain) gets created.
            1. The new domain table (SomeDomain) gets populated with data.
              1. A new nullable / not mandatory column SomeDomainUuid is added to SomeTable.
                1. Column SomeDomainUuid in SomeTable gets populated with data.
                  1. Column SomeDomainUuid in SomeTable is made not null / mandatory.
                    1. A foreign key is created between the two tables.
                      Notice that the above is very labor intensive and involves 7 steps. Software should be able to figure out all the steps and their sequences except for the following
                      1. The new domain table (SomeDomain) gets populated with data.
                        1. Column SomeDomainUuid in SomeTable gets populated with data.
                          The key thing to notice is that the steps that can't be automated involve data. There is no way for the software to know about the specifics of the data.
                            Now that we have defined a specific problem, let's execute on solving the problem using a specific tool set. I am going to use the Visual Studio SQL Server Database Project concept with SQL Server as the target database. Via a series of clicks, the end state of the database is specified in the "Visual Studio SQL Server Database Project".
                              Next we write a script (Populate_dbo_SomeDomain.sql) to populate SomeDomain table with data.
                              
                              
                              INSERT INTO dbo.SomeDomain
                                  VALUES 
                                  (newid(), 'Fred'), (newid(), 'Barney');
                              
                              
                              
                              The second step is to write a script (Update_dbo_Deal_AdvertiserTypeUuid.sql) to populate SomeDomainUuid column in SomeTable.
                              
                              
                              UPDATE [dbo].[SomeTabe] 
                              SET    [SomeDomainUuid] = (SELECT SomeDomainUuid
                                                        FROM   [dbo].[SomeDomain] 
                                                        WHERE  Name = 'Fred');
                              
                              
                              The last preparation step is to write a script (Script.PostDeployment.sql) to run the above 2 scripts after the state based change has happened.
                              
                              
                              :r ".\Populate_dbo_SomeDomain.sql"
                              go
                              
                              :r ".\Update_dbo_Deal_AdvertiserTypeUuid.sql"
                              go
                              
                              
                              
                              Now that the desired end state has been specified as well as the data manipulation scripts written, it is time to modify a database. The Microsoft terminology for this is "publishing the database project". There will be an issue because the state changes will be made and then the Script.PostDeployment.sql script will be run. In between the state changes and the script being run, there will need to be data in the SomeDomainUuid column in the table SomeTable. This issue is addressed by using the GenerateSmartDefaults option when publishing the database project.

                              Let's summarize what this combination of state based and migration based schema transformation has allowed us to do. We were able to take 7 steps and reduce it down to 2 steps. These 2 steps couldn't be automated anyways because they involved data. These are the pros. The con is that have to be familiar with the framework and select the GenerateSmartDefaults option out of the 60 plus available options.

                              In conclusion, a purely state based approach can't work because of data. There is no way for the software to know how to do data migrations. Only humans know the correct way to do data migrations. In our example, there is no way for the software to know whether or not the new column SomeDomainUuid is to be initially populated with "Fred" or "Barney". This is a long winded and nice way of saying that a purely state based database schema migration can't be trusted. However, the combination of state based and migration based can truly improve productivity.