A friend on twitter posted this great question.
https://twitter.com/SQLKohai/status/1129018263491502081
We know how this was done in ancient times – back up from prod. So what’s the take on this approach?
Looking at the downside, you might say, “Well, I’ll be very clever and write a package that does all the
data masking and obfuscation on the fly. Then it’s just a push of a button and Test is Prod! Poof!” The problem, here, is that the package(s) will need to refresh schemas and mappings before every execution. It’s a mess to maintain and the results are unpredictable. In one such case, I saw a developer raise the error tolerance above 1000 errors before the package would finally give up and die. The project was eventually abandoned because the quality of the data was so poor.
The next logical question to ask is “what on earth are we testing anyway?” If we set up the test environment to be production-like as described above, then its best use is for exploratory testing, load tests and integration tests. Exploratory tests do not require a fresh version of the data, it just needs to be realistic so that reports can be run and users can interact with the data. There’s another way to get here, something akin to A-B testing.
Load testing is tricky because you need existing data, as well as, new data to be added and updated in the database. If you’re dealing with a lot of data, which we mostly are nowadays, then we need to do much better than loading a file into an empty table. The other caveat is that this is most important at the beginning of the project, when the processing performance is more of an unknown. This means that we can Test in Prod to start and ongoing for incremental changes.
From what we outlined so far, the test environment may or may not be very useful. So let’s back up a step and talk Dev. The common way of doing development and testing is on a centralized database. What’s my take on this approach?
So, what is Dev good for if we can’t really change the data and use it for unit testing? We can use it for baseline load tests and exploratory testing by the developers (as needed).
So what is unit testing anyway?
Unit tests:
Unit tests are the lifeblood of Continuous Integration (CI) / Continuous Delivery (CD). When we are discussing data projects, CI and CD are a Reese’s Peanut Butter cup. You start with a little chocolate, but then you need some peanut butter and then you can finish it off with more chocolate. You can build your code, but you cannot test it until it’s actually deployed someplace. That someplace needs to be a place that no developer ever goes. It’s not the Dev server, it’s just a place where we can run all the unit tests and make sure that once all the code is checked in, it still is not broken.
Unit tests in CI provide immediate feedback to the development team if there is any kind of problem. They make iterative changes a trivial thing, because we can have the confidence that we didn’t make the code any worse.
If you don’t have unit tests today, buy my book, Unit testing for dummies! Nah, just kidding, but I did start a project on github devoted to unit testing SQL, SSAS and SSIS packages. You can find out more here: https://www.nuget.org/packages/SqlTest/ and here: https://github.com/MarkWojciechowicz/SqlTest/wiki. tSQLt ( https://tsqlt.org/ ) is also very good, if you are just testing database code.
Getting an established regression testing framework is much easier than you think. You only need to write the test – figure out the setup, execute the code and check the results. Put the result into your assert. Your going in assumption is that the code works, so anything that is output now, is correct.
A-B testing, traditionally, is having two versions of prod. The current version and the beta version, to which you impose these changes on different groups and see how things shake out. A similar concept could be applied for many other tests. In SSRS or Powerbi, create a space as a staging environment which points to prod. For SSAS, you can stand up a stage instance of a tabular model and use prod as the data source. There are challenges, of course, like if the reports depend on database code changes as well, but you can get creative with a migration process that uses an alternate schema, for example.
After all our testing, no matter what, we eventually test in prod. I’d rather do this purposefully, like in A-B testing. But, regardless of choice, CD gets you out of a pinch if you’ve screwed things up. Being able to revert a small change or quickly add something new is the hallmark of modern development. Any less and we’re tooling about with prod in test.