Data performance challenges are a universal part of application development. No technology, platform, or development methodology can allow us to escape completely from them. As data gets larger, the ways to inadvertently abuse it become more numerous.
Solving these problems is often seen as a reactive task. An application performs slowly, someone complains, and a developer or administrator needs to research, find the source of the latency, and fix it (somehow). Often, though, the solution is one that could have been implemented proactively as part of the original release of the offending code.
These alternatives can be seen as choices we make in application development every day:
1. Do it right the first time.
2. Do it quickly the first time.
This challenge can be sarcastically addressed like this:
While silly, these do represent real organizational challenges, decisions, and decision-making processes that are not silly.
I can think of a seemingly endless list of mistakes made over the years that were the direct result of speed over precision. While not all errors can be avoided in life, there is value in preventing as many as is reasonably possible up-front. This also has the bonus of improving our sleep schedules at those times when bad things happen. Therefore, striking a comfortable balance between design and architecture and technical debt is a valuable skill in software development.
Here are some examples and how they impacted real projects, software, and people. The names and details are different, the but the mistakes illustrated have been made many times by many people.
Data Retention, Who Needs It?
Creating a new table is a common task. What is not common enough, though, are the questions we should ask ourselves when creating a new data structure.
Imagine a log table that will accept application log data on a regular basis. The table is created as an afterthought with no additional considerations as to how it will be used in the future:
CREATE TABLE dbo.application_log (log_id INT NOT NULL IDENTITY(1,1), log_time DATETIME2(3) NOT NULL, log_title VARCHAR(100) NOT NULL, log_message VARCHAR(MAX) NOT NULL, is_error BIT NOT NULL, error_detail VARCHAR(MAX) NULL);
The application begins to run for the first time and everything is great! 6 months later, though, developers complain that application logging is slow. In addition, the application database has been growing unusually large, consuming an unexpected amount of storage and backup resources.
What was forgotten? Retention! When creating new data, determine a retention policy for it and ensure that computing resources can handle the associated data growth over time.
A retention period for data could be a week, a month, a year, or forever, depending on how quickly it grows and what it is used for. In the log example above, developers likely would have assigned a retention period of 1 week (or maybe a month) to the data and cleaned up any older data during a low-volume time.
OK, problem solved! A retention process is created that cleans up data older than a week each evening. The cleanup process takes an exceptionally long time to process, though. So long, that it is stopped and investigated. What else was forgotten? Indexes! The table above has no clustered or non-clustered indexes. With each cleanup of old data that occurred, the table had to be scanned. In addition to being slow, that scan will block other processes that try to log to the table. The following adds a clustered primary key and a supporting non-clustered index on log_time:
CREATE TABLE dbo.application_log (log_id INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_application_log PRIMARY KEY CLUSTERED, log_time DATETIME2(3) NOT NULL, log_title VARCHAR(100) NOT NULL, log_message VARCHAR(MAX) NOT NULL, is_error BIT NOT NULL, error_detail VARCHAR(MAX) NULL); CREATE NONCLUSTERED INDEX IX_application_log_log_time ON dbo.application_log (log_time);
If this table is highly transactional, even during less busy times, then the deletions made as part of retention could be batched. This reduces each transaction size and reduces contention with other transactional processes running at the same time.
The Data Type Follies
Data structures are easy to create and hard to change. Once applications, reports, APIs, and users are relying on a specific database schema, changing it becomes challenging. The more time that passes, the more work is needed. Choosing the best data types on day one can save more work later and as a bonus, help prevent bad data.
Consider the following table:
CREATE TABLE dbo.sales_transaction ( transaction_id INT NOT NULL CONSTRAINT PK_sales_transaction PRIMARY KEY CLUSTERED, product_id INT NOT NULL, salesperson_id INT NOT NULL, transaction_amount DECIMAL(18,4) NOT NULL, transaction_time VARCHAR(25) NOT NULL, shipping_date VARCHAR(25) NULL); ALTER TABLE dbo.sales_transaction ADD CONSTRAINT FK_sales_transaction_product FOREIGN KEY (product_id) REFERENCES dbo.product (product_id); ALTER TABLE dbo.sales_transaction ADD CONSTRAINT FK_sales_transaction_person FOREIGN KEY (salesperson_id) REFERENCES dbo.person (person_id);
After our previous lesson, I made sure to include a clustered primary key, some foreign keys, and created an archival process for any transactions over 2 years old. Things are going great until a developer reports errors in production. I take a closer look and discover the following row in the table:
The transaction time is on September 31st?! That is not a real date, even during the longest and busiest of months! Storing the date as a string seemed reasonable – and made saving the data from the application quite easy! The right choice, though, was a data type that represented a date & time. Then, when September 31st was entered, it would throw an error, rather than create bad data.
A few days of work later, a change is deployed and the table now contains a DATETIME:
CREATE TABLE dbo.sales_transaction ( transaction_id INT NOT NULL CONSTRAINT PK_sales_transaction PRIMARY KEY CLUSTERED, product_id INT NOT NULL, salesperson_id INT NOT NULL, transaction_amount DECIMAL(18,4) NOT NULL, transaction_time DATETIME NOT NULL, shipping_date DATETIME NULL);
For good measure, I also change the shipping_date column to a DATETIME so similar problems cannot happen there. As a bonus, performance improved on the table as the implicit conversions between DATE and DATETIME values in the application were not being compared to VARCHAR values in the table, allowing a non-clustered index on transaction_time to yield index seeks instead of index scans.
A month later, though, another error crops up related to the shipping date. Some investigation reveals data that looks like this:
That is not right! The shipping date is a DATE. There should not be a time component to it…but…because the data type allowed it, some code somewhere inserted it. While the fix itself was easy – truncate the TIME portion and alter the column to be a DATE, it took some development time and a deployment to fix, which meant a late night working that I would have preferred doing anything else. The new version of the table looks like this:
CREATE TABLE dbo.sales_transaction ( transaction_id INT NOT NULL CONSTRAINT PK_sales_transaction PRIMARY KEY CLUSTERED, product_id INT NOT NULL, salesperson_id INT NOT NULL, transaction_amount DECIMAL(18,4) NOT NULL, transaction_time DATETIME NOT NULL, shipping_date DATE NULL);
This time, I get almost a year of peace and quiet on this table, until one day there is an application outage when all sales stop saving their data to the database. Checking the error logs and testing reveals the following message:
It turns out that this table had an exceptionally high volume of transactions and after a year hit its 2,147,483,647th sales transaction. When transaction_id #2,147,483,648 was inserted, the above error was the result. No one had told me that this table would see billions of transactions! Maybe I should have asked?
The problem was worked around by setting the application to use negative numbers as transaction IDs. This bought some time, but another long night lay ahead of me where I had to create a new table with a BIGINT transaction_id column, backfill it with the existing data, and then swap the tables so that that became the active table, complete with historical data.
The lesson of this escapade is that choosing data types is a significant decision. They reflect data size, content, and longevity. Knowing up-front how much data will be created, how it will be used, and what it represents allows for smart data types to be chosen immediately. This helps prevent bad data and avoids painful emergencies that my future self would prefer to avoid.
To NULL or Not to NULL, that is the Question
This story starts with a simple table:
CREATE TABLE dbo.person ( person_id INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_person PRIMARY KEY CLUSTERED, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email_address VARCHAR(100) NOT NULL, date_of_birth DATE NOT NULL);
After its release, a question is received regarding what to do if a person does not provide a date of birth. Following some discussion, it is decided that an unknown date of birth can be represented by NULL. The change is made:
CREATE TABLE dbo.person ( person_id INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_person PRIMARY KEY CLUSTERED, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email_address VARCHAR(100) NOT NULL, date_of_birth DATE NULL);
The change goes smoothly, but the next day a ticket is received that indicates searches based on date of birth no longer work. Apparently the application has no graceful way to deal with NULL. They ask instead for the column to be made NOT NULL and to include a dummy value for unknown values. Take 3:
CREATE TABLE dbo.person ( person_id INT NOT NULL IDENTITY(1,1) CONSTRAINT PK_person PRIMARY KEY CLUSTERED, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email_address VARCHAR(100) NOT NULL, date_of_birth DATE NOT NULL CONSTRAINT DF_person_dob DEFAULT ('1/1/1900'));
Life goes on until one day an annoyed analyst asks you why there are so many people in the system that are exactly 121 years old. Some head-scratching and review of timing reveals that the default date of birth was polluting any date calculations that happened to use date of birth. In addition to absurd ages, the system was also sending out birthday offers to all people with the dummy date of birth, wishing them a happy birthday on January 1st.
This sequence of events illustrates how a simple problem can result in real-world awkwardness. The easiest solution is to make the date of birth a required field at all levels of the application. This ensures that dummy data or NULL is not needed. Alternatively, if date of birth is truly optional, then a handful of legit solutions exist, including:
1. Make date_of_birth NULL and ensure that this is documented and handled effectively.
2. Normalize date_of_birth into a new optional table. This adds complexity and is not my preferred solution, but is a way to normalize, avoid NULL, and ensure data quality.
As always, before making decisions based on data, filter accordingly. If a data element is optional, then decisions made with that column need to properly omit or take into account scenarios where data is not provided.
Where Does This Lead Us next?
The moral of this short series of stories, code, and vague attempts at humor was to remind us that database architecture and performance optimization go hand-in-hand. Both address the same challenges in different ways and at different times within a software design life cycle.
Asking (and answering) good questions up-front can allow for better data architecture and remove the need for dramatic bug-fixes and changes later on. While not all problems can be proactively solved, a keen eye for detail can prevent many future problems ranging from inconveniences all the way to full-scale disasters.
As this list of design questions grows, so does our experience with seeking the answers to them and turning that information into well-architected database objects and code. We all have crazy stories of how bad data choices led to messy clean-up operations and those tales we share over a drink may very well be the motivation and foundation for future good database architecture decisions.