Creating successful SQL indexes is an art. I was reminded of this recently when I was looking at the indexing of a set of reporting tables, with each table expected to accommodate 100 million record.
The reports are run infrequently (a couple of times a week) and though they should generate quickly, they don’t have to be instantaneous. The report data ideally should be created within 10 seconds.
The other factors to consider are how long it take to populate the reporting tables, how often the data changes and the maintenance required for the indexes.
Different strategies
The first step is to plan some different indexing strategies, which we can then test to see how they perform. They may favour data insertion or favour different types of data lookups. In my case I did the following:
- No indexing, to get a base line.
- Indexing that would help cover two of the three columns for the reporting and also help with the updating of the data.
- An additional index that would cover all of the columns in the data lookup.
- I also experimented with additional included columns on the indexes but in my case it didn’t help.
How to test
Make sure you test with the right data. This should include:
- The amount of data that is in the table should be the maximum expected volume you are likely to have and then some. In my case I was expecting around 50 million records per table, but I doubled this and used a data set of 100 million records.
- The data in the table should be realistic, or based on real (obfuscated) data. This will mean that you can correctly estimate data size, index sizes and lookup performance.
- Try to accurately model how the data is inserted and updated in the table, so you can get an idea of how the indexes will fragment over time. In my case I was expecting up to 120,000 new records a day, but around 500,0000 records could also be updated. These records would be added/updated as a batch overnight.
To test this I created a console app that would insert 120,000 records and update 500,0000 in a batch. Using the console app I modelled the inserting and updating of records for several years worth of data. It is also important to monitor how quickly these inserts take over the course of the testing to see if there is any degradation in performance as we get more data in the database.
As well as testing the inserting of the data you will want to test the expected query lookups on the data as the data volume grows.
What to look for?
When measuring an index I like to record the following at different data volumes:
- Query lookup performance.
- How quick it is to insert data.
- How quick it is to update data.
- How quickly the indexes become fragmented, and how this affects lookups.
- How quick it is to rebuild or reorganize the indexes.
- The size of the indexes.
- Also keep an eye on the servers memory and CPU usage through out the testing.
Deciding what is important
Once you have the facts on how well you indexes performs with different data you can decide which elements are most important to your application. In my case I decided to have the index that covered just two of the three columns as:
- The difference in the lookups was negligible. With the chosen index it took on average 2.4 seconds. With the full index was 0.1 seconds quicker over 100 million records.
- The inserting of the data was a lot quicker as there was a smaller index to maintain.
- The index hardly became fragmented. After modelling several years of data, the index was only fragmented by 0.3%. With the other index it became fragmented quickly and would require a lot more maintenance.
- The size of the index was a lot smaller, and therefore took up less disk space and used less memory.
So I was most comfortable with an index that was slightly slower for lookups (though well within my requirements) but performed better on inserts and updates, and the maintenance overhead was a lot lower.
I was surprised that the index that covered all the columns was only slightly quicker. But indexing can often go against your expectations so it’s always best to test your assumptions. Choosing the correct solution is sometimes a compromise between different factors and can definitely be an art form.
As a summary the consideration of an index should include:
- Query lookup performance.
- How quick it is to insert or update data in the table.
- How often the data in the table changes.
- How often the index needs to be rebuilt or reorganized.
- How long it takes to rebuild or reorganize the indexes.
- The size of the indexes.
- As your data changes over time your indexing strategies may also have to adapt.