#2 SQL Tips - A Numbers table (Part 1)
The longer I’m a SQL guy, the more convinced I am that many of the worlds problems can be solved with a numbers table. Well, SQL problems anyway. A numbers table is really nothing more than an ordered table of sequential integers from 0 or 1 to some useful though not huge number. Maybe 8000 or 1Million, maybe more depending on your uses. This table will serve as a counter just like looping through a WHILE loop or incrementing some counter variable or identity column, etc. The difference between those counters and a numbers table is that the numbers table is pre-generated and because SQL was born to work with set-based/table-based data really efficiently (vs generating counters and looping and doing string manipulation like true CLR/Programming languages do), we are able to realize incredible speed gains when we leverage that fact as much as possible.
So, you may say to me “OK, John…now I’ve got a big ass list of numbers...um…so now what?”
Well, the thinking goes like this:
Do I need to parse a string?
Do I need to remove dupes from a string?
Do I need to extract certain items from a string?
Do I need to generate a range of dates?
Do I need to generate a range of times?
Do I need to generate a list of IP addresses?
Do I need to find gaps in identity columns?
Do I need to do anything that steps or loops or adds or otherwise moves sequentially forward or backward through anything?
If you’ve said “yes” to any of those, then you maybe should create a numbers table on any and all of your SQL servers. Now, of course I understand that it’s fairly easy to generate a sequential list of numbers at runtime in a CTE or otherwise, but in many (most?) cases a static numbers table will perform better because the table is likely already in memory and ordered/indexed and there’s no need to add the processing and memory requirements of number generation—light as they may be—to runtime.
CREATE THE NUMBERS TABLE
Obviously, before we can use it, we need to create the table. So…that brings up a few questions:
How big should the numbers table be?
Well, this is up to you. What do you want to do with it? How big are the strings you’re going to parse? How big is the range you want to generate? How high does your loop/counter/incrementer need to go? I guess what I’m saying is you need to know how you’re going to use it and the attributes of the data that will be worked on. In my case, I use a numbers table that goes from 0 to 1Million. I work for a large company that has over 360K desktops so when we muck with data, it’s usually on a pretty large scale. 6M seems to fit my needs nicely. At first I created a numbers table that literally had one of every single integer in it. 2billion and some. First of all, it took hours to fill the table, second it took up gigabytes of space, third I never found myself using more than 6M rows and the table was never all in memory because it was too big. This made me think hard about what I was mainly using it for. I was using it for medium to large sized strings (like a delimited list of computer names…15 characters * 360,000 computers = 5.4Million characters, plus some slack for delimiters).
I realize that most people aren’t in a company as large as this and maybe 1Million is enough, or 10,000 or even 8,000. If all you’ll find yourself using it for is stepping through a VARCHAR(8000) or NVARCHAR(4000) then 8,000 integers is all you may need. And if you outstrip the number of rows in the table you can either just add more rows to it or cross join it with something to generate some more. Anyway, for the sake of this article I’m going to assume 1M rows.
Where should I create the numbers table?
For some reason, I like to put my numbers table in the MASTER database. This way it’s always in the same spot on all my SQL servers. It really doesn’t matter where you put it though…in fact, I know plenty of people, even in my organization who don’t have access to create tables in the MASTER database so just put it where you can. It’ll work the same no matter where it is. For the sake of this article I’m going to call it [Master].[dbo].[Numbers], or I like to shorthand that MASTER..Numbers. The DBO in this case is implied and doesn’t need to be typed.
What’s the best way to populate the numbers table?
Since this is a one-time deal, this is really not that important. If it takes an hour, who cares? Well, personally I like exploring the best ways of doing things instead of just the easiest. By “best” I really mean the fastest. I think it’s important to know what the fastest way is to do things because you may find out things you didn’t know before and can apply those principles to future problems.
So obviously there are a ton of ways to insert 1Million rows into a table, but after exploring this issue I found this link which shows a few different ways and the amount of time taken. So, based on that I think the best way is to use Itzik’s CTE and dump them in that way.
CODE 1: CREATE NUMBERS TABLE & FILL IT WITH EVERY INT BETWEEN 0 AND 1Million
CREATE TABLE dbo.Numbers(
Number INT PRIMARY KEY
;WITH E00(N) AS (SELECT 1 UNION ALL SELECT 1),
E02(N) AS (SELECT 1 FROM E00 a, E00 b),
E04(N) AS (SELECT 1 FROM E02 a, E02 b),
E08(N) AS (SELECT 1 FROM E04 a, E04 b),
E16(N) AS (SELECT 1 FROM E08 a, E08 b),
E32(N) AS (SELECT 1 FROM E16 a, E16 b),
cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E32)
INSERT INTO master..Numbers
SELECT 0 AS Number --start at 0
SELECT CAST(N AS INT) Number
WHERE N <= 1000000; --go to 1M
If you understand SQL, this code is kind of self-explanatory, but let me run through it quickly to make sure everyone gets what’s going on here…
- First, we create a table in the MASTER database, in the DBO schema called Numbers. That table has one column called Number. It’s an INTEGER that’s also the primary key for the table. That means it’s physically ordered/indexed for that column to be fast, fast, fast for lookups and joins.
- Next there’s a WITH statement which is a CTE containing a bunch of fake tables essentially.
Starting at the top with E00, it’s a table containing two rows with a 1 in both rows.
The next is E02 which is E00 cross joined with itself E00. This will make a Cartesian product (all the rows of the first table multiplied by all the rows of the second…2 x 2) and now there’s 4 rows each with a 1 in them.
Then there’s E04 which cross joins E02 with E02 to give 4x4 = 16 rows with a 1 in them.
E08 cross joins E04 with E04 to get 16x16 = 256 rows with a 1.
E16 cross joins E08 with E08 to get 256x256 = 65536 rows with a 1.
E32 cross joins E16 with E16 to give 65536x65536 = 4294967296 rows with a 1.
Lastly cteTally simply uses the ROW_NUMBER() function over all the rows to have SQL internally build a list of consecutive integers. Because all the rows have the exact same number (1) the optimizer won’t bother sorting them so this amounts to SQL generating the numbers for us.
- So far, nothing has actually happened yet…we’ve just been passing the rows from E00 down through E02 and so on, building the fake rows with consecutive cross joins so that we have enough for the ROW_NUMBER() function to operate on to make a big bunch of numbers.
- The INSERT INTO master..Numbers simply tells us we’re putting the ROW_NUMBERs from cteTally into that table.
- The SELECT 0 AS Number line preprends a 0 to the first row since the ROW_NUMBER function starts at one. This part is optional. Some people don’t like having a zero in their number table, some people do. I currently use it at times. If I don’t want to use it, I need only use a WHERE clause that says to use numbers between 1 AND whatever. The UNION ALL is obviously the part that takes the 0 and prepends it to the next select statement.
- The next SELECT statement is the piece that selects all those ROW_NUMBER() generated numbers from cteTally and puts them into the master..Numbers table. Notice the WHERE clause that selects only rows <= 1Million. This keeps us from inserting 4.2Billion rows into that Number column. Also notice the CAST statement. Since the ROW_NUMBER function was technically operating over 4.2Billion rows, it thinks its output needs to be a BIGINT. We CAST it to an INT in order to cut down on the primary key size of that Number column. That means 4 storage bytes instead of 8…half the size.
SUMMARY AND NEXT STEPS
So, we’ve talked about some reasons WHY one would want a numbers table, we’ve got some code to create the numbers table, we’ve got some code to insert the numbers into said table (and learned about the fastest way to do that), and we’ve got an explanation of how the code works. That should be enough to get you started on this. In the next article, I’ll talk about ways we can put this numbers table to good use. Hope this finds some use for you as well!
- Created on .