How to create a new index on a massive SQL table - sql-server-2008

I have a massive (3,000,000,000 rows) fact table in a datawarehouse star schema. The table is partitioned on the date key.
I would like to add an index on one of the foreign keys. This is to allow me to identify and remove childless rows in a large dimension table.
If I just issue a CREATE INDEX statement then it would take forever.
Do any SQL gurus have any fancy techniques for this problem?
(SQL 2008)
--Simplified example...
CREATE TABLE FactRisk
(
DateId int not null,
TradeId int not null,
Amount decimal not null
)
--I want to create this index, but the straightforward way will take forever...
CREATE NONCLUSTERED INDEX IX_FactRisk_TradeId on FactRisk (TradeId)

I have a plan...
Switch out all the daily partitions to tables
Index the now empty fact table
Index the individual partition
Switch all the partitions back in
Initial investigation implies that this will work. I will report back...

Related

Should I create index on a SQL table column used frequently in WHERE select clause?

So I wonder should I add non-clustered index to a non-unique values column in SQL 2008 R2 table.
Simplified Example:
SELECT Id, FirstName, LastName, City
FROM Customers
WHERE City = 'MyCity'
My understanding is that the primary key [Id] should be the clustered index.
Can non-clustered index be added to the non-unique column [City] ?
Is this going to improve performance or I should not bother at all.
Thanks.
I was thinking to do clustered index as:
CREATE UNIQUE CLUSTERED INDEX IDX_Customers_City
ON Customers (City, Id);
or non-clustered, assuming there is already clustered index on that table.
CREATE NONCLUSTERED INDEX IX_Customers_City
ON Customers (City, Id);
In reality I am dealing with millions of records table. The Select statement returns 0.1% to 5% of the records
Generally yes - you would usually make the clustered index on the primary key.
The exception to this is when you never make lookups based on the primary key, in which case putting the clustered index on another column might be more pertinent.
You should generally add non-clustered indexes to columns that are used as foreign keys, providing there's a reasonably amount of diversity on that column, which I'll explain with an example.
The same applies to columns being used in where clauses, order by etc.
Example
CREATE TABLE Gender (
GenderId INT NOT NULL PRIMARY KEY CLUSTERED
Value NVARCHAR(50) NOT NULL)
INSERT Gender(Id, Value) VALUES (1, 'Male'), (2, 'Female')
CREATE TABLE Person (
PersonId INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED,
Name NVARCHAR(50) NOT NULL,
GenderId INT NOT NULL FOREIGN KEY REFERENCES Gender(GenderId)
)
CREATE TABLE Order (
OrderId INT NOT NULL IDENTITY(1,1) PRIMARY KEY CLUSTERED,
OrderDate DATETIME NOT NULL DEFAULT GETDATE(),
OrderTotal DECIMAL(14,2) NOT NULL,
OrderedByPersonId INT NOT NULL FOREIGN KEY REFERENCES Person(PersonId)
)
In this simple set of tables it would be a good idea to put an index on The OrderedByPersonId Column of the Order table, as you are very likely to want to retrieve all the orders for a given person, and it is likely to have a high amount of diversity.
By a high amount of diversity (or selectiveness) I mean that if you have say 1000 customers, each customer is only likely to have 1 or 2 orders each, so looking up all the values from the order table with a given OrderedByPersonId will result in only a very small proportion of the total records in that table being returned.
By contrast there's not much point in putting an index on the GenderId column in the Person table, as it will have a very low diversity. The query optimiser would not use such an index, and INSERT/UPDATE statements would be a marginally slower because of the extra need to maintain the index.
Now to go back to your example - the answer would have to be "it depends". If you have hundreds of cities in your database then yes, it might be a good idea to index that column
If however you only have 3 or 4 cities, then no - don't bother. As a guideline I might say if the selectivity of the column is 0.9 or higher (ie a where clause selecting a single value in the column would result in only 10% or less of the rows being returned) an index might help, but this is by no means a hard and fast figure!
Even if the column is very selective/diverse you might not bother indexing it if queries are only made very infrequently on it.
One of the easiest things to do though is try your queries with the execution plan displayed in SQL management studio. It will suggest indexes for you if the query optimiser thinks that they'll make a positive impact.
Hope that helps!
If you use the query frequently or if you sort by city regularly in on-line applications specially if your table is dense or has a large row size, it makes sense to add an index. Too many indexes slow down your insert and update. An evaluation of the actual value would only be appreciated when you have significant data in the table.

MySQL, Two billion rows of data, read only, performance optimisations?

I have a set of integer data. The first being the number 0 and the last being 47055833459. There are two billion of these numbers from the first to the last and they will never change or be added to. The only insert into the mysql table will be loading this data into it. From then on, it will only be read from.
I predict the size of the database table to be roughly 20Gb. I plan on having two columns:
id, data
Id will be a primary key, auto incremented unsigned INT and data will be an unsigned BIGINT
What will be the best way of optimising this data for read only with those two columns? I have looked at the other questions which are similar but they all take into account write speeds and ever increasing tables. The host I am using does not support MySQL partitioning so unfortunately this is not an option at the moment. If it turns out that partitioning is the only way, then I will reconsider a new host.
The table will only ever be accessed by the id column so there does not need to be an index on the data column.
So to summarise, what is the best way of handling a table with 2 billion rows with two columns, without partitioning, optimised for reads, in MySQL?
Assuming you are using InnnDB, you should simply:
CREATE TABLE T (
ID INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
DATA BIGINT UNSIGNED
);
This will effectively create one big B-Tree and nothing else, and retrieving a row by ID can be done in a single index seek1. Take a look at "Understanding InnoDB clustered indexes" for more info.
1 Without table heap access, in fact there is no heap at all.
Define your table like so.
CREATE TABLE `lkup` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`data` BIGINT UNSIGNED NOT NULL,
PRIMARY KEY (`id`, `data`)
)
The compound primary key will consume disk space, but will make lookups very fast; your queries can be satisfied just by reading the index (which is known as a covering index).
And, do OPTIMIZE TABLE lkup when you're finished loading your static data into it. That may take a while, but it will pay off big at runtime.

MySQL searching query?

I'm new to MySQL and want to know that if I have a table with 25 column and the first one of it is the "id". Would the computer render every time through the whole table to search the particular "id".
if you construct the query like SELECT * FROM $table_name WHERE table_id=$id; then it will not render all table.
And as #dku.rajkumar says in the comment, it depends on what you want to fetch and your query structure.
It may depend on the query and also the STORAGE Engine you choose to use.
like MyIsam or InnoDb
example
CREATE TABLE tablename (
id INT UNSIGNED PRIMARY KEY
)ENGINE=MyIsam;
CREATE TABLE tablename (
id INT UNSIGNED PRIMARY KEY
)ENGINE=InnoDB;
there do exist difference in way tables are stored ,dependiing on storage engine , which certainly will reflect in the criteria mysql server (mysqld) performs search to cater your needs .

Partitioning non-partitioned table in SQL Server 2008

I have a table which in my opinion will benefit from partitioning:
CREATE TABLE [dbo].[my_table](
[id] [int] IDENTITY(1,1) NOT NULL,
[external_id] [int] NOT NULL,
[amount] [money] NOT NULL,
PRIMARY KEY CLUSTERED ([id] ASC));
There are just few different external_id and thousands of records for each of them.
SSMS Create Partition Wizard generates a script that I don't completely understand. After creating partition function and partition schema,
--it drops Primary Key,
--then creates Primary Key again on id, this time as non-clustered,
--then creates clustered index on external_id on newly created partition schema,
--and finally it drops the clustered index created on previous step.
Everything except last step seems clear, but I cannot get why it has to drop the clustered index. Should I remove the last step from the batch?
Any help will be greatly appreciated.
It makes sense.
The partition key is going to be the external id, so the clustered index must include that.
It preserves the primary key in a non-clustered index - since it's on ID not external_id
It created the clustered index on external_id to physically move the data into the partition scheme.
It drops the clustered index since it only used it to move the data - it was not a previously specified index.
There are a number of alternatives, assuming you always know the external_id, then you could choose to create the clustered index as (id,external_id) - the partition schema / function field used for the table must be within the clustered index on the partition schema.
Performance wise, this is not going to be a huge boost, the use of it is more that you can drop an entire external_id trivially, instead of a large delete transaction.

Table with 80 million records and adding an index takes more than 18 hours (or forever)! Now what?

A short recap of what happened. I am working with 71 million records (not much compared to billions of records processed by others). On a different thread, someone suggested that the current setup of my cluster is not suitable for my need. My table structure is:
CREATE TABLE `IPAddresses` (
`id` int(11) unsigned NOT NULL auto_increment,
`ipaddress` bigint(20) unsigned default NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM;
And I added the 71 million records and then did a:
ALTER TABLE IPAddresses ADD INDEX(ipaddress);
It's been 14 hours and the operation is still not completed. Upon Googling, I found that there is a well-known approach for solving this problem - Partitioning. I understand that I need to partition my table now based on the ipaddress but can I do this without recreating the entire table? I mean, through an ALTER statement? If yes, there was one requirement saying that the column to be partitioned on should be a primary key. I will be using the id of this ipaddress in constructing a different table so ipaddress is not my primary key. How do I partition my table given this scenario?
Ok turns out that this problem was more than just a simple create a table, index it and forget problem :) Here's what I did just in case someone else faces the same problem (I have used an example of IP Address but it works for other data types too):
Problem: Your table has millions of entries and you need to add an index really fast
Usecase: Consider storing millions of IP addresses in a lookup table. Adding the IP addresses should not be a big problem but creating an index on them takes more than 14 hours.
Solution: Partition your table using MySQL's Partitioning strategy
Case #1: When the table you want is not yet created
CREATE TABLE IPADDRESSES(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ipaddress BIGINT UNSIGNED,
PRIMARY KEY(id, ipaddress)
) ENGINE=MYISAM
PARTITION BY HASH(ipaddress)
PARTITIONS 20;
Case #2: When the table you want is already created.
There seems to be a way to use ALTER TABLE to do this but I have not yet figured out a proper solution for this. Instead, there is a slightly inefficient solution:
CREATE TABLE IPADDRESSES_TEMP(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ipaddress BIGINT UNSIGNED,
PRIMARY KEY(id)
) ENGINE=MYISAM;
Insert your IP addresses into this table. And then create the actual table with partitions:
CREATE TABLE IPADDRESSES(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ipaddress BIGINT UNSIGNED,
PRIMARY KEY(id, ipaddress)
) ENGINE=MYISAM
PARTITION BY HASH(ipaddress)
PARTITIONS 20;
And then finally
INSERT INTO IPADDRESSES(ipaddress) SELECT ipaddress FROM IPADDRESSES_TEMP;
DROP TABLE IPADDRESSES_TEMP;
ALTER TABLE IPADDRESSES ADD INDEX(ipaddress)
And there you go... indexing on the new table took me about 2 hours on a 3.2GHz machine with 1GB RAM :) Hope this helps.
Creating indexes with MySQL is slow, but not that slow. With 71 million records, it should take a couple minutes, not 14 hours. Possible problems are :
you have not configured sort buffer sizes and other configuration options
look here : http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_myisam_sort_buffer_size
If you try to generate a 1GB index with a 8MB sort buffer it's going to take lots of passes. But if the buffer is larger than your CPU cache it will get slower. So you have to test and see what works best.
someone has a lock on the table
your IO system sucks
your server is swapping
etc
as usual check iostat, vmstat, logs, etc. Issue a LOCK TABLE on your table to check if someone has a lock on it.
FYI on my 64-bit desktop creating an index on 10M random BIGINTs takes 17s...
I had the problem where I wanted to speed up my query by adding an index. The table only had about 300.000 records but it also took way too long. When I checked the mysql server processes, it turned out that the query I was trying to optimize was still running in the background. 4 times! After I killed those queries, indexing was done in a jiffy. Perhaps the same problem applies to your situation.
You are using MyISAM which is being deprecated soon. An alternative would be InnoDB.
"InnoDB is a transaction-safe (ACID compliant) storage engine for MySQL that has commit, rollback, and crash-recovery capabilities to protect user data. InnoDB row-level locking (without escalation to coarser granularity locks) and Oracle-style consistent nonlocking reads increase multi-user concurrency and performance. InnoDB stores user data in clustered indexes to reduce I/O for common queries based on primary keys. To maintain data integrity, InnoDB also supports FOREIGN KEY referential-integrity constraints. You can freely mix InnoDB tables with tables from other MySQL storage engines, even within the same statement."\
http://dev.mysql.com/doc/refman/5.0/en/innodb.html
According to:
http://dev.mysql.com/tech-resources/articles/storage-engine/part_1.html
, you should be able to switch between different engine by utilizing a simple alter command which allows you some flexibility. It also states that each table in your DB can be configured independently.
In your table . you have already inserted 71 billion records. now if you want to create partitions on the primary key column of your table, you can use alter table option. An example is given for your reference.
CREATE TABLE t1 (
id INT,
year_col INT
);
ALTER TABLE t1
PARTITION BY HASH(id)
PARTITIONS 8;