Indexes in a Data Warehouse - sql-server-2008

I'm creating a data mart in SQL Server 2008 using SSIS for load, and SSAS for an OLAP cube. So far, everything is working great. However, I haven't created any indexes on the source database other than the default clustering on primary key.
I'm pretty comfortable with designing indexes on the application databases, but since this database is intended primary to be the source for a cube, I'm not sure what sort of indexing, if any, will be beneficial.
Is there any sort of indexing I should be doing to improve the processing of the dimensions and cube? I'm using your regular Molap storage.

Generally, the best practice is to keep indexes and constraints off of marts, unless they'll be used directly for reporting. Indexes and constraints can seriously hose your ETL time (especially with the amounts of data that usually go into warehouses).
What I've found works best is to have a single, solitary PK on all of your tables (including fact, because I have composite keys, and I'll just hash the composite to get myself a PK if I have to). Having PK's (that are identity columns) provides you with an autogenerated index, quick joining when the cubes are built, and very quick inserts.
If you're going to be doing reporting, then build out the indexes as you would, but make sure to disable and then rebuild the indexes as part of your ETL process. Otherwise, bulk inserts take some time to do (hours upon hours to commit, in some cases).

Related

Improving MySQL performance on RDS by partitioning

I am trying to improve a performance of some large tables (can be millions of records) in a MySQL 8.0.20 DB on RDS.
Scaling up DB instance and IOPS is not the way to go, as it is very expensive (the DB is live 24/7).
Proper indexes (including composite ones) do already exist to improve the query performance.
The DB is mostly read-heavy, with occasional massive writes - when these writes happen, reads can be just as massive at the same time.
I thought about doing partitioning. Since MySQL doesn't support vertical partitioning, I considered doing horizontal partitioning - which should work very well for these large tables, as they contain activity records from dozens/hundreds of accounts, and storing each account's records in a separate partition makes a lot of sense to me.
But these tables do contain some constraints with foreign keys, which rules out using MySQL's horizontal partitioning : Restrictions and Limitations on Partitioning
Foreign keys not supported for partitioned InnoDB tables. Partitioned tables using the InnoDB storage engine do not support foreign keys. More specifically, this means that the following two statements are true:
No definition of an InnoDB table employing user-defined partitioning may contain foreign key references; no InnoDB table whose definition contains foreign key references may be partitioned.
No InnoDB table definition may contain a foreign key reference to a user-partitioned table; no InnoDB table with user-defined partitioning may contain columns referenced by foreign keys.
What are my options, other than doing "sharding" by using separate tables to store activity records on a per account basis? That would require a big code change to accommodate such tables. Hopefully there is a better way, that would only require changes in MySQL, and not the application code. If the code needs to be changed - the less the better :)
storing each account's records in a separate partition makes a lot of sense to me
Instead, have the PRIMARY KEY start with acct_id. This provides performance at least as good as PARTITION BY acct_id, saves disk space, and "clusters" an account's data together for "locality of reference".
The DB is mostly read-heavy
Replicas allows 'infinite' scaling of reads. But if you are not overloading the single machine now, there may be no need for this.
with occasional massive writes
Let's discuss techniques to help with that. Please explain what those writes entail -- hourly/daily/sporadic? replace random rows / whole table / etc? keyed off what? Etc.
Proper indexes (including composite ones) do already exist to improve the query performance.
Use the slowlog (with long_query_time = 1 or lower) to verify. Use pt-query-digest to find the top one or two queries. Show them to us -- we can help you "think out of the box".
read-heavy
Is the working set size less than innodb_buffer_pool_size? That is, are you CPU-bound and not I/O-bound?
More on PARTITION
PRIMARY KEY(acct_id, ..some other columns..) orders the data primarily on acct_id and makes this efficient: WHERE acct_id=123 AND ....
PARTITION BY .. (acct_id) -- A PARTITION is implemented as a separate "table". "Partition pruning" is the act of deciding which partition(s) are needed for the query. So WHERE acct_id=123 AND ... will first do that pruning, then look for the row(s) in that "table" to handle the AND .... Hopefully, there is a good index (perhaps the PRIMARY KEY) to handle that part of the filtering.
The pruning is sort of takes the place of one level of BTree. It is hard to predict which will be slower or faster.
Note that when partitioning by, say, acct_id, there is usually not efficient to start the index with that column. (However, it would need to be later in the PK.)
Big Deletes
There are several ways to do a "big delete" while minimizing the impact on the system. Partitioning by date is optimal but does not sound viable for your type of data. Check out the others listed here: http://mysql.rjweb.org/doc.php/deletebig
Since you say that the deletion is usually less than 15%, the "copy over what needs to be kept" technique is not applicable either.
Before sharding or partitioning, first analyze your queries to make sure they are as optimized as you can make them. This usually means designing indexes specifically to support the queries you run. You might like my presentation How to Design Indexes, Really (video).
Partitioning isn't as much a solution as people think. It has many restrictions, including the foreign key issue you found. Besides that, it only improves queries that can take advantage of partition pruning.
Also, I've done a lot of benchmarking of Amazon RDS for my current job and also a previous job. RDS is slow. It's really slow. It uses remote EBS storage, so it's bound to incur overhead for every read from storage or write to storage. RDS is just not suitable for any application that needs high performance.
Amazon Aurora is significantly better on latency and throughput. But it's also very expensive. The more you use it, the more you use I/O requests, and they charge extra for that. For a busy app, you end up spending as much as you did for RDS with high provisioned IOPS.
The only way I found to get high performance in the cloud is to forget about managed databases like RDS and Aurora, and instead install and run your own instance of MySQL on an ec2 instance with locally-attached NVMe storage. This means the i3 family of ec2 instances. But local storage is ephemeral instance storage, so if the instance restarts, you lose your data. So you must add one or more replicas and have a failover plan.
If you need an OLTP database in the cloud, and you also need top-tier performance, you either have to spend $$$ for a managed database, or else you need to hire full-time DevOps and DBA staff to run it.
Sorry to give you the bad news, but the TANSTAAFL adage remains true.

Index on high write table is bad?

I got an Oracle 9i book from the Oracle publisher.
In it it's written
Index is bad on a table which is updated/inserted new rows frequently
Is it true ? or is it just about Oracle [and not about other RDBMS packages] ?
Edit
I got a table in MySQL like this
ID [pk / AI]
User [integer]
Text [TinyText]
Time [Timestamp]
Only write/read is allowed to this table.
As PK creates Index, is the table design broken ?
If yes, how to solve this type of problem [where AI is the primary key]
This is true with any database to an extent. Whenever indexed columns updated, the index must also be updated. Each additional index adds extra overhead. Whether or not this matters for your specific situation depends on the indices you create and the workload the server is running. Performance implications are best discovered via benchmarking.
Indexes are only for retrieving data. Because they are a pointer to data location(s), INSERT/UPDATE/DELETE statements are slower to maintain the indexes. Even then, indexes can be fragmented because deletion/updating will change -- which is why there are tools to maintain this and table statistics (both are used by the optimizer to determine the EXPLAIN plan).
Keep in mind that indexes are not ANSI -- it's a miracle the syntax & terminology is so similar. But the functionality is near identical between databases that provide it. For example, Oracle only has "indexes" while both MySQL and SQL Server differentiate between clustered (one per table) and non-clustered indexes.
To address your update about the primary key. The primary key is unique, and considered immutable (though it is infact able to be updated, though the value has to be unique to the column). Deletion from the table would fragment the index, which requires monitoring with database vendor specific tools if performance becomes an issue.
It is not that indexes on highly volatile tables are "bad". It's simply that there is a performance impact on the DML operations. I think what the author was trying to say is that you should carefully consider the need for indexes on such active tables.
As in everything computing, it's all about tradeoffs. As #Michael essentially states, "it depends". You might have a high query rate on the table as well, in which the indexes on the table avoid a lot of full table scans. In such a case, your index maintenance overhead may well be worth the benefit derived from the indexes on queries.
Also, I'd probably not buy a 9i book anyway, unless it was a real bargain. I'd recommend you read most anything by Tom Kyte you can get your hands on, especially "Expert Oracle Database Architecture" or "Effective Oracle by Design".

Cassandra write performance vs Releational Databases

I am trying to grasp some performance differences between Cassandra and relational databases.
From what I have read, Cassandra's write performance remains constant regardless of data volume. By write performance, I am assuming this implies both new rows being added as well as existing rows being replaced on a key match (like an update in the relational world). Is that assumption correct?
Also, from what I understand about relational databases updates get slower when tables/partitions become larger. This is because a full table scan must be performed to locate the row, or an index lookup needs to be performed and both of these things will take longer as the table or partition grows. So updates take perpetually longer based on the data volume of the table/partition?
When new data is inserted to a relational database, I know any indexes need to to have the new data but there is no lookup involved correct? So will inserts also become perpetually slower as data volume increases or stay constant with relational databases?
Thanks for any tips
They will become slower if the table has indexes. Not only the data must be written, but the index must be updated too. Inserting in a table that has no indexes and no constraints is lightning fast, because no checks need to be done. The record can just be written at the end of the table space.
On the relational DB side, I've been doing load testing on our RDBMS where I can see that the performance drops exponentially as data is added to the DB.
I'm still working on a Cassandra setup to be able to realize a comparable test. In the meantime, this Cassandra presentation gives some info on Cassandra compared to MySQL:
http://www.slideshare.net/Eweaver/cassandra-presentation-at-nosql

MySQL sharding approaches?

What is the best approach for Sharding MySQL tables.
The approaches I can think of are :
Application Level sharding?
Sharding at MySQL proxy layer?
Central lookup server for sharding?
Do you know of any interesting projects or tools in this area?
The best approach for sharding MySQL tables to not do it unless it is totally unavoidable to do it.
When you are writing an application, you usually want to do so in a way that maximizes velocity, developer speed. You optimize for latency (time until the answer is ready) or throughput (number of answers per time unit) only when necessary.
You partition and then assign partitions to different hosts (= shard) only when the sum of all these partitions does no longer fit onto a single database server instance - the reason for that being either writes or reads.
The write case is either a) the frequency of writes is overloading this servers disks permanently or b) there are too many writes going on so that replication permanently lags in this replication hierarchy.
The read case for sharding is when the size of the data is so large that the working set of it no longer fits into memory and data reads start hitting the disk instead of being served from memory most of the time.
Only when you have to shard you do it.
The moment you shard, you are paying for that in multiple ways:
Much of your SQL is no longer declarative.
Normally, in SQL you are telling the database what data you want and leave it to the optimizer to turn that specification into a data access program. That is a good thing, because it is flexible, and because writing these data access programs is boring work that harms velocity.
With a sharded environment you are probably joining a table on node A against data on node B, or you have a table larger than a node, on nodes A and B and are joining data from it against data that is on node B and C. You are starting to write application side hash-based join resolutions manually in order to resolve that (or you are reinventing MySQL cluster), meaning you end up with a lot of SQL that no longer declarative, but is expressing SQL functionality in a procedural way (e.g. you are using SELECT statements in loops).
You are incurring a lot of network latency.
Normally, an SQL query can be resolved locally and the optimizer knows about the costs associated with local disk accesses and resolves the query in a way that minimizes the costs for that.
In a sharded environment, queries are resolved by either running key-value accesses across a network to multiple nodes (hopefully with batched key accesses and not individual key lookups per round trip) or by pushing parts of the WHERE clause onward to the nodes where they can be applied (that is called 'condition pushdown'), or both.
But even in the best of cases this involves many more network round trips that a local situation, and it is more complicated. Especially since the MySQL optimizer knows nothing about network latency at all (Ok, MySQL cluster is slowly getting better at that, but for vanilla MySQL outside of cluster that is still true).
You are losing a lot of expressive power of SQL.
Ok, that is probably less important, but foreign key constraints and other SQL mechanisms for data integrity are incapable of spanning multiple shards.
MySQL has no API which allows asynchronous queries that is in working order.
When data of the same type resides on multiple nodes (e.g. user data on nodes A, B and C), horizontal queries often need to be resolved against all of these nodes ("Find all user accounts that have not been logged in for 90 days or more"). Data access time grows linearly with the number of nodes, unless multiple nodes can be asked in parallel and the results aggregated as they come in ("Map-Reduce").
The precondition for that is an asynchronous communication API, which does not exist for MySQL in a good working shape. The alternative is a lot of forking and connections in the child processes, which is visiting the world of suck on a season pass.
Once you start sharding, data structure and network topology become visible as performance points to your application. In order to perform reasonably well, your application needs to be aware of these things, and that means that really only application level sharding makes sense.
The question is more if you want to auto-shard (determining which row goes into which node by hashing primary keys for example) or if you want to split functionally in a manual way ("The tables related to the xyz user story go to this master, while abc and def related tables go to that master").
Functional sharding has the advantage that, if done right, it is invisible to most developers most of the time, because all tables related to their user story will be available locally. That allows them to still benefit from declarative SQL as long as possible, and will also incur less network latency because the number of cross-network transfers is kept minimal.
Functional sharding has the disadvantage that it does not allow for any single table to be larger than one instance, and it requires manual attention of a designer.
Functional sharding has the advantage that it is relatively easily done to an existing codebase with a number of changes that is not overly large. http://Booking.com has done it multiple times in the past years and it worked well for them.
Having said all that, looking at your question, I do believe that you are asking the wrong questions, or I am completely misunderstanding your problem statement.
Application Level sharding: dbShards is the only product that I know of that does "application aware sharding". There are a few good articles on the website. Just by definition, application aware sharding is going to be more efficient. If an application knows exactly where to go with a transaction without having to look it up or get redirected by a proxy, that in its self will be faster. And speed is often one of the primary concerns, if not the only concern, when someone is looking into sharding.
Some people "shard" with a proxy, but in my eyes that defeats the purpose of sharding. You are just using another server to tell your transactions where to find the data or where to store it. With application aware sharding, your application knows where to go on its own. Much more efficient.
This is the same as #2 really.
Do you know of any interesting projects or tools in this area?
Several new projects in this space:
citusdata.com
spockproxy.sourceforge.net
github.com/twitter/gizzard/
Application level of course.
Best approach I've ever red I've found in this book
High Performance MySQL
http://www.amazon.com/High-Performance-MySQL-Jeremy-Zawodny/dp/0596003064
Short description: you could split your data in many parts and store ~50 part on each server. It will help you to avoid the second biggest problem of sharding - rebalancing. Just move some of them to the new server and everything will be fine :)
I strongly recommend you to buy it and read "mysql scaling" part.
As of 2018, there seems to be a MySql-native solution to that. There are actually at least 2 - InnoDB Cluster and NDB Cluster(there is a commercial and a community version of it).
Since most people who use MySql community edition are more familiar with InnoDB engine, this is what should be explored as a first priority. It supports replication and partitioning/sharding out of the box and is based on MySql Router for different routing/load-balancing options.
The syntax for your tables creation would need to change, for example:
CREATE TABLE t1 (col1 INT, col2 CHAR(5), col3 DATETIME) PARTITION BY HASH ( YEAR(col3) );
(this is only one of four partitioning types)
One very important limitation:
InnoDB foreign keys and MySQL partitioning are not compatible. Partitioned InnoDB tables cannot have foreign key references, nor can they have columns referenced by foreign keys. InnoDB tables which have or which are referenced by foreign keys cannot be partitioned.
Shard-Query is an OLAP based sharding solution for MySQL. It allows you to define a combination of sharded tables and unsharded tables. The unsharded tables (like lookup tables) are freely joinable to sharded tables, and sharded tables may be joined to each other as long as the tables are joined by the shard key (no cross shard or self joins that cross shard boundaries). Being an OLAP solution, Shard-Query usually has minimum response times of 100ms or less, even for simple queries so it will not work for OLTP. Shard-Query is designed for analyzing big data sets in parallel.
OLTP sharding solutions exist for MySQL as well. Closed source solutions include ScaleDB, DBShards. Open source OLTP solution include JetPants, Cubrid or Flock/Gizzard (Twitter infrastructure).
Do you know of any interesting projects or tools in this area?
As of 2022 Here are 2 tools:
Vitess (website: https://vitess.io & repo: https://github.com/vitessio/vitess)
PlanetScale (https://planetscale.com)
You can consider this middleware
shardingsphere

MySQL vs SQLite + UNIQUE Indexes

For reasons that are irrelevant to this question I'll need to run several SQLite databases instead of the more common MySQL for some of my projects, I would like to know how SQLite compares to MySQL in terms of speed and performance regarding disk I/O (the database will be hosted in a USB 2.0 pen drive).
I've read the Database Speed Comparison page at http://www.sqlite.org/speed.html and I must say I was surprised by the performance of SQLite but since those benchmarks are a bit old I was looking for a more updated benchmark (SQLite 3 vs MySQL 5), again my main concern is disk performance, not CPU/RAM.
Also since I don't have that much experience with SQLite I'm also curious if it has anything similar to the TRIGGER (on update, delete) events in the InnoDB MySQL engine. I also couldn't find any way to declare a field as being UNIQUE like MySQL has, only PRIMARY KEY - is there anything I'm missing?
As a final question I would like to know if a good (preferably free or open source) SQLite database manager exists.
A few questions in there:
In terms of disk I/O limits, I wouldn't imagine that the database engine makes a lot of difference. There might be a few small things, but I think it's mostly just whether the database can read/write data as fast as your application wants it to. Since you'd be using the same amount of data with either MySQL or SQLite, I'd think it won't change much.
SQLite does support triggers: CREATE TRIGGER Syntax
SQLite does support UNIQUE constraints: column constraint definition syntax.
To manage my SQLite databases, I use the Firefox Add-on SQLite Manager. It's quite good, does everything I want it to.
In terms of disk I/O limits, I wouldn't imagine that the database engine makes
a lot of difference.
In Mysql/myISAM the data is stored UNORDERED, so RANGE reads ON PRIMARY KEY will theoretically need to issue several HDD SEEK operations.
In Mysql/InnoDB the data is sorted by PRIMARY KEY, so RANGE reads ON PRIMARY KEY will be done using one DISK SEEK operation (in theory).
To sum that up:
myISAM - data is written on HDD unordered. Slow PRI-KEY range reads if pri key is not AUTO INCREMENT unique field.
InnoDB - data ordered, bad for flash drives (as data needs to be re-ordered after insert = additional writes). Very fast for PRI KEY range reads, slow for writes.
InnoDB is not suitable for flash memory. As seeks are very fast (so you won't get too much benefit from reordering the data), and additional writes needed to maintain the order are damaging to flash memory.
myISAM / innoDB makes a huge difference for conventional and flash drives (i don't know what about SQLite), but i'd rather use mysql/myisam.
I actually prefer using SQLiteSpy http://www.portablefreeware.com/?id=1165 as my SQLite interface.
It supports things like REGEXP which can come in handy.