I'm trying to create a data import mechanism for a database that requires high availability to readers while serving irregular bulk loads of new data as they are scheduled.
The new data involves just three tables with new datasets being added along with many new dataset items being referenced by them and a few dataset item metadata rows referencing those. Datasets may have tens of thousands of dataset items.
The dataset items are heavily indexed on several combinations of columns with the vast majority (but not all) reads including the dataset id in the where clause. Because of the indexes, data inserts are now too slow to keep up with inflows but because readers of those indexes take priority I can not remove the indexes on the main table but need to work on a copy.
I therefore need some kind of working table that I copy into, insert into and reindex before quickly switching it to become part of the queried table/view. The question is how do I quickly perform that switch?
I have looked into partitioning the dataset items table by a range of dataset id, which is a foreign key, but because this isn't part of the primary key SQL Server doesn't seem make that easy. I am not able to switch the old data partition with a readily indexed updated version.
Different articles suggest use of partitioning, snapshot isolation and partitioned views but none directly answer this situation, being either about bulk loading and archiving of old data (partitioned by date) or simple transaction isolation without considering indexing.
Is there any examples that directly tackle this seemingly common problem?
What different strategies do people have for really minimizing the amount of time that indexes are disabled for when bulk loading new data into large indexed tables?
Notice, that partitioning on a column requires the column to be part of the clustered index key, not part of the primary key. The two are independent.
Still, partitioning imposes lots of constraints on what you operations you can perform on your table. For example, switching only works if all indexes are aligned and no foreign keys reference the table being modified.
If you can make use of partitioning under all of those restrictions this is probably the best approach. Partitioned views give you a more flexibility but have similar restrictions: All indexes are obviously aligned and incoming FKs are impossible.
Partitioning data is not easy. It is not a click-through-wizard-and-be-done solution. The set of tradeoffs is very complex.
Related
I am a student and I have a question when I research about mysql partition.
Example I have a table "Label" with 10 partitions by hash(TaskId)
resourceId (PK)
TaskId (PK)
...
And I have 10 table with name table is "label": + taskId:
tables:
task1(resourceId,...)
task2(resourceId,...)
...
Could you please tell me about advantages and disadvantages between them?
Thanks
Welcome to Stack Overflow. I wish you had offered a third alternative in your question: "just one table with no partitions." That is by far, in almost all cases in the real world, the best way to handle your data. It only requires maintaining and querying one copy of each index, for example. If your data approaches billions of rows in size, it's time to consider stuff like partitions.
But never mind that. Your question was to compare ten tables against one table with ten partitions. Your ten-table approach is often called sharding your data.
First, here's what the two have in common: they both are represented by ten different tables on your storage device (ssd or disk). A query for a row of data that might be anywhere in the ten involves searching all ten, using whatever indexes or other techniques are available. Each of these ten tables consumes resources on your server: open file descriptors, RAM caches, etc.
Here are some differences:
When INSERTing a row into a partitioned table, MySQL figures out which partition to use. When you are using shards, your application must figure out which table to use and write the INSERT query for that particular table.
When querying a partitioned table for a few rows, MySQL automatically figures out from your query's WHERE conditions which partitions it must search. When you search your sharded data, on the other hand, your application much figure out which table or tables to search.
In the case you presented --partitioning by hash on the primary key -- the only way to get MySQL to search just one partition is to search for particular values of the PK. In your case this would be WHERE resourceId = foo AND TaskId = bar. If you search based on some other criterion -- WHERE customerId = something -- MySQL must search all the partitions. That takes time. In the sharding case, your application can use its own logic to figure out which tables to search.
If your system grows very large, you'll be able to move each shard to its own MySQL server running on its own hardware. Then, of course, your application will need to choose the correct server as well as the correct shard table for each access. This won't work with partitions.
With a partitioned table with an autoincrementing id value on each row inserted, each of your rows will have its own unique id no matter which partition it is in. In the sharding case, each table has its own sequence of autoincrementing ids. Rows from different tables will have duplicate ids.
The Data Definition Language (DDL: CREATE TABLE and the like) for partitioning is slightly simpler than for sharding. It's easier and less repetitive to write the DDL add a column or an index to a partitioned table than it is to a bunch of shard tables. With the volume of data that justifies sharding or partitioning, you will need to add and modify indexes to match the needs of your application in future.
Those are some practical differences. Pro tip don't partition and don't shard your data unless you have really good reasons to do so.
Keep in mind that server hardware, disk hardware, and the MySQL software are under active development. If it takes several years for your data to grow very large, new hardware and new software releases may improve fast enough in the meantime that you don't have to worry too much about partitioning / sharding.
We have a dataset of roughly 400M rows, 200G in size. 200k rows are added in a daily batch. It mainly serves as an archive that is indexed for full text search by another application.
In order to reduce the database footprint, the data is stored in plain myIsam.
We are considering a range-partitioned table to streamline the backup process, but cannot figure a good way to handle unique keys. We absolutely need two of them. One to be directly compatible with the rest of the schema (ex: custId), another to be compatible with the full text search app (ex: seqId).
My understanding is that partitions do not support more than one globally unique key. We would have to merge both unique keys (custId, seqId), which will not work in our case.
Am I missing something?
I've always heard that "proper" indexing of one's SQL tables is key for performance. I've never seen a real-world example of this and would like to make one using SQLFiddle but not sure on the SQL syntax to do so.
Let's say I have 3 tables: 1) Users 2) Comments 3) Items.
Let's also say that each item can be commented on by any user. So to get item=3's comments here's what the SQL SELECT would look like:
SELECT * from comments join users on comments.commenter_id=users.user_id
WHERE comments.item_id=3
I've heard that generally speaking if the number of rows gets large, i.e., many thousands/millions, one should put indices on the WHERE and the JOINed column. So in this case, comments.item_id, comments.commenter_id, and users.user_id.
I'd like to make a SQLFiddle to compare having these tables indexed vs. not using many thousands, millions rows for each table. Might someone help with generating this SQLFiddle?
I'm the owner of SQL Fiddle. It definitely is not the place for generating huge databases for performance testing. There are too many other variables that you don't (but should, in real life) have control over, such as memory, hdd configuration, etc.... Also, as a shared environment, there are other people using it which could also impact your tests. That being said, you can still build a small db in sqlfiddle and then view the execution plans for queries with and without indexes. These will be consistent regardless of other environmental factors, and will be a good source for learning optimization.
There's quite a few different ways to index a table and you might choose to index multiple tables differently depending on what your most used SELECT statements are. The 2 fundamental types of indexes are called clustered and non-clustered.
Clustered indexes store all of the information on the index itself rather than storing a list of references that the database can pull from and then use to find the actual data. The easiest way to visualize this is to think of the index and the table itself as separate objects. In a clustered index, if the column you indexed is used as a criterion (in the WHERE clause) then the information the query pulls will be pulled directly from the index and not the table.
On the other hand, non-clustered indexes is more like a reference table. It tells the query where the actual information it is requesting is stored at on the table object itself. So in essence, there is an extra step involved of actually retrieving the data from the table itself when you use non-clustered indexes.
Clustered indexes store data physically on the hard disk in a sequential order, and as a result of that, you can only have one clustered index on a table (since we can only store a table in one 'physical' way on a disk drive). Clustered indexes also need to be unique (although this may not be the case to the naked eye, it is always the case to the database itself). Because of this, most clustered indexes are put on the primary key (since most primary keys are unique).
Unlike clustered indexes, you can have as many non-clustered indexes are you want on a table since after all, they are just reference tables for the actual table itself. Since we have an essentially unlimited number of options for non-clustered indexes, users like to put as many of these as needed on columns that are commonly used in the WHERE clause of a SELECT statement.
But like all things, excess is not always good. The more indexes you put on a table, the more 'overhead' there is on that table. Indexes might speed up your query runs, but excessive overhead will also slow them down. The key is to find a balance between too many indexes and not enough indexes for your particular situation.
As far as a good place to test the performance of your queries with or without indexes, I would recommend using SQL Server. There's a function in SQL Server Management Studio called 'Execution Plan' which tells you the cost and time to run of a query.
I have an InnoDB based schema with roughly 100 tables, most use GUID/UUID's as the primary key. I started this at a point in time where I didn't really understand the implications of a UUID PK with regard to Disk IO and fragmentation, but wanted the benefits of avoiding a single key dispenser when dealing with server clusters. We're not currently dealing with large numbers of rows, but we will be (in the hundreds of millions) and I would like to be prepared for that.
Now that I understand indexing in InnoDB better, specifically the clustered nature of the primary key, I can see that my UUID's are a poor choice for scalability from a DISK IO perspective, but I don't want to stop using them due to the server clustering requirement.
The accepted/recommended solution seems to be a mix of Autoincrement PK (INT|BIGINT), with UNIQUE Indexed UUID keys. My intention is to add a new first column ai_col to each table and assign it as the new PK, I'm taking queues from:
http://dev.mysql.com/doc/refman/5.1/en/innodb-auto-increment-handling.html
I would then update/recreate a new "UNIQUE" index on my UUID keys and continue to use them in our application layer.
My expectation is that once this is done that I can essentially ignore the ai_col and everything else runs business as usual. InnoDB will have a relatively small int based PK from which to cluster on and append to the other unique indexes.
Question 1: Am I correct in assuming that in this new scenario, I can have my cake and eat it too?
The follow up question is with regard to smaller 'associational' tables, i.e. Only two columns, both Foreign Keys to other tables joining them implicitly. In these cases I have typically two indexes, one being a UNIQUE two column index with the more heavily used column first, then a second single index on the other column. I know that this is essentially 2.5x as large as the actual row data, but it seems to really help our more complex queries during optimization, and is on smaller tables so relatively acceptable.
Most of these associational tables will only be a fraction the number of records in the primary tables because they're typically more specific, however, there are a few cases where these have many multiples the number of records as their foreign parents, i.e. potentially billions.
Question 2: Is it a good idea to add the numeric PK's to these tables as well? I'm guessing that the answer will be something along the lines of "Benchtest it" but I'm just looking for helpful nuggets of wisdom.
If I've obviously mis-interpreted anything or you can offer insights that I may not be considering, I'd really appreciate that too!
Many thanks!
EDIT: As promised in the answer, I just wanted to follow up for anyone interested... This solution has worked famously :) Read and write performance increased across the board, and so far it's been tested up to about 6 billion i/o's / month, without breaking a sweat.
Without any other suggestions, confirmations, or otherwise, I've begun testing on our dev server with a number of less used tables but ones that would be affected none the less if the new AI based id's were going to affect our application layer.
So far it's looking good, indexes are performing as expected and the new table fields haven't required any changes to our application layer, we've been basically able to ignore them.
I haven't run any thorough bench testing though to test the actual Disk IO under heavy load but from the sheer amount of information out there on the subject, I can surmise that we're in good shape for scaling up.
Once this has been in place for a while I'll drop in a follow up in case anyone's in the same boat we were.
This is more a conceptual question. It's inspired from using some extremely large table where even a simple query takes a long time (properly indexed). I was wondering is there is a better structure then just letting the table grow, continually.
By large I mean 10,000,000+ records that grows every day by something like 10,000/day. A table like that would hit 10,000,000 additional records every 2.7 years. Lets say that more recent records are accesses the most but the older ones need to remain available.
I have two conceptual ideas to speed it up.
1) Maintain a master table that holds all the data, indexed by date in reverse order. Create a separate view for each year that holds only the data for that year. Then when querying, and lets say the query is expected to pull only a few records from a three year span, I could use a union to combine the three views and select from those.
2) The other option would be to create a separate table for every year. Then, again using a union to combine them when querying.
Does anyone else have any other ideas or concepts? I know this is a problem Facebook has faced, so how do you think they handled it? I doubt they have a single table (status_updates) that contains 100,000,000,000 records.
The main RDBMS providers all have similar concepts in terms of partitioned tables and partitioned views (as well as combinations of the two)
There is one immediate benefit, in that the data is now split across multiple conceptual tables, so any query that includes the partition key within the query can automatically ignore any partition that the key would not be in.
From a RDBMS management perspective, having the data divided into seperate partitions allows operations to be performed at a partition level, backup / restore / indexing etc. This helps reduce downtimes as well as allow for far faster archiving by just removing an entire partition at a time.
There are also non relational storage mechanisms such as nosql, map reduce etc, but ultimately how it is used, loaded and data is archived become a driving factor in the decision of the structure to use.
10 million rows is not that large in the scale of large systems, partitioned systems can and will hold billions of rows.
Your second idea looks like partitioning.
I don't know how well it works, but there is support for partition in MySQL -- see, in its manual : Chapter 17. Partitioning
There is good scalability approach for this tables. Union is right way, but there is better way.
If your database engine supports "semantical partitioning", then you can split one table into partitions. Each partition will cover some subrange (say 1 partition per year). It will not affect anything in SQL syntax, except DDL. And engine will transparently run hidden union logic and partitioned index scans with all parallel hardware it has (CPU, I/O, storage).
For example Sybase allows up to 255 partitions, as it is limit of union. But you will never need keyword "union" in queries.
Often the best plan is to have one table and then use database partioning.
Or you can archive data and create a view for the archived and combined data and keep only the active data in the table most functions are referencing. You will have to have a good archiving stategy though (which is automated) or you can lose data or not get things done efficiently in moving them. This is typically more difficult to maintain.
What you're talking about is horizontal partitioning or sharding.