Mysql - Best primary key for appointments table - mysql

I'm not very expert in SQL and I need to ask an advice about what's the best way to set up a table that will contains appointments.
My doubt is on the primary key.
My ideas are:
1-Use an auto-increment column for the Id of the appointment (for example unsigned integer).
My doubts about this solution are: the index can reachs the overflow even if it's very high and when the number of record grows up performances can decrease?
2-Create a table for every year.
Dubts: it will be complex to mantain and execute queries.
3-Use a composite index.
Dubts: how set it
4-Other?
Thanks.

Use an autoincrement primary key. MySQL will not be able to process a growing table way before your integer will overflow.
MySQL's performance will go down on a large even if you did not have a primary key. This is when you will start thinking about partitioning (your option 2) and archiving old data. But from the beginning autoincrement primary key on a single table should do just fine.

1 - Do you think you will exceed 4 billion rows? Performance degrades if you don't have suitable indexes for your queries, not because of table size. (Well, there is a slight degradation, but not worth worrying about.) Based on 182K/year, MEDIUMINT UNSIGNED (16M max) will suffice.
2 - NO! This is a common question; the answer is always "do not create identical tables".
3 - What column or combination of columns are UNIQUE for the table? Simply list them inside PRIMARY KEY (...)
Number 3 is usually preferred. If there is no unique column(s), go with Number 1.
182K rows per year does not justify PARTITIONing. Consider it if you expect more than a million rows. (Here's an easy prediction: You will re-design this schema before 182K grows to a million.)

Related

Segregating data or using UNIQUE index for optimization

I have a table;
Orders
* id INT NN AN PK
* userid INT NN
* is_open TINYINT NN DEFAULT 1
* amount INT NN
* desc VARCHAR(255)
and the query SELECT * FROM orders WHERE userid = ? AND is_open = 1; that I run frequently. I would like to optimize the database for this query and I currently have two options;
Move closed orders (is_open = 0) to a different table since currently open orders will be relatively smaller than closed orders thereby minimizing rows to scan on lookup
Set a unique key constraint: ALTER TABLE orders ADD CONSTRAINT UNIQUE KEY(id, userid);
I don't know how the latter will perform and I know the former will help performance but I don't know if it's a good approach in terms of best practices.
Any other ideas would be appreciated.
The table is of orders; there can be multiple open/closed orders for each userid.
WHERE userid = ? AND is_open = 1 would benefit from either of these 'composite' indexes: INDEX(userid, is_open) or INDEX(is_open, user_id). The choice of which is better depends on what other queries might benefit from one more than the other.
Moving "closed" orders to another table is certainly a valid option. And it will help performance. (I usually don't recommend it, only because of the clumsy code needed to move rows and/or to search both tables in the few cases where that is needed.)
I see no advantage with UNIQUE(id, userid). Presumably id is already "unique" because of being the PRIMARY KEY? Also, in a composite index, the first column will be checked first; that is what the PK is already doing.
Another approach... The AUTO_INCREMENT PK leads to the data BTree being roughly chronological. But you usually reach into the table by userid? To make that more efficient, change PRIMARY KEY(id), INDEX(userid) to PRIMARY KEY(userid, id), INDEX(id). (However... without knowing the other queries touching this table, I can't say whether this will provide much overall improvement.)
This might be even better:
PRIMARY KEY(userid, is_open, id), -- to benefit many queries
INDEX(id) -- to keep AUTO_INCREMENT happy
The cost of an additional index (on the performance of write operations) is usually more than compensated for by the speedup of Selects.
Setting a unique index on id and user_id will gain you nothing since the id is already uniquely indexed as a primary key, and doesn't feature in your query anyway.
Moving closed orders to a different table will give some performance improvement, but since the closed orders are probably distributed throughout the table, that performance improvement won't be as great as you might expect. It also carries an administrative overhead, requiring that orders be moved periodically, and additional complications with reporting.
Your best solution is likely to be to add an index on user_id so that MySQL can go straight to the required User Id and search only those rows. You might get a further boost by indexing on user_id and is_open instead, but the additional benefit is likely to be small.
Bear in mind that each additional index incurs a performance penalty on every table update. This won't be a problem if your table is not busy.

MySQL database with thousands of tables

So I'm building a database in MySQL which contains approximately 20,000 tables, one for each human gene, where each gene's table has a single column listing the alternative names (synonyms) for this gene found in the literature, and often times there's no logic to these synonyms and they exist purely for historical reasons.
First off, is there a better way to set up this database with fewer tables?
The problem is that each gene has a variable number of alternative names, so it's not like I can make one big table with each row corresponding to a gene and a set number of columns. And even if each gene had the same number of alternative names, any particular column would basically be meaningless, since, for example, there would be no relationship between the synonym in column 1 for gene 1 and the synonym in column 1 for gene 2.
What exactly is bad about having thousands of tables in MySQL?
I could potentially break the database up into 23 databases (one for each chromosome), or something like that, and then each database would only have ~900 tables, would something like that be better?
I almost feel like maybe MySQL (a relational database) is the wrong tool for the job. If that's the case, what would be a better database paradigm?
20,000 tables is a lot of tables. There's nothing necessarily "bad" about having 20,000 tables, if you actually have a need 20,000 tables. We run with innodb_file_per_table, so that's a whole slew of files, and we'd potentially running up against some limits in MySQL (innodb_open_files, open_files_limit, table_cache_open) which are in turn limited by the OS ulimit.
Add to that the potential difficulty managing a large number of identical tables. If I need to add a column, I'd be needing to add that column to 20,000 tables. That's 20,000 ALTER TABLE statements. And if I miss some tables, the tables won't be identical anymore. I just don't want to go there, if I can help it.
I'd propose and consider a different design.
As a first cut, something like:
CREATE TABLE gene_synonym
( gene VARCHAR(64)
, synonym VARCHAR(255)
, PRIMARY KEY (gene, synonym)
) ENGINE=InnoDB
;
To add a synonym for a gene, rather than inserting a value into a single column of a particular table:
INSERT INTO gene_synonym (gene, synonym) VALUES ('alzwhatever','iforgot');
And the querying, instead of figuring out which of the 20,000 tables to be queries, we would query just one table and include a condition on gene column:
SELECT gs.synonym
FROM gene_synonym gs
WHERE gs.gene = 'alzwhatever'
ORDER BY gs.synonym
The WHERE clause makes it so we can view a subset of the one big table, the set returned will emulate one of the currently individual tables.
And if I needed to search for a synonym, I could query just this one table:
SELECT gs.gene
FROM gene_synonym gs
WHERE gs.synonym = 'iforgot'
To do that same search with 20,000 tables, I would need 20,000 different SELECTs, one for each each of the 20,000 tables.
I just took a swag at the datatypes. Since MySQL has a limit of 64 characters for a table name, I limited the gene column to 64 characters.
We could populate the gene column with the names of the tables in the current design.
However, what this table can't emulate is an empty table, a gene that doesn't have any synonyms. (Or maybe our design would be for the name of the gene to be a synonym of itself, so we'd have a row ('alzwhatever','alzwhatever')
Either case, we'd likely also want to add a table like this:
CREATE TABLE gene
( gene VARCHAR(64)
, PRIMARY KEY (gene)
) ENGINE=InnoDB
;
This is the table that would have the 20,000 rows, one row for each of the tables in your current design.
Further, we can add a foreign key constraint
ALTER TABLE gene_synonym
ADD CONSTRAINT FK_gene_synonym_gene (gene) REFERENCES gene (gene)
ON UPDATE CASCADE ON DELETE CASCADE
;
This design is much more in keeping with normative pattern for relational databases.
This isn't to say that other designs are "bad". Just this design would be more typical.
You should have a synonym table. One such table:
create table geneSynonyms (
geneSynonymId int auto_increment primary key,
geneId int not null,
synonym varchar(255),
constraint fk_geneSynonyms_geneId foreign key (geneId) references genes(geneId),
constraint unq_geneSynonyms_synonym unique (synonym) -- I assume this is unique
);
Then, you have one row for each synonym for all the genes in a single table.
What is bad about having thousands of tables? Here are a few things:
The data storage is very inefficient. The minimum space occupied by a table is a data page. If you don't fill the page you are wasting space.
By wasting space, you end up filling the page cache with nearly empty pages. This means less data fits into memory, which adversely affects performance.
Your queries are hard-wired to the table being accessed. You cannot write generic code for multiple genes.
You cannot make changes to your data structure, easily.
You cannot validate data, by having rules that say "a synonym needs to be unique across all genes".
You cannot readily find the gene that a synonym refers to.
Improving performance by, say, adding indexes or partitioning the data is a nightmare.

One to Many Database

I have created a database with One to many relationship
The Parent Table say Master has 2 columns NodeId,NodeName; NodeId is the PrimaryKey and it is of type int rest are of type varchar.
The Child Table say Student has 5 columns NodeId,B,M,F,T; and NodeId is the ForeignKey over here.
none of the columns B,M,F,T are unique and it can have null values hence none of these columns have been defined as Primary Key.
assume student table has more than 20,00,000 fields.
My fetch query is
SELECT * FROM STUDENT WHERE NODEID = 1 AND B='1-123'
I would like to improve the speed of fetching , Any suggestion regarding improvement of the DB structure or alternative fetch query would be really helpful or any suggestion that can improve overall efficiency is most welcome.
since foreign key is not indexed by default, maybe adding indexes to nodeID in student and B would improve query performance if inserts performance are not as big of a issue.
Update:
an index is essentially a way to keep your data sorted to increase search/query time. It should be good enough to just think of it as an ordered list.
An index is quite transparent, so your query would remain exactly the same.
A simple index does allow rows with the same indexed fields, so it should be fine.
it is worth to m mention. a primary key element is indexed by default, however a PK does not allow duplicate data.
also, since it's keeping an ordering of your data, insertion time will increase, however if your dataset is big query time should become faster.

Should I put a non-clustered index on these foreign keys in a fact table

Profile of the foreign keys
FK Distinct Values %
---- --------------- ------
Id1 1 0.1%
,Id2 4 0.3%
,Id3 5 0.3%
,Id4 6 0.4%
,Id5 6 0.4%
,Id6 95 6.1%
,Id7 97 6.2%
,Id8 1423 90.7%
All foreign keys already make up the clustered Primary Key. This fact table is part of a star schema that includes 6 dimensions (Id's 6,7, and 8 reference the same date dimension).
Fact table currently has approx 1800 rows (incredibly small), and is expected to grow by that amount each month.
Should each foreign key have its own non-clustered non-unique single column index for facilitating joins? If so, why?
Each foreign key will be part of a clustered index (primary key) in its dimension table.
If indexes should be put on the foreign keys, then what should the fill factor and padding index be set to given the low cardinality of the columns?
Your profile doesn't really make sense with the "%" column - why are you finding the "percentage" of distinct values across fields? You need stats on the distribution of the distinct values - are 99% of the keys on Id8 the same ? are they evenly distributed? etc.
Note that everything I'm saying here applies to larger tables. With 1800 rows / month, indexes are probably a waste of space and time for you to worry about.
#jrara's "rule" about indexing all the dims is an easy rule to apply, but you can easily make mistakes if that's all you do. I don't want to use an oracle bitmap index on my 100mil row customer dimension, for example.
Indexing depends on what the queries look like against your data. Indexes won't help if you are doing a full scan of the fact table to perform aggregation and grouping for "summary" reports. They will help when a user is trying to filter on an attribute of a dimension, and that filter results in you only having to look up a small percentage of the records from the fact table. Is there a main entry point to your table? Do people typically filter on an attribute of the "Id8" dimension, then want grouping on an attribute from the other dimensions?
Essentially the answers to your questions are:
Should each foreign key have its own non-clustered non-unique single column index for facilitating joins?
In general, yes, so long as the dimension tables are small and the dim keys are relatively evenly distributed in the fact table. Usually it is worse to use a index access to get 99% of the fact table rows.
what should the fill factor and padding index be set to given the low cardinality of the columns?
Lowering the FILLFACTOR below 100% will cause slower index reads, since there are more (empty) pages in the index for the DB to read. Since a data warehouse is designed for fast selects, I don't really ever recommend that you adjust the fillfactor down.
That being said, in a few cases adjusting your FILLFACTOR may make sense. If the the fact table is very large (hundreds of GB / TB), and index rebuilds take hours, and you might only rebuild indexes once a month or even less. In these cases you need to figure out how much data (as a percentage) that you'll be adding to the table each day, and set the fillfactor accordingly.
First of all, I think that you should not create a clustered primary key based on foreign keys. A clustered index is organizes the data on disk and it is better if it is
narrow
numeric
increasing (strictly monotonic)
So I think it is better to create e.g. a unique constraint on foreign keys to make row unique. Or create a non clustered primary key on those columns and then create a clustered index (but not primary key) on e.g. date foreign key (YYYYMMDD).
Usually foreign keys are indexed (non clustered, non unique) on Fact table to make faster searches. But some people does not enforce cardinality at all on dimensional model (ETL takes care of the referential integrity) because primary key - foreign key constraints makes ETL loads slow.
From Vincent Rainardi
Question: How do you index a fact table? And explain why. {H}
Answer: Index all the dim key columns, individually, non clustered
(SQL Server) or bitmap (Oracle). The dim key columns are used to join
to the dimension tables, so if they are indexed the join will be
faster. An exceptional candidate will suggest 3 additional things: a)
index the fact key separately, b) consider creating a covering index
in the right order on the combination of dim keys, and c) if the fact
table is partitioned the partitioning key must be included in all
indexes.

Creating an index on a timestamp to optimize query

I have a query of the following form:
SELECT * FROM MyTable WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime]
I would like to optimize this query, and I am thinking about putting an index on timestamp, but am not sure if this would help. Ideally I would like to make timestamp a clustered index, but MySQL does not support clustered indexes, except for primary keys.
MyTable has 4 million+ rows.
Timestamp is actually of type INT.
Once a row has been inserted, it is never changed.
The number of rows with any given Timestamp is on average about 20, but could be as high as 200.
Newly inserted rows have a Timestamp that is greater than most of the existing rows, but could be less than some of the more recent rows.
Would an index on Timestamp help me to optimize this query?
No question about it. Without the index, your query has to look at every row in the table. With the index, the query will be pretty much instantaneous as far as locating the right rows goes. The price you'll pay is a slight performance decrease in inserts; but that really will be slight.
You should definitely use an index. MySQL has no clue what order those timestamps are in, and in order to find a record for a given timestamp (or timestamp range) it needs to look through every single record. And with 4 million of them, that's quite a bit of time! Indexes are your way of telling MySQL about your data -- "I'm going to look at this field quite often, so keep an list of where I can find the records for each value."
Indexes in general are a good idea for regularly queried fields. The only downside to defining indexes is that they use extra storage space, so unless you're real tight on space, you should try to use them. If they don't apply, MySQL will just ignore them anyway.
I don't disagree with the importance of indexing to improve select query times, but if you can index on other keys (and form your queries with these indexes), the need to index on timestamp may not be needed.
For example, if you have a table with timestamp, category, and userId, it may be better to create an index on userId instead. In a table with many different users this will reduce considerably the remaining set on which to search the timestamp.
...and If I'm not mistaken, the advantage of this would be to avoid the overhead of creating the timestamp index on each insertion -- in a table with high insertion rates and highly unique timestamps this could be an important consideration.
I'm struggling with the same problems of indexing based on timestamps and other keys. I still have testing to do so I can put proof behind what I say here. I'll try to postback based on my results.
A scenario for better explanation:
timestamp 99% unique
userId 80% unique
category 25% unique
Indexing on timestamp will quickly reduce query results to 1% the table size
Indexing on userId will quickly reduce query results to 20% the table size
Indexing on category will quickly reduce query results to 75% the table size
Insertion with indexes on timestamp will have high overhead **
Despite our knowledge that our insertions will respect the fact of have incrementing timestamps, I don't see any discussion of MySQL optimisation based on incremental keys.
Insertion with indexes on userId will reasonably high overhead.
Insertion with indexes on category will have reasonably low overhead.
** I'm sorry, I don't know the calculated overhead or insertion with indexing.
If your queries are mainly using this timestamp, you could test this design (enlarging the Primary Key with the timestamp as first part):
CREATE TABLE perf (
, ts INT NOT NULL
, oldPK
, ... other columns
, PRIMARY KEY(ts, oldPK)
, UNIQUE (oldPK)
) ENGINE=InnoDB ;
This will ensure that the queries like the one you posted will be using the clustered (primary) key.
Disadvantage is that your Inserts will be a bit slower. Also, If you have other indices on the table, they will be using a bit more space (as they will include the 4-bytes wider primary key).
The biggest advantage of such a clustered index is that queries with big range scans, e.g. queries that have to read large parts of the table or the whole table will find the related rows sequentially and in the wanted order (BY timestamp), which will also be useful if you want to group by day or week or month or year.
The old PK can still be used to identify rows by keeping a UNIQUE constraint on it.
You may also want to have a look at TokuDB, a MySQL (and open source) variant that allows multiple clustered indices.