So I'm building a database in MySQL which contains approximately 20,000 tables, one for each human gene, where each gene's table has a single column listing the alternative names (synonyms) for this gene found in the literature, and often times there's no logic to these synonyms and they exist purely for historical reasons.
First off, is there a better way to set up this database with fewer tables?
The problem is that each gene has a variable number of alternative names, so it's not like I can make one big table with each row corresponding to a gene and a set number of columns. And even if each gene had the same number of alternative names, any particular column would basically be meaningless, since, for example, there would be no relationship between the synonym in column 1 for gene 1 and the synonym in column 1 for gene 2.
What exactly is bad about having thousands of tables in MySQL?
I could potentially break the database up into 23 databases (one for each chromosome), or something like that, and then each database would only have ~900 tables, would something like that be better?
I almost feel like maybe MySQL (a relational database) is the wrong tool for the job. If that's the case, what would be a better database paradigm?
20,000 tables is a lot of tables. There's nothing necessarily "bad" about having 20,000 tables, if you actually have a need 20,000 tables. We run with innodb_file_per_table, so that's a whole slew of files, and we'd potentially running up against some limits in MySQL (innodb_open_files, open_files_limit, table_cache_open) which are in turn limited by the OS ulimit.
Add to that the potential difficulty managing a large number of identical tables. If I need to add a column, I'd be needing to add that column to 20,000 tables. That's 20,000 ALTER TABLE statements. And if I miss some tables, the tables won't be identical anymore. I just don't want to go there, if I can help it.
I'd propose and consider a different design.
As a first cut, something like:
CREATE TABLE gene_synonym
( gene VARCHAR(64)
, synonym VARCHAR(255)
, PRIMARY KEY (gene, synonym)
) ENGINE=InnoDB
;
To add a synonym for a gene, rather than inserting a value into a single column of a particular table:
INSERT INTO gene_synonym (gene, synonym) VALUES ('alzwhatever','iforgot');
And the querying, instead of figuring out which of the 20,000 tables to be queries, we would query just one table and include a condition on gene column:
SELECT gs.synonym
FROM gene_synonym gs
WHERE gs.gene = 'alzwhatever'
ORDER BY gs.synonym
The WHERE clause makes it so we can view a subset of the one big table, the set returned will emulate one of the currently individual tables.
And if I needed to search for a synonym, I could query just this one table:
SELECT gs.gene
FROM gene_synonym gs
WHERE gs.synonym = 'iforgot'
To do that same search with 20,000 tables, I would need 20,000 different SELECTs, one for each each of the 20,000 tables.
I just took a swag at the datatypes. Since MySQL has a limit of 64 characters for a table name, I limited the gene column to 64 characters.
We could populate the gene column with the names of the tables in the current design.
However, what this table can't emulate is an empty table, a gene that doesn't have any synonyms. (Or maybe our design would be for the name of the gene to be a synonym of itself, so we'd have a row ('alzwhatever','alzwhatever')
Either case, we'd likely also want to add a table like this:
CREATE TABLE gene
( gene VARCHAR(64)
, PRIMARY KEY (gene)
) ENGINE=InnoDB
;
This is the table that would have the 20,000 rows, one row for each of the tables in your current design.
Further, we can add a foreign key constraint
ALTER TABLE gene_synonym
ADD CONSTRAINT FK_gene_synonym_gene (gene) REFERENCES gene (gene)
ON UPDATE CASCADE ON DELETE CASCADE
;
This design is much more in keeping with normative pattern for relational databases.
This isn't to say that other designs are "bad". Just this design would be more typical.
You should have a synonym table. One such table:
create table geneSynonyms (
geneSynonymId int auto_increment primary key,
geneId int not null,
synonym varchar(255),
constraint fk_geneSynonyms_geneId foreign key (geneId) references genes(geneId),
constraint unq_geneSynonyms_synonym unique (synonym) -- I assume this is unique
);
Then, you have one row for each synonym for all the genes in a single table.
What is bad about having thousands of tables? Here are a few things:
The data storage is very inefficient. The minimum space occupied by a table is a data page. If you don't fill the page you are wasting space.
By wasting space, you end up filling the page cache with nearly empty pages. This means less data fits into memory, which adversely affects performance.
Your queries are hard-wired to the table being accessed. You cannot write generic code for multiple genes.
You cannot make changes to your data structure, easily.
You cannot validate data, by having rules that say "a synonym needs to be unique across all genes".
You cannot readily find the gene that a synonym refers to.
Improving performance by, say, adding indexes or partitioning the data is a nightmare.
Related
My question is regarding database structuring for a table that links 2 other tables for storing the relationship.
for example, I have 3 tables, users, locations, and users_locations.
users and locations table both have an id column.
users_locations table has the user_id and location_id from the other 2 tables.
how do you define your indexes/constraints on these tables to efficiently answer questions such as what locations does this user have or what users belong to this location?
eg.
select user_id from users_locations where location_id = 5;
or
select location_id from users_locations where user_id = 5;
currently, I do not have a foreign key constraint set, which I assume I should add, but does that automatically speed up the queries or create an index?
I don't think I can create an index on each column since there will be duplicates eg. multiple user_id entries for each location, and vice versa.
Will adding a composite key like PRIMARY_KEY (user_id, location_id) speed up queries when most queries only have half of the key?
Is there any reason to just set an AUTO INCREMENT PRIMARY_KEY field on this table when you will never query by that id?
Do I really even need to set a PRIMARY KEY?
Basically, for any table, decision to create an index or not create an index, totally depends on your use cases which you support. Indexes must always be on the per use basis and not on nice to have.
For your particular queries that you have mentioned, separate indexes on both the columns are good enough, that is query doesn't need to go to your rows to fetch the information.
Creating foreign key on a table column automatically creates an index so you need not create indexes yourself if you decide to set up foreign keys.
If you keep an auto increment key as primary key, you will still have to make user_id and location id combination as unique otherwise you will bloat your table with duplicates.So keeping a separate auto increment key doesn't make sense in your use case. However if you want to keep track of each visit to a location and save user experience each time then auto increment primary key will be a required thing.
However I would like to point it out that creating indexes does not guarantee that your queries will use them unless specified explicitly. For a single query there can be many execution plans and most efficient may not use an index.
The optimal indexes for a many-to-many mapping table:
PRIMARY KEY (aid, bid),
INDEX(bid, aid)
More discussion and more tips: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
(Comments on specific points in the Question)
FOREIGN KEYs implicitly create indexes, unless an explicit index has already been provided.
Composite indexes are better for many-to-many tables.
A FOREIGN KEY involves an integrity check, so it is inherently slower than simply having the index. (And the integrity check for this kind of table is of dubious value.)
There is no need for an AUTO_INCREMENT on a many:many table. However, ...
It is important to have a PRIMARY KEY on every table. The pair of columns is fine as a "natural" PRIMARY KEY.
A WHERE clause would like to use the first column(s) of some index; don't worry that it is not using all the columns.
In EXPLAIN you sometimes see "Using index". This means that a "covering index" was used. That means that all the columns used in the SELECT were found in that one index -- without having to reach into the data to get more columns. This is a performance boost. And necessitates two two-column indexes (on is the PK, one is a plain INDEX.)
With InnoDB, any 'secondary' index (INDEX or UNIQUE) implicitly includes the columns of the PK. So, given PRIMARY KEY(a,b), INDEX(b), that secondary index is effectively INDEX(b,a). I prefer to spell out the two columns to point out the to reader that I deliberately wanted those two columns in that order.
Hopefully, the above link will answer any further questions.
I'm not very expert in SQL and I need to ask an advice about what's the best way to set up a table that will contains appointments.
My doubt is on the primary key.
My ideas are:
1-Use an auto-increment column for the Id of the appointment (for example unsigned integer).
My doubts about this solution are: the index can reachs the overflow even if it's very high and when the number of record grows up performances can decrease?
2-Create a table for every year.
Dubts: it will be complex to mantain and execute queries.
3-Use a composite index.
Dubts: how set it
4-Other?
Thanks.
Use an autoincrement primary key. MySQL will not be able to process a growing table way before your integer will overflow.
MySQL's performance will go down on a large even if you did not have a primary key. This is when you will start thinking about partitioning (your option 2) and archiving old data. But from the beginning autoincrement primary key on a single table should do just fine.
1 - Do you think you will exceed 4 billion rows? Performance degrades if you don't have suitable indexes for your queries, not because of table size. (Well, there is a slight degradation, but not worth worrying about.) Based on 182K/year, MEDIUMINT UNSIGNED (16M max) will suffice.
2 - NO! This is a common question; the answer is always "do not create identical tables".
3 - What column or combination of columns are UNIQUE for the table? Simply list them inside PRIMARY KEY (...)
Number 3 is usually preferred. If there is no unique column(s), go with Number 1.
182K rows per year does not justify PARTITIONing. Consider it if you expect more than a million rows. (Here's an easy prediction: You will re-design this schema before 182K grows to a million.)
I am creating a simple comparison script and I have some questions for the database structure. Firstly the database will be huge, I am expecting more than 1 million entries in products.
Secondly, there will be a search form that the search term will look into (%$term%) the field name and display the product's related info and shop's info.
Below you can see my database structure named products.
id int(10) NOT NULL
name varchar(50) NOT NULL
link varchar(50) NOT NULL
description varchar(50) NOT NULL
image varchar(50) NOT NULL
price varchar(50) NOT NULL
My questions are:
Do you suggest me to index a field? Users will not be able to insert or update products, the only query will be SELECT to display the results and I will update the products from XML feeds often for possible products changes.
I have to store the shop info like name, shipping, link, image... This gives me two option. a) To create a new table named shops and join those two tables with a new field in products shopID that will look for the id in shops and display the info or b) Should I add these info (name, shipping, ...) in extra fields in products in every single product ? (I think the answer is obvious but I need your suggestion).
Are there any other things I should have in mind, or change?
I am not an advanced programmer and what I learn is through internet, so maybe the questions are too obvious for you, but for me is the ticket for learning.
Thank you for your answers.
Indexes are required to fetch records very fast. So yes, they're recommended. But what kind of an index would you like to use? MyISAM engine offers "regular" string index that you can use with a LIKE clause (e.g. LIKE 'hello%') but it restricts you from using a wildcard at the beginning of the search phrase. In addition, MyISAM has a FULLTEXT index that allows you to search words in the whole string, not just the beginning of the string. So you could create a FULLTEXT index on the columns description and name - but 2 FULLTEXT indexes seem redundant in this case. Maybe you could join those columns and separate the values with a token or a character? If so, you'll need to create only 1 FULLTEXT index on the joined column, which can save a lot fragmentation and disk space. One of the cons for using MyISAM engine is that when writing to it (UPDATE/DELETE queries) - it locks the entire table. So, if the table is written to many times a minute, it will probably make other queries hang. That's why you should see if InnoDB engine suits your needs - which enables concurrent read/write operations on the table.
That's probably a good idea, since having index on the column price seems essential, and FULLTEXT indexes doesn't work together with other indexes.
I'd say: Use InnoDB and Sphinx, and have a primary index on id & a regular index on price.
The most important thing for you to understand is that when writing a code for specific software, you must be well familiar with that software and it's caveats. You should read High performance MySQL - extremely recommended.
Edit:
If you want to add an indexes in the products table, you can do that with
ALTER TABLE /* etc */ when the table is empty or contains small amount of data. If the table has a lot of data, then it's recommended to create another table that's similar to products, altering that new table and populating it with data from the old products table, e.g.:
CREATE TABLE `products_new` LIKE `products`;
ALTER TABLE `products_new` ADD FULLTEXT (`name`);
LOCK TABLES `products` READ, `products_new` WRITE;
INSERT INTO `products_new` SELECT * FROM `products`;
LOCK TABLES `products` WRITE, `products_new` WRITE;
ALTER TABLE `products` RENAME TO `products_bad`;
ALTER TABLE `products_new` RENAME TO `products`;
/* The following doesn't work:
RENAME TABLE `products` TO `products_bad`, `products_new` TO `products`;
See: http://bugs.mysql.com/bug.php?id=22246
*/
DROP TABLE `products_bad`;
Nikolai,
The ID should be a primary key. That automatically puts an index on ID, and will speed up any queries that need to get specific products.
The shop table should be a second table, but you should have a 3rd table that joins product with shops. At it's most basic, it would have two fields, shop_id, product_id. This let's you have a single product in multiple shops. These two fields should be foreign keys to the product table and shop table.
If you are ever thinking about having a different price for a product per shop, then the product_store join table should also contain the price, although the base price could be stored in the products table.
Price should be a decimal, so that you can do calculations on the price field.
1) You should generally index fields that are commonly used. However since your search on name uses a wildcard at the start an index will have no effect on this query.
2) Creating a shops table and linking to this would be better.
Price for sure because something tells me you will search over this field and do orderings.
"Premature optimization is a root of all evil" (c) Donald Knuth. So, I suggest to normalize your tables, so YES - create table for shops. Once your applicated grown big, and you faced to highloads, you will be able to denormalize your database to avoid JOINS (one way to optimize your voracious application)
Get back to stackoverflow with your problem ;-)
Generally you should index fields that will be intensively used. But using wildcard for your search won't help much.
Better use another table with foreign key.
Also shouldn't your "id" field in your products table be define as PRIMARY KEY ?
Here are my suggestions:
To be able to search for %term% you need full-text search, an index will not do you any good when the search-term starts with a wildcard.
Yes you should put an index on the id-column (and probably make it auto increment) since that seems to be the unique column in the table. Other than that there's no point in us suggesting any other indexes since we don't which queries you are going to run.
Yes, create another table for shops, otherwise you will have data that is not normalized, for shop-name and so on (there might be rare cases that "require" de-normalization, such as optimization, but you have not reached there yet). Not normalized data will cause problems, in your specific case, such as what will you do when a shop needs to change it name? Well, you will have to update all matching rows in the product table.
There are many things you should keep in mind, but it's out of scope for this answer. I suggest that you get to work and learn as you go, because learning by doing is a great way become a better developer. Then when you hit a specific problem, search for/post it here on stackoverflow.
I'm a complete newbie with MySQL indexes. I have several MyISAM tables on MySQL 5.0x having utf8 charsets and collations with 100k+ records each. The primary keys are generally integer. Many columns on each table may have duplicate values.
I need to quickly count, sum, average, or otherwise perform custom calculations on any number of fields in each table or joined on any number of others.
I found this page giving an overview of MySQL index usage: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html, but I'm still not sure I'm using indexes right. Just when I think I've made the perfect index out of a collection of fields I want to calculate against, I get the "index must be under 1000 bytes" error.
Can anyone explain how to most efficiently create and use indexes to speed up queries?
Caveat: upgrading Mysql is not possible in this case. Using Navicat Light for db administration, but this app isn't required.
When you create an index on a column or columns in MySQL table, the database is creating a data structure called a B-tree (assuming you use the default index setting), for which the key of each record is a concatenation of the values in the indexed columns.
For example, let's say you have a table that is defined like:
CREATE TABLE mytable (
id int unsigned auto_increment,
column_a char(32) not null default '',
column_b int unsigned not null default 0,
column_c varchar(512),
column_d varchar(512),
PRIMARY KEY (id)
) ENGINE=MyISAM;
Then let's give it some data:
INSERT INTO mytable VALUES (1, 'hello', 2, null, null);
INSERT INTO mytable VALUES (2, 'hello', 3, 'hi', 'there');
INSERT INTO mytable VALUES (3, 'how', 4, 'are', 'you?');
INSERT INTO mytable VALUES (4, 'foo', 5, '', 'bar');
Now suppose you decide to add a key to column_a and column_b like:
ALTER TABLE mytable ADD KEY (column_a, column_b);
The database is going to create the aforementioned B-tree, which will have four keys in it, one for each row:
hello-2
hello-3
how-4
foo-5
When you perform a search that references the column_a column, or that references the column_a AND column_b columns, the database will be able to use this index to narrow the record set it has to examine. Let's say you have a query like:
SELECT ... FROM mytable WHERE column_a = 'hello';
Even though the above query does not specify a value for the column_b column, it can still take advantage of our index by looking for all keys that begin with "hello". For the same reason, if you had a query like:
SELECT ... FROM mytable WHERE column_b = '2';
This query would NOT be able to use our index, because it would have to parse the index keys themselves to try to determine which keys' second value matches '2', which is terribly inefficient.
Now, let's address your original question of the maximum length. Suppose we try to create an index spanning all four non-PK columns in this table:
ALTER TABLE mytable ADD KEY (column_a, column_b, column_c, column_d);
You will get an error:
ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes
In this case our column lengths are 32, 10, 512, and 512, which in a single-byte-per-character situation is 1066, which is above the limit of 1000. Suppose that it DID work; you would be creating the following keys:
hello-2-
hello-3-hi-there
how-4-are-you?
foo-5--bar
Now, suppose that you had values in column_c and column_d that were very long -- 512 characters each. Even in a basic single-byte character set, your keys would now be over 1000 bytes in length, which is what MySQL is complaining about. It gets even worse with multibyte character sets, where seemingly "small" columns can still push the keys over the limit.
If you MUST use a large compound key, one solution is to use InnoDB tables rather than the default MyISAM tables, which support a larger key length (3500 bytes) -- you can do this by swapping ENGINE=InnoDB instead of ENGINE=MyISAM in the declaration above. However, generally speaking, if you are using long keys there is probably something wrong with your table design.
Remember that single-column indexes often provide more utility than multi-column indexes. You want to use a multi-column index when you are going to often/always take advantage of it by specifying all of the necessary criteria in your queries. Also, as others have mentioned, do NOT index every column of a table, since each index is adding storage overhead to your database. You want to limit your indexes to the columns that are frequently used by queries, and if it seems like you need too many, you should probably think about breaking up your tables up into more logical components.
Indexes generally aren't well suited for custom calculations where the user is able to construct their own queries. Typically you choose the indexes to match the specific queries you intend to run, using EXPLAIN to see if the index is being used.
In the case that you have absolutely no idea what queries might be performed it is generally best to create one index per column - and not one index covering all columns.
If you have a good idea of what queries might be run often you could create an extra index for those specific queries. You can also add indexes later if your users complain that certain types of queries run too slow.
Also, indexes generally aren't that useful for calculating counts, sums and averages since these types of calculations require looking at every row.
It sounds like you are trying to put too many fields into your index. The limit is the probably the number of bytes it takes to encode all the fields.
The index is used in looking up the records, so you want to choose the fields which you are "WHERE"ing on. In choosing between those fields, you want to choose the ones that will narrow the results the quickest.
As an example, a filter on Male/Female will usually not help much because you are only going to save about 50% of the time. However, a filter on State may be useful because you'll break down into many more categories. However, if almost everybody in the database is in a single state then that won't work.
Remember that indexes are for sorting and finding rows.
The error message you got sounds like it is talking about the 1000 byte Prefix Limit for MyISAM table indexes. From http://dev.mysql.com/doc/refman/5.0/en/create-index.html:
The statement shown here creates an
index using the first 10 characters of
the name column:
CREATE INDEX part_of_name ON customer
(name(10)); If names in the column
usually differ in the first 10
characters, this index should not be
much slower than an index created from
the entire name column. Also, using
column prefixes for indexes can make
the index file much smaller, which
could save a lot of disk space and
might also speed up INSERT operations.
Prefix support and lengths of prefixes
(where supported) are storage engine
dependent. For example, a prefix can
be up to 1000 bytes long for MyISAM
tables, and 767 bytes for InnoDB
tables.
Maybe you can try a FULLTEXT index for problematic columns.
Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?
No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.
you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)
Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.
Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.