I have inherited a legacy application which makes use of a self-referencing table to facilitate a hierarchical structure. This results in recursive method calls which are creating a "bad smell".
The parent_id column references the primary key of the same table, and there are roughly 25 million records here:
+-------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+----------------+
| phase_id | int(10) | NO | PRI | NULL | auto_increment |
| plat_id | int(10) | YES | MUL | NULL | |
| name | text | YES | | NULL | |
| parent_id | int(10) | YES | MUL | NULL | |
| plan_id | int(10) | YES | MUL | NULL | |
+-------------+---------+------+-----+---------+----------------+
mysql> show table status like 'ref'\G
*************************** 1. row ***************************
Name: phase
Engine: MyISAM
Version: 10
Row_format: Dynamic
Rows: 25223658
Avg_row_length: 20
Data_length: 509450960
Max_data_length: 281474976710655
Index_length: 1026267136
Data_free: 0
Auto_increment: 25238013
I have a few questions about this kind of structure:
Is it generally bad practice to implement a self-referencing table? The main negative I can think of is that it is difficult/impossible to get the maximum depth of the hierarchy in a single query as there may be an X number children.
Is it worth redesigning this? Having so much data makes it more difficult to move it around.
What are my options? I have heard a little bit about table partitioning, but don't know if it is suitable in my scenario.
Any pointers would be really appreciated
ns
We ended up scrapping the existing self-referencing table. And created a new table to house a more simple model.
Thanks to Sebas for the Link above. There is a lot of goodness in there!
Related
I have a MySQL database with the following structure :
mysql> describe company;
+-------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| name | varchar(50) | NO | | NULL | |
+-------+-------------+------+-----+---------+----------------+
mysql> describe nameserver;
+-----------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| companyId | int | NO | MUL | NULL | |
| ns | varchar(250) | NO | MUL | NULL | |
+-----------+--------------+------+-----+---------+----------------+
mysql> describe domain;
+--------------+--------------+------+-----+-------------------+-------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+-------------------+-------------------+
| id | int | NO | PRI | NULL | auto_increment |
| nameserverId | int | NO | MUL | NULL | |
| domain | varchar(250) | NO | MUL | NULL | |
| tld | varchar(20) | NO | MUL | NULL | |
| createDate | datetime | NO | | CURRENT_TIMESTAMP | DEFAULT_GENERATED |
| updatedAt | datetime | YES | | NULL | |
| status | tinyint | NO | | NULL | |
| fileNo | smallint | NO | MUL | NULL | |
+--------------+--------------+------+-----+-------------------+-------------------+
The indexes structure :
-- Indexes for table `company`
--
ALTER TABLE `company`
ADD PRIMARY KEY (`id`);
--
-- Indexes for table `domain`
--
ALTER TABLE `domain`
ADD PRIMARY KEY (`id`),
ADD KEY `nameserver` (`nameserverId`),
ADD KEY `domain` (`domain`),
ADD KEY `tld` (`tld`),
ADD KEY `fileNo` (`fileNo`);
--
-- Indexes for table `nameserver`
--
ALTER TABLE `nameserver`
ADD PRIMARY KEY (`id`),
ADD KEY `company` (`companyId`),
ADD KEY `ns` (`ns`);
--
-- AUTO_INCREMENT for dumped tables
--
--
-- AUTO_INCREMENT for table `company`
--
ALTER TABLE `company`
MODIFY `id` int NOT NULL AUTO_INCREMENT;
--
-- AUTO_INCREMENT for table `domain`
--
ALTER TABLE `domain`
MODIFY `id` int NOT NULL AUTO_INCREMENT;
--
-- AUTO_INCREMENT for table `nameserver`
--
ALTER TABLE `nameserver`
MODIFY `id` int NOT NULL AUTO_INCREMENT;
--
-- Constraints for dumped tables
--
--
-- Constraints for table `domain`
--
ALTER TABLE `domain`
ADD CONSTRAINT `nameserver` FOREIGN KEY (`nameserverId`) REFERENCES `nameserver` (`id`);
--
-- Constraints for table `nameserver`
--
ALTER TABLE `nameserver`
ADD CONSTRAINT `company` FOREIGN KEY (`companyId`) REFERENCES `company` (`id`);
The amount of data is as following:
domain table about 500 millions records
nameserver table about 2 millions records
Running this query take about 4 hours to get me the result :
SELECT distinct domain FROM domain
INNER join nameserver on nameserver.id = domain.nameserverId
WHERE nameserver.companyId = 2
The explain result for above query :
+----+-------------+------------+------------+------+-------------------
+------------+---------+-----------------------+------+----------+------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+------+-------------------+------------+---------+-----------------------+------+----------+------------------------------+
| 1 | SIMPLE | nameserver | NULL | ref | PRIMARY,company | company | 4 | const | 1738 | 100.00 | Using index; Using temporary |
| 1 | SIMPLE | domain | NULL | ref | nameserver,domain | nameserver | 4 | tldzone.nameserver.id | 716 | 100.00 | NULL |
+----+-------------+------------+------------+------+-------------------+------------+---------+-----------------------+------+----------+------------------------------+
My question is how can I improve the speed of getting query from this database?
It is possible for me to change the DB structure or even replace it with another DBMS.
MySQL is running on a VPS with 8.0 GB RAM and dual core CPU.
nameserver: INDEX(companyId, id) -- in this order (you have this)
domain: INDEX(nameserverId, domain) -- in this order
("MUL" does not tell me whether you already have either of these composite indexes. SHOW CREATE TABLE is more descriptive than DESCRIBE.)
1 Add indexes to the relevant columns: Adding indexes to the companyId, nameserverId, and domain columns in the nameserver and domain tables can help to speed up the query by allowing the database to quickly locate the relevant rows.
2 Use a covering index: A covering index is an index that includes all the columns that are used in the query. By creating a covering index on the companyId, nameserverId, and domain columns, you can avoid the need for the database to look up the data in the actual tables, which can improve query performance.
3 Use a column-store index: A column-store index is an index that stores data by column rather than by row. Column-store indexes can be more efficient for querying large datasets and can improve the performance of the query you provided.
4 Use a database management system that is optimized for large datasets: If you are using a database management system that is not well-suited to handling large datasets, you may see improved performance by switching to a different system. Some options to consider include column-oriented database management systems such as Vertica or ClickHouse, or distributed database management systems such as Cassandra or HBase.
5 Consider using a distributed database: If you have a very large dataset and are still experiencing slow query performance, you may want to consider using a distributed database management system, which allows you to spread your data across multiple servers and can improve the scalability and performance of your database.
6 It's important to keep in mind that the specific solutions that work best for you will depend on the specific requirements of your database and the workload you are placing on it. It may be helpful to perform some benchmarking and testing to determine which approaches work best for your needs.
I have 2 tables called applications and filters. The structure of the tables are as follows:
mysql> DESCRIBE applications;
+-----------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+----------------+
| id | tinyint(3) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(255) | NO | | NULL | |
| filter_id | int(3) | NO | | NULL | |
+-----------+---------------------+------+-----+---------+----------------+
3 rows in set (0.01 sec)
mysql> DESCRIBE filters;
+----------+----------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+----------------------+------+-----+---------+----------------+
| id | smallint(5) unsigned | NO | PRI | NULL | auto_increment |
| name | varchar(100) | NO | | NULL | |
| label | varchar(255) | NO | | NULL | |
| link | varchar(255) | NO | | NULL | |
| anchor | varchar(100) | NO | | NULL | |
| group_id | tinyint(3) unsigned | NO | MUL | NULL | |
| comment | varchar(255) | NO | | NULL | |
+----------+----------------------+------+-----+---------+----------------+
7 rows in set (0.02 sec)
What I want to do is select all the records in applications and make a corresponding record in filters (so that filters.name is the same as applications.name). When the record is inserted in filters I want to get the primary key (filters.id) of the newly inserted record - which is an auto increment field - and update applications.filter_id with it. I should clarify that applications.filter_id is a field I've created for this purpose and contains no data at the moment.
I am a PHP developer and have written a script which can do this, but want to know if it's possible with a pure MySQL solution. In pseudo-code the way my script works is as follows:
Select all the records in applications
Do a foreach loop on (1)
Insert a record in filters (filters.name == applications.name)
Store the inserted ID (filters.id) to a variable and then update applications.filter_id with the variable's data.
I'm unaware of how to do the looping (2) and storing the auto increment ID (4) in MySQL.
I have read about Get the new record primary key ID from mysql insert query? so am aware of LAST_INSERT_ID() but not sure how to reference this in some kind of "loop" which goes through each of the applications records.
Please can someone advise if this is possible?
I don't think this is possible to do this with only one request to mysql.
But, i think this is a good use case for mysql triggers.
I think you should write it like this :
CREATE TRIGGER after_insert_create_application_filter AFTER INSERT
ON applications FOR EACH ROW
BEGIN
INSERT INTO filters (name) VALUES (NEW.name);
UPDATE applications SET filter_id = LAST_INSERT_ID() WHERE id = NEW.id;
END
This trigger is not tested but you should understand the way to write it.
If you don't know mysql triggers, you can read this part of the documentation.
This isn't an answer to your question, more a comment on your database design.
First of all, if the name field needs to contain the same information, they should be the same type and size (varchar(255))
Overall though, I think the schema you're using for your tables is wrong. Your description says that each record in applications can only hold one filter_id. If that is the case, there's no point in using two separate tables.
If there is a chance that there will be a many-to-one relationship, link the records via the relevant primary key. If multiple records in application can relate to a single filter, store filters.id in the applications table. If there are multiple filters for a single application, store applications.id in the filters table.
If there is a many-to-many relationship, create another table to store it:
CREATE TABLE `application_filters_mappings` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`application_id` int(10) unsigned NOT NULL,
`filters_id` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
);
I recently encounter a problem involving MySQL DBSM.
The Table is like this:
CREATE TABLE `orders` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(60) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
`sex` enum('男','女') DEFAULT NULL,
`amount` float(10,2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `name_i` (`name`),
KEY `sex` (`sex`)
) ENGINE=InnoDB AUTO_INCREMENT=5000001 DEFAULT CHARSET=utf8
As is shown above ,I create a single colume index on col name
I want to perform a range query on name, and the explain statement is
mysql> explain select * from orders where name like '王%';
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
| 1 | SIMPLE | orders | NULL | range | name_i | name_i | 183 | NULL | 20630 | 100.00 | Using index condition; Using MRR |
+----+-------------+--------+------------+-------+---------------+--------+---------+------+-------+----------+----------------------------------+
1 row in set, 1 warning (0.10 sec)
so it should use the index name_i and finish the query in a flash(my classmate spent 0.07 sec)
however , this is how it turned out:
| 4998119 | 王缝 | 27 | 男 | 159.21 |
| 4998232 | 王求葬 | 19 | 男 | 335.65 |
| 4998397 | 王倘予 | 49 | 女 | 103.39 |
| 4998482 | 王厚 | 77 | 男 | 960.69 |
| 4998703 | 王啄淋 | 73 | 女 | 458.85 |
| 4999106 | 王般埋 | 70 | 女 | 700.98 |
| 4999359 | 王胆具 | 31 | 女 | 362.83 |
| 4999510 | 王铁脾 | 31 | 女 | 973.09 |
| 4999880 | 王战万 | 59 | 女 | 127.28 |
| 4999928 | 王忆 | 42 | 女 | 72.47 |
+---------+--------+------+------+--------+
11160 rows in set (3.43 sec)
And it seems to not use the index at all, because the data is sorted by the primary key id rather than col name(besides it is too slow ,comparing to 0.07 sec).
Has anyone encountered the problem too?
What percentage of the table is "Kings" (王) ? If it is more than about 20%, it will choose to do a table scan instead of use the index. (And this may actually be faster.) (Based on Comments, 0.22% of the table is Kings.)
EXPLAIN and the execution of the query are separate things. Although I don't remember proving this, it is possible that the EXPLAIN might say one thing, but the query would work another way.
Do you have 5 million rows in the table? Was the cache 'cold' when you first ran it? And it had to fetch 11,160 rows from disk? Then the second time, all was in cache, so much faster?
Was the table loaded in "alphabetical" (or whatever the Chinese word for that is) order? If so, there is a good chance the ids and the names are in the same order?
Apparently you are using utf8_general_ci COLLATION? Maybe it does not sort Chinese well. (Provide a test case; I'll do some tests.)
I do not understand why it mentioned MRR.
I, too, am baffled by "1 min 32.24sec". The ORDER BY name should have further encouraged the Optimizer to use INDEX(name). Can you turn on "Optimizer trace".
To really see whether it used the index, do this:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
If the big number(s) look like the number of rows in the table, then it did a table scan. If they look more like 11160, then they used the index.
I'm working on "online streaming" project and I need some help in constructing a DB for best performance. Currently I have one table containing all relevant information for the player including file, poster image, post_id etc.
+---------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+---------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| post_id | int(11) | YES | | NULL | |
| file | mediumtext | NO | | NULL | |
| thumbs_img | mediumtext | YES | | NULL | |
| thumbs_size | mediumtext | YES | | NULL | |
| thumbs_points | mediumtext | YES | | NULL | |
| poster_img | mediumtext | YES | | NULL | |
| type | int(11) | NO | | NULL | |
| uuid | varchar(40) | YES | | NULL | |
| season | int(11) | YES | | NULL | |
| episode | int(11) | YES | | NULL | |
| comment | text | YES | | NULL | |
| playlistName | text | YES | | NULL | |
| time | varchar(40) | YES | | NULL | |
| mini_poster | mediumtext | YES | | NULL | |
+---------------+-------------+------+-----+---------+----------------+
With 100k records it takes around 0.5 sec for a query and performance constantly degrading as I have more records.
+----------+------------+----------------------------------------------------------------------+
| Query_ID | Duration | Query |
+----------+------------+----------------------------------------------------------------------+
| 1 | 0.04630675 | SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1' |
+----------+------------+----------------------------------------------------------------------+
explain SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1';
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
| 1 | SIMPLE | dle_playerFiles | ALL | NULL | NULL | NULL | NULL | 61777 | Using where |
+----+-------------+-----------------+------+---------------+------+---------+------+-------+-------------+
How can I improve DB structure? How big websites like youtube construct their database?
Generally when query time is directly proportional to the number of rows, that suggests a table scan, which means for a query like
SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1'
The database is executing that literally, as in, iterate over every single row and check if it meets criteria.
The typical solution to this is an index, which is a precomputed list of values for a column (or set of columns) and a list of rows which have said value.
If you create an index on the post_id column on dle_playerFiles, then the index would essentially say
1: <some row pointer>, <some row pointer>, <some row pointer>
2: <some row pointer>, <some row pointer>, <some row pointer>
...
100: <some row pointer>, <some row pointer>, <some row pointer>
...
7000: <some row pointer>, <some row pointer>, <some row pointer>
250000: <some row pointer>, <some row pointer>, <some row pointer>
Therefore, with such an index in place, the above query would simply look at node 7000 of the index and know which rows contain it.
Then the database only needs to read the rows where post_id is 7000 and check if their type is 1.
This will be much quicker because the database never needs to look at every row to handle a query. The costs of an index:
Storage space - this is more data and it has to be stored somewhere
Update time - databases keep indexes in sync with changes to the table automatically, which means that INSERT, UPDATE and DELETE statements will take longer because they need to update the data. For small and efficient indexes, this tradeoff is usually worth it.
For your query, I recommend you create an index on 2 columns. Make them part of the same index, not 2 separate indexes:
create index ix_dle_playerFiles__post_id_type on dle_playerFiles (post_id, type)
Caveats to this working efficiently:
SELECT * is bad here. If you are returning every column, then the database must go to the table to read the columns because the index only contains the columns for filtering. If you really only need one or two of the columns, specify them explicitly in the SELECT clause and add them to your index. Do NOT do this for many columns as it just bloats the index.
Functions and type conversions tend to prevent index usage. Your SQL wraps the integer types post_id and type in quotes so they are interpreted as strings. The database may feel that an index can't be used because it has to convert everything. Remove the quotes for good measure.
If I read your Duration correctly, it appears to take 0.04630675 (seconds?) to run your query, not 0.5s.
Regardless, proper indexing can decrease the time required to return query results. Based on your query SELECT * FROM dle_playerFiles where post_id in ('7000') AND type='1', an index on post_id and type would be advisable.
Also, if you don't absolutely require all the fields to be returned, use individual column references of the fields you require instead of the *. The fewer fields, the quicker the query will return.
Another way to optimize a query is to ensure that you use the smallest data types possible - especially in primary/foreign key and index fields. Never use a bigint or an int when a mediumint, smallint or better still, a tinyint will do. Never, ever use a text field in a PK or FK unless you have no other choice (this one is a DB design sin that is committed far too often IMO, even by people with enough training and experience to know better) - you're far better off using the smallest exact numeric type possible. All this has positive impacts on storage size too.
I have a database with the following stats
Tables Data Index Total
11 579,6 MB 0,9 GB 1,5 GB
So as you can see the Index is close to 2x bigger. And there is one table with ~7 million rows that takes up at least 99% of this.
I also have two indexes that are very similar
a) UNIQUE KEY `idx_customer_invoice` (`customer_id`,`invoice_no`),
b) KEY `idx_customer_invoice_order` (`customer_id`,`invoice_no`,`order_no`)
Update: Here is the table definition (at least structurally) of the largest table
CREATE TABLE `invoices` (
`id` int(10) unsigned NOT NULL auto_increment,
`customer_id` int(10) unsigned NOT NULL,
`order_no` varchar(10) default NULL,
`invoice_no` varchar(20) default NULL,
`customer_no` varchar(20) default NULL,
`name` varchar(45) NOT NULL default '',
`archived` tinyint(4) default NULL,
`invoiced` tinyint(4) default NULL,
`time` timestamp NOT NULL default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
`group` int(11) default NULL,
`customer_group` int(11) default NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_customer_invoice` (`customer_id`,`invoice_no`),
KEY `idx_time` (`time`),
KEY `idx_order` (`order_no`),
KEY `idx_customer_invoice_order` (`customer_id`,`invoice_no`,`order_no`)
) ENGINE=InnoDB AUTO_INCREMENT=9146048 DEFAULT CHARSET=latin1 |
Update 2:
mysql> show indexes from invoices;
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| invoices | 0 | PRIMARY | 1 | id | A | 7578066 | NULL | NULL | | BTREE | |
| invoices | 0 | idx_customer_invoice | 1 | customer_id | A | 17 | NULL | NULL | | BTREE | |
| invoices | 0 | idx_customer_invoice | 2 | invoice_no | A | 7578066 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_time | 1 | time | A | 541290 | NULL | NULL | | BTREE | |
| invoices | 1 | idx_order | 1 | order_no | A | 6091 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 1 | customer_id | A | 17 | NULL | NULL | | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 2 | invoice_no | A | 7578066 | NULL | NULL | YES | BTREE | |
| invoices | 1 | idx_customer_invoice_order | 3 | order_no | A | 7578066 | NULL | NULL | YES | BTREE | |
+----------+------------+----------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
My questions are:
Is there a way to find unused indexes in MySQL?
Are there any common mistakes that impact the size of the index?
Can indexA safely be removed?
How can you measure the size of each index? All I get is the total of all indexes.
You can remove index A, because, as you have noted, it is a subset of another index. And it's possible to do this without disrupting normal processing.
The size of the index files is not alarming in itself and it can easily be true that the net benefit is positive. In other words, the usefulness and value of an index shouldn't be discounted because it results in a large file.
Index design is a complex and subtle art involving a deep understanding of the query optimizer explanations and extensive testing. But one common mistake is to include too few fields in an index in order to make it smaller. Another is to test indexes with insufficient, or insufficiently representative data.
I may be wrong, but the first index (idx_customer_invoice) is UNIQUE, the second (idx_customer_invoice_order) is not, so you'll probably lose the uniqueness constraint when you remove it. No?
Is there a way to find unused indexes in MySQL?
The database engine optimizer will select a proper index when attempting to optimize your query. Depending on when you collected statistics on your indexes last, the index which is chosen will vary. Unused indexes could suddenly become used because of new data repartition.
Can indexA safely be removed?
I would say yes, if indexA and indexB are B-Tree indexes. This is because an index that starts with the same columns in the same order will have the same structure.
use
show indexes from table;
to define what indexes do you have in a particular table. Cardinality would tell how useful your index is.
You can remove your indexes safely (it will not break a table), but beware: some queries might execute slower. First you should analyze your queries to decide whether you need a certain index or not.
I don't think you can find out data length of a particular index, though.
BUT, I think you probably think that if indexes length is greater than data length twice is something abnormal... Well, you are wrong. All of your indexes might be useful ;) If you have a table that provides a lot of information and you have to search on it upon a large number of column, it can easily be that indexes of this table will 2 times bigger in size that the tables data.
indexA can remove because there's a
indexB include indexA
what impact your index length is
your column type and column length
use:
select index_length from information_schema.tables
where table_name='your_table_name' and
table_schema='your_db_name';
get your table index_length