Optimizing MySQL table structure. Advice needed - mysql

I have these table structures and while it works, using EXPLAIN on certain SQL queries gives 'Using temporary; Using filesort' on one of the table. This might hamper performance once the table is populated with thousands of data. Below are the table structure and explanations of the system.
CREATE TABLE IF NOT EXISTS `jobapp` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`fullname` varchar(50) NOT NULL,
`icno` varchar(14) NOT NULL,
`status` tinyint(1) NOT NULL DEFAULT '1',
`timestamp` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `icno` (`icno`)
) ENGINE=MyISAM;
CREATE TABLE IF NOT EXISTS `jobapplied` (
`appid` int(11) NOT NULL,
`jid` int(11) NOT NULL,
`jobstatus` tinyint(1) NOT NULL,
`timestamp` int(10) NOT NULL,
KEY `jid` (`jid`),
KEY `appid` (`appid`)
) ENGINE=MyISAM;
Query I tried which gives aforementioned statement:
EXPLAIN SELECT japp.id, japp.fullname, japp.icno, japp.status, japped.jid, japped.jobstatus
FROM jobapp AS japp
INNER JOIN jobapplied AS japped ON japp.id = japped.appid
WHERE japped.jid = '85'
AND japped.jobstatus = '2'
AND japp.status = '2'
ORDER BY japp.`timestamp` DESC
This system is for recruiting new staff. Once registration is opened, hundreds of applicant will register in a single time. They are allowed to select 5 different jobs. Later on at the end of registration session, the admin will go through each job one by one. I have used a single table (jobapplied) to store 2 items (applicant id, job id) to record who applied what. And this is the table which causes aforementioned statement. I realize this table is without PRIMARY key but I just can't figure out any other way later on for the admin to search specifically which job who have applied.
Any advice on how can I optimize the table?

Apart from the missing indexes and primary keys others have mentioned . . .
This might hamper performance once the
table is populated with thousands of
data.
You seem to be assuming that the query optimizer will use the same execution plan on a table with thousands of rows as it will on a table with just a few rows. Optimizers don't work like that.
The only reliable way to tell how a particular vendor's optimizer will execute a query on a table with thousands of rows--which is still a small table, and will probably easily fit in memory--is to
load a scratch version of the
database with thousands of rows
"explain" the query you're interested
in
FWIW, the last test I ran like this involved close to a billion rows--about 50 million in each of about 20 tables. The execution plan for that query--which included about 20 left outer joins--was a lot different than it was for the sample data (just a few thousand rows).

You are ordering by jobapp.timestamp, but there is no index for timestamp so the tablesort (and probably the temporary) will be necessary try adding and index for timestamp to jobapp something like KEY timid (timestamp,id)

Related

MySQL query is slower after index create [duplicate]

At first i will write some information about my test table.
This is books table with 665647 rows of data.
Below you can see how it looks.
I made 10 same queries for books with price equal
select * from books where price = 10
Execution time for all 10 queries was 9s 663ms.
After that i created index which you can see here:
i tried to run same 10 queries one more time.
Execution time for them was 21s 996ms.
show index from books;
Showed very wired data for me.
Possible value is just one!
What did i wrong? I was sure indexes are thing that can make our queries faster, not slower.
i found this topic : MySQL index slowing down query
but to be honest i dont really understand this especially Cardinality column
in my table books i have two possible values for price field at this moment
10 and 30 still show index from books; shows 1
#Edit1
SHOW CREATE TABLE books
Result:
CREATE TABLE `books` (
`id` bigint unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`description` text COLLATE utf8mb4_unicode_ci NOT NULL,
`isbn` bigint unsigned NOT NULL,
`price` double(8,2) unsigned NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
`author_id` bigint unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `books_isbn_unique` (`isbn`),
KEY `books_author_id_foreign` (`author_id`),
KEY `books_price_index` (`price`),
CONSTRAINT `books_author_id_foreign` FOREIGN KEY (`author_id`) REFERENCES `users` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=665648 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
#Edit2
I added new index create index nameIndex on books (name)
Which have big Cardinality value.
When i tried to do this query select * from books where name ='Inventore cumque quis.'
Before and after index create i can see the difference in execution time.
But still i dont understand how index'es works. I was sure about one thing - if i create new index in my database is creating new data structure with data which fit to this index.
For example if i have orws with price 10, 30 i got two "Tables" where are rows with these prices.
Is it realistic to have so many rows with the same price? Is it realistic to return 444K rows from a query? I ask these because query optimization is predicated on "normal" data.
An index (eg, INDEX(price)) is useful when looking for a price that occurs a small number of times. In fact, the Optimizer shuns the index if it sees that the value being searched for occurs more than about 20% of the time. Instead, it would simply ignore the index and do what you tested first--simply scan the entire table, ignoring any rows that don't match.
You should be able to see that by doing
EXPLAIN select * from books where price = 10
with and without the index. Alternatively, you can try:
EXPLAIN select * from books IGNORE INDEX(books_price_index) where price = 10
EXPLAIN select * from books FORCE INDEX(books_price_index) where price = 10
But, ... It seems that the Optimizer did not ignore the index. I see that the "cardinality" of price is "1", which implies that there is only one distinct value in that column. This 'statistic' is either incorrect or misleading. Please run this and see what changes:
ANALYZE TABLE books;
This will recompute the stats via a few random probes, and may change that "1" to perhaps "2".
General advice: Beware of benchmarks that run against fabricated data.
Maybe this?
https://stackoverflow.com/questions/755569/why-does-the-cardinality-of-an-index-in-mysql-remain-unchanged-when-i-add-a-new
Cardinality didnt get updated after index was created. Try to run the analyze table command.

MySQL query is slower after index create

At first i will write some information about my test table.
This is books table with 665647 rows of data.
Below you can see how it looks.
I made 10 same queries for books with price equal
select * from books where price = 10
Execution time for all 10 queries was 9s 663ms.
After that i created index which you can see here:
i tried to run same 10 queries one more time.
Execution time for them was 21s 996ms.
show index from books;
Showed very wired data for me.
Possible value is just one!
What did i wrong? I was sure indexes are thing that can make our queries faster, not slower.
i found this topic : MySQL index slowing down query
but to be honest i dont really understand this especially Cardinality column
in my table books i have two possible values for price field at this moment
10 and 30 still show index from books; shows 1
#Edit1
SHOW CREATE TABLE books
Result:
CREATE TABLE `books` (
`id` bigint unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) COLLATE utf8mb4_unicode_ci NOT NULL,
`description` text COLLATE utf8mb4_unicode_ci NOT NULL,
`isbn` bigint unsigned NOT NULL,
`price` double(8,2) unsigned NOT NULL,
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
`author_id` bigint unsigned NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `books_isbn_unique` (`isbn`),
KEY `books_author_id_foreign` (`author_id`),
KEY `books_price_index` (`price`),
CONSTRAINT `books_author_id_foreign` FOREIGN KEY (`author_id`) REFERENCES `users` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=665648 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
#Edit2
I added new index create index nameIndex on books (name)
Which have big Cardinality value.
When i tried to do this query select * from books where name ='Inventore cumque quis.'
Before and after index create i can see the difference in execution time.
But still i dont understand how index'es works. I was sure about one thing - if i create new index in my database is creating new data structure with data which fit to this index.
For example if i have orws with price 10, 30 i got two "Tables" where are rows with these prices.
Is it realistic to have so many rows with the same price? Is it realistic to return 444K rows from a query? I ask these because query optimization is predicated on "normal" data.
An index (eg, INDEX(price)) is useful when looking for a price that occurs a small number of times. In fact, the Optimizer shuns the index if it sees that the value being searched for occurs more than about 20% of the time. Instead, it would simply ignore the index and do what you tested first--simply scan the entire table, ignoring any rows that don't match.
You should be able to see that by doing
EXPLAIN select * from books where price = 10
with and without the index. Alternatively, you can try:
EXPLAIN select * from books IGNORE INDEX(books_price_index) where price = 10
EXPLAIN select * from books FORCE INDEX(books_price_index) where price = 10
But, ... It seems that the Optimizer did not ignore the index. I see that the "cardinality" of price is "1", which implies that there is only one distinct value in that column. This 'statistic' is either incorrect or misleading. Please run this and see what changes:
ANALYZE TABLE books;
This will recompute the stats via a few random probes, and may change that "1" to perhaps "2".
General advice: Beware of benchmarks that run against fabricated data.
Maybe this?
https://stackoverflow.com/questions/755569/why-does-the-cardinality-of-an-index-in-mysql-remain-unchanged-when-i-add-a-new
Cardinality didnt get updated after index was created. Try to run the analyze table command.

Selecting rows where column value changed from previous row, user variables, innodb

I have a problem similar to
SQL: selecting rows where column value changed from previous row
The accepted answer by ypercube which i adapted to
CREATE TABLE `schange` (
`PersonID` int(11) NOT NULL,
`StateID` int(11) NOT NULL,
`TStamp` datetime NOT NULL,
KEY `tstamp` (`TStamp`),
KEY `personstate` (`PersonID`, `StateID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `states` (
`StateID` int(11) NOT NULL AUTO_INCREMENT,
`State` varchar(100) NOT NULL,
`Available` tinyint(1) NOT NULL,
`Otherstatuseshere` tinyint(1) NOT NULL,
PRIMARY KEY (`StateID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SELECT
COALESCE((#statusPre <> s.Available), 1) AS statusChanged,
c.PersonID,
c.TStamp,
s.*,
#statusPre := s.Available
FROM schange c
INNER JOIN states s USING (StateID),
(SELECT #statusPre:=NULL) AS d
WHERE PersonID = 1 AND TStamp > "2012-01-01" AND TStamp < "2013-01-01"
ORDER BY TStamp ;
The query itself worked just fine in testing, and with the right mix of temporary tables i was able to generate reports with daily sum availability from a huge pile of data in virtually no time at all.
The real problem comes in when i discovered that the tables where using the MyISAM engine, which we have completely abandoned, recreated the tables to use InnoDB, and noticed the query no longer works as expected.
After some bashing head into wall i have discovered that MyISAM seems to go over the columns each row in order (selecting statusChanged before updating #statusPre), while InnoDB seems to do all the variable assigning first, and only after that it populates result rows, regardless if the assigning happens in the select or where clauses, in functions (coalesce, greater etc), subqueries or otherwise.
Trying to accomplish this in a query without variables seems to always end the same way, a subquery requiring exponentially more time to process the more rows are in the set, resulting in a excrushiating minutes (or hours) long wait to get beginning and ending events for one status, while a finished report should include daily sums of multiple.
Can this type of query work on the InnoDB engine, and if so, how should one go about it?
or is the only feasible option to go for a database product that supports WITH statements?
Removing
KEY personstate (PersonID, StateID)
fixes the problem.
No idea why tho, but it was not really required anyway, the timestamp key is the more important one and speeds up the query nicely.

Select rows where column LIKE dictionary word

I have 2 tables:
Dictionary - Contains roughly 36,000 words
CREATE TABLE IF NOT EXISTS `dictionary` (
`word` varchar(255) NOT NULL,
PRIMARY KEY (`word`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Datas - Contains roughly 100,000 rows
CREATE TABLE IF NOT EXISTS `datas` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`hash` varchar(32) NOT NULL,
`data` varchar(255) NOT NULL,
`length` int(11) NOT NULL,
`time` int(11) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `hash` (`hash`),
KEY `data` (`data`),
KEY `length` (`length`),
KEY `time` (`time`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=105316 ;
I would like to somehow select all the rows from datas where the column data contains 1 or more words.
I understand this is a big ask, it would need to match all of these rows together in every combination possible, so it needs the best optimization.
I have tried the below query, but it just hangs for ages:
SELECT `datas`.*, `dictionary`.`word`
FROM `datas`, `dictionary`
WHERE `datas`.`data` LIKE CONCAT('%', `dictionary`.`word`, '%')
AND LENGTH(`dictionary`.`word`) > 3
ORDER BY `length` ASC
LIMIT 15
I have also tried something similar to the above with a left join, and on clause that specified the like statement.
This is actually not an easy problem, what you are trying to perform is called Full Text Search, and relational databases are not the best tools for such a task. If this is some kind of a core functionality consider using solutions dedicated for this kind of operations, like Sphinx Search Server.
If this is not a "Mission Critical" system, you can try with something else. I can see that datas.data column isn't really long, so you can create a structure dedicated for your task and keep maintaining it during operational use. Fore example, create table:
dictionary_datas (
datas_id FK (datas.id),
word FK (dictionary.word)
)
Now anytime you insert, delete or simply modify datas or dictionary tables you update dictionary_datas placing there info which datas_id contains which words (basically many to many relations). Of course it will degradate your performance, so if you have high high transactional load on your system, you have to do this periodicaly. For example place a Cron Job which runs every night at 03:00 am and actualize the table. To simplify the task you can add a flag TO_CHECK into DATAS table, and actualize data only for those records having there 1 (after you actualise dictionary_datas you switch the value to 0). Remember by the way to refresh whole DATAS table after an update to DICTIONARY table. 36 000 and 100 000 aren't big numbers in terms of data processing.
Once you have this table you can just query it like:
SELECT datas_id, count(*) AS words_num FROM dictionary_datas GROUP BY datas_id HAVING count(*) > 3;
To speed up the query (and yet slow down it's update) you can create a composite index on its columns datas_id, word (in EXACTLY that order). If you decide to refresh the data periodicaly you should remove the index before refresh, than refresh the data, and finaly create the index after refreshing - this way will be faster.
I'm not sure if I understood your problem, but I think this could be a solution. Also, I think people don't like Regular Expression but this works for me to select columns where their value has more than 1 word.
SELECT * FROM datas WHERE data REGEXP "([a-z] )+"
Have you tried this?
select *
from dictionary, datas
where position(word,data) > 0
;
This is very inefficient, but might be good enough for you. Here is a fiddle.
For better performance, you could try placing a text search index on your text column DATA and then using the CONTAINS function instead of POSITION.

Count the number of rows between unix time stamps for each ID

I'm trying to populate some data for a table. The query is being run on a table that contains ~50 million records. The query I'm currently using is below. It counts the number of rows that match the template id and are BETWEEN two unix timestamps:
SELECT COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN '1346904000' AND '1346993271'
AND `template` = '1'
While the query above does work, performance is rather slow while looping through each template which at times can be in the hundreds. The time stamps are stored as int and are properly indexed. Just to test thing out, I tried running the query below, omitting the time_sent restriction:
SELECT COUNT(*) as count FROM `s_log`
AND `template` = '1'
As expected, it runs very fast, but is obviously not restricting count results inside the correct time frame. How can I obtain a count for a specific template AND restrict that count BETWEEN two unix timestamps?
EXPLAIN:
1 | SIMPLE | s_log | ref | time_sent,template | template | 4 | const | 71925 | Using where
SHOW CREATE TABLE s_log:
CREATE TABLE `s_log` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`email` varchar(255) NOT NULL,
`time_sent` int(25) NOT NULL,
`template` int(55) NOT NULL,
`key` varchar(255) NOT NULL,
`node_id` int(55) NOT NULL,
`status` varchar(55) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `time_sent` (`time_sent`),
KEY `template` (`template`),
KEY `node_id` (`node_id`),
KEY `key` (`key`),
KEY `status` (`status`),
KEY `timestamp` (`timestamp`)
) ENGINE=MyISAM AUTO_INCREMENT=2078966 DEFAULT CHARSET=latin1
The best index you may have in this case is composite one template + time_sent
CREATE INDEX template_time_sent ON s_log (template, time_sent)
PS: Also as long as all your columns in the query are integer DON'T enclose their values in quotes (in some cases it could lead to issues, at least with older mysql versions)
First, you have to create an index that has both of your columns together (not seperately). Also check your table type, i think it would work great if your table is innoDB.
And lastly, use your WHERE clause in this fashion:
`WHEREtemplate= '1' ANDtime_sent` BETWEEN '1346904000' AND '1346993271'
What this does is first check if template is 1, if it is then it would check for the second condition else skip. This will definitely give you performance-edge
If you have to call the query for each template maybe it would be faster to get all the information with one query call by using GROUP BY:
SELECT template, COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN 1346904000 AND 1346993271;
GROUP BY template
It's just a guess that this would be faster and you also would have to redesign your code a bit.
You could also try to use InnoDB instead of MyISAM. InnoDB uses a clustered index which maybe performs better on large tables. From the MySQL site:
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. (For example, MyISAM uses one file for data rows and another for index records.)
There are some questions on Stackoverflow which discuss the performance between InnoDB and MyISAM:
Should I use MyISAM or InnoDB Tables for my MySQL Database?
Migrating from MyISAM to InnoDB
MyISAM versus InnoDB