Improve order by json field performance on mysql - mysql

I have this table:
CREATE TABLE `mytable` (
`session_id` mediumint(8) UNSIGNED NOT NULL,
`data` json NOT NULL,
`jobname` varchar(100) COLLATE utf8_unicode_ci GENERATED ALWAYS AS
(json_unquote(json_extract(`data`,'$.jobname'))) VIRTUAL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
PARTITION BY HASH (session_id)
PARTITIONS 10;
ALTER TABLE `mytable`
ADD KEY `session` (`session_id`),
ADD KEY `jobname` (`jobname`);
It has 2 million rows.
When execute this query, it takes around 23 sec to get the result.
SELECT JSON_EXTRACT(f.data, '$.jobdesc') AS jobdesc
FROM mytable f
WHERE f.session_id = 1
ORDER BY jobdesc DESC
I understand that it is slow because there is no index for jobdesc field.
On data's column, I have 12 fields. I want to let user to be able to sort all fields. If I add index for each field, is it good approach?
Is there any way to improve it?
I am using MYSQL 5.7.13.

You would have to create a virtual column with an index for each of your 12 fields, if you want the user to have the option of sorting by.
CREATE TABLE `mytable` (
`session_id` mediumint(8) UNSIGNED NOT NULL,
`data` json NOT NULL,
`jobname` varchar(100) AS (json_unquote(json_extract(`data`,'$.jobname'))),
`jobdesc` varchar(100) AS (json_unquote(json_extract(`data`,'$.jobdesc'))),
...other extracted virtual fields...
KEY (`jobname`),
KEY (`jobdesc`),
...other indexes on virtual columns...
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
PARTITION BY HASH (session_id)
PARTITIONS 10;
This makes me wonder: why bother using JSON? Why not just declare 12 conventional, non-virtual columns with indexes?
CREATE TABLE `mytable` (
`session_id` mediumint(8) UNSIGNED NOT NULL,
...no `data` column needed...
`jobname` varchar(100),
`jobdesc` varchar(100),
...
KEY (`jobname`),
KEY (`jobdesc`),
...
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
PARTITION BY HASH (session_id)
PARTITIONS 10;
JSON is best when you treat it as a single atomic document, and don't try to use SQL operations on fields within it. If you regularly need to access fields within your JSON, make them into conventional columns.

Related

Django slow inner join on a table with more than 10 million records

I am using mysql with Django. I am trying to count the number of visitor_pages for a specific dealer in a certain amount of time.
I would share the raw sql query that I have obtained from django debug toolbar.
SELECT COUNT(*) AS `__count`
FROM `visitor_page`
INNER JOIN `dealer_visitors`
ON (`visitor_page`.`dealer_visitor_id` = `dealer_visitors`.`id`)
WHERE (`visitor_page`.`date_time` BETWEEN '2021-02-01 05:51:00'
AND '2021-03-21 05:50:00'
AND `dealer_visitors`.`dealer_id` = 15)
The issue is that I have more than 13 million records in the visitor_pages table and about 1.5 million records in the dealer_visitor table. I have already indexed date_time. I am thinking of using a materialized view but before attempting that, I would really appreciate suggestions on how I could improve this query.
visitor_pages schema:
CREATE TABLE `visitor_page` (
`id` int NOT NULL AUTO_INCREMENT,
`date_time` datetime(6) DEFAULT NULL,
`added_at` datetime(6) DEFAULT NULL,
`updated_at` datetime(6) DEFAULT NULL,
`page_id` int NOT NULL,
`dealer_visitor_id` int NOT NULL,
PRIMARY KEY (`id`),
KEY `visitor_page_page_id_246babdf_fk_web_page_id` (`page_id`),
KEY `visitor_page_dealer_visitor_id_e2dddea2_fk_dealer_visitors_id` (`dealer_visitor_id`),
KEY `visitor_page_date_time_06e9e9f5` (`date_time`),
CONSTRAINT `visitor_page_dealer_visitor_id_e2dddea2_fk_dealer_visitors_id` FOREIGN KEY (`dealer_visitor_id`) REFERENCES `dealer_visitors` (`id`),
CONSTRAINT `visitor_page_page_id_246babdf_fk_web_page_id` FOREIGN KEY (`page_id`) REFERENCES `web_page` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=13626649 DEFAULT CHARSET=latin1;
dealer_visitors schema:
CREATE TABLE `dealer_visitors` (
`id` int NOT NULL AUTO_INCREMENT,
`visit_date` datetime(6) DEFAULT NULL,
`added_at` datetime(6) DEFAULT NULL,
`updated_at` datetime(6) DEFAULT NULL,
`dealer_id` int NOT NULL,
`visitor_id` int NOT NULL,
`type` int DEFAULT NULL,
`notes` longtext,
`location` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `dealer_visitors_dealer_id_306e2202_fk_dealer_id` (`dealer_id`),
KEY `dealer_visitors_visitor_id_27ae498e_fk_visitor_id` (`visitor_id`),
KEY `dealer_visitors_type_af0f7d79` (`type`),
KEY `dealer_visitors_visit_date_f2b138c9` (`visit_date`),
CONSTRAINT `dealer_visitors_dealer_id_306e2202_fk_dealer_id` FOREIGN KEY (`dealer_id`) REFERENCES `dealer` (`id`),
CONSTRAINT `dealer_visitors_visitor_id_27ae498e_fk_visitor_id` FOREIGN KEY (`visitor_id`) REFERENCES `visitor` (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1524478 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
EXPLAIN ANALYZE the query gives me the following:
EXPLAIN:
For this query:
SELECT COUNT(*) AS `__count`
FROM visitor_page vp JOIN
dealer_visitors dv
ON vp.dealer_visitor_id = dv.id
WHERE vp.date_time BETWEEN '2021-02-01 05:51:00' AND '2021-03-21 05:50:00' AND
dv.dealer_id = 15;
The best indexes are on dealer_visitors(dealer_id, date_time, id) and visitor_page(dealer_visitor_id).
An index only on date helps a bit. But you are retrieving a month's worth of data and that might be a lot of data to process. Having dealer_id as the first column in the index will restrict the data to only the rows for that dealer in that time frame.
Depending on the distribution of the data, the Optimizer might pick one of the tables to start with, or pick the other. So, let's provide optimal indexes for each case:
ON `visitor_page`.`dealer_visitor_id` = `dealer_visitors`.`id`
WHERE `visitor_page`.`date_time` BETWEEN ...
AND `dealer_visitors`.`dealer_id` = 15
Starting with visitor_page:
visitor_page: INDEX(date_time) -- (already exists)
dealer_visitors: (already has PRIMARY KEY(id))
Starting with dealer_visitors:
dealer_visitors: INDEX(dealer_id) -- (already exists)
visitor_page: INDEX(dealer_visitor_id, date_time) -- in this order
and drop dealer_visitors_visitor_id_27ae498e_fk_visitor_id as now being redundant.
The net is to add one index and drop one index.
Materialized view -- It is often best for Data Warehouse reports to build and incrementally maintain a "summary table" (a "materialized view"). The very odd date range (1 month + 20 days - 61 seconds) makes this clumsy to do. Typically it is handy to make the table based on whole days. If you can shift to daily (or hourly), then see http://mysql.rjweb.org/doc.php/summarytables
Something else to check: How much RAM do you have? What does SHOW VARIABLES LIKE 'innodb_buffer_pool_size'; say?
I see that the tables have different charset/collation. This is not a problem for the query in question, but if you have other queries that JOIN on VARCHARs, check that they use the same collation.

Optimize MySQL count query with JOIN

I have a query that takes about 20 seconds, I would like to understand if there is a way to optimize it.
Table 1:
CREATE TABLE IF NOT EXISTS `sessions` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`user_id` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `user_id` (`user_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=9845765 ;
And table 2:
CREATE TABLE IF NOT EXISTS `access` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`session_id` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
KEY `session_id ` (`session_id `)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=9467799 ;
Now, what I am trying to do is to count all the access connected to all sessions about one user, so my query is:
SELECT COUNT(*)
FROM access
INNER JOIN sessions ON access.session_id=session.id
WHERE session.user_id='6';
It takes almost 20 seconds...and for user_id 6 there are about 3 millions sessions stored.
There is anything I can do to optimize that query?
Change this line from the session table:
KEY `user_id` (`user_id`)
To this:
KEY `user_id` (`user_id`, `id`)
What this will do for you is allow you to complete the query from the index, without going back to the raw table. As it is, you need to do an index scan on the session table for your user_id, and for each item go back to the table to find the id for the join to the access table. By including the id in the index, you can skip going back to the table.
Sadly, this will make your inserts slower into that table, and it seems like this may be a bid deal, given just one user has 3 millions sessions. Sql Server and Oracle would address this by allowing you to include the id column in your index, without actually indexing on it, saving a little work at insert time, and also by allowing you specify a lower fill factor for the index, reducing the need to re-build or re-order the indexes at insert, but MySql doesn't support these.

Slow mysql search query using exists

We have performance problems with a string search query (MySQL 5.5). Table main contains a few text columns. Table multi contains zero or more multivalue fields per main entry. We want to count how many main items that have one of the fields starting with a specific string.
Here is the query:
select count(main_id) from main ma
where s1 like '888%' or s2 like '888%' or s3 like '888%'
or exists (select m.multivalue
from multi m
where ma.main_id=m.main_id and (m.multivalue like '888%'))
Already at about a few hundred thousand records in main, and about the same number of records in multi, the query takes seconds to complete.
Here is the explain:
PRIMARY ma ALL s1,s2,s3 100407 100.00 Using where
DEPENDENT SUBQUERY m ref PRIMARY,main_id,multivalue PRIMARY 8 sample.ma.main_id 1 100.00 Using where; Using index
And the table definitions:
CREATE TABLE `main` (
`main_id` bigint(20) NOT NULL AUTO_INCREMENT,
`s1` varchar(50) CHARACTER SET utf8 NOT NULL,
`s2` varchar(50) CHARACTER SET utf8 DEFAULT NULL,
`s3` varchar(50) CHARACTER SET utf8 DEFAULT NULL,
PRIMARY KEY (`main_id`),
KEY `s1` (`s1`),
KEY `s2` (`s2`),
KEY `s3` (`s3`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
CREATE TABLE `multi` (
`main_id` bigint(20) NOT NULL,
`multivalue` varchar(50) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`main_id`,`multivalue`),
KEY `main_id` (`main_id`),
KEY `multivalue` (`multivalue`),
CONSTRAINT `FK_multi_main` FOREIGN KEY (`main_id`) REFERENCES `main` (`main_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
Any ideas how to make this query faster?
innodb settings:
innodb_additional_mem_pool_size=32M
innodb_flush_log_at_trx_commit=1
innodb_log_buffer_size=64M
innodb_buffer_pool_size=1000M
innodb_log_file_size=128M
innodb_thread_concurrency=16
innodb_file_per_table
innodb_file_format=Barracuda

Order by two fields - Indexing

So I've got a table with all users, and their values. And I want to order them after how much "money" they got. The problem is that they have money in two seperate fields: users.money and users.bank.
So this is my table structure:
CREATE TABLE IF NOT EXISTS `users` (
`id` int(4) unsigned NOT NULL AUTO_INCREMENT,
`username` varchar(54) COLLATE utf8_swedish_ci NOT NULL,
`money` bigint(54) NOT NULL DEFAULT '10000',
`bank` bigint(54) NOT NULL DEFAULT '10000',
PRIMARY KEY (`id`),
KEY `users_all_money` (`money`,`bank`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_swedish_ci AUTO_INCREMENT=100 ;
And this is the query:
SELECT id, (money+bank) AS total FROM users FORCE INDEX (users_all_money) ORDER BY total DESC
Which works fine, but when I run EXPLAIN it shows "Using filesort", and I'm wondering if there is any way to optimize it?
Because you want to sort by a derived value (one that must be calculated for each row) MySQL can't use the index to help with the ordering.
The only solution that I can see would be to create an additional total_money or similar column and as you update money or bank update that value too. You could do this in your application code or it would be possible to do this in MySQL with triggers too if you wanted.

What should i do with partitioning in a table require 2 ways of selections for best performance in MYSQL - myisam?

I have a table like bellow:
CREATE TABLE `hosts` (
`ID` int(11) unsigned NOT NULL AUTO_INCREMENT,
`Name` varchar(60) NOT NULL,
PRIMARY KEY (`ID`,`Name`),
UNIQUE KEY `UniqueHost` (`Name`),
) ENGINE=MyISAM AUTO_INCREMENT=32527823 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (`Name`)
PARTITIONS 20 */
What i want to select here are:
select * from hosts where Name = 'blah.com';
and:
select * from hosts where ID = 123123;
What should i do when i have 2 ways of selection like above for best performance ?
other tables requires the ID of this table.
However, i also need to select the Name of hosts frequently.
Another question is How many partitions i should create for 32 millions of rows?