Historical big data slow queries - mysql

I have problem with slow queries.
PS
MariaDB: mariadb:10.3.25 - InnoDB
I optimized most DB configurations
Structure
create table customers
(
id bigint unsigned auto_increment
primary key,
email varchar(255) null,
full_name varchar(255) null,
country varchar(2) null,
first_name varchar(255) null,
second_name varchar(255) null,
company_name varchar(255) null,
gender char null,
birth_date date null,
state varchar(3) null,
null,
custom_field_1 varchar(255) null,
custom_field_2 varchar(255) null,
custom_field_3 varchar(255) null,
created_at timestamp null,
updated_at timestamp null,
deleted_at timestamp null
)
collate = utf8mb4_unicode_ci;
create table customer_daily_stats
(
date date not null,
campaign_id bigint not null,
customer_id bigint not null,
event_1 int unsigned default 0 not null,
event_2 int unsigned default 0 not null,
event_3 int unsigned default 0 not null,
event_4 int unsigned default 0 not null,
event_5 int unsigned default 0 not null,
constraint customer_daily_stats_date_customer_id_campaign_id_unique
unique (date, customer_id, campaign_id)
)
collate = utf8mb4_unicode_ci;
create index customer_daily_stats_customer_id_date_index
on customer_daily_stats (customer_id, date);
create index customer_daily_stats_campaign_id_index
on customer_daily_stats (campaign_id);
customers ~ 1 - 5 millions rows
customer_daily_stats ~ 1 - 100 millions rows
Queries
select
customers.*,
IFNULL(
SUM(events_aggregation.event_1),
0
) as event_1,
IFNULL(
SUM(events_aggregation.event_2),
0
) as event_2,
IFNULL(
SUM(events_aggregation.event_3),
0
) as event_3,
IFNULL(
SUM(events_aggregation.event_4),
0
) as event_4
from
`customers`
left join customer_daily_stats as events_aggregation on `customers`.`id` = `events_aggregation`.`customer_id`
and `events_aggregation`.`date` between '2021-09-06' and '2022-07-06'
group by
`customers`.`id`;
Problems
Main idea is to have possibility to get aggregation by any dates.
Problem is that works too slow now and i need to do addition aggregations which decrease performance. One more problem i don't have a lot of disc space (250G and about 80% used already).
I have:
customers ~ 1.5m
customer_daily_stats ~ 50.000
query speed ~ 5s
Questions
Is there any methods to optimize my DB or another tools?
Is there any DBs that help my to increase performance?

Change the indexes. You currently have
unique (date, customer_id, campaign_id)
INDEX(customer_id, date)
INDEX(campaign_id)
Maybe Change to:
PRIMARY KEY(customer_id, date, campaign_id)
INDEX(campaign_id)
BUT... And this is a big BUT. This rearrangement of indexing may significantly hurt other queries. We really need to see
All the big queries
EXPLAIN SELECT for each
Did you notice that the range is 10 months plus 1 day? This is because BETWEEN is 'inclusive'.
If 80% of disk is already used, you are in deep weeds. Any fixes will require more than 20% of the disk to achieve.
One thing to do (when you have enough disk space) is to shrink BIGINT (8 bytes, probably an excessive range) and INT UNSIGNED (4 bytes, 4 billion max) to smaller int types where practical.
I'm confused. These seem to contradict each other; please clarify:
customer_daily_stats ~ 1 - 100 millions rows
customer_daily_stats ~ 50.000
Some more things to help with the analysis:
innodb_buffer_pool_size
RAM size
disk footprint for tables (GB)

Related

Improve query speed suggestions

For self education I am developing an invoicing system for an electricity company. I have multiple time series tables, with different intervals. One table represents consumption, two others represent prices. A third price table should be still incorporated. Now I am running calculation queries, but the queries are slow. I would like to improve the query speed, especially since this is only the beginning calculations and the queries will only become more complicated. Also please note that this is my first database i created and exercises I have done. A simplified explanation is preferred. Thanks for any help provided.
I have indexed: DATE, PERIOD_FROM, PERIOD_UNTIL in each table. This speed up the process from 60 seconds to 5 seconds.
The structure of the tables is the following:
CREATE TABLE `apxprice` (
`APX_id` int(11) NOT NULL AUTO_INCREMENT,
`DATE` date DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`PRICE` decimal(10,2) DEFAULT NULL,
PRIMARY KEY (`APX_id`)
) ENGINE=MyISAM AUTO_INCREMENT=28728 DEFAULT CHARSET=latin1
CREATE TABLE `imbalanceprice` (
`imbalanceprice_id` int(11) NOT NULL AUTO_INCREMENT,
`DATE` date DEFAULT NULL,
`PTU` tinyint(3) DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`UPWARD_INCIDENT_RESERVE` tinyint(1) DEFAULT NULL,
`DOWNWARD_INCIDENT_RESERVE` tinyint(1) DEFAULT NULL,
`UPWARD_DISPATCH` decimal(10,2) DEFAULT NULL,
`DOWNWARD_DISPATCH` decimal(10,2) DEFAULT NULL,
`INCENTIVE_COMPONENT` decimal(10,2) DEFAULT NULL,
`TAKE_FROM_SYSTEM` decimal(10,2) DEFAULT NULL,
`FEED_INTO_SYSTEM` decimal(10,2) DEFAULT NULL,
`REGULATION_STATE` tinyint(1) DEFAULT NULL,
`HOUR` int(2) DEFAULT NULL,
PRIMARY KEY (`imbalanceprice_id`),
KEY `DATE` (`DATE`,`PERIOD_FROM`,`PERIOD_UNTIL`)
) ENGINE=MyISAM AUTO_INCREMENT=117427 DEFAULT CHARSET=latin
CREATE TABLE `powerload` (
`powerload_id` int(11) NOT NULL AUTO_INCREMENT,
`EAN` varchar(18) DEFAULT NULL,
`DATE` date DEFAULT NULL,
`PERIOD_FROM` time DEFAULT NULL,
`PERIOD_UNTIL` time DEFAULT NULL,
`POWERLOAD` int(11) DEFAULT NULL,
PRIMARY KEY (`powerload_id`)
) ENGINE=MyISAM AUTO_INCREMENT=61039 DEFAULT CHARSET=latin
Now when running this query:
SELECT i.DATE, i.PERIOD_FROM, i.TAKE_FROM_SYSTEM, i.FEED_INTO_SYSTEM,
a.PRICE, p.POWERLOAD, sum(a.PRICE * p.POWERLOAD)
FROM imbalanceprice i, apxprice a, powerload p
WHERE i.DATE = a.DATE
and i.DATE = p.DATE
AND i.PERIOD_FROM >= a.PERIOD_FROM
and i.PERIOD_FROM = p.PERIOD_FROM
AND i.PERIOD_FROM < a.PERIOD_UNTIL
AND i.DATE >= '2018-01-01'
AND i.DATE <= '2018-01-31'
group by i.DATE
I have run the query with explain and get the following result: Select_type, all simple partitions all null possible keys a,p = null i = DATE Key a,p = null i = DATE key_len a,p = null i = 8 ref a,p = null i = timeseries.a.DATE,timeseries.p.PERIOD_FROM rows a = 28727 p = 61038 i = 1 filtered a = 100 p = 10 i = 100 a extra: using where using temporary using filesort b extra: using where using join buffer (block nested loop) c extra: null
Preferably I run a more complicated query for a whole year and group by month for example with all price tables incorporated. However, this would be too slow. I have indexed: DATE, PERIOD_FROM, PERIOD_UNTIL in each table. The calculation result may not be changed, in this case quarter hourly consumption of two meters multiplied by hourly prices.
"Categorically speaking," the first thing you should look at is indexes.
Your clauses such as WHERE i.DATE = a.DATE ... are categorically known as INNER JOINs, and the SQL engine needs to have the ability to locate the matching rows "instantly." (That is to say, without looking through the entire table!)
FYI: Just like any index in real-life – here I would be talking about "library card catalogs" if we still had such a thing – indexes will assist both "equal to" and "less/greater than" queries. The index takes the computer directly to a particular point in the data, whether that's a "hit" or a "near miss."
Finally, the EXPLAIN verb is very useful: put that word in front of your query, and the SQL engine should "explain to you" exactly how it intends to carry out your query. (The SQL engine looks at the structure of the database to make that decision.) Although the EXPLAIN output is ... (heh) ... "not exactly standardized," it will help you to see if the computer thinks that it needs to do something very time-wasting in order to deliver your answer.

Optimize MYSQL query with aggregate function

i have one simple query, but on the other hand relatively big table.
Here it is:
select `stats_ad_groups`.`ad_group_id`,
sum(stats_ad_groups.earned) / 1000000 as earned
from `stats_ad_groups`
where `stats_ad_groups`.`day` between '2018-01-01' and '2018-05-31'
group by `ad_group_id` order by earned asc
limit 10
And here is table structure:
CREATE TABLE `stats_ad_groups` (
`campaign_id` int(11) NOT NULL,
`ad_group_id` int(11) NOT NULL,
`impressions` int(11) NOT NULL,
`clicks` int(11) NOT NULL,
`avg_position` double(3,1) NOT NULL,
`cost` int(11) NOT NULL,
`profiles` int(11) NOT NULL DEFAULT 0,
`upgrades` int(11) NOT NULL DEFAULT 0,
`earned` int(11) NOT NULL DEFAULT 0,
`day` date NOT NULL,
PRIMARY KEY (`ad_group_id`,`day`,`campaign_id`)
)
Also there are partitions by range here, but i excluded them, not to waste space :)
Query I wrote here is executed in about 9 sec. Do you know some way to improve it?
If i exclude limit/order by its executed in 200ms.
To sum it:
I need to order by sum on big table, if its possible with limit and offset.
INDEX(day, ad_group_id, earned)
handles the WHERE and is 'covering'.
Is your PARTITIONing PARTITION BY RANGE(TO_DAYs(day)) with daily partitions? If so, could leave off day from that index.
With that index, PARTITIONing provides no extra performance for this query.
For significant speedup, build and maintain a summary table that has day, ad_group_id, SUM(earned). More
Don't use (m,n) on DOUBLE or FLOAT.

How to design Cassandra Scheme for User Actions Log?

I have a table like this in MYSQL to log user actions :
CREATE TABLE `actions` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`module` VARCHAR(32) NOT NULL,
`controller` VARCHAR(64) NOT NULL,
`action` VARCHAR(64) NOT NULL,
`date` Timestamp NOT NULL,
`userid` BIGINT(20) NOT NULL,
`ip` VARCHAR(32) NOT NULL,
`duration` DOUBLE NOT NULL,
PRIMARY KEY (`id`),
)
COLLATE='utf8mb4_general_ci'
ENGINE=MyISAM
AUTO_INCREMENT=1
I have a MYSQL Query Like this to find out count of specific actions per day :
SELECT COUNT(*) FROM actions WHERE actions.action = "join" AND
YEAR(date)=2017 AND MONTH(date)=06 GROUP BY YEAR(date), MONTH(date),
DAY(date)
this takes 50 - 60 second to me to have a list of days with count of "join" action with only 5 million rows and index in date and action.
So, I want to log actions using Cassandra, so How can I design Cassandra scheme and How to query to get such request less than 1 second.
CREATE TABLE actions (
id timeuuid,
module varchar,
controller varchar,
action varchar,
date_time timestamp,
userid bigint,
ip varchar,
duration double,
year int,
month int,
dt date,
PRIMARY KEY ((action,year,month),dt,id)
);
Explanation:
With abobe table Defination
SELECT COUNT(*) FROM actions WHERE actions.action = "join" AND yaer=2017 AND month=06 GROUP BY action,year,month,dt
will hit single partition.
In dt column only date will be there... may be you can change it to only day number with int as datatype and since id is timeuuid.. it will be unique.
Note: GROUP BY is supported by cassandra 3.10 and above

MARIADB: Index not used for a select with join on a range

I have a first table containing my ips stored as integer (500k rows), and a second one containing ranges of black listed ips and the reason of black listing (10M rows)
here is the table structure :
CREATE TABLE `black_lists` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`ip_start` INT(11) UNSIGNED NOT NULL,
`ip_end` INT(11) UNSIGNED NULL DEFAULT NULL,
`reason` VARCHAR(3) NOT NULL,
`excluded` TINYINT(1) NULL DEFAULT NULL,
PRIMARY KEY (`id`),
INDEX `ip_range` (`ip_end`, `ip_start`),
INDEX `ip_start` ( `ip_start`),
INDEX `ip_end` (`ip_end`),
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
AUTO_INCREMENT=10747741
;
CREATE TABLE `ips` (
`id` INT(11) NOT NULL AUTO_INCREMENT COMMENT 'Id ips',
`idhost` INT(11) NOT NULL COMMENT 'Id Host',
`ip` VARCHAR(45) NULL DEFAULT NULL COMMENT 'Ip',
`ipint` INT(11) UNSIGNED NULL DEFAULT NULL COMMENT 'Int ip',
`type` VARCHAR(45) NULL DEFAULT NULL COMMENT 'Type',
PRIMARY KEY (`id`),
INDEX `host` (`idhost`),
INDEX `index3` (`ip`),
INDEX `index4` (`idhost`, `ip`),
INDEX `ipsin` (`ipint`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
AUTO_INCREMENT=675651;
my problem is when I try to run this query no index is used and it takes an eternity to finish :
select i.ip,s1.reason
from ips i
left join black_lists s1 on i.ipint BETWEEN s1.ip_start and s1.ip_end;
I'm using MariaDB 10.0.16
True.
The optimizer has no knowledge that start..end values are non overlapping, nor anything else obvious about them. So, the best it can do is decide between
s1.ip_start <= i.ipint -- and use INDEX(ip_start), or
s1.ip_end >= i.ipint -- and use INDEX(ip_end)
Either of those could result in upwards of half the table being scanned.
In 2 steps you could achieve the desired goal for one ip; let's say #ip:
SELECT ip_start, reason
FROM black_lists
WHERE ip_start <= #ip
ORDER BY ip_start DESC
LIMIT 1
But after that, you need to see if the ip_end corresponding to that ip_start is <= #ip before deciding whether you have a black-listed item.
SELECT reason
FROM ( ... ) a -- fill in the above query
JOIN black_lists b USING(ip_start)
WHERE b.ip_end <= #ip
That will either return the reason or no rows.
In spite of the complexity, it will be very fast. But, you seem to have a set of IPs to check. That makes it more complex.
For black_lists, there seems to be no need for id. Suggest you replace the 4 indexes with only 2:
PRIMARY KEY(ip_start, ip_end),
INDEX(ip_end)
In ips, isn't ip unique? If so, get rid if id and change 5 indexes to 3:
PRIMARY KEY(idint),
INDEX(host, ip),
INDEX(ip)
You have allowed more than enough in the VARCHAR for IPv6, but not in INT UNSIGNED.
More discussion.

MySQL Create a function table?

I am trying to design the layout of the table to work best in the following situation.
I have a product that is sold based on age. The age determines if that product exists for this person and the minimum and maximum one can buy.
Right now i have designed the table as follows:
CREATE TABLE `tblProductsVsAge` (
`id` int(255) AUTO_INCREMENT NOT NULL,
`product_id` bigint(255) NOT NULL,
`age_min` int(255) NOT NULL,
`age_max` int(255) NOT NULL,
`quantity_min` decimal(8) NOT NULL,
`quantity_max` decimal(8) NOT NULL,
/* Keys */
PRIMARY KEY (`id`)
) ENGINE = InnoDB;
this is functional and it work, but i feel as if its not the best optimized structure.
any idea?
i forgot to mention a product can have many ranges. for example age min 25 age max 35 and the quantity for this would be 12 and 28, for the same product ID we might have age 36 to 60, quantity from 3 to 8.
Use tinyint unsigned for age_max and age_min since none of the ages in the question pass 255 (highest unsigned tinyint).
Use smallint unsigned for quantity_max and quantity_min if those values > 255 and <= 65535 (highest unsigned smallint).
Use mediumint unsigned for quantity_max and quantity_min if those values > 65535 and <= 16777215 (highest unsigned mediumint).
Use int unsigned for quantity_max and quantity_min if those values > 16777215 and <= 4294967295 (highest unsigned int). (Sometimes, you gotta Think Big !!!)
My recommendation:
CREATE TABLE `tblProductsVsAge` (
`product_id` int NOT NULL,
`age_min` tinyint unsigned NOT NULL,
`age_max` tinyint unsigned NOT NULL,
`quantity_min` smallint unsigned NOT NULL,
`quantity_max` smallint unsigned NOT NULL,
/* Keys */
PRIMARY KEY (`product_id`, `age_min`)
) ENGINE = InnoDB;
Here is something to consider if the table already has data: You could ask mysql to recommend column defintions for this table.
Simply run this query:
SELECT * FROM tblProductsVsAge PROCEDURE ANALYSE();
The directive PROCEDURE ANALYSE() will cause mysql not to display the data but to examine the values from each column and come up with its own recommendation. Sometimes, the recommendation is too granular. For example, if age_min is in the teenage range, it may recommend ENUM('13','14','15','16','17',18','19') instead of tinyint. After PROCEDURE ANALYSE() is done, you still make the final call on the column definitions.
CREATE TABLE `tblProductsVsAge` (
`product_id` int NOT NULL,
`age_min` smallint NOT NULL,
`age_max` smallint NOT NULL,
`quantity_min` smallint NOT NULL,
`quantity_max` smallint NOT NULL,
/* Keys */
PRIMARY KEY (`product_id`, `age_min`)
) ENGINE = InnoDB;
Changes to your structure:
id is probably not needed (unless you really need it), but then if you need product_id to be bigint then id should have the same type - after all this table can get more rows than your products table,
I changed type od product_id to int, I don't think you will have more than 2147483647 products,
age and quantity are smallints, which can have a maximum value of 32767 (use mediumint or int if it's not enough). decimal is intended for when you need exact precision or numbers bigger than bigint,
index on (id, age_min) to make faster searches for given product_id and for searches like product_id = {some_id} AND min_age > {user_age}
(255) in int/bigint definition doesn't make it 255 digits long - it's only a hint for string representation.
MySQL manual on numeric types: http://dev.mysql.com/doc/refman/5.5/en/numeric-types.html