2gb table with 10 million rows, late pagination select is slow - mysql

I have table in MySQL with 10 million rows with 2 GB data
selecting IN LIFO format data is slow
Table engine is = InnoDB
table has one primary key and one unique key
SELECT * FROM link LIMIT 999999 , 50;
how I improve the performance of the table. ?
table structure
id int(11) NO PRI NULL auto_increment
url varchar(255) NO UNI NULL
website varchar(100) NO NULL
state varchar(10) NO NULL
type varchar(100) NO NULL
prio varchar(100) YES NULL
change varchar(100) YES NULL
last varchar(100) YES NULL
NOTE:
SELECT * FROM link LIMIT 1 , 50; is taking .9ms but current sql is taking 1000ms its 1000 time taking more

This most likely is due to "early row lookup"
MySQL can be forced to do "late row lookup". Try below query
SELECT l.*
FROM (
SELECT id
FROM link
ORDER BY
id
LIMIT 999999 , 50
) q
JOIN link l
ON l.id = q.id
Check this article
MySQL limit clause and rate low lookups

For the Next and Prev buttons you can use a WHERE clause instead of OFFSET.
Example (using LIMIT 10 - Sample data explained below): You are on some page which shows you 10 rows with the ids [2522,2520,2514,2513,2509,2508,2506,2504,2497,2496]. This in my case is created with
select *
from link l
order by l.id desc
limit 10
offset 999000
For the next page you would use
limit 10
offset 999010
getting rows with ids [2495,2494,2493,2492,2491,2487,2483,2481,2479,2475].
For the previous page you would use
limit 10
offset 998990
getting rows with ids [2542,2541,2540,2538,2535,2533,2530,2527,2525,2524].
All above queries execute in 500 msec. Using the "trick" suggested by Sanj it still takes 250 msec.
Now with the given page with minId=2496 and maxId=2522 we can create queries for the Next and Last buttons using the WHERE clause.
Next button:
select *
from link l
where l.id < :minId -- =2496
order by l.id desc
limit 10
Resulting ids: [2495,2494,2493,2492,2491,2487,2483,2481,2479,2475].
Prev button:
select *
from link l
where l.id > :maxId -- =2522
order by l.id asc
limit 10
Resulting ids: [2524,2525,2527,2530,2533,2535,2538,2540,2541,2542].
To reverse the order you can use the query in a subselect:
select *
from (
select *
from link l
where l.id > 2522
order by l.id asc
limit 10
) sub
order by id desc
Resulting ids: [2542,2541,2540,2538,2535,2533,2530,2527,2525,2524].
These queries execute in "no time" (less than 1 msec) and provide the same result.
You can not use this solution to create page numbers. But i don't think you are going to output 200K page numbers.
Test data:
Data used for the example and benchmarks has been created with
CREATE TABLE `link` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`url` VARCHAR(255) NOT NULL,
`website` VARCHAR(100) NULL DEFAULT NULL,
`state` VARCHAR(10) NULL DEFAULT NULL,
`type` VARCHAR(100) NULL DEFAULT NULL,
`prio` VARCHAR(100) NULL DEFAULT NULL,
`change` VARCHAR(100) NULL DEFAULT NULL,
`last` VARCHAR(100) NULL DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE INDEX `url` (`url`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;
insert into link
select i.id
, concat(id, '-', rand()) url
, rand() website
, rand() state
, rand() `type`
, rand() prio
, rand() `change`
, rand() `last`
from test._dummy_indexes_2p23 i
where i.id <= 2000000
and rand() < 0.5
where test._dummy_indexes_2p23 is a table containing 2^23 ids (about 8M). So the data contains about 1M rows randomly missing every second id. Table size: 228 MB

Due to the large amount of data,
There are few tips for improve the query response time:
Change the storage engine Innodb to myisam.
Create table partitioning
(https://dev.mysql.com/doc/refman/5.7/en/partitioning-management.html)
Mysql cluster (http://dev.mysql.com/doc/refman/5.7/en/mysql-cluster-overview.html)
Increase hardware capacity.
Thanks

First of all running on your table without any order doesn't guaranty your query will return the same data if ran twice.
It's better adding an ORDER BY clause. Taking id as a good candidate, as it's your primary key and seems unique (as it's an auto_increment value).
You could use this as your base:
SELECT * FROM link ORDER BY id LIMIT 50;
This will give you the first 50 rows in your table.
Now for the next 50 rows, instead of using OFFSET, we could save our last location in the query.
You would save the id from the last row last id from the previous query and use it in the next query:
SELECT * FROM link WHERE id > last_id ORDER BY id LIMIT 50;
This will give you the next 50 rows after the last id.
The reason your query runs slowly on high values of OFFSET is because mysql has to run on all rows in the given OFFSET and return the last LIMIT number of rows. This means that the bigger OFFSET is the slower the query will run.
The solution I showed above, doesn't depend on OFFSET, thus the query will run at the same speed independent of the current page.
See also this useful article that explains a few other options you can choose from: http://www.iheavy.com/2013/06/19/3-ways-to-optimize-for-paging-in-mysql/

I have updated my SQL Query to this and this is taking less amount of time.
SELECT * FROM link ORDER BY id LIMIT 999999 , 50 ;

Related

mysql with few tables, subquery on one large table performs slow

We are experiencing slow performance with a query on mysql database and we are not sure if the query is wrong or maybe mysql or server is not good enough.
The query with a subquery returns some project details (3 fields) and filename of the latest taken picture of a online camera.
Info
Table 'projects' contains 40 records.
Table 'cameras' contains approx 40 records (1 project, multiple cameras possible)
Table 'cameraimages' contains around 250000 (250 thousand) records. (1 camera can have thousands of images)
Engine is InnoDb
Database size is about 100Mb approx
No indexes are added yet.
Version number mysql 8.0.15
This is the query
SELECT
pj.title,
pj.description,
pj.city,
(SELECT cmi.filename
FROM cameras cm
LEFT JOIN cameraimages cmi ON cmi.cameraId = cm.id
WHERE cm.projectId = pj.id
ORDER BY cmi.dateRecording DESC
LIMIT 0,1) as latestfilename
FROM
projects pj
It takes 40-50 seconds to return this data.
That is to long for a webpage but I think it should take not that long at all.
We tested the same query on another server, to compare. Same data, same query.
That takes 25 seconds.
My questions are:
Is this query to 'heavy/bad' and if it is, what query should perform better?
Is there a way, or what should I check, to find out why this query runs better on an older/other server?
Hope someone can give some advice.
Thnx!
Additional info
CREATE TABLE `cameras` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`guid` varchar(50) DEFAULT NULL,
`title` varchar(50) DEFAULT NULL,
`longitude` double DEFAULT NULL,
`latitude` double DEFAULT NULL,
`status` smallint(6) DEFAULT NULL,
`cameraUid` varchar(20) DEFAULT NULL,
`cameraFriendlyName` varchar(50) DEFAULT NULL,
`projectId` int(11) DEFAULT NULL,
`dateCreated` datetime DEFAULT NULL,
`dateModified` datetime DEFAULT NULL,
`address` varchar(100) DEFAULT NULL,
`city` varchar(50) DEFAULT NULL,
`createArchive` smallint(6) DEFAULT '0',
`createDaily` smallint(6) DEFAULT '1',
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=88 DEFAULT CHARSET=latin1
Columns cameraId,dateRecording is unique.
One camera takes on picture at the time.
You're using a so-called dependent subquery. That's slow.
I guess cameraimages.id is a primary key for your cameraimages file. That's a guess. You didn't provide enough information in your question to answer it with certainty.
I also guess that the dateRecording values in cameraimages are in the same order as your autoincrementing primary key id values. That is, I guess you INSERT a record to that table at the time each image is captured.
Let's break this down.
You want the id of the most recent image from each project. How can you get that? Write a subquery to retrieve the largest, most recent id for each project.
SELECT cm.projectId,
MAX(cmi.id) imageId
FROM cameras cm
JOIN cameraimages cmi ON cmi.cameraId = cm.id
GROUP BY cm.projectId
That subquery does the heavy lifting of searching your big table. It does it just once, not for every project, so it won't take as long.
Then put that subquery into your query to retrieve the columns you need.
SELECT
pj.title,
pj.description,
pj.city,
cmi.filename latestfilename
FROM projects pj
JOIN (
SELECT cm.projectId,
MAX(cmi.id) imageId
FROM cameras cm
JOIN cameraimages cmi ON cmi.cameraId = cm.id
GROUP BY cm.projectId
) latest ON pj.id = latest.projectId
JOIN cameraimages cmi ON cmi.imageId = latest.imageId
This has a series of JOINs making a chain from projects to the latest subquery and from there to cameraimages.
This depends on cameraimages.id values being in chronological order. It can still be done if they aren't in that order with a more elaborate query.
Indexes:
cm: INDEX(projectId, id)
cmi: INDEX(cameraId, dateRecording, filename)
cmi: INDEX(cameraId, id)
When cameraimages.id values aren't in chronological order, we need to work with the latest dateRecording values.
This is going to require a sequence of subqueries. So, rather than nesting them, let's use MySQL 8+ Common Table Expressions. It's a big query.
WITH
ProjectCameraImage AS (
/* a virtual version of the cameraimages table including projectId */
SELECT cmi.id, cmi.dateRecording, cm.projectId, cm.cameraId
FROM cameras cm
JOIN cameraimages cmi ON cm.id = cmi.cameraId
),
LatestDate AS (
/* the latest date for each entry in ProjectCameraImage */
/* Notice how this uses MAX rather than ORDER BY ... DESC LIMIT 1 */
SELECT projectId, cameraId,
MAX(dateRecording) dateRecording
FROM ProjectCameraImage
GROUP BY projectId, cameraId
),
ProjectCameraLatest AS (
/* the cameraimage.id values for the latest images in ProjectCameraImage */
SELECT ProjectCameraImage.id,
ProjectCameraImage.projectId,
ProjectCameraImage.cameraId,
ProjectCameraImage.dateRecording
FROM ProjectCameraImage
JOIN LatestDate
ON ProjectCameraImage.projectId = LatestDate.projectId
AND ProjectCameraImage.cameraId = LatestDate.cameraId
AND ProjectCameraImage.dateRecording = LatestDate.dateRecording
),
LatestProjectDate AS (
/* the latest data for each entry in ProjectCameraLatest */
SELECT projectId,
MAX(dateRecording) dateRecording
FROM ProjectCameraLatest
GROUP BY projectId
),
ProjectLatest AS (
/* the cameraimage.id values for the latest images in ProjectCameraLatest */
SELECT ProjectCameraLatest.id,
ProjectCameraLatest.projectId
FROM ProjectCameraLatest
JOIN LatestProjectDate
ON ProjectCameraLatest.projectId = LatestProjectDate.projectId
AND ProjectCameraLatest.dateRecording = LatestProjectDate.dateRecording
)
/* the main query */
SELECT pj.title,
pj.description,
pj.city,
cmi.filename latestfilename
FROM projects pj
JOIN ProjectLatest ON pj.id = ProjectLatest.projectId
JOIN cameraimages cmi ON ProjectLatest.id = cmi.id;
It's big because we have to go through two different cycles of finding the cameraimages.id value with the largest dateRecording.
Edit The heavy lifting, in terms of searching your tables, happens in the second common table expression (CTE), the one called LatestDate. I suggest adding an index to your cameraimages table as follows to give it a boost.
CREATE INDEX cmi_cameraid_daterec
ON cameraimages (cameraId, dateRecording DESC);
That compound index should allow random access by cameraId, then quick access to the latest date. Notice that it also should help the ProjectCameraLatest CTE.
You can test the performance of this by changing the last SELECT, the one in the main query, to just SELECT * FROM LatestDate;. And to see whether / how it uses the index try using EXPLAIN or EXPLAIN ANALYZE: use EXPLAIN SELECT * FROM LatestDate; as the main query.
You may learn some useful things about indexes if you run EXPLAIN with and without the index.

Delete all items in a database except the last date

I have a MySQL table that looks (very simplified) like this:
CREATE TABLE `logging` (
`id` bigint(20) NOT NULL,
`time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`level` smallint(3) NOT NULL,
`message` longtext CHARACTER SET utf8 COLLATE utf8_general_mysql500_ci NOT NULL
);
I would like to delete all rows of a specific level, except the last one (time is most recent).
Is there a way to select all rows with level set to a specific value and then delete all rows except the latest one in one single SQL query? How would I start solving this problem?
(As I said, this is a very simplified table, so please don't try to discuss possible design problems of this table. I removed some columns. It is designed per PSR-3 logging standard and I don't think there is an easy way to change that. What I want to solve is how I can select from a table and then delete all but some rows of the same table. I have only intermediate knowledge of MySQL.)
Thank you for pushing me in the right direction :)
Edit:
The Database version is /usr/sbin/mysqld Ver 8.0.18-0ubuntu0.19.10.1 for Linux on x86_64 ((Ubuntu))
You can use ROW_NUMBER() analytic function ( as using DB version 8+ ) :
DELETE lg FROM `logging` AS lg
WHERE lg.`id` IN
( SELECT t.`id`
FROM
(
SELECT t.*,
ROW_NUMBER() OVER (ORDER BY `time` DESC) as rn
FROM `logging` t
-- WHERE `level` = #lvl -- optionally add this line to restrict for a spesific value of `level`
) t
WHERE t.rn > 1
)
to delete all of the rows except the last inserted one(considering id is your primary key column).
You can do this:
SELECT COUNT(time) FROM logging WHERE level=some_level INTO #TIME_COUNT;
SET #TIME_COUNT = #TIME_COUNT-1;
PREPARE STMT FROM 'DELETE FROM logging WHERE level=some_level ORDER BY time ASC LIMIT ?;';
EXECUTE STMT USING #TIME_COUNT;
If you have an AUTO_INCREMENT id column - I would use it to determine the most recent entry. Here is one way doing that:
delete l
from (
select l1.level, max(id) as id
from logging l1
where l1.level = #level
) m
join logging l
on l.level = m.level
and l.id < m.id
An index on (level) should give you good performance and will support the MAX() subquery as well as the JOIN.
View on DB Fiddle
If you really need to use the time column, you can modify the query as follows:
delete l
from (
select l1.level, l1.id
from logging l1
where l1.level = #level
order by l1.time desc, l1.id desc
limit 1
) m
join logging l
on l.level = m.level
and l.id <> m.id
View on DB Fiddle
Here you would want to have an index on (level, time).

Strange query results from MySQL

I have a query that I'm testing on my database, but for some weird reason, and randomly, it returns a different set of results. Interestingly, there are only two distinct result-sets that it returns, from thousands of rows, and the query will randomly return one or the other, but nothing else.
Is there a reason the query only returns one of two datasets? Query and schema below.
My goal is to select the fastest laps for a given track, in a given time period, but only the fastest lap for each user (so there are always 10 different users in the top 10).
Most of the time the correct results are returned, but randomly, a totally different result set is returned.
SELECT `lap`.`ID`, `lap`.`qualificationTime`, `lap`.`userId`
FROM `lap`
WHERE (lap.trackID =4)
AND (lap.raceDateTime >= "2013-07-25 10:00:00")
AND (lap.raceDateTime < "2013-08-04 23:59:59")
AND (isTestLap =0)
GROUP BY `userId`
ORDER BY `qualificationTime` ASC
LIMIT 10
Schema:
CREATE TABLE IF NOT EXISTS `lap` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`userId` int(11) DEFAULT NULL,
`trackId` int(11) DEFAULT NULL,
`raceDateTime` datetime NOT NULL,
`qualificationTime` decimal(7,4) DEFAULT '0.0000',
`isTestLap` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`ID`)
(DB create script trimmed of un-needed columns)
You are using a (mis)feature of MySQL called hidden columns. As others have pointed out, you are allowed to put columns in the select statement that are not in the group by. But, the returned values are arbitrary, and not even guaranteed to be the same from one run to the next.
The solution is to find the max qualification time for each user. Then join this information back to get the other fields. Here is one way:
select l.*
from (SELECT userId, min(qualificationtime) as minqf
FROM lap
WHERE (lap.trackID =4)
AND (lap.raceDateTime >= "2013-07-25 10:00:00")
AND (lap.raceDateTime < "2013-08-04 23:59:59")
AND (isTestLap =0)
GROUP BY `userId`
) lu join
lap l
on lu.minqf = l.qualificationtime
ORDER BY l.`qualificationTime` ASC
LIMIT 10
You are selecting lap.ID, lap.qualificationTime and lap.userId, but you are not GROUPing BY them. You can only select fields you group by, or else aggregate functions on the other fields (MIN, MAX, AVG, etc). Otherwise, results are undefined.
I think you mean that sometimes values for lap.ID, lap.qualificationTime are different. And it's right behaviour for mysql. Because you group by userId and you don't know what values for other fields will be returned. Mysql can select different values depend on first value or last rows reading.
I would check something like this:
SELECT `l1`.`qualificationTime`, `l1`.`userId`,
(SELECT l2.ID FROM `lap` AS l2 WHERE l2.`userId` = l1.userId AND
l2.qualificationTime = min(l1.`qualificationTime`))
FROM `lap` AS `l1`
WHERE (l1.trackID =4)
AND (l1.raceDateTime >= "2013-07-25 10:00:00")
AND (l1.raceDateTime < "2013-08-04 23:59:59")
AND (isTestLap =0)
GROUP BY `userId`
ORDER BY `qualificationTime` ASC
LIMIT 10
It's likely to be your ORDER BY on a decimal entity, and how the DB stores this and then retrieves it.

How to format this mysql Query

What I have is a table statistieken with an ip, hash of browser info, url visited and last visited date in timestamp.
What I could compile from different sources led to this query, the only problem is that this query takes forever(9 minutes) to complete on a table with about 15000 rows, so this query is very inefficient.
I think I'm going to this the wrong way around, but I can't find a decent post or tutorial how to use the results of a select as basis for getting the results I want.
What I simply want is an overview of every entry in the table that matches the hash of the results that are returned that have visted more than 25 pages in the last 12 hours.
CREATE TABLE IF NOT EXISTS `statsitieken` (
`hash` varchar(35) NOT NULL,
`ip` varchar(24) NOT NULL,
`visits` int(11) NOT NULL,
`lastvisit` int(11) NOT NULL,
`browserinfo` text NOT NULL,
`urls` text NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
This is the query I have tried to compile so far.
SELECT * FROM `database`.`statsitieken` WHERE hash in (SELECT hash FROM `database`.`statsitieken`
where `lastvisit` > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
)
group by hash
having count(urls) > 25
order by urls)
I need this to compile in a decent time, like < 1 second which should be possible in my opinion...
I suggest trying this modified query. The subquery is now computed only once instead of being run for each record returned:
SELECT s.*
FROM `database`.`statsitieken` s, (SELECT *
FROM `database`.`statsitieken`
WHERE `lastvisit` > UNIX_TIMESTAMP(DATE_SUB(NOW(),INTERVAL 12 HOUR))
GROUP BY hash
HAVING COUNT(urls)>25) tmp
WHERE s.`hash`=tmp.`hash`
ORDER BY s.urls
Be sure you have indexes on the following fields:
hash to speed up the GROUP BY and WHERE
urls to speed up the ORDER BY
Derived table with INNER JOIN is faster than a subquery. try this optimized query:
SELECT *
FROM statsitieken a
INNER JOIN (SELECT hash
FROM statsitieken
WHERE lastvisit > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
) b
ON a.hash = b.hash
GROUP BY a.hash
HAVING COUNT(urls) > 25
ORDER BY urls;
For better performance of this select query you should add indexes as:
ALTER TABLE statsitieken ADD KEY ix_hash(hash);
ALTER TABLE statsitieken ADD KEY ix_lastvisit(lastvisit);
WHERE hash in (SELECT hash FROM `database`.`statsitieken`
where `lastvisit` > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
)
You are "subquerying" (i don't know if exists that word :P, 'doing a subquery') in the same table, why not to:
where `lastvisit` > unix_timestamp(DATE_SUB(
NOW(),INTERVAL 12 hour
)
do it directly?

Doing some calculations in mysql, numbers off when using GROUP BY

Im running the following query to get the stats for a user, based on which I pay them.
SELECT hit_paylevel, sum(hit_uniques) as day_unique_hits
, (sum(hit_uniques)/1000)*hit_paylevel as day_earnings
, hit_date
FROM daily_hits
WHERE hit_user = 'xxx' AND hit_date >= '2011-05-01' AND hit_date < '2011-06-01'
GROUP BY hit_user
The table in question looks like this:
CREATE TABLE IF NOT EXISTS `daily_hits` (
`hit_itemid` varchar(255) NOT NULL,
`hit_mainid` int(11) NOT NULL,
`hit_user` int(11) NOT NULL,
`hit_date` date NOT NULL,
`hit_hits` int(11) NOT NULL DEFAULT '0',
`hit_uniques` int(11) NOT NULL,
`hit_embed` int(11) NOT NULL,
`hit_paylevel` int(1) NOT NULL DEFAULT '1',
PRIMARY KEY (`hit_itemid`,`hit_date`),
KEY `hit_user` (`hit_user`),
KEY `hit_mainid` (`hit_mainid`,`hit_date`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
The problem in the calculation has to do with the hit_paylevel which acts as a multiplier. Default is one, the other option is 2 or 3, which essentially doubles or triples the earnings for that day.
If I loop through the days, the daily day_earnings is correct, its just that when I group them, it calculates everything as paylevel 1. This happens if the user was paylevel 1 in the beginning, and was later upgraded to a higher level. if user is pay level 2 from the start, it also calculates everything correctly.
Shouldn't this be sum(hit_uniques * hit_paylevel) / 1000?
Like #Denis said:
Change the query to
SELECT hit_paylevel, sum(hit_uniques) as day_unique_hits
, sum(hit_uniques * hit_paylevel) / 1000 as day_earnings
, hit_date
FROM daily_hits
WHERE hit_user = 'xxx' AND hit_date >= '2011-05-01' AND hit_date < '2011-06-01'
GROUP BY hit_user;
Why this fixes the problem
Doing the hit_paylevel outside the sum, first sums all hit_uniques and then picks a random hit_paylevel to multiply it by.
Not what you want. If you do both columns inside the sum MySQL will pair up the correct hit_uniques and hit_paylevels.
The dangers of group by
This is an important thing to remember on MySQL.
The group by clause works different from other databases.
On MSSQL *(or Oracle or PostgreSQL) you would have gotten an error
non-aggregate expression must appear in group by clause
Or words to that effect.
In your original query hit_paylevel is not in an aggregate (sum) and it's also not in the group by clause, so MySQL just picks a value at random.