Mysql: indexed count query vs maintaining summary table - mysql

I am working on an e-commerce website where user can show interest on available products and we store it as a lead in mysql table. This Leads table consists of millions of records and grows by 8 records per seconds. Table structure is as follows:
LeadId | ProductId | UserId | RequestDate(DateTime)
Table Schema:
`id` int(11) NOT NULL AUTO_INCREMENT,
`ProductId` int(11) DEFAULT NULL,
`UserID` int(11) NOT NULL,
`RequestDateTime` datetime(3) NOT NULL,
PRIMARY KEY (`id`),
KEY `ix_leads_requestdatetime` (`RequestDateTime`) USING BTREE,
KEY `ix_leads_productid` (`ProductId`) USING BTREE,
KEY `ix_leads_userid` (`UserID`) USING BTREE
Now, the requirement is to allow one user to give maximum 10 leads in a day. I have following approaches to implement this:
Select query to count number of records for that day in Leads table and check if < 20 before insertion.
Maintain a DailyLeadCount table which contains count of leads for each userId for particular date. Table Structure:
UserId | Date | Count
Table Schema:
`RequestDate` date NOT NULL,
`UserId` int(11) NOT NULL,
`LeadCount` smallint(6) NOT NULL,
PRIMARY KEY (`RequestDate`,`UserId`)
I will check count in this table before inserting in Leads table and update this count after insertion accordingly. Also, as only one day data will be useful in this table, I will create a job to archive it daily.
Which approach is better? Is running select query on Leads table to get count more heavy than insert/update and select query on DailyLeadCount table?
Is it worth maintaining and archiving a table daily for it?
Is there any other way to handle this?

Change
KEY `ix_leads_userid` (`UserID`) USING BTREE
to
INDEX(UserID, RequestDateTime)
Then spit at the user when
( SELECT COUNT(*) FROM Leads WHERE UserID = 1234
AND RequestDateTime > NOW() - INTERVAL 24 HOUR
) >= 10
The query will be fast enough to do in real time.
The count is between this time yesterday and now -- this may not be exactly what you want. If, instead, you want the clock to start at midnight this morning:
AND RequestDateTime > CURDATE()
If "since midnight yesterday":
AND RequestDateTime > CURDATE() - INTERVAL 1 DAY
If you want to use his timezone for midnight, it gets messier.
Potential issue: If he can somehow batch his leads, he could insert multiple leads in the same millisecond. (I am noticing DATETIME(3).)
Your idea of a Summary Table works best if you need to check against "yesterday", not so well for "the last 86400000 milliseconds".

Related

Best way to keep an Earnings Record ensuring performance and maintainability

I'm trying to keep a record of the earnings of the users of my website and I'm stuck at which of the following designs is best regarding performance and overall usability:
• First way:
In this way a single database will be created containing a table for each year. Each table will have 13 columns, the user ID and the 12 months. The value for each field will be a stringified array with the values of all the days of the month, like so: [12.5, 28.3, 9.75, ...].
Code:
-- Create a database to keep record of the earnings of all users.
CREATE DATABASE IF NOT EXISTS `Earnings_Record`;
-- Create a table for each year containing 13 columns, the user ID and the 12 months.
CREATE TABLE IF NOT EXISTS `Earnings_Record`.`Earnings_2017` (
`ID` INT(11) NOT NULL,
`January` VARCHAR(250) NOT NULL,
`February` VARCHAR(250) NOT NULL,
...
`December` VARCHAR(250) NOT NULL,
PRIMARY KEY (`ID`)
) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_general_ci;
• Second way:
In this way multiple databases will be created, one for each year, containing a table for each month. Each table will have 28-31 + 1 columns, the user ID and the 28-31 days. The value for each field will be a decimal, like so: 12.5.
Code:
-- Create a database to keep record of the earnings of all users for 2017.
CREATE DATABASE IF NOT EXISTS `Earnings_2017`;
-- Create a table for each month with 28-31 + 1 columns, the user ID and the 28-31 days.
CREATE TABLE IF NOT EXISTS `Earnings_2017`.`January` (
`ID` INT(11) NOT NULL,
`Day 1` DECIMAL(10, 2) NOT NULL,
`Day 2` DECIMAL(10, 2) NOT NULL,
...
`Day 31` DECIMAL(10, 2) NOT NULL,
PRIMARY KEY (`ID`)
) ENGINE = InnoDB CHARACTER SET utf8 COLLATE utf8_general_ci;
Since the website will hopefully be running for 5-10 years, which is the best way to design this when it comes to overall performance and long-term maintainability?
(One thing to keep in mind is that the earnings of each user will be updated multiple times every day for active users)
Third way:
Create a single database. Create a single table for each entity:
CREATE DATABASE IF NOT EXISTS Earnings_Record;
CREATE TABLE IF NOT EXISTS Earnings_Record.Users (
UsersId INT AUTO_INCREMENT PRIMARY KEY,
. . .
);
CREATE TABLE IF NOT EXISTS Earnings_Record.Earnings (
EarningsID INT AUTO_INCREMENT PRIMARY KEY,
UserId INT NOT NULL,
EarningsDate DATE,
Amount DECIMAL(10, 2) -- or whatever
CONSTRAINT fk_Earnings_UserId FOREIGN KEY (UserId) REFERENCES Users(UserId)
) ;
Simple. Easy-to-query. Just what you need. This is the SQL way to represent data.
It's hard to answer this - but any solution that requires multiple databases or tables is probably not maintainable, or scalable, or fast.
I really don't understand your business domain - you say you want to maintain earnings per user, but your tables don't have any reference to a user.
To design the database, it would really help to understand typical queries - do you want to find out total earnings for a period? Do you want to find days with high and/or low earnings? Do you want to aggregate earnings over a group of dates, e.g. "every monday"?
I'd start with:
table earnings
-------------
earnings_date (date) (pk)
earnings_amount (decimal 10,2)
Multiple databases -- NO
Splaying the months across 12 columns -- NO
Stringifying -- Only if you never need MySQL to filter on the data or do arithmetic on the data.
All of these are discussed in various ways in this forum.
There is no problem having thousands, even millions, of rows in a table. The other approaches are headache-causers.

MySQL Count Distinct - Very Slow

I have a very big MySQL InnoDB table with following structure:
TABLE `whois_records` (
`record_id` int(10) unsigned NOT NULL,
`domain_name` varchar(100) NOT NULL,
`tld_id` smallint(5) unsigned DEFAULT NULL,
`create_date` date DEFAULT NULL,
`update_date` date DEFAULT NULL,
`expiry_date` date DEFAULT NULL,
`query_time` datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
PRIMARY KEY (`record_id`)
UNIQUE KEY `domain_time` (`domain_name`,`query_time`)
INDEX `tld_id` (`tld_id`)
This table currently has 10 Million rows.
It stores frequently updated details of domain names.
So there can be multiple records for same domain name in the table.
TLD ID is the numeric value of the type of domain extension.
Problem is when I'm trying to count the total number of domain names of a particular TLD.
I have tried the following 3 SQL queries:
SELECT COUNT(DISTINCT(domain_name)) FROM `whois_records` WHERE tld_id=159
SELECT COUNT(*) FROM `whois_records` WHERE tld_id=159 GROUP BY domain_name
SELECT COUNT(*) FROM ( SELECT 1 FROM `whois_records` WHERE tld_id=159 GROUP BY domain_name) q
All the 3 are very slow, taking between 5 to 10 minutes. It is also using up a lot of CPU to complete. There is INDEX defined on the TLD ID column, so these queries might be doing a FULL INDEX SCAN. It is still very slow. TLD ID of 159 is for ".com", which are the most in number. So when doing a search for 159, it is slowest. For non-popular TLD, with less than 100 domains, the same query takes around 0.10 seconds. TLD ID 159 has around 6 Million records, which is 60% of the entire table consisting of 10 Million rows.
Is there any way to optimize the calculation?
As table grows, the current queries will take longer. So please can anyone help me with a future proof solution to this problem. Is any alteration of table required? Plz help, thank you :)
Extend the index to contain domain_name as well:
INDEX `tld_id` (`tld_id`, `domain_name`)
This should make MySQL use only the index and not table data to compute the result. If the combination of both values is unique, instead add a new unique index:
UNIQUE INDEX `new_index` (`tld_id`, `domain_name`)
I doubt you can push it a lot further than that. If it is still not fast enough, think about caching the counters.

MySQL Query list

I'm going to try to explain this best I can I will provide more information if needed quickly.
I'm storing data for each hour in military time. I only need to store a days worth of data. My table structure is below
CREATE TABLE `onlinechart` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`user` varchar(100) DEFAULT NULL,
`daytime` varchar(10) DEFAULT NULL,
`maxcount` smallint(20) DEFAULT NULL,
`lastupdate` varchar(100) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=innodb AUTO_INCREMENT=2 DEFAULT CHARSET=latin1
The "user" column is unique to each user. So I will have list for each user.
The "daytime" column I'm having it store the day and hour together. So as for today and hour it would be "2116" so the day is 21 and the current hour is 16.
The "maxcount" column is what data for each hour. I'm tracking just one total number each hour.
The "lastupdate" column is just a timestamp im using to delete data that is 24 hours+ old.
I have the script running in PHP fine for the tracking. It keeps a total of 24 rows of data for each user and deletes anything older then 24hours. My problem is how would I go about a query that would start from the current hour/day and pull that past 24 hours maxcount and display them in order.
Thanks
You will run into an issue of handling this near the end of the year. It's advisable you switch to using the native timestamp type of MySQL (described here: http://dev.mysql.com/doc/refman/5.0/en/datetime.html). Then you can grab max count by doing something such as:
SELECT * FROM onlinechart WHERE daytime >= ? ORDER BY maxcount
The question mark should be replaced by the timestamp - 86400 (number of seconds in a day).

MYSQL Value Difference optimization

Hallo guys,
I'm running a very large database (ATM like >5 Million datasets). My database stores custom generated numbers (which and how they compose doesn't really matters here) and the corresponding date to this one. In addition there is a ID stored for every product (means one product can have multiple entries for different dates in my database -> primary key is divided). Now I want to SELECT those top 10 ID's which got the largest difference in theire numbers in the last two days. Currently I tried to achieve this using JOINS but since I got that much datasets this way is far to slow. How could I speed up the whole operation?
SELECT
d1.place,d2.place,d1.ID
FROM
daily
INNER JOIN
daily AS d1 ON d1.date = CURDATE()
INNER JOIN
daily as d2 ON d2.date = DATE_ADD(CURDATE(), INTERVAL -1 DAY)
ORDER BY
d2.code-d1.code LIMIT 10
EDIT: Thats how my structure looks like
CREATE TABLE IF NOT EXISTS `daily` (
`ID` bigint(40) NOT NULL,
`source` char(20) NOT NULL,
`date` date NOT NULL,
`code` int(11) NOT NULL,
`cc` char(2) NOT NULL,
PRIMARY KEY (`ID`,`source`,`date`,`cc`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Thats the output of the Explain Statement
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE d1 ALL PRIMARY NULL NULL NULL 5150350 Using where; Using temporary; Using filesort
1 SIMPLE d2 ref PRIMARY PRIMARY 8 mytable.d1.ID 52 Using where
How about this?
SELECT
d1.ID, d1.place, d2.place
FROM
daily AS d1
CROSS JOIN
daily AS d2
USING (ID)
WHERE
d1.date = CURDATE()
AND d2.date = CURDATE() - INTERVAL 1 DAY
ORDER BY
d2.code - d1.code DESC
LIMIT
10
Some thoughts about your table structure.
`ID` bigint(40) NOT NULL,
Why BIGINT? You would need to be doing 136 inserts/s 24h a day, 7 days a week for a year to exhaust the range of INT. And before you get halfway there, your application will probably need a professional DBA anyway.
Remember, Smaller primary index leads to fater lookups - which brings us to:
PRIMARY KEY (`ID`,`source`,`date`,`cc`)
Why? A single column PK on ID column should be enough. If you need indexes on other columns, create additional indexes (and to it wisely). As it is, you basically have a covering index for entire table... which is like having entire table in the index.
Last but not least: where is place column? You've used it in your query (and then I in mine), but it's nowhere to be seen?
Proposed table structure:
CREATE TABLE IF NOT EXISTS `daily` (
`ID` int(10) UNSIGNED NOT NULL, --usually AUTO_INCREMENT is used as well,
`source` char(20) NOT NULL,
`date` date NOT NULL,
`code` int(11) NOT NULL,
`cc` char(2) NOT NULL,
PRIMARY KEY (`ID`),
KEY `ID_date` (`ID`,`date`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

How could I optimise this MySQL query?

I have a table that stores a pupil_id, a category and an effective date (amongst other things). The dates can be past, present or future. I need a query that will extract a pupil's current status from the table.
The following query works:
SELECT *
FROM pupil_status
WHERE (status_pupil_id, status_date) IN (
SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW() -- to ensure we ignore the "future status"
GROUP BY status_pupil_id );
In MySQL, the table is defined as follows:
CREATE TABLE IF NOT EXISTS `pupil_status` (
`status_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`status_pupil_id` int(10) unsigned NOT NULL, -- a foreign key
`status_category_id` int(10) unsigned NOT NULL, -- a foreign key
`status_date` datetime NOT NULL, -- effective date/time of status change
`status_modify` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`status_staff_id` int(10) unsigned NOT NULL, -- a foreign key
`status_notes` text NOT NULL, -- notes detailing the reason for status change
PRIMARY KEY (`status_id`),
KEY `status_pupil_id` (`status_pupil_id`,`status_category_id`),
KEY `status_pupil_id_2` (`status_pupil_id`,`status_date`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=1409 ;
However, with 950 pupils and just over 1400 statuses in the table, the query takes 0.185 seconds to process. Perhaps acceptable now, but when the table swells, I'm worried about scalability. It is likely that the production system will have over 10000 pupils and each will have 15-20 statuses each.
Is there a better way to write this query? Are there better indexes that I should have to assist the query? Please let me know.
There are the following things you could try
1 Use an INNER JOIN instead of the WHERE
SELECT *
FROM pupil_status ps
INNER JOIN
(SELECT status_pupil_id, MAX(status_date)
FROM pupil_status
WHERE status_date < NOW()
GROUP BY status_pupil_id) X
ON ps.status_pupil_id = x.status_pupil_id
AND ps.status_date = x.status_date
2 Have a variable and store the value for NOW() - I am not sure if the DB engine optimizes this call to NOW() as just one call but if it doesnt, then this might help a bit
These are some suggestions however you will need to compare the query plans and see if there is any appreciable improvement or not.
Based on your usage of indexes as per the Query plan, robob's suggestion above could also come in handy
Find out how long query takes when you load the system with 10000 pupils each with have 15-20 statuses each.
Only refactor if it takes too long.