I have no experience in writing MySQL triggers, my boss is asking me to write a trigger which does the following.
There are 5 tables (table1, table2, table3, table4, and table5) in the database that we are trying to backup with triggers, but it is not just about the backup, I will explain everything below:
Create backup table: create table table1_backup as select * from table1 (for all tables). - This is a one-time activity.
Compare Data of Both Tables.
Delete everything from table1 (data, partition) and keep only 3 days data (Present, Yesterday, and Day Before) - Every Day
I would also like to add these tables have partition.
partition_date | count(partition_date) |
+----------------+-----------------------+
| 2023-02-19 | 4837 |
| 2023-02-18 | 20213 |
The catch is my boss wants to compare and verify data, if any action fails the daily process should also rollback.
I tried writing cron jobs, it works sometimes and fails sometimes, I can only verify with count function.
Related
We are working on ticket booking platform, where user selects the number of tickets, fills the attendee forms and makes the payment. On the database level, we are storing transaction entry for a single transaction in a table and multiple attendee entries in different table. So there is one to many relation between transaction table and attendee table.
Transaction Table:
txnId | order id | buyer name | buyer email | amount | txn_status | attendee json | ....
Attendee Table:
attendeeId | order id | attendee name | attende email | ......
Now you might be thinking "Why I have attendee json in transaction table?". Well the answer is, when user initiates the transaction, we store attendee data in json and we mark the transaction as INITIATED. After successful transaction, the same transaction will be marked as SUCCESS and attendee json will be saved in Attendee table. Plus, we use this json data to show attendee deatils to the organizer on dashboard , this way we saved a database hit on attendee table. And attendee json is not queryable that's why we had attendee table to fire required queries.
Question: Now for some reason we are thinking of merging these tables and removing the json column. And suppose if a transaction initiated for 4 attendees, we are thinking of creating four transaction entries. And we have algorithm to show these entries as a single on dashboard. How is it going to effect the performance if I go for this approach? What will be your suggestions?
Now table will look like this:
txnId | order id | buyer name | buyer email | amount | txn_status | attendee name | attendee email ....
1 | 123 | abc | abc#abc.com | 100 | SUCCESS | xyz | xyz#xyz.com....
2 | 123 | abc | abc#abc.com | 100 | SUCCESS | pqr | pqr#pqr.com....
Normalization attempts to organize the database to minimize redundancy. The technique you're using is called denormalization and it's used to try and optimize reading tables by adding redundant data to avoid joins. It's hotly debated when denormalization is appropriate.
In your case, there should be no performance issue with having two tables and a simple join so long as your foreign keys are indexed.
I would go so far as to say you should eliminate the attendee json column as it's redundant and likely to fall out of sync causing bugs. The attendee table will need UPDATE, INSERT and DELETE triggers to keep it up to date slowing down writing to the table. Many databases have built in JSON functions which can create JSON very quickly. At minimum move the cached JSON to the attendee table.
In addition, you have order id in both the attendee and txn table hinting at another data redundancy. buyer name and buyer email suggest that should also be split off into another table avoiding gumming up the txn table with too much information.
Rule of thumb is to work towards normalization unless you have solid data otherwise. Use indexes as indicated by using EXPLAIN. Then only denormalize only as much as you need to make the database perform as you need. Even then, consider putting a cache on the application side instead.
You might be able to cheaply squeak some performance out of your database now, but you're mortgaging your future. What happens when you want to add a feature that has to do with attendee information and nothing to do with transactions? Envision yourself explaining this to a new developer...
You get attendee information from the transaction table... buyer information, too. But a single attendee may be part of multiple transactions, so you need to use DISTINCT or GROUP BY... which will slow everything down. Also they might have slightly different information, so you have to use insert complicated mess here to figure that all out... which will slow everything down. Why is it this way? Optimization, of course! Welcome to the company!
I have a MySQL database table that have around 10-15k inserts daily, and it certainly will increase next months.
- Table Example (reservations): *important fields*
+----+--------+----------+---------+-----+
| ID | people | modified | created | ... |
+----+--------+----------+---------+-----+
I need to provide daily statistics, informing how much entries had (total and specified with same number of people), based on a DATE or a date RANGE that user selected.
Today I'm executing two queries each request. It's working fine, with desirable delay, but I'm wondering if it will be stable with more data.
- Single Date:
SELECT COUNT(*) from reservations WHERE created='DATE USER SELECTED'
SELECT COUNT(*), people from reservations WHERE created='DATE USER SELECTED' GROUP BY people
- Date Range:
SELECT COUNT(*) from reservations WHERE created BETWEEN 'DATE USE SELECTED' AND 'DATE USE SELECTED';
SELECT COUNT(*), people from reservations WHERE created BETWEEN 'DATE USE SELECTED' AND 'DATE USE SELECTED' GROUP BY people
IN MY VIEW
Pros: Real time statistics.
Cons: Can overload the database, with similar and slow queries.
I thought to create a secondary table, named 'statistics', and run a cronjob on my server, each morning, to calculate all statistics.
- Table Example (statistics):
+----+------+--------------------+---------------------------+---------------------------+-----+
| ID | date | numberReservations | numberReservations2People | numberReservations3People | ... |
+----+------+--------------------+---------------------------+---------------------------+-----+
- IN MY VIEW
Pros: Faster queries, do not need to count every request.
Cons: Not real time statistics.
What you think about it? Theres a better approach?
The aggregate queries you've shown can efficiently be satisfied if you have the correct compound index in your table. If you're not sure about compound indexes, you can read about them.
The index (created,people) on your reservations is the right one for both those queries. They both can be satisfied with an efficient index scan known as a loose range scan. You'll find that they are fast enough that you don't need to bother with a secondary table for the foreseeable future in your system.
That's good, because secondary tables like you propose are a common source of confusion and errors.
I am building a front-end to a largish db (10's of millions of rows). The data is water usage for loads of different companies and the table looks something like:
id | company_id | datetime | reading | used | cost
=============================================================
1 | 1 | 2012-01-01 00:00:00 | 5000 | 5 | 0.50
2 | 1 | 2012-01-01 00:01:00 | 5015 | 15 | 1.50
....
On the frontend users can select how they want to view the data, eg: 6 hourly increments, daily increments, monthly etc. What would be the best way to do this quickly. Given the data changes so much and the number of times any one set of data will be seen, caching the query data in memcahce or something similar is almost pointless and there is no way to build the data before hand as there are too many variables.
I figured using some kind of agregate aggregate table would work having tables such as readings, readings_6h, readings_1d with exactly the same structure, just already aggregated.
If this is a viable solution, what is the best way to keep the aggregate tables upto date and accurate. Besides the data coming in from meters the table is read only. Users don't ever have to update or write to it.
A number of possible solutions include:
1) stick to doing queries with group / aggregate functions on the fly
2) doing a basic select and save
SELECT `company_id`, CONCAT_WS(' ', date(`datetime`), '23:59:59') AS datetime,
MAX(`reading`) AS reading, SUM(`used`) AS used, SUM(`cost`) AS cost
FROM `readings`
WHERE `datetime` > '$lastUpdateDateTime'
GROUP BY `company_id`
3) duplicate key update (not sure how the aggregation would be done here, also making sure that the data is accurate not counted twice or missing rows.
INSERT INTO `readings_6h` ...
SELECT FROM `readings` ....
ON DUPLICATE KEY UPDATE .. calculate...
4) other ideas / recommendations?
I am currently doing option 2 which is taking around 15 minutes to aggregate +- 100k rows into +- 30k rows over 4 tables (_6h, _1d, _7d, _1m, _1y)
TL;DR What is the best way to view / store aggregate data for numerous reports that can't be cached effectively.
This functionality would be best served by a feature called materialized view, which MySQL unfortunately lacks. You could consider migrating to a different database system, such as PostgreSQL.
There are ways to emulate materialized views in MySQL using stored procedures, triggers, and events. You create a stored procedure that updates the aggregate data. If the aggregate data has to be updated on every insert you could define a trigger to call the procedure. If the data has to be updated every few hours you could define a MySQL scheduler event or a cron job to do it.
There is a combined approach, similar to your option 3, that does not depend on the dates of the input data; imagine what would happen if some new data arrives a moment too late and does not make it into the aggregation. (You might not have this problem, I don't know.) You could define a trigger that inserts new data into a "backlog," and have the procedure update the aggregate table from the backlog only.
All these methods are described in detail in this article: http://www.fromdual.com/mysql-materialized-views
I have to update the views of the current post. From table posts witch have data > 2 millions. And the loading time of the page is slow.
Tables:
idpost | iduser | views | title |
1 | 5675 | 45645 | some title |
2 | 345 | 457 | some title 2 |
6 | 45 | 98 | some title 3 |
and many more... up to 2 millions
And iduser have Index, idpost have Primary key.
If I seprate the data and make a new table post_views and use LEFT JOIN to get the value of the views. At first it will be fast since the new table is still small, but over time she as well will have > 2 millions rows. And again it will be slow. How you deal with huge table ?
Split the table
You should split the table to separate different things and prevent repetition of title data. This will be a better design. I suggest following schema:
posts(idpost, title)
post_views(idpost, iduser, views)
Updating views count
You will need to update views of only one row at a time. Because, someone views your page, then you update related row. So, just one row update at a time without a searching overhead (thanks to key & index). I didn't understand how this can make an overhead?
Getting total views
Probably, you run a query like this one:
SELECT SUM(views) FROM post_views WHERE idpost = 100
Yes, this can make an overhead. A solution may be to create anew table total_post_views and update corresponding value in this table after each update on post_views. Thus, you will get rid of the LEFT JOIN and access total view count directly.
But, updating for each update also makes an overhead. To increase performance, you can give up updating total_post_views after each update on post_views. If you choose this way, you can perform update:
periodically, say in each 30sec,
after certain update counts of post_views, say for each 30 update.
In this way, you will get approximate results, of course. If this is tolerable, then I suggest you to go this way.
I have a more or less good working query (concerning to the result) but it takes about 45seconds to be processed. That's definitely too long for presenting the data in a GUI.
So my demand is to find a much faster/efficient query (something around a few milliseconds would be nice)
My data table has something around 3000 ~2,619,395 entries and is still growing.
Schema:
num | station | fetchDate | exportValue | error
1 | PS1 | 2010-10-01 07:05:17 | 300 | 0
2 | PS2 | 2010-10-01 07:05:19 | 297 | 0
923 | PS1 | 2011-11-13 14:45:47 | 82771 | 0
Explanation
the exportValue is always incrementing
the exportValue represents the actual absolute value
in my case there are 10 stations
every ~15 minutes 10 new entries are written to the table
error is just an indicator for a proper working station
Working query:
select
YEAR(fetchDate), station, Max(exportValue)-MIN(exportValue)
from
registros
where
exportValue > 0 and error = 0
group
by station, YEAR(fetchDate)
order
by YEAR(fetchDate), station
Output:
Year | station | Max-Min
2008 | PS1 | 24012
2008 | PS2 | 23709
2009 | PS1 | 28102
2009 | PS2 | 25098
My thoughts on it:
writing several queries with between statements like 'between 2008-01-01 and 2008-01-02' to fetch the MIN(exportValue) and between 2008-12-30 and 2008-12-31' to grab the MAX(exportValue) - Problem: a lot of queries and the problem with having no data in a specified time range (it's not guaranteed that there will be data)
limiting the resultset to my 10 stations only with using order by MIN(fetchDate) - problem: takes also a long time to process the query
Additional Info:
I'm using the query in a JAVA Application. That means, it would be possible to do some post-processing on the resultset if necessary. (JPA 2.0)
Any help/approaches/ideas are very appreciated. Thanks in advance.
Adding suitable indexes will help.
2 compound indexes will speed things up significantly:
ALTER TABLE tbl_name ADD INDEX (error, exportValue);
ALTER TABLE tbl_name ADD INDEX (station, fetchDate);
This query running on 3000 records should be extremely fast.
Suggestions:
do You have PK set on this table? station, fetchDate?
add indexes; You should experiment and try with indexes as rich.okelly suggested in his answer
depending on experiments with indexes, try breaking your query into multiple statements - in one stored procedure; this way You will not loose time in network traffic between multiple queries sent from client to mysql
You mentioned that You tried with separate queries and there is a problem when there is no data for particular month; it is regular case in business applications, You should handle it in a "master query" (stored procedure or application code)
guess fetchDate is current date and time at the moment of record insertion; consider keeping previous months data in sort of summary table with fields: year, month, station, max(exportValue), min(exportValue) - this means that You should insert summary records in summary table at the end of each month; deleting, keeping or moving detail records to separate table is your choice
Since your table is rapidly growing (every 15 minutes) You should take the last suggestion into account. Probably, there is no need to keep detailed history at one place. Archiving data is process that should be done as part of maintenance.