MySql table structure - mysql

At the moment, I have a table in mysql that records transactions. These transactions may be updated by users - sometimes never, sometimes often. However, I need to track changes to every field in this table. So, what I have at the moment is a TINYINT(1) field in the table called 'is_deleted', and when a transaction is 'updated' by the user, it simply updates the is_deleted field to 1 for the old record and inserts a brand new record.
This all works well, because I simply have to run the following sql statement to get all current records:
SELECT id, foo, bar, something FROM trans WHERE is_deleted = 0;
However, I am concerned that this table will grow unnecessarily large over time, and I am therefor thinking of actually deleting the old record and 'archiving' it off to another table (trans_deleted) instead. That means that the trans table will only contain 'live' records, therefor making SELECT queries that little bit faster.
It does mean, however, the updating records will take slightly longer, as it will be running 3 queries:
1. INSERT INTO trans_deleted old record;
2. DELETE from trans WHERE id = 5;
3. INSERT INTO trans new records
So, updating records will take a bit more work, but reading will be faster.
Any thoughts on this?

I would suggest a table trans and a table_revision
Where trans has the fields id and current_revision and revision has the fields id, transid, foo and bar.
To get all current items then:
SELECT r.foo, r.bar FROM trans AS t
LEFT JOIN revision AS r ON t.id = r.trans_id
WHERE t.current_revision = r.id
If you now put indexes on r.id and r.trans_id archiving woun't make it much faster for you.

Well typically, you read much more often than you write (and you additionally say that some records may be never changed). So that's one reason to go for the archive table.
There's also another one: You also have to account for programmer time, not only processor time :) If you keep the archived rows in the same table with the live ones, you'll have to remember and take care of that in every single query you perform on that table. Not to speak of future programmers who may have to deal with the table... I'd recommend the archiving approach based on this factor alone, even if there wasn't any speed improvement!

Related

Should a counter column with frequent update be stored in a separate table?

I have a MySQL/MariaDB database where posts are stored. Each post has some statistical counters such as the number of times the post has been viewed for the current day, the total number of views, number of likes and dislikes.
For now, I plan to have all of the counter columns updated in real-time every time an action happens - a post gets a view, a like or a dislike. That means that the post_stats table will get updated all the time while the posts table will rarely be updated and will only be read most of the time.
The table schema is as follows:
posts(post_id, author_id, title, slug, content, created_at, updated_at)
post_stats(post_id, total_views, total_views_today, total_likes, total_dislikes)
The two tables are connected with a post_id foreign key. Currently, both tables use InnoDB. The data from both tables will be always queried together to be able to show a post with its counters, so this means there will be an INNER JOIN used all the time. The stats are updated right after reading them (every page view).
My questions are:
For best performance when the tables grow, should I combine the two tables into one since the columns in post_status are directly related to the post entries, or should I keep the counter/summary table separate from the main posts table?
For best performance when the tables grow, should I use MyISAM for the posts table as I can imagine that MyISAM can be more efficient at reads while InnoDB at inserts?
This problem is general for this database and also applies to other tables in the same database such as users (counters such as the total number views of their posts, the total number of comments written by them, the total number of posts written by them, etc.) and categories (the number of posts in that category, etc.).
Edit 1: The views per day counters are reset once daily at midnight with a cron job.
Edit 2: One reason for having posts and post_stats as two tables is concerns about caching.
For low traffic, KISS -- Keep the counters in the main post table. (I assume you have ruled this out.)
For high traffic, keep the counters in a separate table. But let's do the "today's" counters differently. (This is what you want to discuss.)
For very high traffic, gather up counts so that you can do less than 1 Update per click/view/like. ("Summary Tables" is beyond the scope of this question.)
Let's study total_views_today. Do you have to do a big "reset" every midnight? That is (or will become) too costly, so let's try to avoid it.
Have only total_views in the table.
At midnight copy the table into another table. (SELECT is faster and less-invasive than the UPDATE needed to reset the values.) Do this copy by building a new table, then RENAME TABLE to move it into place.
Compute total_views_today by subtracting the corresponding values in the two tables.
That left you with
post_stats(post_id, total_views, total_likes, total_dislikes)
For "high traffic, it is fine to do
UPDATE post_stats SET ... = ... + 1 WHERE post_id = ...;
at the moment needed (for each counter).
But there is a potential problem. You can't increment a counter if the row does not exist. That would be best solved by creating a row with zeros at the same time the post is created. (Otherwise, see IODKU.)
(I may come back if I think of more.)

Improve delete with IN performance

I struggle to write a DELETE query in MariaDB 5.5.44 database.
The first of the two following code samples works great, but I need to add a WHERE statement there. That is displayed in the second code sample.
I need to delete only rows from polozkyTransakci where puvodFaktury <> FAKTURA VO CZ in transakce_tmp table. I thought that my WHERE statement in the second sample could have worked ok with the inner SELECT, but it takes forever to process (it takes about 40 minutes in my cloud based ETL tool) and even then it does not leave the rows I want untouched.
1.
DELETE FROM polozkyTransakci
WHERE typPolozky = 'odpocetZalohy';
2.
DELETE FROM polozkyTransakci
WHERE typPolozky = 'odpocetZalohy'
AND idTransakce NOT IN (
SELECT idTransakce
FROM transakce_tmp
WHERE puvodFaktury = 'FAKTURA VO CZ');
Thaks a million for any help
David
IN is very bad on performance .. Try using NOT EXISTS()
DELETE FROM polozkyTransakci
WHERE typPolozky = 'odpocetZalohy'
AND NOT EXISTS (SELECT 1
FROM transakce_tmp r
WHERE r.puvodFaktury = 'FAKTURA VO CZ'
AND r.idTransakce = polozkyTransakci.idTransakce );
Before you can performance tune, you need to figure out why it is not deleting the correct rows.
So first start with doing selects until you get the right rows identified. Build your select a bit at time checking the results at each stage to see if you are getting the results you want.
Once you have the select then you can convert to a delete. When testing the delete do it is a transaction and run some test of the data that is left behind to ensure it deleted properly before rolling back or committing. Since you likely want to performance tune, I would suggest rolling back, so that you can then try again on the performance tuned version to ensure you got the same results. Of course, you only want to do this on a dev server!
Now while I agree that not exists may be faster, some of the other things you want to look at are:
do you have cascade deletes happening? If you end up deleting many
child records, that could be part of the problem.
Are there triggers affecting the delete? especially look to see if someone set one up to run through things row by row instead of as a set. Row by row triggers are a very bad thing when you delete many records. For instance suppose you are deleting 50K records and you have a delete trigger to an audit table. If it inserts to that table one record at a time, it is being executed 50K times. If it inserts all the deleted records in one step, that insert individually might take a bit longer but the total execution is much shorter.
What indexing do you have and is it helping the delete out?
You will want to examine the explain plan for each of your queries to
see if they are improving the details of how the query will be
performed.
Performance tuning is a complex thing and it is best to get read up on it in detail by reading some of the performance tuning books available for your specific database.
I might be inclined to write the query as a LEFT JOIN, although I'm guessing this would have the same performance plan as NOT EXISTS:
DELETE pt
FROM polozkyTransakci pt LEFT JOIN
transakce_tmp tt
ON pt.idTransakce = tt.idTransakce AND
tt.puvodFaktury = 'FAKTURA VO CZ'
WHERE pt.typPolozky = 'odpocetZalohy' AND tt.idTransakce IS NULL;
I would recommend indexes, if you don't have them: polozkyTransakci(typPolozky, idTransakce) and transakce_tmp(idTransakce, puvodFaktury). These would work on the NOT EXISTS version as well.
You can test the performance of these queries using SELECT:
SELECT pt.*
FROM polozkyTransakci pt LEFT JOIN
transakce_tmp tt
ON pt.idTransakce = tt.idTransakce AND
tt.puvodFaktury = 'FAKTURA VO CZ'
WHERE pt.typPolozky = 'odpocetZalohy' AND tt.idTransakce IS NULL;
The DELETE should be slower (due to the cost of logging transactions), but the performance should be comparable.

Can I optimize such a MySQL query without using an index?

A simplified version of my MySQL db looks like this:
Table books (ENGINE=MyISAM)
id <- KEY
publisher <- LONGTEXT
publisher_id <- INT <- This is a new field that is currently null for all records
Table publishers (ENGINE=MyISAM)
id <- KEY
name <- LONGTEXT
Currently books.publisher holds values that keep getting repeated, but that the publishers.name holds uniquely.
I want to get rid of books.publisher and instead populate the books.publisher_id field.
The straightforward SQL code that describes what I want done, is as follows:
UPDATE books
JOIN publishers ON books.publisher = publishers.name
SET books.publisher_id = publishers.id;
The problem is that I have a big number of records, and even though it works, it's taking forever.
Is there a faster solution than using something like this in advance?:
CREATE INDEX publisher ON books (publisher(20));
Your question title says ".. optimize ... query without using an index?"
What have you got against using an index?
You should always examine the execution plan if a query is running slowly. I would guess it's having to scan the publishers table for each row in order to find a match. It would make sense to have an index on publishers.name to speed the lookup of an id.
You can drop the index later but it wouldn't harm to leave it in, since you say the process will have to run for a while until other changes are made. I imagine the publishers table doesn't get update very frequently so performance of INSERT and UPDATE on the table should not be an issue.
There are a few problems here that might be helped by optimization.
First of all, a few thousand rows doesn't count as "big" ... that's "medium."
Second, in MySQL saying "I want to do this without indexes" is like saying "I want to drive my car to New York City, but my tires are flat and I don't want to pump them up. What's the best route to New York if I'm driving on my rims?"
Third, you're using a LONGTEXT item for your publisher. Is there some reason not to use a fully indexable datatype like VARCHAR(200)? If you do that your WHERE statement will run faster, index or none. Large scale library catalog systems limit the length of the publisher field, so your system can too.
Fourth, from one of your comments this looks like a routine data maintenance update, not a one time conversion. So you need to figure out how to avoid repeating the whole deal over and over. I am guessing here, but it looks like newly inserted rows in your books table have a publisher_id of zero, and your query updates that column to a valid value.
So here's what to do. First, put an index on tables.publisher_id.
Second, run this variant of your maintenance query:
UPDATE books
JOIN publishers ON books.publisher = publishers.name
SET books.publisher_id = publishers.id
WHERE books.publisher_id = 0
LIMIT 100;
This will limit your update to rows that haven't yet been updated. It will also update 100 rows at a time. In your weekly data-maintenance job, re-issue this query until MySQL announces that your query affected zero rows (look at mysqli::rows_affected or the equivalent in your php-to-mysql interface). That's a great way to monitor database update progress and keep your update operations from getting out of hand.
Your update query has invalid syntax but you can fix that later. The way to get it to run faster is to add a where clause so that you are only updating the necessary records.

How to retrieve the new rows of a table every minute

I have a table, to which rows are only appended (not updated or deleted) with transactions (I'll explain why this is important), and I need to fetch the new, previously unfetched, rows of this table, every minute with a cron.
How am I going to do this? In any programming language (I use Perl but that's irrelevant.)
I list the ways I thought of how to solve this problem, and ask you to show me the correct one (there HAS to be one...)
The first way that popped to my head was to save (in a file) the largest auto_incrementing id of the rows fetched, so in the next minute I can fetch with: WHERE id > $last_id. But that can miss rows. Because new rows are inserted in transactions, it's possible that the transaction that saves the row with id = 5 commits before the transaction that saves the row with id = 4. It's therefore possible that the cron script retrieves row 5 but not row 4, and when row 4 gets committed one split second later, it will never gets fetched (because 4 is not > than 5 which is the $last_id).
Then I thought I could make the cron job fetch all rows that have a date field in the last TWO minutes, check which of these rows have been retrieved again in the previous run of the cron job (to do this I would need to save somewhere which row ids were retrieved), compare, and process only the new ones. Unfortunately this is complicated, and also doesn't solve the problem that will occur if a certain inserting transaction takes TWO AND A HALF minutes to commit for some weird database reason, which will cause the date to be too old for the next iteration of the cron job to fetch.
Then I thought of installing a message queue (MQ) like RabbitMQ or any other. The same process that does the inserting transaction, would notify RabbitMQ of the new row, and RabbitMQ would then notify an always-running process that processes new rows. So instead of getting a batch of rows inserted in the last minute, that process would get the new rows one-by-one as they are written. This sounds good, but has too many points of failure - RabbitMQ might be down for a second (in a restart for example) and in that case the insert transaction will have committed without the receiving process having ever received the new row. So the new row will be missed. Not good.
I just thought of one more solution: the receiving processes (there's 30 of them, doing the exact same job on exactly the same data, so the same rows get processed 30 times, once by each receiving process) could write in another table that they have processed row X when they process it, then when time comes they can ask for all rows in the main table that don't exist in the "have_processed" table with an OUTER JOIN query. But I believe (correct me if I'm wrong) that such a query will consume a lot of CPU and HD on the DB server, since it will have to compare the entire list of ids of the two tables to find new entries (and the table is huge and getting bigger each minute). It would have been fast if the receiving process was only one - then I would have been able to add a indexed field named "have_read" in the main table that would make looking for new rows extremely fast and easy on the DB server.
What is the right way to do it? What do you suggest? The question is simple, but a solution seems hard (for me) to find.
Thank you.
I believe the 'best' way to do this would be to use one process that checks for new rows and delegates them to the thirty consumer processes. Then your problem becomes simpler to manage from a database perspective and a delegating process is not that difficult to write.
If you are stuck with communicating to the thirty consumer processes through the database, the best option I could come up with is to create a trigger on the table, which copies each row to a secondary table. Copy each row to the secondary table thirty times (once for each consumer process). Add a column to this secondary table indicating the 'target' consumer process (for example a number from 1 to 30). Each consumer process checks for new rows with its unique number and then deletes those. If you are worried that some rows are deleted before they are processed (because the consumer crashes in the middle of processing), you can fetch, process and delete them one by one.
Since the secondary table is kept small by continuously deleting processed rows, INSERTs, SELECTs and DELETEs would be very fast. All operations on this secondary table would also be indexed by the primary key (if you place the consumer ID as first field of the primary key).
In MySQL statements, this would look like this:
CREATE TABLE `consumer`(
`id` INTEGER NOT NULL,
PRIMARY KEY (`id`)
);
INSERT INTO `consumer`(`id`) VALUES
(1),
(2),
(3)
-- all the way to 30
;
CREATE TABLE `secondaryTable` LIKE `primaryTable`;
ALTER TABLE `secondaryTable` ADD COLUMN `targetConsumerId` INTEGER NOT NULL FIRST;
-- alter the secondary table further to allow several rows with the same primary key (by adding targetConsumerId to the primary key)
DELIMTER //
CREATE TRIGGER `mark_to_process` AFTER INSERT ON `primaryTable`
FOR EACH ROW
BEGIN
-- by doing a cross join with the consumer table, this automatically inserts the correct amount of rows and adding or deleting consumers is just a matter of adding or deleting rows in the consumer table
INSERT INTO `secondaryTable`(`targetConsumerId`, `primaryTableId`, `primaryTableField1`, `primaryTableField2`) SELECT `consumer`.`id`, `primaryTable`.`id`, `primaryTable`.`field1`, `primaryTable`.`field2` FROM `consumer`, `primaryTable` WHERE `primaryTable`.`id` = NEW.`id`;
END//
DELIMITER ;
-- loop over the following statements in each consumer until the SELECT doesn't return any more rows
START TRANSACTION;
SELECT * FROM secondaryTable WHERE targetConsumerId = MY_UNIQUE_CONSUMER_ID LIMIT 1;
-- here, do the processing (so before the COMMIT so that crashes won't let you miss rows)
DELETE FROM secondaryTable WHERE targetConsumerId = MY_UNIQUE_CONSUMER_ID AND primaryTableId = PRIMARY_TABLE_ID_OF_ROW_JUST_SELECTED;
COMMIT;
I've been thinking on this for a while. So, let me see if I got it right. You have a HUGE table in which N, amount which may vary in time, processes write (let's call them producers). Now, there are these M, amount which my vary in time, other processes that need to at least process once each of those records added (let's call them consumers).
The main issues detected are:
Making sure the solution will work with dynamic N and M
It is needed to keep track of the unprocessed records for each consumer
The solution has to escalate as much as possible due to the huge amount of records
In order to tackle those issues I thought on this. Create this table (PK in bold):
PENDING_RECORDS(ConsumerID, HugeTableID)
Modify the consumers so that each time they add a record to the HUGE_TABLE they also add M records to the PENDING_RECORDS table so that it has the HugeTableID and also each of the ConsumerID that exist at that time. Each time a consumer runs it will query the PENDING_RECORDS table and will find a small amount of matches for itself. It will then join against the HUGE_TABLE (note it will be an inner join, not a left join) and fetch the actual data it needs to process. Once the data is processed then the consumer will delete the records fetched from the PENDING_RECORDS table, keeping it decently small.
Interesting, i must say :)
1) First of all - is it possible to add a field to the table that has rows only added (let's call it 'transactional_table')? I mean, is it a design paradigm and you have a reason not to do any sort of updates on this table, or is it "structurally" blocked (i.e. user connecting to db has no privileges to perform updates on this table) ?
Because then the simplest way to do it is to add "have_read" column to this table with default 0, and update this column on fetched rows with 1 (even if 30 processess do this simultanously, you should be fine as it would be very fast and it won't corrupt your data). Even if 30 processess mark the same 1000 rows as fetched - nothing is corrupt. Although if you do not operate on InnoDB, this might be not the best way as far as performance is concerned (MyISAM locks whole tables on updates, InnoDB only rows that are updated).
2) If this is not what you could use - I would surely check out the solution you gave as your last one, with a little modification. Create a table (let's say: fetched_ids), and save fetched rows' ids in that table. Then you could use something like :
SELECT tt.* from transactional_table tt
RIGHT JOIN fetched_ids fi ON tt.id = fi.row_id
WHERE fi.row_id IS NULL
This will return the rows from you transactional table, that have not been saved as already fetched. As long as both (tt.id) and (fi.row_id) have (ideally unique) indexes, this should work just fine even on large sets of data. MySQL handles JOINS on indexed fields pretty well. Do not fear trying out - create new table, copy ids to it, delete some of them and run your query. You'll see the results and you'll know if they are satisfactory :)
P.S. Of course, adding rows to this 'fetched_ids' table should be ran carefully not to create unnecessary duplicates (30 simultaneous processes could write 30 times the data you need - and if you need performance, you should watch out for this case).
How about a second table with a structure like this:
source_fk - this would hold an ID of the data rows you want to read.
process_id - This would be a unique id for one of the 30 processes.
then do a LEFT JOIN and exclude items from your source that have entries matching the specified process_id.
once you get your results, just go back and add the source_fk and process_id for each result you get.
One plus about this is you can add more processes later on with no problem.
I would try adding a timestamp column and use it as a reference when retrieving new rows.

Where to store users visited pages?

I have a project, where I have posts for example.
The task is next: I must show to user his last posts visit.
This is my solution: every time user visits new (for him) topic, I create a new record in table visits.
Table visits has next structure: id, user_id, post_id, last_visit.
Now my tables visits has ~14,000,000 records and its still growing every day..
May be my solution isnt optimal and exists another way how to store users visits?
Its important to save every visit as standalone record, because I also have feature to select and use users visits. And I cant purge this table, because data could be needed later month, year. How I could optimize this situation?
Nope, you don't really have much choice other than to store your visit data in a table with columns for (at a bare minimum) user id, post id, and timestamp if you need to track the last time that each user visited each post.
I question whether you need an id field in that table, rather than using a composite key on (user_id, post_id), but I'd expect that to have a minor effect, provided that you already have a unique index on (user_id, post_id). (If you don't have an index on that pair of fields, adding one should improve query performance considerably and making it a unique index or composite key will protect against accidentally inserting duplicate records.)
If performance is still an issue despite proper indexing, you should be able to improve it a bit by segmenting the table into a collection of smaller tables, but segment it by user_id or post_id (rather than by date as previous answers have suggested). If you break it up by user or post id, then you will still be able to determine whether a given user has previously viewed a given post and, if so, on what date with only a single query. If you segment it by date, then that information will be spread across all tables and, in the worst-case scenario of a user who has never previously viewed a post (which I expect to be fairly common), you'll need to separately query each and every table before having a definitive answer.
As for whether to segment it by user id or by post id, that depends on whether you will more often be looking for all posts viewed by a user (segment by user_id to get them all in one query) or all users who have viewed a post (segment by post_id).
If it doesn't need to be long lasting, you could store it in session instead. If it does, you could either break the records apart by table, like say 1 per month, or you could only store the last 5-10 pages visited, and delete old ones as new ones come in. You could also change it to pages visited today, this week, etc.
If you do need all 14 million records, I would create another historical table to archive the visits that are not the most relevant for the day-to-day site operation.
At the end of the month (or week, or quarter, etc...) have some scheduled logic to archive records beyond a certain cutoff point to the historical table and reduce the number of records in the "live" table. This should help increase the query speed on the "live" table since you would have less records in it.
If you do need to query all of the data, you can use both tables and have all of the data available to you.
you could delete the ones you don't need - if you only want to show the last 10 visited posts then
DELETE FROM visits WHERE user_id = ? AND id NOT IN (SELECT id from visits where user_id = ? ORDER BY last_visit DESC LIMIT 0, 10);
(i think that's the best way to do that query, any mysql guru can tell me otherwise? you can ORDER BY in DELETE but the LIMIT only takes 1 parameter, so you can't do LIMIT 10, 100 there)
after inserting/updating each new row, or every few days if you like
Having a structure like (id, user_id, post_id, last_visit) for your vists table, makes it appear as though you are saving all posts, not just last post per Topic. Don't you need a topic ID in there somewhere so that you can determine what there last post PER TOPIC was, and so you know which row to replace when they post in the same topic more than once?
Store post_ids to $_SESSION and then using MYSQL IN with one SELECT query you will be able to show his visited posts. But all those ids will be destroyed after member close his browser, but anyways, this is much more faster and optimal than using database.
edit: sorry, I didn't notice you that you must store that records in database and use it after months. Then I have no idea how to optimize it, but with 14 mln. records you should definitely use indexes.