Optimising MySQL queries with heavy joins

Optimising MySQL queries with heavy joins - mysql

I currently run a site which tracks up-to-the-minute scores and ratings in a list. The list has thousands of entries that are updated frequently, and the list should be sortable by these score and ratings columns.
My SQL for getting this data currently looks like (roughly):
SELECT e.*, SUM(sa.amount) AS score, AVG(ra.rating) AS rating
FROM entries e
LEFT JOIN score_adjustments sa ON sa.entry_id = e.id
HAVING sa.created BETWEEN ... AND ...
LEFT JOIN rating_adjustments ra ON ra.entry_id = e.id
HAVING ra.rating > 0
ORDER BY score
LIMIT 0, 10
Where the tables are (simplified):
entries:
id: INT(11) PRIMARY
...other data...
score_adjustments:
id: INT(11), PRIMARY
entry_id: INT(11), INDEX, FOREIGN KEY (entries.id)
created: DATETIME
amount: INT(4)
rating_adjustments:
id: INT(11), PRIMARY
entry_id: INT(11), INDEX, FOREIGN KEY (entries.id)
rating: DOUBLE
There are approx 300,000 score_adjustments entries and they grow at about 5,000 a day. The rating_adjustments is about 1/4 that.
Now, I'm no DBA expert but I'm guessing calling SUM() and AVG() all the time isn't a good thing - especially when sa and ra contain hundreds of thousands of records - right?
I already do caching on the query, but I want the query itself to be fast - yet still as up to date as possible. I was wondering if anyone could share any solutions to optimise heavy join/aggregation queries like this? I'm willing to make structural changes if necessary.
EDIT 1
Added more info about the query.

Your data is badly clustered.
InnoDB will store rows with "close" PKs physically close together. Since your child tables use surrogate PKs, their rows will be stored in effect randomly. When the time comes to make calculations for the given row in the "master" table, DBMS must jump all over the place to gather the related rows from the child tables.
Instead of surrogate keys, try using more "natural" keys, with the parent's PK in the leading edge, similar to this:
score_adjustments:
entry_id: INT(11), FOREIGN KEY (entries.id)
created: DATETIME
amount: INT(4)
PRIMARY KEY (entry_id, created)
rating_adjustments:
entry_id: INT(11), FOREIGN KEY (entries.id)
rating_no: INT(11)
rating: DOUBLE
PRIMARY KEY (entry_id, rating_no)
NOTE: This assumes created's resolution is fine enough and the rating_no was added to allow multiple ratings per entry_id. This is just an example - you may vary the PKs according to your needs.
This will "force" rows belonging to the same entry_id to be stored physically close together, so a SUM or AVG can be calculated by just a range scan on the PK/clustering key and with very few I/Os.
Alternatively (e.g. if you are using MyISAM that doesn't support clustering), cover the query with indexes so the child tables are not touched during querying at all.
On top of that, you could denormalize your design, and cache the current results in the parent table:
Store SUM(score_adjustments.amount) as a physical field and adjust it via triggers every time a row is inserted, updated or deleted from score_adjustments.
Store SUM(rating_adjustments.rating) as "S" and COUNT(rating_adjustments.rating) as "C". When a row is added to rating_adjustments, add it to S and increment C. Calculate S/C at run-time to get the average. Handle updates and deletes similarly.

If you're worried about performance you could add the score and rating columns to the corresponding tables and update them on insert or update to the referenced tables using a trigger. This would cache the new results every time they are updated and you won't have to recalculate them every time, significantly reducing the amount of joining needed to get the results... just guessing but in most cases the results of your query are probably much more often fetched than updated.
Check out this sql fiddle http://sqlfiddle.com/#!2/b7101/1 to see how to make the triggers and their effect, I only added triggers on insert, you can add update triggers just as easily, if you ever delete data add triggers for delete as well.
Didn't add the datetime field, if the between ... and ... parameters change often you might have to still do that manually every time, otherwise you can just add the between clause to the score_update trigger.

Related

Is one table with partition better for response time or multiple tables, for using in website and operations(insert, update, delete) used frequently

I have mysql database hosted on one of the websites hosting services companies -Hostinger-, this database used from mobile app by php APIs.
There are many tables.
I will show important tables with only the important columns as objects to be easier for understanding:
user(id, username, password, balance, state);
cardsTrans(id, user_id, number, password, price, state);
customersTrans(id, user_id, location, state);
posTrans(id, user_id, number, state);
I thought create one table instead of these three transactions tables, and this table showed like:
allTransaction(id, user_id, target_id, type, card_number, card_pass, location);
I know that there is a redundancy and some columns will get null, and I can normalize this table, but the normalization will produced with many join when query the data and I interested the response time.
To explain the main idea: the user can do three types of transactions(each type is with different table), these transactions stored on allTransaction table with user_id as foreign key from users table and target_id as foreign key from other table, determined in depends on the type.
the other columns also depends on the type and maybe set to null.
What I want is to determine which better for response time and performance when users using the app. The DML operations(insert , update, delete) applied frequently on these tables, and also very much queries, Usually querying by user_id and target_id.
If I used one table, this table will have very large number of rows and many null values in each row, so slowing the queries and take large storage.
If the table has index, the index will slowing the insert or update operations.
Is creating partition per user on the table without indexes will be better for response time with any operation (select, insert, update, or delete), or creating multiple tables (table per user) is better. the expected number of users is between (500 - 5000).
I searched and found this similar question MySQL performance: multiple tables vs. index on single table and partitions
But it doesn't in the same context when I interested in response time and then the performance, also my database is hosted on hosting server and not in the same device with the mobile app.
Who can tell me what is better and why?

As a general rule:
Worst: Multiple tables
Better: Builtin PARTITIONing
Best: Neither, just better indexing.
If you want to talk specifically about your case, please provide SHOW CREATE TABLE and the main SELECTs, DELETEs, etc.
It is possible to "over-normalize".
three types of transactions(each type is with different table)
That can be tricky. It may be better to have one table for transactions.
"Response time" -- Are you expecting hundreds of writes per second?
take large storage.
Usually proper indexing (especially with 'composite' indexes) makes table size not a performance issue.
partition per user on the table
That is no faster than having an index starting with user_id.
If the table has index, the index will slowing the insert or update operations.
The burden on writes is much less than the benefit on reads. Do not avoid indexes for that reason.
(I can be less vague if you provide tentative CREATE TABLEs and SQL statements.)

Instead of trying to predict the future, use the simplest schema that will work for now and be prepared to change it when you learn more by actual use. This means avoid scattering assumptions about the schema around the code. Look into the concept of Schema Migrations to safely change your schema and the Repository Pattern to hide the details of how things are stored. 5000 users is not a lot (unless they will all be using the system at the same time).
For now, go with the design that provides the strongest referencial integrity. That means as many not null columns as possible. While you're developing the product, you're going to be introducing bugs which might accidentally insert nulls where it should insert a value. Referencial integrity provides another layer of protection.
For example, if you have a single AllTransactions table which might have some fields filled in and might not depending on the type of transaction your schema has to make all these columns nullable. The schema cannot protect you from accidentally inserting a null value.
But if you have individual CardTransactions, CustomerTransactions, and PosTransactions tables their schemas can be constrained to ensure all the necessary fields are always filled in. This will catch many different sorts of bugs.
A variation on this is to have a single UserTransaction table which stores all the generic information about a user transaction (user_id, timestamp) and then join tables for each type of transaction. Here's a sketch.
user_transactions
id bigint primary key auto_increment
user_id integer not null references users on delete casade
-- Fields common to every transaction below
state enum(...) not null
price numeric not null
created_at timestamp not null default current_timestamp()
card_transactions
user_transaction_id bigint not null references user_transactions on delete cascade
card_id integer not null references cards on delete casade
..any other fields for card transactions...
pos_transactions
user_transaction_id bigint not null references user_transactions on delete cascade
pos_id integer not null references pos on delete cascade
..any other fields for POS transactions...
This provides full referential integrity. You can't make a card transaction without a card. You can't make a POS transation without a POS. Any fields required by a card transaction can be set not null. Any fields required by a POS transaction can be set not null.
Getting all transactions for a user is a simple indexed query.
select *
from user_transactions
where user_id = ?
And if you only want one type do a left join, also a simple indexed query.
select *
from card_transactions ct
join user_transactions ut on ut.id = ct.user_transaction_id
where ut.user_id = ?

Updating single table frequently vs using another table and CRON to import changes into main table in MySQL?

I have a table with login logs which is EXTREMELY busy and large InnoDB table. New rows are inserted all the time, the table is queried by other parts of the system, it is by far the busiest table in the DB. In this table, there is logid which is PRIMARY KEY and its generated as a random hash by software (not auto increment ID). I also want to store some data like number of items viewed.
create table loginlogs
(
logid bigint unsigned primary key,
some_data varchar(255),
viewed_items biging unsigned
)
viewed_items is a value that will get updated for multiple rows very often (assume thousands of updates / second). The dilemma I am facing now is:
Should I
UPDATE loginlogs SET viewed_items = XXXX WHERE logid = YYYYY
or should I create
create table loginlogs_viewed_items
(
logid bigint unsigned primary key,
viewed_items biging unsigned,
exported tinyint unsigned default 0
)
and then execute with CRON
UPDATE loginlogs_viewed_items t
INNER JOIN loginlogs l ON l.logid = t.logid
SET
t.exported = 1,
l.viewed_items = t.viewed_items
WHERE
t.exported = 0;
e.g. every hour?
Note that either way the viewed_items counter will be updated MANY TIMES for one logid, it can be even 100 / hour / logid and there is tons of rows. So whichever table I chose for this, either the main one or the separate one, it will be getting updated quite frequently.
I want to avoid unnecessary locking of loginlogs table and at the same time I do not want to degrade performance by duplicating data in another table.

Hmm, I wonder why you'd want to change log entries and not just add new ones...
But anyway, as you said either way the updates have to happen, whether individually or in bulk.
If you have less busy time windows updating in bulk then might have an advantage. Otherwise the bulk update may have more significant impact when running in contrast to individual updates that might "interleave" more with the other operations making the impact less "feelable".
If the column you need to update is not needed all the time, you could think of having a separate table just for this column. That way queries that just need the other columns may be less affected by the updates.

"Tons of rows" -- To some people, that is "millions". To others, even "billions" is not really big. Please provide some numbers; the answer can be different. Meanwhile, here are some general principles.
I will assume the table is ENGINE=InnoDB.
UPDATEing one row at a time is 10 times as costly as updating 100 rows at a time.
UPDATEing more than 1000 rows in a single statement is problematic. It will lock each row, potentially leading to delays in other statements and maybe even deadlocks.
Having a 'random' PRIMARY KEY (as opposed to AUTO_INCREMENT or something roughly chronologically ordered) is very costly when the table is bigger than the buffer_pool. How much RAM do you have?
"the table is queried by other parts of the system" -- by the random PK? One row at a time? How frequently?
Please elaborate on how exported works. For example, does it get reset to 0 by something else?
Is there a single client doing all the work? Or are there multiple servers throwing data and queries at the table? (Different techniques are needed.)

One to Many Database

I have created a database with One to many relationship
The Parent Table say Master has 2 columns NodeId,NodeName; NodeId is the PrimaryKey and it is of type int rest are of type varchar.
The Child Table say Student has 5 columns NodeId,B,M,F,T; and NodeId is the ForeignKey over here.
none of the columns B,M,F,T are unique and it can have null values hence none of these columns have been defined as Primary Key.
assume student table has more than 20,00,000 fields.
My fetch query is
SELECT * FROM STUDENT WHERE NODEID = 1 AND B='1-123'
I would like to improve the speed of fetching , Any suggestion regarding improvement of the DB structure or alternative fetch query would be really helpful or any suggestion that can improve overall efficiency is most welcome.

since foreign key is not indexed by default, maybe adding indexes to nodeID in student and B would improve query performance if inserts performance are not as big of a issue.
Update:
an index is essentially a way to keep your data sorted to increase search/query time. It should be good enough to just think of it as an ordered list.
An index is quite transparent, so your query would remain exactly the same.
A simple index does allow rows with the same indexed fields, so it should be fine.
it is worth to m mention. a primary key element is indexed by default, however a PK does not allow duplicate data.
also, since it's keeping an ordering of your data, insertion time will increase, however if your dataset is big query time should become faster.

How to let data "disappear" from database? MySQL

I've got a bit of a stupid question. The thing is my program has to have the function to delete data from my database. Yay, not really the problem. But how can I delete data without the danger that others can see, that there has been something deleted.
User Table:
U_ID U_NAME
1 Chris
2 Peter
OTHER TABLE
ID TIMESTAMP FK_U_D
1 2012-12-01 1
2 2012-12-02 1
Sooooo the ID's are AUTO_INCREMENT, so if I delete one of them there's a gap. Furthermore, the timestamp is also bigger than the row before, so ascending.
I want to let the data with ID 1 disappear from the user's profile (U_ID 1).
If I delete it, there is a gap. If I just change the FK_U_ID to 2 (Peter) it's obvious, because when I insert data, there are 20 or 30 data rows with the same U_ID...so it's obvious that there has been a modification.
If I set the FK_U_ID NULL --> same sh** like when I change it to another U_ID.
Is there any solution to get this work? I know that if nobody but me has access to the database, it's just no problem. But just in case, if somebody controls my program it should not be obvious that there has been modifications.
So here we go.

For the ID gaps issue you can use GUIDs as #SLaks suggests, but then you can't use the native RDBMS auto_increment which means you have to create the GUID and insert it along with the rest of the record data upon creation. Of course, you don't really need the ID to be globally unique, you could just store a random string of 20 characters or something, but then you have to do a DB read to see if that ID is taken and repeat (recursively) that process until you find an unused ID... could be quite taxing.

It's not at all clear why you would want to "hide" evidence that a delete was performed. That sounds like a really bad idea. I'm not a fan of promulgating misinformation.
Two of the characteristics of an ideal primary key are:
- anonymous (be void of any useful information, doesn't matter what it's set to)
- immutable (once assigned, it will never be changed.)
But, if we set that whole discussion aside...
I can answer a slightly different question (an answer you might find helpful to your particular situation)
The only way to eliminate a "gap" in the values in a column with an AUTO_INCREMENT would be to change the column values from their current values to a contiguous sequence of new values. If there are any foreign keys that reference that column, the values in those columns would need to be updated as well, to preserve the relationship. That will likely leave the current auto_increment value of the table higher than the largest value of the id column, so I'd want to reset that as well, to avoid a "gap" on the next insert.
(I have done re-sequencing of auto_increment values in development and test environments, to "cleanup" lookup tables, and to move the id values of some tables to ranges that are distinct from ranges in other tables... that let's me test SQL to make sure the SQL join predicates aren't inadvertently referencing the wrong table, and returning rows that look correct by accident... those are some reasons I've done reassignment if auto_increment values)
Note that the database can "automagically" update foreign key values (for InnnoDB tables) when you change the primary key value, as long as the foreign key constraint is defined with ON UPDATE CASCADE, and FOREIGN_KEY_CHECKS is not disabled.
If there are no foreign keys to deal with, and assuming that all of the current values of id are positive integers, then I've been able to do something like this: (with appropriate backups in place, so I can recover if things don't work right)
UPDATE mytable t
JOIN (
SELECT s.id AS old_id
, #i := #i + 1 AS new_id
FROM mytable s
CROSS
JOIN (SELECT #i := 0) i
ORDER BY s.id
) c
ON t.id = c.old_id
SET t.id = c.new_id
WHERE t.id <> c.new_id
To reset the table AUTO_INCREMENT back down to the largest id value in the table:
ALTER TABLE mytable AUTO_INCREMENT = 1;
Typically, I will create a table and populate it from that query in the inline view (aliased as c) above. I can then use that table to update both foreign key columns and the primary key column, first disabling the FOREIGN_KEY_CHECKS and then re-enabling it. (In a concurrent environment, where other processes might be inserting/updating/deleting rows from one of the tables, I would of course first obtain an exclusive lock on all of the tables to be updated.)
Taking up again, the discussion I set aside earlier... this type of "administrative" function can be useful in a test environment, when setting up test cases. But it is NOT a function that is ever performed in a production environment, with live data.

Can I optimize my database by splitting one big table into many small ones?

Assume that I have one big table with three columns: "user_name", "user_property", "value_of_property". Lat's also assume that I have a lot of user (let say 100 000) and a lot of properties (let say 10 000). Then the table is going to be huge (1 billion rows).
When I extract information from the table I always need information about a particular user. So, I use, for example where user_name='Albert Gates'. So, every time the mysql server needs to analyze 1 billion lines to find those of them which contain "Albert Gates" as user_name.
Would it not be wise to split the big table into many small ones corresponding to fixed users?

No, I don't think that is a good idea. A better approach is to add an index on the user_name column - and perhaps another index on (user_name, user_property) for looking up a single property. Then the database does not need to scan all the rows - it just need to find the appropriate entry in the index which is stored in a B-Tree, making it easy to find a record in a very small amount of time.
If your application is still slow even after correctly indexing it can sometimes be a good idea to partition your largest tables.
One other thing you could consider is normalizing your database so that the user_name is stored in a separate table and use an integer foriegn key in its place. This can reduce storage requirements and can increase performance. The same may apply to user_property.

you should normalise your design as follows:
drop table if exists users;
create table users
(
user_id int unsigned not null auto_increment primary key,
username varbinary(32) unique not null
)
engine=innodb;
drop table if exists properties;
create table properties
(
property_id smallint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
drop table if exists user_property_values;
create table user_property_values
(
user_id int unsigned not null,
property_id smallint unsigned not null,
value varchar(255) not null,
primary key (user_id, property_id),
key (property_id)
)
engine=innodb;
insert into users (username) values ('f00'),('bar'),('alpha'),('beta');
insert into properties (name) values ('age'),('gender');
insert into user_property_values values
(1,1,'30'),(1,2,'Male'),
(2,1,'24'),(2,2,'Female'),
(3,1,'18'),
(4,1,'26'),(4,2,'Male');
From a performance perspective the innodb clustered index works wonders in this similar example (COLD run):
select count(*) from product
count(*)
========
1,000,000 (1M)
select count(*) from category
count(*)
========
250,000 (500K)
select count(*) from product_category
count(*)
========
125,431,192 (125M)
select
c.*,
p.*
from
product_category pc
inner join category c on pc.cat_id = c.cat_id
inner join product p on pc.prod_id = p.prod_id
where
pc.cat_id = 1001;
0:00:00.030: Query OK (0.03 secs)

Properly indexing your database will be the number 1 way of improving performance. I once had a query take a half an hour (on a large dataset, but none the less). Then we come to find out that the tables had no index. Once indexed the query took less than 10 seconds.

Why do you need to have this table structure. My fundemental problem is that you are going to have to cast the data in value of property every time you want to use it. That is bad in my opinion - also storing numbers as text is crazy given that its all binary anyway. For instance how are you going to have required fields? Or fields that need to have constraints based on other fields? Eg start and end date?
Why not simply have the properties as fields rather than some many to many relationship?
have 1 flat table. When your business rules begin to show that properties should be grouped then you can consider moving them out into other tables and have several 1:0-1 relationships with the users table. But this is not normalization and it will degrade performance slightly due to the extra join (however the self documenting nature of the table names will greatly aid any developers)
One way i regularly see databqase performance get totally castrated is by having a generic
Id, property Type, Property Name, Property Value table.
This is really lazy but exceptionally flexible but totally kills performance. In fact on a new job where performance is bad i actually ask if they have a table with this structure - it invariably becomes the center point of the database and is slow. The whole point of relational database design is that the relations are determined ahead of time. This is simply a technique that aims to speed up development at a huge cost to application speed. It also puts a huge reliance on business logic in the application layer to behave - which is not defensive at all. Eventually you find that you wan to use properties in a key relationsip which leads to all kinds of casting on the join which further degrades performance.
If data has a 1:1 relationship with an entity then it should be a field on the same table. If your table gets to more than 30 fields wide then consider movign them into another table but dont call it normalisation because it isnt. It is a technique to help developers group fields together at the cost of performance in an attempt to aid understanding.
I don't know if mysql has an equivalent but sqlserver 2008 has sparse columns - null values take no space.
SParse column datatypes
I'm not saying a EAV approach is always wrong, but i think using a relational database for this approach is probably not the best choice.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008