Scalable way of doing self join with many to many table

Scalable way of doing self join with many to many table - mysql

I have a table structure like the following:
user
id
name
profile_stat
id
name
profile_stat_value
id
name
user_profile
user_id
profile_stat_id
profile_stat_value_id
My question is:
How do I evaluate a query where I want to find all users with profile_stat_id and profile_stat_value_id for many stats?
I've tried doing an inner self join, but that quickly gets crazy when searching for many stats. I've also tried doing a count on the actual user_profile table, and that's much better, but still slow.
Is there some magic I'm missing? I have about 10 million rows in the user_profile table and want the query to take no longer than a few seconds. Is that possible?

Typically databases are able to handle 10 million records in a decent manner. I have mostly used oracle in our professional environment with large amounts of data (about 30-40 million rows also) and even doing join queries on the tables has never taken more than a second or two to run.
On IMPORTANT lessson I realized whenever query performance was bad was to see if the indexes are defined properly on the join fields. E.g. Here having index on profile_stat_id and profile_stat_value_id (user_id I am assuming is the primary key) should have indexes defined. This will definitely give you a good performance increaser if you have not done that.
After defining the indexes do run the query once or twice to give DB a chance to calculate the index tree and query plan before verifying the gain

Superficially, you seem to be asking for this, which includes no self-joins:
SELECT u.name, u.id, s.name, s.id, v.name, v.id
FROM User_Profile AS p
JOIN User AS u ON u.id = p.user_id
JOIN Profile_Stat AS s ON s.id = p.profile_stat_id
JOIN Profile_Stat_Value AS v ON v.id = p.profile_stat_value_id
Any of the joins listed can be changed to a LEFT OUTER JOIN if the corresponding table need not have a matching entry. All this does is join the central User_Profile table with each of the other three tables on the appropriate joining column.
Where do you think you need a self-join?
[I have not included anything to filter on 'the many stats'; it is not at all clear to me what that part of the question means.]

Related

Mysql inner join vs in clause performance

I have a query to get data of friends of user. I have 3 tables, one is user table, second is a user_friend table which has user_id and friend_id (both are foreign key to user table) and 3rd table is feed table which has user_id and feed content. Feed can be shown to friends. I can query in two ways either by join or by using IN clause (I can get all the friends' ids by graph database which I am using for networking).
Here are two queries:
SELECT
a.*
FROM feed a
INNER JOIN user_friend b ON a.user_id = b.friend_id
WHERE b.user_id = 1;
In this query I get friend ids from graph database and will pass to this query:
SELECT
a.*
FROM feed a
WHERE a.user_id IN (2,3,4,5)
Which query runs faster and good for performance when I have millions of records?

With suitable indexes, a one-query JOIN (Choice 1) will almost always run faster than a 2-query (Choice 2) algorithm.
To optimize Choice 1, b needs this composite index: INDEX(user_id, friend_id). Also, a needs an index (presumably the PRIMARY KEY?) starting with user_id.

This depends on your desired result when you have a compared big data in your subquery their always a join is much preferred for such conditions. Because subqueries can be slower than LEFT [OUTER] JOINS / INNER JOIN [LEft JOIN is faster than INNER JOIN], but in my opinion, their strength is slightly higher readability.
So if your data have fewer data to compare then why you chose a complete table join so that depends on how much data you have.
In my opinion, if you have a less number of compared data in IN than it's good but if you have a subquery or big data then you must go for a join...

Conditionals in WHEREs or JOINs?

Lets say I have the following query:
SELECT occurs.*, events.*
FROM occurs
INNER JOIN events ON (events.event_id = occurs.event_id)
WHERE event.event_state = 'visible'
Another way to do the same query and get the same results would be:
SELECT occurs.*, events.*
FROM occurs
INNER JOIN events ON (events.event_id = occurs.event_id
AND event.event_state = 'visible')
My question. Is there a real difference? Is one way faster than the other? Why would I choose one way over the other?

For an INNER JOIN, there's no conceptual difference between putting a condition in ON and in WHERE. It's a common practice to use ON for conditions that connect a key in one table to a foreign key in another table, such as your event_id, so that other people maintaining your code can see how the tables relate.
If you suspect that your database engine is mis-optimizing a query plan, you can try it both ways. Make sure to time the query several times to isolate the effect of caching, and make sure to run ANALYZE TABLE occurs and ANALYZE TABLE events to provide more info to the optimizer about the distribution of keys. If you do find a difference, have the database engine EXPLAIN the query plans it generates. If there's a gross mis-optimization, you can create an Oracle account and file a feature request against MySQL to optimize a particular query better.
But for a LEFT JOIN, there's a big difference. A LEFT JOIN is often used to add details from a separate table if the details exist or return the rows without details if they do not. This query will return result rows with NULL values for b.* if no row of b matches both conditions:
SELECT a.*, b.*
FROM a
LEFT JOIN b
ON (condition_one
AND condition_two)
WHERE condition_three
Whereas this one will completely omit results that do not match condition_two:
SELECT a.*, b.*
FROM a
LEFT JOIN b ON some_condition
WHERE condition_two
AND condition_three
Code in this answer is dual licensed: CC BY-SA 3.0 or the MIT License as published by OSI.

Refinement to this MySQL query?

I've got a query which is taking a long time and I was wondering if there was a better way to do it? Perhaps with joins?
It's currently taking ~2.5 seconds which is way too long.
To explain the structure a little: I have products, "themes" and "categories". A product can be assigned any number of themes or categories. The themeitems and categoryitems tables are linking tables to link a category/theme ID to a product ID.
I want to get a list of all products with at least one theme and category. The query I've got at the moment is below:
SELECT *
FROM themes t, themeitems ti, products p, catitems ci, categories c
WHERE t.ID = ti.THEMEID
AND ti.PRODID = p.ID
AND p.ID = ci.PRODID
AND ci.CATID = c.ID
I'm only actually selecting the rows I need when performing the query but I've removed that to abstract a little.
Any help in the right direction would be great!
Edit: EXPLAIN below

Utilise correct JOINs and ensure there are indexes on the fields used in the JOIN is the standard response for this issue.
SELECT *
FROM themes t
INNER JOIN themeitems ti ON t.ID = ti.THEMEID
INNER JOIN products p ON ti.PRODID = p.ID
INNER JOIN catitems ci ON p.ID = ci.PRODID
INNER JOIN categories c ON ci.CATID = c.ID
The specification of the JOINs assists the query engine in working out what it needs to do, and the indexes on the columns used in the join, will enable more rapid joining.

Your query is slow because you don't have any indexes on your tables.
Try:
create unique index pk on themes (ID)
create index fk on themeitems(themeid, prodid)
create unique index pk on products (id)
create index fk catitems(prodid, catid)
create unique index pk on categories (id)
As #symcbean writes in the comments, the catitems and themeitems indices should probably be unique indices too - if there isn't another column to add to that index (e.g. "validityDate"), please add that to the create statement.

Your query is very simple. I do not think that your cost decreases with implementing joins. You can try putting indexes to appropriate columns

Simply selecting less data is the glaringly obvious solution here.
Why do you need to know every column and every row every time you run the query? Addressing any one of these 3 factors will improve performance.
I want to get a list of all products with at least one theme and category
That rather implies you don't care which theme and category, in which case.....
SELECT p.*
FROM themeitems ti, products p, catitems ci
WHERE p.ID = ti.PRODID
AND p.ID = ci.PRODID
It may be possible to make the query run significantly faster - but you've not provided details of the table structure, the indexes, the volume of data, the engine type, the query cache configuration, the frequency of data updates, the frequency with which the query is run.....
update
Now that you've provided the explain plan then it's obvious you've got very small amounts of data AND NO RELEVENT INDEXES!!!!!
As a minimum you should add indexes on the product foreign key in the themeitems and catitems tables. Indeed, the primary keys for these tables should be the product id and category id / theme id, and since it's likely that you will have more products than categories or themes then the fields should be in that order in the indexes. (i.e. PRODID,CATID rather than CATID, PRODID)
update2
Given the requirement "to get a list of all products with at least one theme and category", it might be faster still (but the big wins are reducing the number of joins and adding the right indexes) to....
SELECT p.*
FROM product p
INNER JOIN (
SELECT DISTINCT ti.PRODID
FROM themeitems ti, catitems ci
WHERE ti.PRODID=ci.PRODID
) i ON p.id=i.PRODID

Ive made an answer off this because i could not place it as an comment
Basic thumb off action if you want to remove FULL table scans with JOINS.
You should index first.
Note that this not always works with ORDER BY/GROUP BY in combination with JOINS, because often an Using temporary; using filesort is needed.
Extra because this is out off the scope off the question and how to fix slow query with ORDER BY/GROUP BY in combination with JOIN
Because the MySQL optimizer thinks it needs to access the smallest table first to get the best execution what will cause MySQL cant always use indexes to sort the result and needs to use an temporary table and the filesort the fix the wrong sort ordering
(read more about this here MySQL slow query using filesort this is how i fix this problem because using temporary really can kill performance when MySQL needs an disk based temporary table)

Database design to enable Multiple tags like Stackoverflow?

I have the following tables.
Articles table
a_id INT primary unique
name VARCHAR
Description VARCHAR
c_id INT
Category table
id INT
cat_name VARCHAR
For now I simply use
SELECT a_id,name,Description,cat_name FROM Articles LEFT JOIN Category ON Articles.a_id=Category.id WHERE c_id={$id}
This gives me all articles which belong to a certain category along with category name.
Each article is having only one category.
AND I use a sub category in a similar way(I have another table named sub_cat).But every article doesn't necessary have a sub category.It may belong to multiple categories instead.
I now think of tagging an article with more than one category just like the questions at stackoverflow are tagged(eg: with multiple tags like PHP,MYSQL,SQL etc).AND later I have to display(filter) all article with certain tags(eg: tagged with php,php +MySQL) and I also have to display the tags along with the article name,Description.
Can anyone help me redesign the database?(I am using php + MySQL at back-end)

Create a new table:
CREATE TABLE ArticleCategories(
A_ID INT,
C_ID INT,
Constraint PK_ArticleCategories Primary Key (Article_ID, Category_ID)
)
(this is the SQL server syntax, may be slightly different for MySQL)
This is called a "Junction Table" or a "Mapping Table" and it is how you express Many-to-Many relationships in SQL. So, whenever you want to add a Category to an Article, just INSERT a row into this table with the IDs of the Article and the Category.
For instance, you can initialize it like this:
INSERT Into ArticleCategories(A_ID,C_ID)
SELECT A_ID,C_ID From Articles
Now you can remove c_id from your Articles table.
To get back all of the Categories for a single Article, you would do use a query like this:
SELECT a_id,name,Description,cat_name
FROM Articles
LEFT JOIN ArticleCategories ON Articles.a_id=ArticleCategories.a_id
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id={$a_id}
Alternatively, to return all articles that have a category LIKE a certain string:
SELECT a_id,name,Description
FROM Articles
WHERE EXISTS( Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
)
(You may have to adjust the last line, as I am not sure how string parameters are passed MySQL+PHP.)

Ok RBarryYoung you asked me about an reference/analyse you get one
This reference / analyse is based off the documention / source code analyse off the MySQL server
INSERT Into ArticleCategories(A_ID,C_ID)
SELECT A_ID,C_ID From Articles
On an large Articles table with many rows this copy will push one core off the CPU to 100% load and will create a disk based temporary table what will slow down the complete MySQL performance because the disk will be stress out with that copy.
If this is a one time process this is not that bad but do the math if you run this every time..
SELECT a_id,name,Description
FROM Articles
WHERE EXISTS( Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
)
Note dont take the Execution Times on sqlfriddle for real its an busy server and the times vary alot to make a good statement but look to what View Execution Plan has to say
see http://sqlfiddle.com/#!2/48817/21 for demo
Both querys always trigger an complete table scan on table Articles and two DEPENDENT SUBQUERYS thats not good if you have an large Articles table with many records.
This means the performance depends on the number of Articles rows even when you want only the articles that are in the category.
Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
This query is the inner subquery but when you try to run it, MySQL cant run because it depends on a value of the Articles table so this is correlated subquery. a subquery type that will be evaluated once for each row processed by the outer query. not good indeed
There are more ways off rewriting RBarryYoung query i will show one.
The INNER JOIN way is much more efficent even with the LIKE operator
Note ive made an habbit out off it that i start with the table with the lowest number off records and work my way up if you start with the table Articles the executing will be the same if the MySQL optimizer chooses the right plan..
SELECT
Articles.a_id
, Articles.name
, Articles.description
FROM
Category
INNER JOIN
ArticleCategories
ON
Category.id = ArticleCategories.c_id
INNER JOIN
Articles
ON
ArticleCategories.a_id = Articles.a_id
WHERE
cat_name LIKE '%php%';
;
see http://sqlfiddle.com/#!2/43451/23 for demo Note that this look worse because it looks like more rows needs to be checkt
Note if the Article table has low number off records RBarryYoung EXIST way and INNER JOIN way will perform more or less the same based on executing times and more proof the INNER JOIN way scales better when the record count become larger
http://sqlfiddle.com/#!2/c11f3/1 EXISTS oeps more Articles records needs to be checked now (even when they are not linked with the ArticleCategories table) so the query is less efficient now
http://sqlfiddle.com/#!2/7aa74/8 INNER JOIN same explain plan as the first demo
Extra notes about scaling it becomes even more worse when you also want to ORDER BY or GROUP BY the NOT EXIST way has an bigger chance it will create an disk based temporary table that will kill MySQL performance
Lets also analyse the LIKE '%php%' vs = 'php' for the EXIST way and INNER JOIN way
the EXIST way
http://sqlfiddle.com/#!2/48817/21 / http://sqlfiddle.com/#!2/c11f3/1 (more Articles) the explain tells me both patterns are more or less the same but 'php' should be little faster because off the const type vs ref in the TYPE column but LIKE %php% will use more CPU because an string compare algoritme needs to run.
the INNER JOIN way
http://sqlfiddle.com/#!2/43451/23 / http://sqlfiddle.com/#!2/7aa74/8 (more Articles) the explain tell me the LIKE '%php%' should be slower because 3 more rows need to be analysed but not shocking slower in this case (you can see the index is not really used on the best way).
RBarryYoung way works but doenst keep performance atleast not on a MySQL server
see http://sqlfiddle.com/#!2/b2bd9/1 or http://sqlfiddle.com/#!2/34ea7/1
for examples that will scale on large tables with lots of records this is what the topic starter needs

MySQL -- joining then joining then joining again

MySQL setup: step by step.
programs -> linked to --> speakers (by program_id)
At this point, it's easy for me to query all the data:
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
Nice and easy.
The trick for me is this. My speakers table is also linked to a third table, "books." So in the "speakers" table, I have "book_id" and in the "books" table, the book_id is linked to a name.
I've tried this (including a WHERE you'll notice):
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
No results.
My questions:
What am I doing wrong?
What's the most efficient way to make this query?
Basically, I want to get back all the programs data and the books data, but instead of the book_id, I need it to come back as the book name (from the 3rd table).
Thanks in advance for your help.
UPDATE:
(rather than opening a brand new question)
The left join worked for me. However, I have a new problem. Multiple books can be assigned to a single speaker.
Using the left join, returns two rows!! What do I need to add to return only a single row, but separate the two books.

is there any chance that the books table doesn't have any matching columns for speakers.book_id?
Try using a left join which will still return the program/speaker combinations, even if there are no matches in books.
SELECT *
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
LIMIT 5
Btw, could you post the table schemas for all tables involved, and exactly what output (or reasonable representation) you'd expect to get?
Edit: Response to op author comment
you can use group by and group_concat to put all the books on one row.
e.g.
SELECT speakers.speaker_id,
speakers.speaker_name,
programs.program_id,
programs.program_name,
group_concat(books.book_name)
FROM programs
JOIN speakers on programs.program_id = speakers.program_id
LEFT JOIN books on speakers.book_id = books.book_id
WHERE programs.category_id = 1
GROUP BY speakers.id
LIMIT 5
Note: since I don't know the exact column names, these may be off

That's typically efficient. There is some kind of assumption you are making that isn't true. Do your speakers have books assigned? If they don't that last JOIN should be a LEFT JOIN.
This kind of query is typically pretty efficient, since you almost certainly have primary keys as indexes. The main issue would be whether your indexes are covering (which is more likely to occur if you don't use SELECT *, but instead select only the columns you need).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Scalable way of doing self join with many to many table - mysql

Related

Mysql inner join vs in clause performance

Conditionals in WHEREs or JOINs?

Refinement to this MySQL query?

Database design to enable Multiple tags like Stackoverflow?

MySQL -- joining then joining then joining again

Categories

Resources