mysql performance issues in SELECT - mysql

I have these tables:
IdToName:
Id Name
1 A
2 B
RawData:
Son Father
B A
I want to create a new table called Data, in which instead of string, I will have Id's, i.e.:
Data:
Son Father
2 1
I do this using this query:
INSERT INTO `Data`
SELECT L.`ID`, P.`ID`
FROM `IdToName` L,
`IdToName` P,
`RawData` T
WHERE T.Father = P.Name
AND T.Son = L.Name
I have keys on RawData's son and father and on IdToName's Name. This query takes about 7 minutes for 2,800,000 lines. Does anyone have any idea how I can improve the performance for this?

Check the time of the query alone. I strongly suspect that what you have is really "MySQL performance issues in INSERT", and not "in SELECT". 7000 inserts per second is quite a lot, it might be the physical limit of your machine.
uhm, and btw [edit]: we don't know the exact shape and content of your tables (and of memory), but I don't think in your case any index can help.

The only apparent reason why that would be slow is the lack of propper indexes.
Please index Id in table IdToName to UNIQUE, and both columns in table RawData to INDEX.

Related

Mysql two ways to select where. Which way uses less resources and is faster?

For example have url like domain.com/transport/cars
Based on the url want to select from mysql and show list of ads for cars
Want to choose fastest method (method that takes less time to show results and will use less resources).
Comparing 2 ways
First way
Mysql table transport with rows like
FirstLevSubcat | Text
---------------------------------
1 | Text1 car
2 | Text1xx lorry
1 | Text another car
FirstLevSubcat Type is int
Then another mysql table subcategories
Id | NameOfSubcat
---------------------------------
1 | cars
2 | lorries
3 | dogs
4 | flats
Query like
SELECT Text, AndSoOn FROM transport
WHERE
FirstLevSubcat = (SELECT Id FROM subcategories WHERE NameOfSubcat = `cars`)
Or instead of SELECT Id FROM subcategories get Id from xml file or from php array
Second way
Mysql table transport with rows like
FirstLevSubcat | Text
---------------------------------
cars | Text1 car
lorries | Text1xx lorry
cars | Text another car
FirstLevSubcat Type is varchar or char
And query simply
SELECT Text, AndSoOn FROM transport
WHERE FirstLevSubcat = `cars`
Please advice which way would use less resources and takes less time to show results. I read that better select where int than where varchar SQL SELECT speed int vs varchar
So as understand the First way would be better?
The first design is much better, because you separate two facts in your data:
There is a category 'cars'.
'Text1 car' is in the Category 'cars'.
Imagine, in your second design you enter another car, but type in 'cors' instead of 'cars'. The dbms doesn't see this, and so you have created another category with a single entry. (Well, in MySQL you could use an enum column instead to circumvent this issue, but this is not available in most other dbms. And anyhow, whenever you want to rename your category, say from 'cars' to 'vans', then you would have to change all existing records plus alter the table, instead of simply renaming the entry once in the subcategories table.)
So stay away from your second design.
As to Praveen Prasannan's comment on sub queries and joins: That is nonsense. Your query is straight forward and good. You want to select from transport where the category is the desired one. Perfect. There are two groups of persons who would prefer a join here:
Beginners who simply don't know better and always join from the start and try to sort things out in the end.
Experienced programmers who know that some dbms often handle joins better than sub-queries. But this is a pessimistic habit. Better write your queries such that they are easy to read and maintain, as you are already doing, and only change this in case grave performance issues occur.
Yup. As the SO link in your question suggests, int comparison is faster than character comparison and yield faster fetch. Keeping this in mind, first design would be considered as better design. However sub queries are never recommended. Use join instead.
eg:
SELECT t.Text, t.AndSoOn FROM transport t
INNER JOIN subcategories s ON s.ID = t.FirstLevSubcat
WHERE s.NameOfSubcat = 'cars'

SnowFlake Diagram and Many to Many relationship

I have a snowflake diagram with:
Fact:
id_movie
id_user
rating
Dim Users:
id_user
...
Dim Movies:
id_movie
...
In my ERD, I also have a table Category, that has a many to many relationship with the movies like this:
Dim_Category:
id_category
...
Map_Category_Movie:
id_movie
id_category
relevance
I am trying to find an efficient way to model this in a snowflace/star schema. My issues:
I could just add these two tables into the snowflake diagram, but this would feel wrong as I usually only use tables that are aggregates of the subtables on the outer fringes of this diagram.
I could create another fact table for the relevance, but as I want to ultimately report on the correlation of relevance of users to their behaviour in rating in movie, I'd need to use both fact tables, which to me is an incorrect approach.
Any guidance here?
There is huge chance that you have already answered to yourself and welcome to hell.
First, quotation from http://www.information-management.com/ would be interested to you:
The snowflake structure will reduce batch updates to dimensions. Though always said to be slower than a star, some tests have revealed no difference in performance between flattened and snowflaked dimensions. In fact in some cases, the snowflake provides superior performance, such as when a wide dimension (i.e., customer) is segmented into a snowflake.
So, using a bridge table is not going to cause significant loss of performance. I prefer snowflake in good percent of cases because sometimes is really easier to manage your data mart and hardware/size of data gives you an opportunity to do it.
My friendly advise is to create bridge table (movie_ID, category_ID, relevance) and go on.
If you have fixed and small list of categories, create table with predefined categories:
dim_movies
----------
movies_id
category1_relavance
category2_relavance
category3_relavance
Up to ten is perhaps ok, especially if you work for company you're creating dwh, not just consulting it (you can administer).
Once, we have tried to create a masterpiece of data warehouse, where was a similar example like yours. Payment deal was based on performance (data was over 2TB per fact table) so we decided to give shot to create star-schema.
We created dimension like I described above and every time when no. of distinct categories grows etl added new field in table.
ETL process also had to dynamically recreate the cube.
It took a lot of pain but performance was as I remember 13% better than snow-flake.
Also, during the most exhaustively project, where I believe that 10y.o kid would designed DB better, we had to connect exact 5 categories per item. Each category points to one of 20+ possible tables. It could be joined ONLY through theirs software based on some rules. It was some kind of 1...5: Many relationship (it doesn't exists!?!)
pk code_conto cat1 cat2 cat3 cat4 cat5
----------------------------------------------------------
1 123 17 NULL 5467 12 NULL
2 124 67 1098 NULL 1423 AK12
3 123 NULL NULL NULL 13 23
Code was like this:
If (code_conto == 123)
{
Category1_join_set = 'SELECT cat_id, cat_name FROM cat_customers'; //NOTE THIS
Category2_join_set = 'SELECT cat_id, cat_name FROM cat_products';
Category3_join_set = 'SELECT cat_id, cat_name FROM cat_city';
...
...
}
If (code_conto == 124)
{
Category1_join_set = 'SELECT cat_id, cat_name FROM cat_products'; //AND THIS
Category2_join_set = 'SELECT cat_id, cat_name FROM cat_origin'; //ON SAME FIELD
Category3_join_set = 'SELECT cat_id, cat_name FROM cat_blabla'; //DIFFERENT JOIN TABLE
...
...
}
All hard-coded. So we hard coded our queries with over 100 times repeating WHEN in CASE Statement. Guess what? ERP provider 'improved' his software and created mapping table where was 'C' if statements based on code_conto key.
We took more than 3 weeks to provide a good and secure ETL job (with SQLs, external tools).
I didn't wrote all this for nothing. I wanted to convince you and others that using bridge table in many to many relationships is probably the best practice in 97% percents.
However, there are five design solutions to M:M relationship possible:
Array or series (I don't want to even try it)
Bridge table
Groupings
Fixed levels
Dynamically created fixed levels
Hope I didn't confused you.

Efficiency of Query to Select Records based on Related Records in Composite Table

Setup
I am creating an event listing where users can narrow down results by several filters. Rather than having a table for each filter (i.e. event_category, event_price) I have the following database structure (to make it easy/flexible to add more filters later):
event
event_id title description [etc...]
-------------------------------------------
fllter
filter_id name slug
-----------------------------
1 Category category
2 Price price
filter_item
filter_item_id filter_id name slug
------------------------------------------------
1 1 Music music
2 1 Restaurant restaurant
3 2 High high
4 2 Low low
event_filter_item
event_id filter_item_id
--------------------------
1 1
1 4
2 1
2 3
Goal
I want to query the database and apply the filters that users specify. For example, if a user searches for events in 'Music' (category) priced 'Low' (price) then only one event will show (with event_id = 1).
The URL would look something like:
www.site.com/events?category=music&price=low
So I need to query the database with the filter 'slugs' I receive from the URL.
This is the query I have written to make this work:
SELECT ev.* FROM event ev
WHERE
EXISTS (SELECT * FROM event_filter_item efi
JOIN filter_item fi on fi.filter_item_id = efi.filter_item_id
JOIN filter f on f.filter_id = fi.filter_id
WHERE efi.event_id = ev.event_id AND f.slug = 'category' AND fi.slug ='music')
AND EXISTS (SELECT * FROM event_filter_item efi
JOIN filter_item fi on fi.filter_item_id = efi.filter_item_id
JOIN filter f on f.filter_id = fi.filter_id
WHERE efi.event_id = ev.event_id AND f.slug = 'price' AND fi.slug = 'low')
This query is currently hardcoded but would be dynamically generated in PHP based on what filters and slugs are present in the URL.
And the big question...
Is this a reasonable way to go about this? Does anyone see a problem with having multiple EXISTS() with sub-queries, and those subqueries performing several joins? This query is extremely quick with only a couple records in the database, but what about when there are thousands or tens of thousands?
Any guidance is really appreciated!
Best,
Chris
While EXISTS is just a form of JOIN, MySQL query optimizer is notoriously "stupid" about executing it optimally. In your case, it will probably do a full table scan on the outer table, then execute the correlated subquery for each row, which is bound to scale badly. People often rewrite EXISTS as explicit JOIN for that reason. Or, just use a smarter DBMS.
In addition to that, consider using a composite PK for filter_item, where FK is at the leading edge - InnoDB tables are clustered and you'd want to group items belonging to the same filter physically close together.
BTW, tens of thousands is not a "large" number of rows - to truly test the scalability use tens of millions or more.

MySQL select users on multiple criteria

My team working on a php/MySQL website for a school project. I have a table of users with typical information (ID,first name, last name, etc). I also have a table of questions with sample data like below. For this simplified example, all the answers to the questions are numerical.
Table Questions:
qid | questionText
1 | 'favorite number'
2 | 'gpa'
3 | 'number of years doing ...'
etc.
Users will have the ability fill out a form to answer any or all of these questions. Note: users are not required to answer all of the questions and the questions themselves are subject to change in the future.
The answer table looks like this:
Table Answers:
uid | qid | value
37 | 1 | 42
37 | 2 | 3.5
38 | 2 | 3.6
etc.
Now, I am working on the search page for the site. I would like the user to select what criteria they want to search on. I have something working, but I'm not sure it is efficient at all or if it will scale (not that these tables will ever be huge - like I said, it is a school project). For example, I might want to list all users whose favorite number is between 100 and 200 and whose GPA is above 2.0. Currently, I have a query builder that works (it creates a valid query that returns accurate results - as far as I can tell). A result of the query builder for this example would look like this:
SELECT u.ID, u.name (etc)
FROM User u
JOIN Answer a1 ON u.ID=a1.uid
JOIN Answer a2 ON u.ID=a2.uid
WHERE 1
AND (a1.qid=1 AND a1.value>100 AND a1.value<200)
AND (a2.qid=2 AND a2.value>2.0)
I add the WHERE 1 so that in the for loops, I can just add " AND (...)". I realize I could drop the '1' and just use implode(and,array) and add the where if array is not empty, but I figured this is equivalent. If not, I can change that easy enough.
As you can see, I add a JOIN for every criteria the searcher asks for. This also allows me to order by a1.value ASC, or a2.value, etc.
First question:
Is this table organization at least somewhat decent? We figured that since the number of questions is variable, and not every user answers every question, that something like this would be necessary.
Main question:
Is the query way too inefficient? I imagine that it is not ideal to join the same table to itself up to maybe a dozen or two times (if we end up putting that many questions in). I did some searching and found these two posts which seem to kind of touch on what I'm looking for:
Mutiple criteria in 1 query
This uses multiple nested (correct term?) queries in EXISTS
Search for products with multiple criteria
One of the comments by youssef azari mentions using 'query 1' UNION 'query 2'
Would either of these perform better/make more sense for what I'm trying to do?
Bonus question:
I left out above for simplicity's sake, but I actually have 3 tables (for number valued questions, booleans, and text)
The decision to have separate tables was because (as far as I could think of) it would either be that or have one big answers table with 3 value columns of different types, having 2 always empty.
This works with my current query builder - an example query would be
SELECT u.ID,...
FROM User u
JOIN AnswerBool b1 ON u.ID=b1.uid
JOIN AnswerNum n1 ON u.ID=n1.uid
JOIN AnswerText t1 ON u.ID=t1.uid
WHERE 1
AND (b1.qid=1 AND b1.value=true)
AND (n1.qid=16 AND n1.value<999)
AND (t1.qid=23 AND t1.value LIKE '...')
With that in mind, what is the best way to get my results?
One final piece of context:
I mentioned this is for a school project. While this is true, then eventual goal (it is an undergrad senior design project) is to have a department use our site for students creating teams for their senior design. For a rough estimate of size, every semester, the department would have somewhere around 200 or so students use our site to form teams. Obviously, when we're done, the department will (hopefully) check our site for security issues and other stuff they need to worry about (what with FERPA and all). We are trying to take into account all common security practices and scalablity concerns, but in the end, our code may be improved by others.
UPDATE
As per nnichols suggestion, I put in a decent amount of data and ran some tests on different queries. I put around 250 users in the table, and about 2000 answers in each of the 3 tables. I found the links provided very informative
(links removed because I can't hyperlink more than twice yet) Links are in nnichols' response
as well as this one that I found:
http://phpmaster.com/using-explain-to-write-better-mysql-queries/
I tried 3 different types of queries, and in the end, the one I proposed worked the best.
First: using EXISTS
SELECT u.ID,...
FROM User u WHERE 1
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=# AND value>#) -- or any condition on value
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=another # AND some_condition(value))
AND EXISTS
(SELECT * FROM AnswerText
...
I used 10 conditions on each of the 3 answer tables (resulting in 30 EXISTS)
Second: using IN - a very similar approach (maybe even exactly?) which yields the same results
SELECT u.ID,...
FROM User u WHERE 1
AND (u.ID) IN (SELECT uid FROM AnswerNumber WHERE qid=# AND ...)
...
again with 30 subqueries.
The third one I tried was the same as described above (using 30 JOINs)
The results of using EXPLAIN on the first two were as follows: (identical)
The primary query on table u had a type of ALL (bad, though users table is not huge) and rows searched was roughly twice the size of the user table (not sure why). Each other row in the output of EXPLAIN was a dependent query on the relevant answer table, with a type of eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall not bad.
For the query I suggested (JOINing):
The primary query was actually on whatever table you joined first (in my case AnswerBoolean) with type of ref (better than ALL). The number of rows searched was equal to the number of questions answered by anyone (as in 50 distinct questions have been answered by anyone) (which will be much less than the number of users). For each additional row in EXPLAIN output, it was a SIMPLE query with type eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall almost the same, but a smaller starting multiplier.
One final advantage to the JOIN method: it was the only one I could figure out how to order by various values (such as n1.value). Since the other two queries were using subqueries, I could not access the value of a specific subquery. Adding the order by clause did change the extra field in the first query to also have 'using temporary' (required, I believe, for order by's) and 'using filesort' (not sure how to avoid that). However, even with those slow-downs, the number of rows is still much less, and the other two (as far as I could get) cannot use order by.
You could answer most of these questions yourself with a suitably large test dataset and the use of EXPLAIN and/or the profiler.
Your INNER JOINs will almost certainly perform better than switching to EXISTS but again this is easy to test with a suitable test dataset and EXPLAIN.

Soccer SQL Query Home- and Roadteam Issue

In a soccer environment I want to display the current standings. Meaning: points and goals per team. The relevant tables look similar to the following (simplified).
Match Table
uid (PK) hometeamid roadteamid
------------------------------------------------------------------
Result Table
uid (PK) hometeamscore roadteamscore resulttype (45min, 90min, ..)
-------------------------------------------------------------------
Team Table
uid (PK) name shortname icon
------------------------------------------------------------------
Now I don't get my head around it, how to write the standings in one query. What I managed was to write a query, which returns the "homegames"-standings only. I guess that's the easy part. Anyway here is how it looks:
SELECT ht.name,
Count(*) As matches,
SUM(res.hometeamscore) AS goals,
SUM(res.roadteamscore) AS opponentgoals,
SUM(res.hometeamscore - res.roadteamscore) AS goalDifference,
SUM(res.hometeamscore > res.roadteamscore) * 3 + SUM(res.hometeamscore = res.roadteamscore) As Points
FROM league_league l
JOIN league_gameday gd
ON gd.leagueid = l.uid
JOIN league_match m
ON m.gamedayid = gd.uid
JOIN league_result res
ON res.matchid = m.uid
AND res.resulttype = 2
JOIN league_team ht
ON m.hometeamid = ht.uid
Where l.uid = 1
Group By ht.uid
Order By points DESC, goalDifference DESC
Any idea how to modify this, that it will return home- and roadgames would be big time appreciated.
Many thanks,
Robin
Create views. If your data does not change often and you need performance, create one or more pre-computed tables.
Views in MySQL are juste pseudo-tables that are dynamically computed from a SELECT query. Using the SQL in your question, you can create a view of the teams results at home: CREATE VIEW homegames AS SELECT ...
Then do the same for road games. Then it will be easy to synthesize both views in a third one (you just need to sum up the columns).
Views have at least one flaw: they are slow. A view built on views is like using complex subqueries, and MySQL is quite bad at this. I don't think it's a problem for you as you're probably dealing with hundreds of games at most. But if you find these views to be too slow to query, and provided you don't use any kind of cache that could mitigate this, then use simple tables instead of views. Of course, you'll need to keep them in sync. You can TRUNCATE and INSERT INTO homegames SELECT ... each time you have a new game, or you can be smarter and just UPDATE the tables. Both are right, depending on your needs.
Could you not abstract this out into a stored procedure or stored function to call rather than constructing such a big-ass complicated query?