I have two tables:
++++++++++++++++++++++++++++++++++++
| Games |
++++++++++++++++++++++++++++++++++++
| ID | Name | Description |
++++++++++++++++++++++++++++++++++++
| 1 | Game 1 | A game description |
| 2 | Game 2 | And another |
| 3 | Game 3 | And another |
| .. | ... | ... |
++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++
| GameReviews |
+++++++++++++++++++++++++++++++++++++++
| ID |GameID| Review |
+++++++++++++++++++++++++++++++++++++++
| 1 | 1 |Review for game 1 |
| 2 | 1 |Another review for game 1|
| 3 | 1 |And another |
| .. | ... | ... |
+++++++++++++++++++++++++++++++++++++++
Option 1:
SELECT
Games.ID,
Games.Name,
Games.Description,
GameReviews.ID,
GameReviews.Review
FROM
GameReviews
LEFT JOIN
Games
ON
Games.ID = GameReviews.GameID
WHERE
Games.ID=?
Option 2:
SELECT
ID,
Name,
Description
FROM
Games
WHERE
ID=?
and then
SELECT
ID,
Review
FROM
GameReviews
WHERE
GameID=?
Obviously query 1 would be "simpler" where it is less code to write, and the other would seem to logically be "easier" on the database as it only queries the Games table once. The question is when it really gets down to it is there really a difference in performance and efficiency?
The vast majority of the time option 1 would be the way to go. The performance difference between the two would not be measurable until you have a lot of data. Keep it simple.
Your example is also fairly basic. At scale, performance issues can start revealing themselves based on what fields are being filtered, joined and pulled. The ideal scenario is to only pull data that exists in indexes (particularly with InnoDB). That usually is not possible, but a strategy is to pull the actual data you need at the last possible moment. Which is sort of what option 2 would be doing.
At extreme scale, you don't want to do any joins in the database at all. Your "joins" would happen in code, minimizing data sent over the network. Go with option 1 until you start having performance issues, which may never happen.
Go with the option 1, that is exactly what RDBMSes are optimized for.
And it always better to hit a database once from the client than hit it repeatedly multiple times.
I don't believe that you will ever have so many games and reviews that it will make sense to go with option 2.
Related
I hope that stackoverflow is the correct place to ask this, I feel a bit on the fence but didn't find that it really fit better into another stack-exchange site.
So, the question is pretty much about "best-practice" or design in mysql, I don't see this done a lot in tutorials and resources why I am a bit afraid that it is not a good way to do it, so I thought I'd try to get some feedback.
I tried to make a layout as an example (thanks for commenting)
https://www.db-fiddle.com/f/rBRUhX3DYiTgGyBPSgQfCm/2
I have a layout similar to this:
table: player
+----+------+------+
| id | name | data |
+----+------+------+
| 1 | foo | bar |
| 2 | test | test |
+----+------+------+
Then I have tables to pick specific information
table: user_external_name
+----+----------+
| id | nickname |
+----+----------+
| 1 | baz |
| 2 | qux |
+----+----------+
And I have a third table containing matches between players, something like:
table: matches
+---------+--------+--------+
| matchid | homeid | awayid |
+---------+--------+--------+
| 0 | 1 | 2 |
+---------+--------+--------+
And then I might do queries like this on matches:
SELECT
(SELECT nickname from user_external_name WHERE id = matches.home) as home,
(SELECT nickname from user_external_name WHERE id = matches.away) as away
FROM matches;
I also realized that I can make use of joins to make the query and that way I go get rid of the multiple selects. I am still not sure why the design is dumb, but I figured out that what I need to read about is pretty much relational databases. I will leave my original above for reference if someone else come stumbling down this road.
SELECT
h.nickname home,
a.nickname away
FROM `matches` as m
join user_external_name as h on h.id = m.home
join user_external_name as a on a.id = m.away;
resulting in:
+------+------+
| home | away |
+------+------+
| baz | qux |
+------+------+
So the actual question
Is this a reasonable way of doing it, or is it dumb in some way? One of my main arguments are that this way I can reuse the id to get the specific information by id in other tables (i.e. I never have to copy the actual name). Could you point me to a better way of doing this, or some resources/suggestions as how to think in this situation?
Thanks for taking the time to read through and hopefully I can learn something good. :)
We have been developing the system at my place of work for sometime now and I feel the database design is getting out of hand somewhat.
For example we have a table widgets (I'm spoofing these somewhat):
+-----------------------+
| Widget |
+-----------------------+
| Id | Name | Price |
| 1 | Sprocket | 100 |
| 2 | Dynamo | 50 |
+-----------------------+
*There's about 40+ columns on this table already
We want to add on a property for each widget for packaging information. We need to know if it has packaging information, if it doesn't have packaging information or we don't know if it does or doesn't. We then need to also store the type of packaging details (assuming it does or maybe it doesn't and it's reduntant info now).
We already have another table which stores the details information information (I personally think this table should be divided up but that's another issue).
PD = PackageDetails
+--------------------------------+
| System Properties |
+--------------------------------+
| Id | Type | Value |
| 28 | PD | Boxed |
| 29 | PD | Vacuum Sealed |
+--------------------------------+
*There's thousands of rows in the table for all system wide table properties
Instinctively I would create a number of mapping tables to capture this information. I have however been instructed to just add another column onto each table to avoid doing a join.
My solution:
Create tables:
+---------------------------------------------------+
| widgets_packaging |
+---------------------------------------------------+
| Id | widget_id | packing_info | packing_detail_id |
| 1 | 27 | PACKAGED | 2 |
| 2 | 28 | UNKNOWN | NULL |
+---------------------------------------------------+
+--------------------+
| packaging |
+--------------------+
| Id | |
| 1 | Boxed |
| 2 | Vacuum Sealed |
+--------------------+
If I want to know what packaging a widget has I join through to widgets_packaging and join again to packaging if I want to know the exact details. Therefore no more columns on the widgets table.
I have however been told to ignore this and put the value int for the packing information and another as a foreign key to System Properties table to find the packaging details. Therefore adding another two columns to the table and creating yet more rows in the system properties table to store package details.
+------------------------------------------------------------+
| Widget |
+------------------------------------------------------------+
| Id | Name |Price | has_packaging | packaging_details |
| 1 | Sprocket |100 | 1 | 28 |
| 2 | Dynamo |50 | 0 | 29 |
+------------------------------------------------------------+
The reason for this is because it's simpler and doesn't involve a join if you only want to know if the widget has packaging (there are lots of widgets). They are concerned that more joins will slow things down.
Which is the more correctly solution here and are their concerns about speed legitimate? My gut instint is that we can't just keep adding columns onto the widgets table as it is growing and growing with flags for properties at present.
The answer to this really depends on whether the application(s) using this database are read or write intensive. If it's read intensive, the de-normalized structure is a better approach because you can make use of indexes. Selects are faster with fewer joins, too.
However, if your application is write intensive, normalization is a better approach (the structure you're suggesting is a more normalized approach). Tables tend to be smaller, which means they have a better chance of fitting into the buffer. Also, normalization tends to lead to less duplication of data, which means updates and inserts only need to be done in one place.
To sum it up:
Write Intensive --> normalization
smaller tables have a better chance of fitting into the buffer
less duplicated data, which means quicker updates / inserts
Read Intensive --> de-normalization
better structure for indexes
fewer joins means better performance
If your application is not heavily weighted toward reads over writes, then a more mixed approach would be better.
By referring table in the link, I have table category and another table name "package" to store category id.
http://ftp.nchu.edu.tw/MySQL/tech-resources/articles/hierarchical-data.html
Category
+-------------+----------------------+--------+
| category_id | name | parent |
+-------------+----------------------+--------+
| 1 | ELECTRONICS | NULL |
| 2 | TELEVISIONS | 1 |
| 3 | TUBE | 2 |
| 4 | LCD | 2 |
| 5 | PLASMA | 2 |
| 6 | PORTABLE ELECTRONICS | 1 |
| 7 | MP3 PLAYERS | 6 |
| 8 | FLASH | 7 |
| 9 | CD PLAYERS | 6 |
| 10 | 2 WAY RADIOS | 6 |
+-------------+----------------------+--------+
Is there anyway I can left join until there is no parent left without knowing how many times I have to join?
And second question, my table "package" is only storing the last/smallest category id, for example in the table is "7 - FLASH", is that a good practices to only store the last/smallest category id and refer it back by joining the table? Will this action making the database heavy by query it back every time?
Thanks in advance!
It is not possible to do such queries in MySQL.
If you need to keep this database structure, then the fastest approach is likely to select the relevant data from the table and then process the data client-side into the approach array/join.
The above may not work well if you cannot sufficiently narrow down the number of rows to SELECT out, in which case, recursively running multiple queries may be faster. On your second query, the best approach is to do something like WHERE id IN (list_of_parent_values) rather than 1 query per parent.
Lastly if you can change your data structure, there is a way of using special tree column values to efficiently select all of the nodes out with a single SQL query. Some more work is required to insert and re-organise the tree however.
There are a number of slightly differing implementations of this, see here for one such discussion:
http://web.archive.org/web/20110606032941/http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
awesome_nested_set is also a ruby implementation of this pattern:
https://github.com/collectiveidea/awesome_nested_set
Just after some opinions on the best way to achieve the following outcome:
I would like to store in my MySQL database products which can be voted on by users (each vote is worth +1). I also want to be able to see how many times in total a user has voted.
To my simple mind, the following table structure would be ideal:
table: product table: user table: user_product_vote
+----+-------------+ +----+-------------+ +----+------------+---------+
| id | product | | id | username | | id | product_id | user_id |
+----+-------------+ +----+-------------+ +----+------------+---------+
| 1 | bananas | | 1 | matthew | | 1 | 1 | 2 |
| 2 | apples | | 2 | mark | | 2 | 2 | 2 |
| .. | .. | | .. | .. | | .. | .. | .. |
This way I can do a COUNT of the user_product_vote table for each product or user.
For example, when I want to look up bananas and the number of votes to show on a web page I could perform the following query:
SELECT p.product AS product, COUNT( v.id ) as votes
FROM product p
LEFT JOIN user_product_vote v ON p.id = v.product_id
WHERE p.id =1
If my site became hugely successful (we can all dream) and I had thousands of users voting on thousands of products, I fear that performing such a COUNT with every page view would be highly inefficient in terms of server resources.
A more simple approach would be to have a 'votes' column in the product table that is incremented each time a vote is added.
table: product
+----+-------------+-------+
| id | product | votes |
+----+-------------+-------+
| 1 | bananas | 2 |
| 2 | apples | 5 |
| .. | .. | .. |
While this is more resource friendly - I lose data (eg. I can no longer prevent a person from voting twice as there is no record of their voting activity).
My questions are:
i) am I being overly worried about server resources and should just stick with the three table option? (ie. do I need to have more faith in the ability of the database to handle large queries)
ii) is their a more efficient way of achieving the outcome without losing information
You can never be over worried about resources, when you first start building an application you should always have resources, space, speed etc. in mind, if your site's traffic grew dramatically and you never built for resources then you start getting into problems.
As for the vote system, personally I would keep the votes like so:
table: product table: user table: user_product_vote
+----+-------------+ +----+-------------+ +----+------------+---------+
| id | product | | id | username | | id | product_id | user_id |
+----+-------------+ +----+-------------+ +----+------------+---------+
| 1 | bananas | | 1 | matthew | | 1 | 1 | 2 |
| 2 | apples | | 2 | mark | | 2 | 2 | 2 |
| .. | .. | | .. | .. | | .. | .. | .. |
Reasons:
Firstly user_product_vote does not contain text, blobs etc., it's purely integer so it takes up less resources anyways.
Secondly, you have more of a doorway to new entities within your application such as Total votes last 24 hr, Highest rated product over the past 24 hour etc.
Take this example for instance:
table: user_product_vote
+----+------------+---------+-----------+------+
| id | product_id | user_id | vote_type | time |
+----+------------+---------+-----------+------+
| 1 | 1 | 2 | product |224.. |
| 2 | 2 | 2 | page |218.. |
| .. | .. | .. | .. | .. |
And a simple query:
SELECT COUNT(id) as total FROM user_product_vote WHERE vote_type = 'product' AND time BETWEEN(....) ORDER BY time DESC LIMIT 20
Another thing is if a user voted at 1AM and then tried to vote again at 2PM, you can easily check when the last time they voted and if they should be allowed to vote again.
There are so many opportunities that you will be missing if you stick with your incremental example.
In regards to your count(), no matter how much you optimize your queries it would not really make a difference on a large scale.
With an extremely large user-base your resource usage will be looked at from a different perspective such as load balancers, mainly server settings, Apache, catching etc., there's only so much you can do with your queries.
If my site became hugely successful (we can all dream) and I had thousands of users voting on thousands of products, I fear that performing such a COUNT with every page view would be highly inefficient in terms of server resources.
Don't waste your time solving imaginary problems. mysql is perfectly able to process thousands of records in fractions of a second - this is what databases are for. Clean and simple database and code structure is far more important than the mythical "optimization" that no one needs.
Why not mix and match both? Simply have the final counts in the product and users tables, so that you don't have to count every time and have the votes table , so that there is no double posting.
Edit:
To explain it a bit further, product and user table will have a column called "votes". Every time the insert is successfull in user_product_vote, increment the relevant user and product records. This would avoid dupe votes and you wont have to run the complex count query every time as well.
Edit:
Also i am assuming that you have created a unique index on product_id and user_id, in this case any duplication attempt will automatically fail and you wont have to check in the table before inserting. You will just to make sure the insert query ran and you got a valid value for the "id" in the form on insert_id
You have to balance the desire for your site to perform quickly (in which the second schema would be best) and the ability to count votes for specific users and prevent double voting (for which I would choose the first schema). Because you are only using integer columns for the user_product_vote table, I don't see how performance could suffer too much. Many-to-many relationships are common, as you have implemented with user_product_vote. If you do want to count votes for specific users and prevent double voting, a user_product_vote is the only clean way I can think of implementing it, as any other could result in sparse records, duplicate records, and all kinds of bad things.
You don't want to update the product table directly with an aggregate every time someone votes - this will lock product rows which will then affect other queries which are using products.
Assuming that not all product queries need to include the votes column, you could keep a separate productvotes table which would retain the running totals, and keep your userproductvote table as a means to enforce your user voting per product business rules / and auditing.
I'm trying to build a MySQL query that uses the rows in a lookup table as the columns in my result set.
LookupTable
id | AnalysisString
1 | color
2 | size
3 | weight
4 | speed
ScoreTable
id | lookupID | score | customerID
1 | 1 | A | 1
2 | 2 | C | 1
3 | 4 | B | 1
4 | 2 | A | 2
5 | 3 | A | 2
6 | 1 | A | 3
7 | 2 | F | 3
I'd like a query that would use the relevant lookupTable rows as columns in a query so that I can get a result like this:
customerID | color | size | weight | speed
1 A C D
2 A A
3 A F
The kicker of the problem is that there may be additional rows added to the LookupTable and the query should be dynamic and not have the Lookup IDs hardcoded. That is, this will work:
SELECT st.customerID,
(SELECT st1.score FROM ScoreTable st1 WHERE lookupID=1 AND st.customerID = st1.customerID) AS color,
(SELECT st1.score FROM ScoreTable st1 WHERE lookupID=2 AND st.customerID = st1.customerID) AS size,
(SELECT st1.score FROM ScoreTable st1 WHERE lookupID=3 AND st.customerID = st1.customerID) AS weight,
(SELECT st1.score FROM ScoreTable st1 WHERE lookupID=4 AND st.customerID = st1.customerID) AS speed
FROM ScoreTable st
GROUP BY st.customerID
Until there is a fifth row added to the LookupTable . . .
Perhaps I'm breaking the whole relational model and will have to resolve this in the backend PHP code?
Thanks for pointers/guidance.
tom
You have architected an EAV database. Prepare for a lot of pain when it comes to maintainability, efficiency and correctness. "This is one of the design anomalies in data modeling." (http://decipherinfosys.wordpress.com/2007/01/29/name-value-pair-design/)
The best solution would be to redesign the database into something more normal.
What you are trying to do is generally referred to as a cross-tabulation, or cross-tab, query. Some DBMSs support cross-tabs directly, but MySQL isn't one of them, AFAIK (there's a blog entry here depicting the arduous process of simulating the effect).
Two options come to mind for dealing with this:
Don't cross-tab at all. Instead, sort the output by row id, then AnalysisString, and generate the tabular output in your programming language.
Generate code on-the-fly in your programming langauge to emit the appropriate query.
Follow the blog I mention above to implement a server-side solution.
Also consider #Marek's answer, which suggests that you might be better off restructuring your schema. The advice is not a given, however. Sometimes, a key-value model is appropriate for the problem at hand.