I have a snowflake diagram with:
Fact:
id_movie
id_user
rating
Dim Users:
id_user
...
Dim Movies:
id_movie
...
In my ERD, I also have a table Category, that has a many to many relationship with the movies like this:
Dim_Category:
id_category
...
Map_Category_Movie:
id_movie
id_category
relevance
I am trying to find an efficient way to model this in a snowflace/star schema. My issues:
I could just add these two tables into the snowflake diagram, but this would feel wrong as I usually only use tables that are aggregates of the subtables on the outer fringes of this diagram.
I could create another fact table for the relevance, but as I want to ultimately report on the correlation of relevance of users to their behaviour in rating in movie, I'd need to use both fact tables, which to me is an incorrect approach.
Any guidance here?
There is huge chance that you have already answered to yourself and welcome to hell.
First, quotation from http://www.information-management.com/ would be interested to you:
The snowflake structure will reduce batch updates to dimensions. Though always said to be slower than a star, some tests have revealed no difference in performance between flattened and snowflaked dimensions. In fact in some cases, the snowflake provides superior performance, such as when a wide dimension (i.e., customer) is segmented into a snowflake.
So, using a bridge table is not going to cause significant loss of performance. I prefer snowflake in good percent of cases because sometimes is really easier to manage your data mart and hardware/size of data gives you an opportunity to do it.
My friendly advise is to create bridge table (movie_ID, category_ID, relevance) and go on.
If you have fixed and small list of categories, create table with predefined categories:
dim_movies
----------
movies_id
category1_relavance
category2_relavance
category3_relavance
Up to ten is perhaps ok, especially if you work for company you're creating dwh, not just consulting it (you can administer).
Once, we have tried to create a masterpiece of data warehouse, where was a similar example like yours. Payment deal was based on performance (data was over 2TB per fact table) so we decided to give shot to create star-schema.
We created dimension like I described above and every time when no. of distinct categories grows etl added new field in table.
ETL process also had to dynamically recreate the cube.
It took a lot of pain but performance was as I remember 13% better than snow-flake.
Also, during the most exhaustively project, where I believe that 10y.o kid would designed DB better, we had to connect exact 5 categories per item. Each category points to one of 20+ possible tables. It could be joined ONLY through theirs software based on some rules. It was some kind of 1...5: Many relationship (it doesn't exists!?!)
pk code_conto cat1 cat2 cat3 cat4 cat5
----------------------------------------------------------
1 123 17 NULL 5467 12 NULL
2 124 67 1098 NULL 1423 AK12
3 123 NULL NULL NULL 13 23
Code was like this:
If (code_conto == 123)
{
Category1_join_set = 'SELECT cat_id, cat_name FROM cat_customers'; //NOTE THIS
Category2_join_set = 'SELECT cat_id, cat_name FROM cat_products';
Category3_join_set = 'SELECT cat_id, cat_name FROM cat_city';
...
...
}
If (code_conto == 124)
{
Category1_join_set = 'SELECT cat_id, cat_name FROM cat_products'; //AND THIS
Category2_join_set = 'SELECT cat_id, cat_name FROM cat_origin'; //ON SAME FIELD
Category3_join_set = 'SELECT cat_id, cat_name FROM cat_blabla'; //DIFFERENT JOIN TABLE
...
...
}
All hard-coded. So we hard coded our queries with over 100 times repeating WHEN in CASE Statement. Guess what? ERP provider 'improved' his software and created mapping table where was 'C' if statements based on code_conto key.
We took more than 3 weeks to provide a good and secure ETL job (with SQLs, external tools).
I didn't wrote all this for nothing. I wanted to convince you and others that using bridge table in many to many relationships is probably the best practice in 97% percents.
However, there are five design solutions to M:M relationship possible:
Array or series (I don't want to even try it)
Bridge table
Groupings
Fixed levels
Dynamically created fixed levels
Hope I didn't confused you.
Related
I am trying to normalise my MySQL 5.7 data shema and strugle with replacing the SQL queries:
At the moment there is one table containing all attributes of each article:
article_id | title | ref_id | dial_c_id
The task is to retrieve all articles which match two given attributes (ref_id and dial_c_id) and also retrieve all their other attributes.
With just one table, this is straightforward:
SELECT *
FROM test.articles_test
WHERE
ref_id = '127712'
AND dial_c_id = 51
Now in my effort to normalise, I have created a second table, which stores the attributes of each article and removed the ones in table articles:
table 1:
article_id | title
table 2:
article_id | attr_group | attribute
1 ref_id 51
1 dial_c_id 33
1 another 5
2 ..
I would like to retrieve all article details including ALL attributes which match ref_id and dial_c_id with this two table shema.
Somehow like this:
SELECT
a.article_id,
a.title,
attr.*
FROM test.articles_test a
INNER JOIN attributes attr ON a.article_id = attr.article_id
AND ref_id = '127712'
AND dial_c_id = 51
How can this be done?
You have used an Entity-Attribute-Value table to record your attributes.
This is the opposite of normalization.
Name the rule of normalization that guided you to put different attributes into the same column. You can't, because this is not a normalization practice.
To accomplish your query with your current EAV design, you need to pivot the result so you get something as if you had your original table.
SELECT * FROM (
SELECT
a.article_id,
a.title,
MAX(CASE attr_group WHEN 'ref_id' THEN attribute END) AS ref_id,
MAX(CASE attr_group WHEN 'dial_c_id' THEN attribute END) AS dial_c_id
-- ...others...
FROM test.articles_test a
INNER JOIN attributes attr ON a.article_id = attr.article_id
GROUP BY a.article_id, a.title) AS pivot
WHERE pivot.ref_id = '127712'
AND pivot.dial_c_id = 51
While the above query can produce the result you want, the performance will be terrible. It has to create a temp table for the subquery, containing all data from both tables, then apply the WHERE clause against the temp table.
You're really better off with each attribute in its own column in your original table.
I understand that you are trying to allow for many attributes in the future. This is a common problem.
See my answer to
How to design a product table for many kinds of product where each product has many parameters
But you shouldn't call it "normalised," because it isn't. It's not even denormalised. It's derelational.
You can't just use words to describe anything you want — especially not the opposite of what the word means. I can't let the air out of my bicycle tire and say "I'm inflating it."
You commented that you're trying to make your database "scalable." You also misunderstand what the word "scalable" means. By using EAV, you're creating a structure where the queries needed are difficult to write and inefficient to execute, and the data takes 10x space. It's the opposite of scalable.
What you mean is that you're trying to create a system that is extensible. This is complex to implement in SQL, but I describe several solutions in the other Stack Overflow answer to which I linked. You might also like my presentation Extensible Data Modeling with MySQL.
My team working on a php/MySQL website for a school project. I have a table of users with typical information (ID,first name, last name, etc). I also have a table of questions with sample data like below. For this simplified example, all the answers to the questions are numerical.
Table Questions:
qid | questionText
1 | 'favorite number'
2 | 'gpa'
3 | 'number of years doing ...'
etc.
Users will have the ability fill out a form to answer any or all of these questions. Note: users are not required to answer all of the questions and the questions themselves are subject to change in the future.
The answer table looks like this:
Table Answers:
uid | qid | value
37 | 1 | 42
37 | 2 | 3.5
38 | 2 | 3.6
etc.
Now, I am working on the search page for the site. I would like the user to select what criteria they want to search on. I have something working, but I'm not sure it is efficient at all or if it will scale (not that these tables will ever be huge - like I said, it is a school project). For example, I might want to list all users whose favorite number is between 100 and 200 and whose GPA is above 2.0. Currently, I have a query builder that works (it creates a valid query that returns accurate results - as far as I can tell). A result of the query builder for this example would look like this:
SELECT u.ID, u.name (etc)
FROM User u
JOIN Answer a1 ON u.ID=a1.uid
JOIN Answer a2 ON u.ID=a2.uid
WHERE 1
AND (a1.qid=1 AND a1.value>100 AND a1.value<200)
AND (a2.qid=2 AND a2.value>2.0)
I add the WHERE 1 so that in the for loops, I can just add " AND (...)". I realize I could drop the '1' and just use implode(and,array) and add the where if array is not empty, but I figured this is equivalent. If not, I can change that easy enough.
As you can see, I add a JOIN for every criteria the searcher asks for. This also allows me to order by a1.value ASC, or a2.value, etc.
First question:
Is this table organization at least somewhat decent? We figured that since the number of questions is variable, and not every user answers every question, that something like this would be necessary.
Main question:
Is the query way too inefficient? I imagine that it is not ideal to join the same table to itself up to maybe a dozen or two times (if we end up putting that many questions in). I did some searching and found these two posts which seem to kind of touch on what I'm looking for:
Mutiple criteria in 1 query
This uses multiple nested (correct term?) queries in EXISTS
Search for products with multiple criteria
One of the comments by youssef azari mentions using 'query 1' UNION 'query 2'
Would either of these perform better/make more sense for what I'm trying to do?
Bonus question:
I left out above for simplicity's sake, but I actually have 3 tables (for number valued questions, booleans, and text)
The decision to have separate tables was because (as far as I could think of) it would either be that or have one big answers table with 3 value columns of different types, having 2 always empty.
This works with my current query builder - an example query would be
SELECT u.ID,...
FROM User u
JOIN AnswerBool b1 ON u.ID=b1.uid
JOIN AnswerNum n1 ON u.ID=n1.uid
JOIN AnswerText t1 ON u.ID=t1.uid
WHERE 1
AND (b1.qid=1 AND b1.value=true)
AND (n1.qid=16 AND n1.value<999)
AND (t1.qid=23 AND t1.value LIKE '...')
With that in mind, what is the best way to get my results?
One final piece of context:
I mentioned this is for a school project. While this is true, then eventual goal (it is an undergrad senior design project) is to have a department use our site for students creating teams for their senior design. For a rough estimate of size, every semester, the department would have somewhere around 200 or so students use our site to form teams. Obviously, when we're done, the department will (hopefully) check our site for security issues and other stuff they need to worry about (what with FERPA and all). We are trying to take into account all common security practices and scalablity concerns, but in the end, our code may be improved by others.
UPDATE
As per nnichols suggestion, I put in a decent amount of data and ran some tests on different queries. I put around 250 users in the table, and about 2000 answers in each of the 3 tables. I found the links provided very informative
(links removed because I can't hyperlink more than twice yet) Links are in nnichols' response
as well as this one that I found:
http://phpmaster.com/using-explain-to-write-better-mysql-queries/
I tried 3 different types of queries, and in the end, the one I proposed worked the best.
First: using EXISTS
SELECT u.ID,...
FROM User u WHERE 1
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=# AND value>#) -- or any condition on value
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=another # AND some_condition(value))
AND EXISTS
(SELECT * FROM AnswerText
...
I used 10 conditions on each of the 3 answer tables (resulting in 30 EXISTS)
Second: using IN - a very similar approach (maybe even exactly?) which yields the same results
SELECT u.ID,...
FROM User u WHERE 1
AND (u.ID) IN (SELECT uid FROM AnswerNumber WHERE qid=# AND ...)
...
again with 30 subqueries.
The third one I tried was the same as described above (using 30 JOINs)
The results of using EXPLAIN on the first two were as follows: (identical)
The primary query on table u had a type of ALL (bad, though users table is not huge) and rows searched was roughly twice the size of the user table (not sure why). Each other row in the output of EXPLAIN was a dependent query on the relevant answer table, with a type of eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall not bad.
For the query I suggested (JOINing):
The primary query was actually on whatever table you joined first (in my case AnswerBoolean) with type of ref (better than ALL). The number of rows searched was equal to the number of questions answered by anyone (as in 50 distinct questions have been answered by anyone) (which will be much less than the number of users). For each additional row in EXPLAIN output, it was a SIMPLE query with type eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall almost the same, but a smaller starting multiplier.
One final advantage to the JOIN method: it was the only one I could figure out how to order by various values (such as n1.value). Since the other two queries were using subqueries, I could not access the value of a specific subquery. Adding the order by clause did change the extra field in the first query to also have 'using temporary' (required, I believe, for order by's) and 'using filesort' (not sure how to avoid that). However, even with those slow-downs, the number of rows is still much less, and the other two (as far as I could get) cannot use order by.
You could answer most of these questions yourself with a suitably large test dataset and the use of EXPLAIN and/or the profiler.
Your INNER JOINs will almost certainly perform better than switching to EXISTS but again this is easy to test with a suitable test dataset and EXPLAIN.
I need and advice about MySQL.
I have a user table that have id, nickname, numDVD, money and table DVD that have idDVD, idUser, LinkPath, counter.
Now I belive that I could have max. 20 user and each user has about 30 DVD.
So when I insert a DVD I should have idDVD(auto-Increment), idUser (same idUser of User table), LinkPath (generic String), and counter that it is a number from 1 to 30 (unique number) (depends from number or DVD) for each user.
The problem is handle the last column "counter", because I would select for example 2 3 random DVD from 1 to 30 that have the same UserId.
So I was thinking if it's the best solution in my case and hard to handle (for me I never used MySQL) OR it's better create 20 tables (1 for each user) that contains the ID and DVDname etc.
Thanks
Don't create 20 tables! That'd be way overkill, and what if you needed to add more users in the future ? It'd be practically impossible to maintain and update reliably.
A better way would be like:
Table users
-> idUser
-> other user specific data
Table dvd
-> idDvd
-> DVDname
-> LinkPath
-> other dvd specific data (no user data here)
Table usersDvds
-> idUser
-> idDvd
This way, it's no problem if one or more users has the same DVD, as it's just another entry in the usersDvds table - the idDvd value would be the same, but idUser woudl be different. And to count how many DVDs a user has, just do a SELECT count(*) FROM usersDvds WHERE userId = 1
You don't need a table per user, and doing so will make the subsequent SQL programming basically impossible. However with these data volumes practically nothing you do is going to cause or relieve bottlenecks. Very probably the entire database will fit into memory so access via any schema will be practically instantenous.
If I understand your requirements clearly, you should be able to accomplish that by creating a compound index for you to be able to select efficiently.
If there is too much of data that is being handled in that table, then it would help to clear up some historical data.
I am sure this is a basic question but I am new to SQL so anyways, for my user profile I want to display this: location = "Hollywood, CA - USA" if a user lives in Hollywood. So I assume in the user table there will be 1 column like current_city which will have ID say 1232 which is a FK to the city table where city_name for this PK = Hollywood. Then connect with the state table and the country table to find the names CA and USA as the city lookup table will only store the IDs (like CA = 21 and USA = 345)
Is this the best way to design the table OR I was thinking should I add 2 columns like city_id and city_name to the user_table. And also add country_id, country_name, state_id, state_name to the city table. This way i save on trips to other parent tables just to fetch the name for the IDs.
This is only a sample use case but I have lots of lookup ID tables so I will apply the same principle to all tables once i know how to do it best. My requirement is scalability and performance so whatever works best for these is what i would like.
The first way you described is almost always better.
Having both the city_id and city_name (or any pair of that kind) in the users table is not best practice since it may cause data discrepancies - a wrong update may result in a city_id that does not match the city_name and then the system behavior becomes unexpected.
As said, your first suggestion would be the common and usually the best way to do this. If table keys are designed properly so all select statements can use them efficiently this would also give the best performance.
For example, having just the city_name in the users table would make it a little quicker to find and show the city for one user, but when trying to run other queries - like finding all users in city X, that would make much less sense.
You can find a nice series of articles for beginners about DB normalization here:
http://databases.about.com/od/specificproducts/a/2nf.htm. This article has an example which is very much like what you are trying to achieve, and the related articles will probably help you design many other tables in your DB.
Good luck!
I'm trying to select some data from a MySQL database.
I have a table containing business details, and a seperate one containing a list of trades. As we have multiple trades
business_details
id | business_name | trade_id | package_id
1 | Happy News | 12 | 1
This is the main table, contains the business name, the trade ID and the package ID
shop_trades
id | trade
1 | newsagents
This contains the trade type of the business
configuration_packages
id | name_of_trade_table
1 | shop_trades
2 | leisure_trades
This contains the name of the trade table to look in
So, basically, if I want to find the trade type (e.g., newsagent, fast food, etc) I look in the XXXX_trades table. But I first need to look up the name of XXXX from the configuration_packages table.
What I would normally do is 2 SQL queries:
SELECT business_details.*, configuration_packages.name_of_trade_table
FROM business_details, configuration_packages
WHERE business_details.package_id = configuration_packages.id
AND business_details.id = '1'
That gives me the name of the database table to look in for the trade name, so I look up the name of the table
SELECT trade FROM XXXX WHERE id='YYYY'
Where XXXX is the name of the table returned as part of the first query and YYYY is the id of the package, again returned from the first query.
Is there a way to combine these two queries so that I only run one?
I've used subqueries before, but only on the SELECT side of the query - not the FROM side.
Typically, this is handled by a union in a single query.
Normalization gets you to a logical model. This helps better understand the data. It is common to denormalize when implementing the model. Subtypes as you have here are commonly implemented in two ways:
Seperate tables as you have, which makes retrieval difficult. This results in your question about how to retreive the data.
A common table for all subtypes with a subtype indicator. This may result in columns which are always null for certain subtypes. It simplifies data access, and may alter the way that the subtypes are handled in code.
If the extra columns for a subtype are relatively rarely accessed, then you may use a hybrid implementation where the common columns are in the type table, and some or all of the subtype columns are in a subtype table. This is more complex to code.
That's not possible, and it sounds like a problem with your model.
Why don't you put shop_trades and leisure_traces into the same table with one column to distinct between the two?
If this is possible, try this
SELECT trade
FROM (SELECT 'TABLE_NAME' FROM 'INFORMATION_SCHEMA'.'TABLES'
WHERE 'TABLE_SCHEMA'='*schema name*')
WHERE id='YYYY'
UPDATE:
I think the code I have above is not possible. :|