So I am working with a few tables in a database, and I'm wondering the best way to query it.
Here's the setup:
EVENTS:
int event_id
varchar event_name
date event_date
ATTENDANCE:
int attendance_id
int event_id (foreign key for EVENTS)
int user_id (foreign key for USERS)
int status
USERS:
int user_id
varchar first_name
varchar last_name
varchar email
Pretty much what I was going to do, is have an event (the ID #) that I want to get the attendance for, and then query the attendance table for all records matching that event, then query the users table for all users referenced in attendance as part of that event.
The first thought that came to mind was to first query the database for all attendance entries and get an array, then loop through each record to query the user information. However this seems pretty inefficient and there must be a better way with joins or something of the like. I don't have much experience with joins, so I was wondering if I could get some help.
This is the pseudo code of what I was originally thinking:
SELECT * FROM attendance WHERE event_id = eventID
while (row exists):
SELECT FROM users WHERE user_id = attendanceUserID
get info export in xml...etc.
I don't think this is the best way to do this, so what would be the better way to do it?
The question is "join or loop?" and the technical answer is use a join. What you are describing is what joins are meant to do, combine tables on conditions.
"select from ... (select more)" isn't the way to go about it. Consider the idea really of "what do these two tables have that connects them in a totally reliable and identifiable way? The note in the above comment is spot on.
However, as an "old man", the question isn't quite as straightforward. Everything is a question of time, yours and the machine's. So ask yourself this: imagine method A is 100 times more efficient than method B. But, you already know how to do method B. AND... the difference is 0.02 milliseconds versus 2 milliseconds. (Let's just say you have a small data set, and a fast machine.) If you can code up method B in three minutes and get on with your project, that might be good. Especially if there's a deadline. Because everything is easier with a working example to start from, even if implemented in a different way. It gives you something to test the NEW method that you're just learning against. Lots of people chase efficiency before they even know if it would even matter or not.
Get things working first, make them faster second. (Of course, don't paint yourself into a horrible design corner. You haven't though, the database is fine, you're just looking at different ways of getting info out of it.)
Related
I have a table that has three columns. The id of a material, the family it belongs (there is some kind of grouping), the type of test and the results of that test as a number (1-4). The type of test points to a table with a list of tests and the result is a number. I am trying to create a query/view if appropriate so that I can have a result of a family of materials and for each test a new column with the cell showing the average of that test in all the materials in the family...
I am thinking that since it's for a small project and I know the test (don't think they are going to change but not sure) that I could just add the tests (they are 6 of them) as columns in the table instead of having it as data but it doesn't feel right to me. Also, there could be a chance that a test would be added in the future. At the same time though, adding a column would not be hard and I could just change the code for averages to disregard values to a specific value so I could differentiate from values before a test was added.
So how would I go about doing it and is it a good idea the way I am doing it?
For now what I have is maybe making a select statement for each pair of family/test and then somehow creating a view (is it a virtual table) with the results of those queries.
So if the table is
test_result
family_id | material_id | test_id | result
The query would be
SELECT AVG(result) AS 'TEST'
FROM test_result
WHERE family_id = 'family_id'
AND test_id = 'test_id;
But I am not sure how to proceed or if there is a better way than doing this 6 times and somehow combining the results
I am not sure how much you know about database design.
The question you are asking, about having 6 columns for test_id in one table versus having another table with family_id and test_id as the primary key (unique identifier) is a fundamental one about database design. It has to do with first normal form. You can study up on first normal form, and on data normalization generally, if you choose to.
Here is an oversimplified version, for this case.
There are two big problems with the six columns in one table approach.
The first is this: what happens when they change their minds and add a seventh test? If this never happens, everything is ok. But if not, you have to alter the table by adding another column, and you have to alter any queries that reference the table. If that's only one query in your case, you can manage it. In cases where there are hundreds of queries that may reference the table, and some of those are in application programs that may require a maintenance cycle to revise the query, this can be a nightmare. That is why database tutorials are full of material that you may not need to learn if this small project is the only one you ever do.
The second is this: what happens when you have to write a query that has to find every occurrence of testid = 4, regardless of which of the six columns the value is stored in? You are going to have to write a query with five OR operators in the WHERE clause. This is tedious, error prone, and runs slow. Again, this may never be a problem.
The generally better approach is to create a third table with family_id and test_id as columns, and maybe result as a third column (I'm not sure what material_id is... is there a material table?)
The first table, families, has the family_id and any data that only depends on the family, like family_name.
The second table, tests, has the test_id and any data that only depends on the test, like test_name.
And the third table contains data that depends on both.
You then write a view that joins all three tables to together, to make it look the way you want to use it.
I apologize if this covers a lot of concepts you already know. Again, I couldn't tell from your question.
On the project I'm working on we have an activity table and each activity can be linked to one of about 20 different "activity details" tables...
e.g. If the activity was of type "work", then it would have a corresponding activity_details_work record, if it was of type "sick leave" then it would have a corresponding activity_details_sickleave record and so on.
Currently we are loading the activities and then for each activity we have a separate query to go fetch the activity details from the relevant table. This obviously doesn't scale well if you have thousands of activities.
So my initial thought was to have a single query which fetches the activities and joins the details in one go e.g.
SELECT * FROM activity
LEFT JOIN activity_details_1_work ON ...
LEFT JOIN activity_details_2_sickleave ON ...
LEFT JOIN activity_details_3_travelwork ON ...
...etc...
LEFT JOIN activity_details_20_yearleave ON ...
But this will result in each record having 100's of fields, most of which are empty and that feels nasty.
Lazy-loading the details isn't really an option either as the details are almost always requested in the core logic, at least for the main types anyway.
Is there a super clever way of doing this that I'm not thinking of?
Thanks in advance
My suggestion is to define a view for each ActivityType, that is tailored specifically to that activity.
Then add an index on the Activity table lead by the ActivityType field. Cluster said index unless there is an overwhelming need for some other to be clustered (or performance benchmarking shows some other clustering selection to be more performant).
Is there a particular reason why this degree of denormalization was designed in? Is that reason well known?
Chances are your activity tables are like (date_from, date_to, with_who, descr) or something to that effect. As Pieter suggested, consider tossing in a type varchar or enum field in there, so as to deal with a single details table.
If there are rational reasons to keep the tables apart, consider adding triggers that maintain boolean/tinyint fields (has_work, has_sickleave, etc), or a bit string (has_activites_of_type where the first position amounts to has_work, the next to has_sickleave, etc.).
Either way, you'll probably be better off by fetching the activity's details in one or more separate queries -- if only to avoid field name collisions.
I don't think enum is the way to go, because as you say there might be 1000's of activities, then altering your activity table would become an issue.
There is no point doing a left join on a large number of tables either.
So the options that you have are :
See this The first comment might be useful.
I am guessing that your activity table has a field called activity_type_id.
Build a table called activity_types containing fields activity_type_id, activity_name, activity_details_table_name. First query in the following way
activity
inner join
activity_types
using( activity_type_id )
This query gives you the table name on which to query for the details.
This way you can add any new activity type just by adding a row in the activity_types table.
I am developing a forum in PHP MySQL. I want to make my forum as efficient as I can.
I have made these two tables
tbl_threads
tbl_comments
Now, the problems is that there is a like and dislike button under the each comment. I have to store the user_name which has clicked the Like or Dislike Button with the comment_id. I have made a column user_likes and a column user_dislikes in tbl_comments to store the comma separated user_names. But on this forum, I have read that this is not an efficient way. I have been advised to create a third table to store the Likes and Dislikes and to comply my database design with 1NF.
But the problem is, If I make a third table tbl_user_opinion and make two fields like this
1. comment_id
2. type (like or dislike)
So, will I have to run as many sql queries as there are comments on my page to get the like and dislike data for each comment. Will it not inefficient. I think there is some confusion on my part here. Can some one clarify this.
You have a Relational Scheme like this:
There are two ways to solve this. The first one, the "clean" one is to build your "like" table, and do "count(*)'s" on the appropriate column.
The second one would be to store in each comment a counter, indicating how many up's and down's have been there.
If you want to check, if a specific user has voted on the comment, you only have to check one entry, wich you can easily handle as own query and merge them two outside of your database (for this use a query resulting in comment_id and kind of the vote the user has done in a specific thread.)
Your approach with a comma-seperated-list is not quite performant, due you cannot parse it without higher intelligence, or a huge amount of parsing strings. If you have a database - use it!
("One Information - One Dataset"!)
The comma-separate list violates the principle of atomicity, and therefore the 1NF. You'll have hard time maintaining referential integrity and, for the most part, querying as well.
Here is one way to do it in a normalized fashion:
This is very clustering-friendly: it groups up-votes belonging to the same comment physically close together (ditto for down-votes), making the following query rather efficient:
SELECT
COMMENT.COMMENT_ID,
<other COMMENT fields>,
COUNT(DISTINCT UP_VOTE.USER_ID) - COUNT(DISTINCT DOWN_VOTE.USER_ID) SCORE
FROM COMMENT
LEFT JOIN UP_VOTE
ON COMMENT.COMMENT_ID = UP_VOTE.COMMENT_ID
LEFT JOIN DOWN_VOTE
ON COMMENT.COMMENT_ID = DOWN_VOTE.COMMENT_ID
WHERE
COMMENT.COMMENT_ID = <whatever>
GROUP BY
COMMENT.COMMENT_ID,
<other COMMENT fields>;
[SQL Fiddle]
Please measure on realistic amounts of data if that works fast enough for you. If not, then denormalize the model and cache the total score in the COMMENT table, and keep it current it through triggers every time a new row is inserted to or deleted from *_VOTE tables.
If you also need to get which comments a particular user voted on, you'll need indexes on *_VOTE {USER_ID, COMMENT_ID}, i.e. the reverse of the primary/clustering key above.1
1 This is one of the reasons why I didn't go with just one VOTE table containing an additional field that can be either 1 (for up-vote) or -1 (for down-vote): it's less efficient to cover with secondary indexes.
I want to save permissions for both individual users, and user groups. In my mysql database, I have a permissions table where I store a permission name, and a permission id. I have a user table, where I store a username, password etc. which also contains an id, and I have a groups table which stores a group name and group id.
What would now be the most efficient option? To make 2 tables, one containing user permissions and one containing group permissions, so something like this:
int group id | int permission id
int user id | int permission id
or would it be better to have one table like this;
int id | int permission id | enum('user','group')
I would recommend using two tables.
First you can reduce the seeking time by checking the first table for group permission and if there is no group permission to search for user permission.
Second it will be easier to understand by other programmers.
I doubt there would be much of a performance difference between your two approaches; but, as with all performance and optimization questions, feelings and guesses don't matter, only profiled results matter. Which one will be more efficient depends on your data, your database, your access patterns, and what "efficient" means (space? time? developer effort? final monetary cost?) in your context.
That said, using two tables is a better structure as it allows you to have foreign keys from your group-permission table back to your group table and your user-permission table back to your user table. Even if it was faster in one table I'd still go with two: data integrity is more important than wasting a couple µs of processor time, I don't see much point in quickly accessing unreliable or broken data.
The first way looks more time-efficient, the second way looks more space-efficient. For the speed of a unique index, in the 2nd case you'd have to index on the 1st and 3rd fields.
Mulling over it a little more, any potential gains of doing it the 2nd way aren't worth it, IMO. There may be some third way, but of the two you posted the first is superior. Simple, clean, and fast.
I need some help designing a friends system
The mysql table:
friends_ list
- auto_ id
- friend_ id
- user_ id
- approved_ status
Option 1 = Everytime a user adds a user there is 2 entries added to the DB, we then can get there friends like this
SELECT user_id FROM `friends_list` WHERE friend_id='$userId' and approved_status='yes'
Option 2 = We add 1 entry for every friend added and then get the friend list like this
SELECT friend_id AS FriendId FROM `friends_list` WHERE user_id='$userId' and approved_status='yes'
UNION
SELECT user_id as FriendId FROM `friends_list` WHERE friend_id='$userId' and approved_status='yes'
Of the 2 methods above for having friends on a site like myspace, facebook, all the other sites do, which of the 2 methods above would be best performance?
The 1st method doubles the ammount of rows, example a site with 2 million friend rows would only be 1 million rows the second method.
However does the union method mean there is 2 queries being made, so 2 queries on a million row table instead of 1?
UPDATE
I just ran some test on 60,000 friends and here are the results, and yes the tables are indexed.
Option 1 with 2 entries per friend;
.0007 seconds
Option 2 with 1 entry per friend using UNION to select
.3100 seconds
Option 1.
Do as much work as possible on adding a friend. Adding someone is very rare compare to selecting all friends. Something probably done every time you render a page (on a social site).
If you are indexing on user_id and friend_id, then the two statements should be equivalent in time -- best to test though -- DB's can have surprising results. The UNION is seen by the database and it can use it to optimize the query, though.
Is friendship always mutual and with the same approval status? If so, I'd opt for the second one. If it can be one-way or with separate approvals, then you need the first one, right?
Which dbms are you using? MySQL always use an extra temporary table when executing union queries. Creating these temporary tables creates some overhead to the query and is probably why your union query is slower than the first query.
Have you thought about separating friends and friend request? This would reduce the content of the friends table, you would also be able to delete accepted requests from the friend request table and keep the size of the table down. Another good advantage is that you keep less columns in each table and it is thereby easier to get a fine tuned index on them.
I am currently building this feature myself, and it would be interesting to hear about your experience on the matter.
Answers to this kind of question tend to depend upong the usage patterns. In your case which is most common: adding new friendship relationships or querying the friendship realtionship?
Also you need to consider future flecibility, what otehr uses might there be for the data.
My guess is that option is a cleaner data model for what you are trying to represent. It looks like it allows for expandsion in directions such as asymmetric relationships, and friend of friends. I think that option will prove to be unweildy in the future.