My team working on a php/MySQL website for a school project. I have a table of users with typical information (ID,first name, last name, etc). I also have a table of questions with sample data like below. For this simplified example, all the answers to the questions are numerical.
Table Questions:
qid | questionText
1 | 'favorite number'
2 | 'gpa'
3 | 'number of years doing ...'
etc.
Users will have the ability fill out a form to answer any or all of these questions. Note: users are not required to answer all of the questions and the questions themselves are subject to change in the future.
The answer table looks like this:
Table Answers:
uid | qid | value
37 | 1 | 42
37 | 2 | 3.5
38 | 2 | 3.6
etc.
Now, I am working on the search page for the site. I would like the user to select what criteria they want to search on. I have something working, but I'm not sure it is efficient at all or if it will scale (not that these tables will ever be huge - like I said, it is a school project). For example, I might want to list all users whose favorite number is between 100 and 200 and whose GPA is above 2.0. Currently, I have a query builder that works (it creates a valid query that returns accurate results - as far as I can tell). A result of the query builder for this example would look like this:
SELECT u.ID, u.name (etc)
FROM User u
JOIN Answer a1 ON u.ID=a1.uid
JOIN Answer a2 ON u.ID=a2.uid
WHERE 1
AND (a1.qid=1 AND a1.value>100 AND a1.value<200)
AND (a2.qid=2 AND a2.value>2.0)
I add the WHERE 1 so that in the for loops, I can just add " AND (...)". I realize I could drop the '1' and just use implode(and,array) and add the where if array is not empty, but I figured this is equivalent. If not, I can change that easy enough.
As you can see, I add a JOIN for every criteria the searcher asks for. This also allows me to order by a1.value ASC, or a2.value, etc.
First question:
Is this table organization at least somewhat decent? We figured that since the number of questions is variable, and not every user answers every question, that something like this would be necessary.
Main question:
Is the query way too inefficient? I imagine that it is not ideal to join the same table to itself up to maybe a dozen or two times (if we end up putting that many questions in). I did some searching and found these two posts which seem to kind of touch on what I'm looking for:
Mutiple criteria in 1 query
This uses multiple nested (correct term?) queries in EXISTS
Search for products with multiple criteria
One of the comments by youssef azari mentions using 'query 1' UNION 'query 2'
Would either of these perform better/make more sense for what I'm trying to do?
Bonus question:
I left out above for simplicity's sake, but I actually have 3 tables (for number valued questions, booleans, and text)
The decision to have separate tables was because (as far as I could think of) it would either be that or have one big answers table with 3 value columns of different types, having 2 always empty.
This works with my current query builder - an example query would be
SELECT u.ID,...
FROM User u
JOIN AnswerBool b1 ON u.ID=b1.uid
JOIN AnswerNum n1 ON u.ID=n1.uid
JOIN AnswerText t1 ON u.ID=t1.uid
WHERE 1
AND (b1.qid=1 AND b1.value=true)
AND (n1.qid=16 AND n1.value<999)
AND (t1.qid=23 AND t1.value LIKE '...')
With that in mind, what is the best way to get my results?
One final piece of context:
I mentioned this is for a school project. While this is true, then eventual goal (it is an undergrad senior design project) is to have a department use our site for students creating teams for their senior design. For a rough estimate of size, every semester, the department would have somewhere around 200 or so students use our site to form teams. Obviously, when we're done, the department will (hopefully) check our site for security issues and other stuff they need to worry about (what with FERPA and all). We are trying to take into account all common security practices and scalablity concerns, but in the end, our code may be improved by others.
UPDATE
As per nnichols suggestion, I put in a decent amount of data and ran some tests on different queries. I put around 250 users in the table, and about 2000 answers in each of the 3 tables. I found the links provided very informative
(links removed because I can't hyperlink more than twice yet) Links are in nnichols' response
as well as this one that I found:
http://phpmaster.com/using-explain-to-write-better-mysql-queries/
I tried 3 different types of queries, and in the end, the one I proposed worked the best.
First: using EXISTS
SELECT u.ID,...
FROM User u WHERE 1
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=# AND value>#) -- or any condition on value
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=another # AND some_condition(value))
AND EXISTS
(SELECT * FROM AnswerText
...
I used 10 conditions on each of the 3 answer tables (resulting in 30 EXISTS)
Second: using IN - a very similar approach (maybe even exactly?) which yields the same results
SELECT u.ID,...
FROM User u WHERE 1
AND (u.ID) IN (SELECT uid FROM AnswerNumber WHERE qid=# AND ...)
...
again with 30 subqueries.
The third one I tried was the same as described above (using 30 JOINs)
The results of using EXPLAIN on the first two were as follows: (identical)
The primary query on table u had a type of ALL (bad, though users table is not huge) and rows searched was roughly twice the size of the user table (not sure why). Each other row in the output of EXPLAIN was a dependent query on the relevant answer table, with a type of eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall not bad.
For the query I suggested (JOINing):
The primary query was actually on whatever table you joined first (in my case AnswerBoolean) with type of ref (better than ALL). The number of rows searched was equal to the number of questions answered by anyone (as in 50 distinct questions have been answered by anyone) (which will be much less than the number of users). For each additional row in EXPLAIN output, it was a SIMPLE query with type eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall almost the same, but a smaller starting multiplier.
One final advantage to the JOIN method: it was the only one I could figure out how to order by various values (such as n1.value). Since the other two queries were using subqueries, I could not access the value of a specific subquery. Adding the order by clause did change the extra field in the first query to also have 'using temporary' (required, I believe, for order by's) and 'using filesort' (not sure how to avoid that). However, even with those slow-downs, the number of rows is still much less, and the other two (as far as I could get) cannot use order by.
You could answer most of these questions yourself with a suitably large test dataset and the use of EXPLAIN and/or the profiler.
Your INNER JOINs will almost certainly perform better than switching to EXISTS but again this is easy to test with a suitable test dataset and EXPLAIN.
Related
I'm querying a table from a third party plugin so I don't have control over how the data is being inputted. It's a course plugin broken into 4 different quizzes. One of the questions is being used in all four quizzes.
THERE IS NO "quiz_id" which is why I think they only way to query data is with some sort of if conditional. There IS a date field and a unique id field.
This is what my subquery looks like:
(SELECT y.post_content
FROM wp_posts AS y
WHERE 137 = y.post_author
AND y.post_title LIKE '%If you checked "Other" in "What are your organization’s primary goals related to hiring people with autism?", please explain%'
ORDER BY y.post_author ASC limit 1)
This works to query the answer (y.post_content) for the first quiz, but not for all 4 quizzes. Is there a conditional I can use for this? i.e. let's say i'm querying results for the 2nd quiz: if there are four answers, pick the 2nd one,if there are 3 answers pick the 2nd one, if there are 2 answers, pick the most recent one
I am trying to normalise my MySQL 5.7 data shema and strugle with replacing the SQL queries:
At the moment there is one table containing all attributes of each article:
article_id | title | ref_id | dial_c_id
The task is to retrieve all articles which match two given attributes (ref_id and dial_c_id) and also retrieve all their other attributes.
With just one table, this is straightforward:
SELECT *
FROM test.articles_test
WHERE
ref_id = '127712'
AND dial_c_id = 51
Now in my effort to normalise, I have created a second table, which stores the attributes of each article and removed the ones in table articles:
table 1:
article_id | title
table 2:
article_id | attr_group | attribute
1 ref_id 51
1 dial_c_id 33
1 another 5
2 ..
I would like to retrieve all article details including ALL attributes which match ref_id and dial_c_id with this two table shema.
Somehow like this:
SELECT
a.article_id,
a.title,
attr.*
FROM test.articles_test a
INNER JOIN attributes attr ON a.article_id = attr.article_id
AND ref_id = '127712'
AND dial_c_id = 51
How can this be done?
You have used an Entity-Attribute-Value table to record your attributes.
This is the opposite of normalization.
Name the rule of normalization that guided you to put different attributes into the same column. You can't, because this is not a normalization practice.
To accomplish your query with your current EAV design, you need to pivot the result so you get something as if you had your original table.
SELECT * FROM (
SELECT
a.article_id,
a.title,
MAX(CASE attr_group WHEN 'ref_id' THEN attribute END) AS ref_id,
MAX(CASE attr_group WHEN 'dial_c_id' THEN attribute END) AS dial_c_id
-- ...others...
FROM test.articles_test a
INNER JOIN attributes attr ON a.article_id = attr.article_id
GROUP BY a.article_id, a.title) AS pivot
WHERE pivot.ref_id = '127712'
AND pivot.dial_c_id = 51
While the above query can produce the result you want, the performance will be terrible. It has to create a temp table for the subquery, containing all data from both tables, then apply the WHERE clause against the temp table.
You're really better off with each attribute in its own column in your original table.
I understand that you are trying to allow for many attributes in the future. This is a common problem.
See my answer to
How to design a product table for many kinds of product where each product has many parameters
But you shouldn't call it "normalised," because it isn't. It's not even denormalised. It's derelational.
You can't just use words to describe anything you want — especially not the opposite of what the word means. I can't let the air out of my bicycle tire and say "I'm inflating it."
You commented that you're trying to make your database "scalable." You also misunderstand what the word "scalable" means. By using EAV, you're creating a structure where the queries needed are difficult to write and inefficient to execute, and the data takes 10x space. It's the opposite of scalable.
What you mean is that you're trying to create a system that is extensible. This is complex to implement in SQL, but I describe several solutions in the other Stack Overflow answer to which I linked. You might also like my presentation Extensible Data Modeling with MySQL.
Hi I have run in to a problem when retrieving a particular data set using 3 tables in a MySql database.Tables are as follows
Student
SID | Name | Age | Telephone
Term
TID | Start | End
Payment
PID | TID | SID | Value
SID is primary key of Student table. TID is primary key of Term table. PID is primary key of Payment table. TID and SID in Payment table are foreign key references.
Student table contains data of students. Term table contain data of term start and end dates. Payment table contain data about student payment. Records in Payment table may either contain TID or not. When it is a registration payment there will be no TID. Otherwise it is a term fee and there will be TID. What I want is a list of students that hasn't paid this terms fees until today. Asuume this TID is in a variable. How can I obtain the list of students ? IT SEEMS SUPER EASY. BUT I COULDNT FIND AN ANSWER THIS WHOLE DAY 😣
You want a list of just those students who do not have a TID-populated record whose start and end dates are either side of today, in Payment
SELECT * FROM
student
LEFT OUTER JOIN
(select * from payment where TID is not null and NOW() BETWEEN start and end) this_term_payments
on student.id = this_term_payments.sid
WHERE
this_term_payments.ID is null
There are many ways to skin this cat. Here is one. We filter the payments table down to just a list of this term's payments (that's the inner query). And left join that to students. Left join means we get all students, matched with this_term_payments if the this_term_payments row exists, or NULL in every this_term_payments column if the term payment doesn't exist. The where clause then filters the whole results set down to "just those who don't have a term payment" by looking for those nulls that the left join creates
FWIW, your question attracted close votes because it didn't include example data/demonstrate the level of your effort we like to see on SQL questions. If you'd included sample data for all your tables and an example result set you wanted to see out, it means we can write an exact query that meets your requirements.
This is a bit of a double edged sword for me; we can deliver exactly what you ask for even if you later realise it's not what you want (asking in English is far more vague than giving an example result set) but at the same time we basically become some free outsourced homework contractor or similar, doing your work for you and removing learning opportunities along the way. Hopefully you'll take this query (it's likely it doesn't output everything you want, or outputs stuff you don't want) and craft what you want out of it now that the technique has been explained.. :)
For an SQL question that was relatively well received (by the time i'd finished editing it following up on the comments), and attracted some great answers take a look here:
Fill in gaps in data, using a value proportional to the gap distance to data from the surrounding rows?
That's more how you need to be asking SQL questions - say what you want, give example data, give scripts to help people create your same data so they can have a play with their idea without the boring bits of creating the data first. I picked on that one because I didn't even have any SQL attempts to show at the time; it was just a thought exercise. Having nothing working isn't necessarily a barrier to asking a good question
Try this:
select s.name, p.value from Student s, Term t, Payment p where t.TID = p.TID and s.SID=p.SID and p.TID is null;
For example have url like domain.com/transport/cars
Based on the url want to select from mysql and show list of ads for cars
Want to choose fastest method (method that takes less time to show results and will use less resources).
Comparing 2 ways
First way
Mysql table transport with rows like
FirstLevSubcat | Text
---------------------------------
1 | Text1 car
2 | Text1xx lorry
1 | Text another car
FirstLevSubcat Type is int
Then another mysql table subcategories
Id | NameOfSubcat
---------------------------------
1 | cars
2 | lorries
3 | dogs
4 | flats
Query like
SELECT Text, AndSoOn FROM transport
WHERE
FirstLevSubcat = (SELECT Id FROM subcategories WHERE NameOfSubcat = `cars`)
Or instead of SELECT Id FROM subcategories get Id from xml file or from php array
Second way
Mysql table transport with rows like
FirstLevSubcat | Text
---------------------------------
cars | Text1 car
lorries | Text1xx lorry
cars | Text another car
FirstLevSubcat Type is varchar or char
And query simply
SELECT Text, AndSoOn FROM transport
WHERE FirstLevSubcat = `cars`
Please advice which way would use less resources and takes less time to show results. I read that better select where int than where varchar SQL SELECT speed int vs varchar
So as understand the First way would be better?
The first design is much better, because you separate two facts in your data:
There is a category 'cars'.
'Text1 car' is in the Category 'cars'.
Imagine, in your second design you enter another car, but type in 'cors' instead of 'cars'. The dbms doesn't see this, and so you have created another category with a single entry. (Well, in MySQL you could use an enum column instead to circumvent this issue, but this is not available in most other dbms. And anyhow, whenever you want to rename your category, say from 'cars' to 'vans', then you would have to change all existing records plus alter the table, instead of simply renaming the entry once in the subcategories table.)
So stay away from your second design.
As to Praveen Prasannan's comment on sub queries and joins: That is nonsense. Your query is straight forward and good. You want to select from transport where the category is the desired one. Perfect. There are two groups of persons who would prefer a join here:
Beginners who simply don't know better and always join from the start and try to sort things out in the end.
Experienced programmers who know that some dbms often handle joins better than sub-queries. But this is a pessimistic habit. Better write your queries such that they are easy to read and maintain, as you are already doing, and only change this in case grave performance issues occur.
Yup. As the SO link in your question suggests, int comparison is faster than character comparison and yield faster fetch. Keeping this in mind, first design would be considered as better design. However sub queries are never recommended. Use join instead.
eg:
SELECT t.Text, t.AndSoOn FROM transport t
INNER JOIN subcategories s ON s.ID = t.FirstLevSubcat
WHERE s.NameOfSubcat = 'cars'
I'm working on redesigning some parts of our schema, and I'm running into a problem where I just don't know a good clean way of doing something. I have an event table such as:
Events
--------
event_id
for each event, there could be n groups or users associated with it. So there's a table relating Events to Users to reflect that one to many relationship such as:
EventUsers
----------
event_id
user_id
The problem is that we also have a concept of groups. We want to potentially tie n groups to an event in addition to users. So, that user_id column isn't sufficient, because we need to store potentially either a user_id or a group_id.
I've thought of a variety of ways to handle this, but they all seem like a big hack. For example, I could make that a participant_id and put in a participant_type column such as:
EventUsers
----------
event_id
participant_id
participant_type
and if I wanted to get the events that user_id 10 was a part of, it could be something like:
select event_id
from EventUsers
where participant_id = 10
and participant_type = 1
(assuming that somewhere participant_type 1 was defined to be a User). But I don't like that from a philosophical point of view because when I look at the data, I don't know what the number in participant_id means unless I also look at the value in particpant_type.
I could also change EventUsers to be something like:
EventParticipants
-----------------
event_id
user_id
group_id
and allow the values of user_id and group_id to be NULL if that record is dealing with the other type of information.
Of course, I could just break EventUsers and we'll call it EventGroups into 2 different tables but I'd like to keep who is tied to an event stored in one single place if there's a good logical way to do it.
So, am I overlooking a good way to accomplish this?
Tables Events, Users and Groups represent the basic entities. They are related by EventUsers, GroupUsers and EventGroups. You need to union results together, e.g. the attendees for an event are:
select user_id
from EventUsers
where event_id = #event_id
union
select GU.user_id
from EventGroups as EG inner join
GroupUsers as GU on GU.group_id = EG.group_id
where EG.event_id = #event_id
Don't be shy about creating additional tables to represent different types of things. It is often easier to combine them, e.g. with union, than to try to sort out a mess of vague data.
Of course, I could just break EventUsers and we'll call it EventGroups into 2 different tables
This is the good logical way to do it. Create a junction table for each many-to-many relationship; one for events and users, the other for events and groups.
There's no correct answer to this question (although I'm sure if you look hard enough you'll finds some purists that believe that their approach is the correct one).
Personally, I'm a fan of the second approach because it allows you to give columns names that accurately reflect the data they contain. This makes your SELECT statements (in particular when it comes to joining) a bit easier to understand. Yeah, you'll end up with a bunch of NULL values in the column that is unused, but that's not really a big deal.
However, if you'll be joining on this table a lot, it might be wise to go with the first approach, so that the column you join on is consistently the same. Also, if you anticipate new types of participant being added in the future, which would result in a third column in EventParticipants, then you might want to go with the first approach to keep the table narrow.