Robust SQL query on demographics data set - mysql

I have a rather complex SQL Server query (at least to me) to write on a demographics data set. I need to figure out how many respondents in the system mathc a specific demographic.
I have 2 main tables. I will list the relevant columns. Assume there are unique ID's on each row.
Table Respondents:
[RespondentID] [SystemEntryDate]
Table RespondentProfiles:
[QuestionID] [AnswerID]
The respondent ID on Respondents links to RespondentProfiles. For each question answered, a row is created. The question id corresponds to a specific question (say gender, ethnicity, state, and car ownership) and the answer id means something different depending on the question. Like 1 is male and 2 is female, or 1 might be white, 2 hispanic, 3 pacific islander, and so on.
I also have a table called Conditions. The conditions table looks like this:
[ConditionSetID] [QuestionID] [AnswerID]
The condition set id links to the conditions together into a collection of conditions. So i can pass a condition set id to the query, and it will return a count of how many respondents meet that criteria, as well as the min and max dates from that set.
My query will look something like this:
create procedure query
#ConditionSetID int
as
select count(distinct r.ID) as Respondents,
min(r.SystemEntryDate) as EarliestDate,
max(r.SystemEntryDate) as LatestDate
from Respondents r
join RespondentProfiles rp
on r.ID = rp.RespondentID
join Conditions c
on c.ConditionSetID = #ConditionSetID
and c.QuestionID = rp.QuestionID
where rp.QuestionID = c.QuestionID
and rp.Condition = c.AnswerID
As an example, I might have a respondent profiles table like this
[RespondentID] [QuestionID] [AnswerID]
10001 1 (gender) 1 (male)
10001 2 (ethnicity) 1 (white)
10001 3 (car) 23 (lexus)
10002 1 (gender) 2 (female)
10002 2 (ethnicity) 2 (black)
10002 3 (car) 24 (buick)
10003 1 (gender) 2 (female)
10003 2 (ethnicity) 1 (white)
10003 3 (car) 5 (honda)
10004 1 (gender) 1 (male)
10004 2 (ethnicity) 2 (black)
10004 3 (car) 24 (buick)
And if I pick a specific condition set, the rows id have might be like:
[QuestionID] [AnswerID]
1 (gender) 2 (female)
2 (ethnicity) 2 (black)
3 (car) 24 (buick)
This would be asking for all the black females who own a buick, which should give em a count of 1.
Or I could have:
[QuestionID] [AnswerID]
3 (car) 23 (lexus)
3 (car) 24 (buick)
This is asking for everyone who owns a buick or lexus, which would be 3 people.
And then as a final example:
[QuestionID] [AnswerID]
2 (ethnicity) 2 (black)
3 (car) 23 (lexus)
3 (car) 24 (buick)
This is asking for everyone who is black and owns a lexus or everyone who is black and owns a buick, which would be 2 people.
I know this isn't horribly complicated, but it is the most complex thing I've attempted yet, and any help would be greatly appreciated. I'm having a lot of trouble figuring out how to set up the where clause, and even general direction would be appreciated. There are also about 800,000 records in the respondentprofiles table, so it must be efficient.
The where clause I have set up isn't quite correct, because it will only get the records as if the different questions are being or'd together as opposed to and'ed. So it will return a row for that respondent even if only one answer matches, which is wrong. A particular respondent must meet all the conditions in the condition set to be selected.
Perhaps I need to select into a temp table question at a time or something? Or use some sort of grouping? I am just really confused on where to go with this. I hope I have provided enough information to adequately demonstrate my dilemma.

The examples below show how to get the respondent IDs of respondents who answered:
To question A, Yes
To question B, No
TO question C, Yes
Assuming you are actually using SQL server (you tagged both mysql and sql server in your question), you can use:
select id
from RespondentProfiles
where QuestionID = 'a'
and AnswerID = 'yes'
intersect
select id
from RespondentProfiles
where QuestionID = 'b'
and AnswerID = 'no'
intersect
select id
from RespondentProfiles
where QuestionID = 'c'
and AnswerID = 'yes'
Or if you are using MySQL you can use:
select id
from RespondentProfiles x
where QuestionID = 'a'
and AnswerID = 'yes'
join (select id
from RespondentProfiles
where QuestionID = 'b'
and AnswerID = 'no') y
on x.id = y.id
join (select id
from RespondentProfiles
where QuestionID = 'c'
and AnswerID = 'yes') z
on y.id = z.id
Just to add to my answer what I put in the comments - there is no need for your conditions table. You do not need to have such a table in order to query for respondents who answers 2+ questions a certain way. You can use inline views and/or subqueries to accomplish that. (or in the case of sql server, the intersect set operator)

Related

MySQL Select statement with several joins returns duplicated records

I have 4 tables with the following structure:
Table Groups
Groupid
groupname
groupadmin
Table GroupMembership
devid
groupid
Table GroupLocator
devid
name
pass
color
sampling
connected
forget
trace
Table GroupTracker
devid
groupid
latitude
longitude
timestamp
There is only one groupid='1' with groupname="FBorges"
Table GroupLocator has 2 records where devid points to grouid='1' on GroupMembership
GroupTracker has two records where groupid='1'
When I run the following SELECT:
SELECT GroupLocator.name, GroupLocator.color, GroupLocator.sampling,
GroupLocator.forget, GroupLocator.connected, GroupLocator.trace,
Groups.groupname, GroupTracker.latitude, GroupTracker.longitude,
GroupTracker.timestamp
FROM GroupMembership
JOIN GroupLocator ON GroupLocator.devid=GroupMembership.devid
JOIN Groups ON Groups.groupid=GroupMembership.groupid
JOIN GroupTracker ON GroupTracker.groupid=GroupMembership.groupid
WHERE GroupMembership.groupid=1;
I get the result:
name color sampling forget connected trace groupname latitude longitude timestamp
PCBorges 2 1 45 0 1 FBorges -22.883639 -42.822542 2020-01-08 20:29:24
Test 3 2 45 1 0 FBorges -22.883639 -42.822542 2020-01-08 20:29:24
PCBorges 2 1 45 0 1 FBorges -22.873639 -42.322542 2020-01-11 16:56:30
Test 3 2 45 1 0 FBorges -22.873639 -42.322542 2020-01-11 16:56:30
What I hope to get is:
name color sampling forget connected trace groupname latitude longitude timestamp
PCBorges 2 1 45 0 1 FBorges -22.883639 -42.822542 2020-01-08 20:29:24
Test 3 2 45 1 0 FBorges -22.883639 -42.822542 2020-01-08 20:29:24
EDIT: Removed my previous speculation after structure and data was provided and wrote a new answer:
I believe that you want to JOIN GroupTracker on devid instead of on groupid. Groupid 1 matches both rows in the GroupTracker table, so it will provide two results for each 1 row in GroupMemebership. Devid only matches one row. A correct JOIN is more efficient than your current GROUP BY solution (in comments) and may also produce more consistent results as your database grows.
SELECT gl.name, gl.color, gl.sampling,
gl.forget, gl.connected, gl.trace,
g.groupname, gt.latitude, gt.longitude,
gt.timestamp
FROM GroupMembership AS gm
JOIN GroupLocator AS gl ON gm.devid = gl.devid
JOIN Groups AS g ON gm.groupid = g.groupid
JOIN GroupTracker AS gt ON gm.devid = gt.devid
WHERE gm.groupid=1
;
I aliased all your tables so the query is much shorter and hence faster to write. I also swapped positions of all your JOIN clauses. I prefer to have the left table on the left side and the right table on the right side. Makes it easier to read. These two changes are not important. It's only style. The query will work perfectly without them.

Limit selected results by unique selected IDs when using left joins

I have a table users and some other tables like images and products
Table users:
user_id user_name
1 andrew
2 lutz
3 sophie
4 michael
5 peter
6 oscor
7 anton
8 billy
9 henry
10 jon
Tables images:
user_id img_type img_url
1 0 url1
1 1 url4
2 0 url5
7 0 url7
8 0 url8
9 1 url9
Table Products
user_id prod_id
1 5
1 55
2 555
8 5555
9 5
9 55
I use this kind of SELECT:
SELECT * FROM
(SELECT user.user_id,user.user_name, img.img_type, prod.prod_id FROM
users
LEFT JOIN images img ON img.user_id = users.user_id
LEFT JOIN products prod ON prod.user_id = users.user_id
WHERE user.user_id <= 5) AS users
ORDER BY user.user_id ASC
The result should be the following output. Due to performance improvements, I use ORDER BY and an inner select. If I put a LIMIT 5 within the inner or outer select, things won't work. MySQL will hard LIMIT the results to 5. However I need the LIMIT of 5 (pagination) found unique user_id results which would lead to 9 in this case.
Can I use maybe an if-statement to push an array with found user_id and break/finish up the select when the array consist of 5 UIDs? Or can I modify somehow the select?
user_id user_name img_type prod_id
1 andrew 0 5
1 andrew 1 5
1 andrew 0 55
1 andrew 1 55
2 lutz 0 5
2 lutz 0 55
3 sophie null null
4 michael null null
5 peter null null
results: 9
LIMIT 5 and user_id <= 5 do not necessarily give you the same results. One reason: There are multiple rows (after the JOINs) for user_id = 1. This is because there can be multiple images and/or multiple products for a given 'user'.
So, first decide which you want.
LIMIT without ORDER BY gives you an arbitrary set of rows. (Yeah, it is somewhat predictable, but you should not depend on it.)
ORDER BY + LIMIT usually implies gathering all the potentially relevant rows, sorting them, then doing the "limit". There are sometimes ways around this sluggishness.
LEFT leads to the NULLs you got; did you want that?
What do you want pagination to do if you are displaying 5 items per page, but user 1 has 6 images? You need to think about this edge case before we can help you with a solution. Maybe you want all of user 1 on a page, even if it exceeds 5? Maybe you want to break in the middle of '1'; but then we need an unambiguous way to know where to continue from for the next page.
Probably any viable solution will not use nested SELECTs. As you are finding out, it leads to "errors". Think of it this way: First find all the rows you need to display on all the pages, then carve out 5 for the current page.
Here are some more musings on pagination: http://mysql.rjweb.org/doc.php/pagination

MySQL: Find number of users with identical poll answers

I have a table, poll_response, with three columns: user_id, poll_id, option_id.
Give an arbitrary number of poll/response pairs, how can I determine the number of distinct user_ids match?
So, suppose the table's data looks like this:
user_id | poll_id | option_id
1 1 0
1 2 1
1 3 0
1 4 0
2 1 1
2 2 1
2 3 1
2 4 0
And suppose I want to know how many users have responded "1" to poll 2 and "0" to poll 3.
In this case, only user 1 matches, so the answer is: there is only one distinct user.
But suppose I want to know how many users have responded "1" to poll 2 and "0" to poll 4.
In this case, both user 1 and user 2 match, so the answer is: there are 2 distinct users.
I'm having trouble constructing the MySQL query to make this happen, especially given that there are an arbitrary number of poll/response pairs. Do I just try to chain a bunch of joins together?
To know how many users have responded "1" to poll 2 and "0" to poll 3.
select count(user_id) from(
select user_id from tblA
where (poll_id=2 and option_id=1) or (poll_id=3 and option_id=0)
group by user_id
having count(user_id)=2
)m
SQL FIDDLE HERE.

MySQL select re-ask+simplification

The original question is here.. MySQL self-referencing ID and selects
I would like to pose the question in a way with all the relation to a specific case removed.
I have the example table..
id1 id2
1 5
5 1
2 3
3 2
What SQL command would return..
id1 id2
1 5
2 3
Essentially removing the "duplicate rows".
Q1 and Q2 are the alias' I've created for your table, so we can reference the id's as if they were on different tables.
DELETE Q1 FROM table Q1
JOIN table Q2
ON Q1.id1 = Q2.id2
AND Q2.id1 = Q1.id2
WHERE Q1.id1 > Q1.id2

MS Access 2007 Rows to columns in recordset

I have a table which is like a questionnaire type ..
My original table contains 450 columns and 212 rows.
Slno is the person's id who answer the questionaire .
SlNo Q1a Q1b Q2a Q2b Q2c Q2d Q2e Q2f .... Q37c <450 columns>
1 1
2 1 1
3 1
4 1 1
5 1
I have to do analysis for this data , eg Number of persons who is male (Q1a) and who owns a boat (Q2b) i.e ( select * from Questionnaire where Q1a=1 and Q2b=1 ).. etc .. many more combinations are there ..
I have designed in MS access all the design worked perfectly except for a major problem ( Number of table columns is restricted to 255 ).
To be able to enter this into access table i have inserted in as 450 rows and 212 columns (now am able to enter this into access db). Now while fetching the records i want the record set to transpose the results into the form that i wanted so that i do not have to change my algorithm or logic .... How to achieve this with the minimum changes ? This is my first time working with Access Database
You might be able to use a crosstab query to generate what you are expecting. You could also build a transpose function.
Either way, I think you'll stil run into the 255 column limit and MS Access is using temporary table, etc.
However, I think you'll have far less work and better results if you change the structure of your table.
I assume that this like a fill-in-the-bubble questionnaire, and it's mostly multiple choice. In which case instead of recording the result, I would record the answer for the question
SlNo Q1 Q2
1 B
2 B
3 A
4 A C
5 A
Then you have far fewer columns to work with. And you query for where Q1='A' instead of Q1a=1.
The alternative is break the table up into sections (personal, career, etc.) and then do a join, and only show the column you need (so as not to exceed that 255 column limit).
An way to do this that handles more questions is have a table for the person, a table for the question, and a table for the response
Person
SlNo PostalCode
1 90210
2 H0H 0H0
3
Questions
QID, QTitle, QDesc
1 Q1a Gender Male
2 Q1b Gender Female
3 Q2a Boat
4 Q2b Car
Answers
SlNo QID Result
1 2 True
1 3 True
1 4 True
2 1 True
2 3 False
2 4 True
You can then find the question takers by selecting Persons from a list of Answers
select * from Person
where SlNo in (
select SlNo from Answers, Questions
where
questions.qid = answers=qid
and
qtitle = 'Q1a'
and
answers.result='True')
and SlNo in (
select SlNo from Answers, Questions
where
questions.qid = answers=qid
and
qtitle = 'Q2a'
and
answers.result='True')
I finally got the solutions
I created two table one having 225 columns and the other having 225 column
(total 450 columns)
I created a SQL statement
select count(*) from T1,T2 WHERE T1.SlNo=T2.SlNo
and added the conditions what i want
It is coming correct after this ..
The database was entered wrongly by the other staff in the beginning but just to throw away one week of work was not good , so had to stick to this design ... and the deadly is next week .. now it's working :) :)