Mysql Table and Index Design for Dating Portal [closed]

Mysql Table and Index Design for Dating Portal [closed] - mysql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I am creating a dating portal where we will be asking user around 40-50 questions like religion,caste,date of birth,food preference,smoking/non smoking.
I am asking similar questions on the user preference like age range,religion preference,smoking preference.
I have around 30-40 such preference.
Now I want to show user the matches based on the preference set.
I want to know how I should design MySQL tables and indexes.
Should I create 1 big table of user_preferences and have all preferences indexes.
Should it be multiple column indexes or merge indexes.
Should I keep set of questions in different tables and join them when fetching the data?
m

I think this could be a case for EAV:
You should be able to get the matching user pairs in the descending order (from most matching to least) similar to this:
SELECT *
FROM (
SELECT U1.USER_ID, U2.USER_ID, COUNT(*) MATCH_COUNT
FROM USER U1
JOIN USER_PREFERENCE P1
ON (U1.USER_ID = P1.USER_ID)
JOIN USER_PREFERENCE P2
ON (P1.NAME = P2.NAME AND P1.VALUE = P2.VALUE)
JOIN USER U2
ON (P2.USER_ID = U2.USER_ID)
WHERE U1.USER_ID < U2.USER_ID -- To avoid matching the user with herself and duplicated pairs with flipped user IDs.
GROUP BY U1.USER_ID, U2.USER_ID
) Q
ORDER BY MATCH_COUNT DESC
This just matches the preferences by their exact values. You may want to create additional "preference" tables for range or enum-like values, and replace P1.VALUE = P2.VALUE accordingly. And you may still need special processing if the match is with the data in USER table (such whether user's age falls into other user's preferred age range).
Note the index on {NAME, VALUE} which is meant to help P1.NAME = P2.NAME AND P1.VALUE = P2.VALUE. InnoDB tables are clustered, and one consequence is that secondary indexes contain the copy of PK fields - which in this case causes the index I1 to completely cover the table. Whether MySQL will actually use it is another matter - as always look at the query plan and measure on representative data...

I see something like this:
questions is the list of questions to be answered. question_type is an enumeration that indicates what type of answer is expected (e.g. lookup from question_choices, a date, a number, text, etc.) - whatever types of data you expect to be entered. This, along with the other columns in this table, can drive your input form.
question_answers contains a list of predefined answers to questions (such as a predefined list of religions, or hair color, or eye color, etc.). This can be used to build a drop-down list of values on your input form.
users is pretty self explanatory.
user_characteristics contains a list of my answers to the questionnaire. The weight column indicates how important it is to me that someone looking for me have this same answer. The question_choices_id would be populated if the answer came from a select list built from the question_choices table. Otherwise question_choices_id would be NULL. The converse is true for the value column. value would be NULL if the answer came from a select list built from the question_choices table. Otherwise value would contain the user's hand crafted answer to the question.
user_preferences contains answers to the questionnaire for who I am looking for. The weight column indicates how important it is to me that the person I am looking for have this same answer. The question_choices_id and value columns behave the same as in the user_characteristics table.
SQL to find my match might look something like:
SELECT uc.id
,SUM(up.weight) AS my_weighted_score_of_them
,SUM(uc.weight) AS their_weighted_score_of_me
,SUM(up.weight) + SUM(uc.weight) AS combined_weighted_score
FROM user_preferences up
JOIN user_characteristics uc
ON uc.questions_id = up.questions_id
AND uc.question_choices_id = up.question_choices_id
AND uc.value = up.value
AND uc.users_id != up.users_id
WHERE up.users_id = me.id
GROUP BY uc.id
ORDER BY SUM(up.weight) + SUM(uc.weight) DESC
,SUM(up.weight) DESC
,SUM(uc.weight) DESC
For performance reasons, an index on user_characteristics(id, question_id, question_choices_id, value, and user_id) and an index on user_preferences(id, question_id, question_choices_id, value, and user_id) would be advisable.
Note that the above SQL will return one row for EVERY user except the one making the request. This certainly is NOT desirable. Consequently, one might consider adding HAVING SUM(up.weight) + SUM(uc.weight) > :some_minimum_value - or some other way to further filter the results.
Further tweaks might include only returning people who value an answer as much or more than I do (i.e. their characteristic weight is >= my weight preference weight.

Related

Understanding Normalization & Duplicates - I Guess I Don't - Adding Artist & Title Ids [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
I began with a table listing the top 100 songs by date for the years 1958 through 1980. For each date, there are 100 records. Obviously many will be duplicates as a song changes position from week to week. Also, the artists will be duplicated (think Elvis) numerous times. There are ~ 116,000 records in the table.
This table had the following fields
uniq,
date,
artist,
title,
position
To eliminate duplicates (normalization as I understand it) I have modified the table so that it now looks like this
uniq,
date,
artistcode,
titlecode,
position
And have two new tables artists and titles.
Artists looks like this
artist,
artistcode
And titles looks like this
title,
titlecode
To get started in the right direction, I simply want to reassemble (join) these tables so that I have a view that looks like the original table, ie
uniq,
date,
artist,
title,
position
and has those 116000 records. After reading a couple of books and working with several tutorials, I have come to the conclusion that I have a misconception of what normalization should do, or I am simply headed in the wrong direction.
The SQL syntax to create the view would be much appreciated.

To get back to the original output with the multiple tables, you can use the following syntax with JOINs
SELECT s.uniq, s.date, a.artist, t.title, s.position
FROM songs AS s
JOIN artists AS a ON a.artistcode = s.artistcode
JOIN titles AS t ON t.titlecode = s.titlecode
If you are trying to eliminate duplicate song entries, you can add this to the query:
GROUP BY t.title

What "duplicates"? There is nothing wrong per se with the same value appearing multiple times. You need to begin reading some academic textbook(s)/slides/course(s) about information modeling and relational databases.
Each row that is in or not in a table makes a statement about the situation.
The sort of "duplicate" and "redundancy" problems that normalization addresses arise sometimes when multiple rows of the same table say the same thing about the situation. (Which might or might not involve subrow values appearing multiple times.)
Eg: If you had a table like this one but with an additional column and a given artist/title combination always appeared with the same value in that column (like if an artist never has multiple recordings with the same title charting and you added the playing time of each recording) then there would be a problem. ("... AND recording artist/title is time minutes long") If you had a table like this one but with an additional column and a value in it always appeared with the same artist/title combination (like if you added a recording id) then there would be a problem. ("... AND recording recordingcode is of title title by artist artist") Right now there is no problem. What do you expect as an answer? The answer is, normalization says there's no problem, and your impressions are not informed by normalization.
Normalization does not involve replacing values by ids. Introduced id values have exactly the same pattern of appearances as the values they identify/replaced, so that doesn't "eliminate duplicates", and it adds more "duplicates" of the ids in new tables. The original table as a view is a projection of a join of the new tables on equality of ids. (You might want to have ids for ease of update or data compression (etc) at the expense of more tables & joins (etc). That's a separate issue.)
-- hit `uniq` is title `title` by artist `artist` at position `position` on date `date`
/* FORSOME h.*, a.*, t.*,
hit h.uniq is title with id h.titlecode by artist with id h.artistcode
at position h.position on date h.date
AND artist a.artist has id a.artistcode AND h.artistcode = a.artistcode
AND title t.title has id t.titlecode AND h.titlecode = a.title
AND `uniq` = h.uniq AND `title` = t.title AND `artist` = a.artist
AND `position` = h.position AND `date` = h.date
*/
/* FORSOME h.*, a.*, t.*,
Hit(h.uniq, h.titlecode, h.artistcode, h.position, h.date)
AND Artist(a.artist, a.artistcode) AND h.artistcode = a.artistcode
AND Title(t.title, t.titlecode) AND h.titlecode = a.title
AND `uniq` = h.uniq AND `title` = t.title AND `artist` = a.artist
AND `position` = h.position AND `date` = h.date
*/
create view HitOriginal as
select h.uniq, h.date, a.artist, t.title, h.position
from Hit h
join Artist a on h.artistcode = a.artistcode
join Title t on h.titlecode = t.titlecode

LEFT JOIN - narrow things down

I'm currently having a problem with a legacy app I just inherited on my new job. I have a SQL query that's way too long to respond and I need to find a way to fasten it.
This query acts on 3 tables:
SESSION contains all users visits
CONTACT contains all the messages people have been sending through a form and contains a "session_id" field that links back to the SESSION id field
ACCOUNT contains users accounts (people who registered on the website) and whose "id" field is linked back in SESSION (through a "SESSION.account_id" field). ACCOUNT and CONTACT are no linked in any way, besides the SESSION table (legacy app...).
I can't change this structure unfortunately.
My query tries to recover ALL the interesting sessions to serve to the administrator. I need to find all sessions that links back to an account OR a contact form.
Currently, the query is structured like that :
SELECT s.id
/* a few fields from ACCOUNT and CONTACT tables */
FROM session s
LEFT JOIN account act ON act.id = s.account_id
LEFT JOIN contact c on c.session_id = s.id
WHERE s.programme_id = :program_id
AND (
c.id IS NOT NULL
OR
act.id IS NOT NULL
)
Problem is, the SESSION table is growing pretty fast (as you can expect) and with 400k records it slows things down for some programs ( :programme_id in the query).
I tried to use an UNION query with two INNER JOIN query, one between SESSION and ACCOUNT and the other one between SESSION and CONTACT, but it doesn't give me the same number of records and I don't really understand why.
Can somebody help me to find a better way to make this query ?
Thanks a lot in advance.

I think you just need indexes. For this query:
SELECT s.id
/* a few fields from ACCOUNT and CONTACT tables */
FROM session s LEFT JOIN
account act
ON act.id = s.account_id LEFT JOIN
contact c
ON c.session_id = s.id
WHERE s.programme_id = :program_id AND
(c.id IS NOT NULL OR act.id IS NOT NULL);
You want indexes on session(programme_id, account_id, id), account(id) and contact(session_id).
It is important that programme_id be the first column in the index on session.

#Gordon already suggested you add an index, which is generally the easy and effective solution, so I'm going to answer a different part of your question.
I tried to use an UNION query with two INNER JOIN query, one between
SESSION and ACCOUNT and the other one between SESSION and CONTACT, but
it doesn't give me the same number of records and I don't really
understand why.
That part is rather simple: the JOIN returns a result set that contains the rows of both tables joined together. So in the first case you would end up with a result that looks like
session.id, session.column2, session.column3, ..., account.id, account.column2, account.column3, ....
and a second where
session.id, session.column2, session.column3, ..., contact.id, contact.column2, contact.column3, ....
Then an UNION will faill unless the contact and account tables have the same number of columns with correspoding types, which is unlikely. Otherwise, the database will be unable to perform a UNION. From the docs (emphasis mine):
The column names from the first SELECT statement are used as the column names for the results returned. Selected columns listed in corresponding positions of each SELECT statement should have the same data type. (For example, the first column selected by the first statement should have the same type as the first column selected by the other statements.)
Just perform both INNER JOINs seperately and compare the results if you're unsure.
If you want to stick to an UNION solution, make sure to perform a SELECT only on corresponding columns : doing SELECT s.id would be trivial but it should work, for instance.

Complex MySQL Query to calculate percentage improvement by specific categories

I'm stumped after countless hours of trial. Albeit i'm not a SQL GURU, so I appeal to those who are. I'd like to know if/how it's possible to write a single query to look like:
Specialty Performance on PreTest Questions
Infectious Disease 25% (37/148)
Internal Medicine 17% (2/12)
Pathology 20% (3/15)
This is an Exam database. What I want is a listing by specialty that shows a percentage. The first number represents the total number of people who got a question correct (37). The second is the total number who answered it at all, right or wrong (148)
Pre-Test Questions
A pretest consists of a set of modules and questions
module 2 questions (1,2),
module 3 questions (1,2,3),
module 4 question (1),
module 5 question (1),
module 6 question (1)
Where Clause
This is part of the where clause. It's how we calculate a "correct" question:
(q.type = 'PASS_FAIL' and e.correct = 'T' )
Here is the part that derives our total of those who answered it at all:
(q.type = 'PASS_FAIL' )
My Best Attempt
I'm convinced that we can't do this for the entire set of pre-test questions as one query, so
doing it per-question is ok. I think a parametrized query where we drop in the module and question numbers would be fine.
The best I could come up with is totals by specialty using two separate queries. I couldn't figure out how to make this a single query, nor could I link in the percentage calculation (per specialty). Is it possible????
I am a sponge for knowledge!
-thanks
CREATE
ALGORITHM = UNDEFINED
VIEW `PretestTotals_M2_Q1_by_specialty_degree`
AS
(select a.specialty, count( e.question ) as totals_M2_Q1
FROM Exam as e
JOIN
Questions as q using(module,question) join Accounts a using (user_id)
WHERE
(q.type = 'PASS_FAIL' )
and
(e.module = 2 and e.question = 1)
group by a.specialty
);
CREATE
ALGORITHM = UNDEFINED
VIEW `PretestCorrect_M2_Q1_by_specialty_degree`
(select a.specialty,a.degree, count( e.question ) as Correct_M2_Q1
FROM Exam as e
JOIN
Questions as q using(module,question) join Accounts a using (user_id)
WHERE
(q.type = 'PASS_FAIL' and e.correct = 'T' )
and
(e.module = 2 and e.question = 1)
group by a.specialty
);
Accounts Table
[typical stuff, but I've noted the fields that are critical to this query]
user_id,
degree, #college degree as selected from a dropdown on a form
specialty #medical specialization
Exam Table
[*records the result of an online Exam.Their user_id, the module and question number, the attempt counter, their actual answer and the correctness T/F of that answer*]
user_id,
module,
question,
attempt,
answer,
correct
Questions table
[*Records the module number, question number, text of the actual question and the 'type' of question it is. Three possible types (ALWAYS_PASS,PASS_FAIL,POLLING) as enumerations*]
module,
question
text,
type

Database Design/SQL Optimisation: WHERE <id> NOT IN (thousands of IDs)

I'v been asked to add functionality to an application that lets users vote between two options: A and B. The table for those questions is very basic:
QUESTIONS
question_id (PK)
option_id_1(FK)
option_id_2(FK)
urgent (boolean)
Each time a user votes, that the user has voted is stored in an equally simple table:
USER VOTES
vote_id (PK)
user_id (FK)
question_id (FK)
The algorithm for selecting which question appears when a user requests a new one is complex, but for our purposes we can assume it's random. So, the issue?
Each user will be voting on many questions. Likely hundreds, and possibly thousands. I need to ensure no user is presented with a question they've already voted on, and the only way I can think to do that will, I'm guessing, pound the server into oblivion. Specifically, something like:
SELECT * from questions WHERE question_id NOT in (SELECT question_id from user_votes WHERE user_id = <user_id>) ORDER BY RAND() LIMIT 1.
[Note: RAND() is not actually in the query - it's just there as a substitute for a slightly complex (order_by).]
So, keeping in mind that many users could well have voted on hundreds if not thousands of questions, and that it's not possible to present the questions in a set order...any ideas on how to exclude voted-on questions without beating my server into the ground?
All advice appreciated - many thanks.

JOIN operator perform much better than nested queries in MySQL(that might have changed with latest MySQL releases but if you are experiencing performance problems than i guess my statement still holds)
what you could do is simply left join votes onto questions and only pick those records where none votes were joined(nobody voted):
SELECT *
FROM questions q
LEFT JOIN user_votes uv ON
uv.question_id = q.question_id AND
uv.user_id = '<user_id>'
WHERE vote_id IS NULL

RAND() is nasty however this may mitigate the problem while giving you the results you need. Seeing as you have mentioned that the RAND() is an example, I can't really provide more specific suggestions than that below however replacing the ORDER BY should work just fine.
The more you are able to limit the number of rows in the inner query, the faster the entire query will perform.
SELECT
q.*
FROM (
-- First get the questions which have not been answered
SELECT
questions.*
FROM questions
LEFT JOIN user_votes
ON user_votes.question_id = questions.question_id
AND user_votes.user_id = <user_id>
WHERE user_votes.user_id IS NULL
) q
-- Now get a random 1. I hate RAND().
ORDER BY RAND()
LIMIT 1

SELECT from one table while two conditions are NOT TRUE in another table. Possible?

I'm stuck on my matchmaker site. It's like eHarmony for married couples to find other couples to socialize with.
I'm trying to be a good rdba and am using a million tables then using a number of complicated joins to pull the info out I want. But, I'm stuck on one of the last ones. I can do it in the code, but I should be able to do it in the SQL. Here's the deal:
I'm showing a member profile page with their list of matches. (all the matchmaking algorithms are in and working). You can then, give someone thumbs up or down. and it marks that in a "verdict" database. I want to then refresh the member profile page and eliminate the voted on folks.
Now, I'm only showing 4 matches, so I want you to be able to thumbs up someone or thumbs down them and then they disappear and are replaced by someone else.
The problem is generating a sql statement that also checks the verdicts table.
The hard part is that your id might be in one of two columns: the voter or the votee.
SO,
I have these tables and I will list the columns that matter right now:
couples
couples_id
When a new person signs up, I recalculate the matches table,
comparing every person with every other person and entering
a compatibility quotient for each set of couples.:
matches_couples
matches_couples_couplea
matches_couples_coupleb
matches_couples_matchfactor
(their compatibility number, I sort by this)
When a person votes up or down on someone,
I enter a row for that vote.
Who voted, about whom, and (a)ccepted or (r)ejected.:
verdict_couples
verdict_c_couplea (the person voting)
verdict_c_coupleb (the person they're voting about)
verdict_c_verdict (either 'r' for rejected or 'a' for accepted)
So, this is my current, working SQL:
SELECT
*
FROM
match_couples
WHERE
(match_couples_couple_a = '$couples_id'
or
match_couples_couple_b = '$couples_id')
ORDER BY
match_couples_matchfactor desc
LIMIT 4
But, it doesn't taken into account the voting, and will still show someone you rejected, or who has already rejected you or you approved. I want to strip out anyone who who has rejected you, or you rejected, or whom you approved.
So basically, if you're EVER the verdict_c_couplea, I don't want to include the person who was the coupleb, since you've already made a decision about them.
And if you're verdict_c_coubleb, and it's a 'r' for reject in verdict_c_verdict, I don't want to show that person either.
SO, I want some super complicated JOIN or nested EXISTS clause or something that strips those people out (that way, my LIMIT 4 still works.
IF NOT, the brute force method is to take off the limit, then for each of those people above, do a second SQL call to check the verdict table before letting them be part of the final list. But that's a major drag that I'm hoping to avoid.
I was able to get a COUNT on the number of times you approved a couple and they also approved you- a complete match. The answer to the above question, I think, is hiding in this working match count SQL but I can't even believe I got it to work:
SELECT COUNT( * ) AS matches
FROM (
verdict_couples t1
)
JOIN (
verdict_couples same
) ON ( (
t1.verdict_c_couplea = same.verdict_c_coupleb
)
AND (
same.verdict_c_verdict = 'a'
)
AND (
t1.verdict_c_verdict = 'a'
) )
WHERE
same.verdict_c_couplea = '$couples_id'
and
t1.verdict_c_coupleb = '$couples_id'
Basically the ON clause criss-crosses the WHERE clause, because you're looking for:
id couplea coupleb verdict
54 US YOU accept
78 YOU US accept
That means we approved YOU and you approved US. and amazingly that works. Somewhere in there is the guts to limit my matches list to just people I haven't voted on yet and who haven't rejected ME.
Once I figure this out, I'll replicate it for individual matches, as well.
Any help on the joins?
K

SELECT *
FROM match_couples m
WHERE
(m.match_couples_couple_a = '$couples_id' # we are couple a
AND m.matches_couples_coupleb NOT IN ( # couple b not in the list of couples which:
# A. we have voted on before
select verdict_c_coupleb
from verdict_couples
where (verdict_c_couplea = $couples_id)
UNION
# or B. have rejected us
select verdict_c_couplea
from verdict_couples
where (verdict_c_coupleb = $couples_id
AND verdict_c_verdict = 'r'))
OR
(m.match_couples_couple_b = '$couples_id'
AND m.matches_couples_couplea NOT IN (select verdict_c_coupleb
from verdict_couples
where (verdict_c_couplea = $couples_id)
UNION
select verdict_c_couplea
from verdict_couples
where (verdict_c_coupleb = $couples_id
AND verdict_c_verdict = 'r')
ORDER BY match_couples_matchfactor desc
LIMIT 4

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008