MySQL finding the most similar user - mysql

Let's say I have a table that looks like this:
Mark - Green
Mark - Blue
Mark - Red
Adam - Yellow
Andrew - Red
Andrew - Green
And my objective is to compare the user "Mark" with all the other users in the database, to find out which other user he is most similar to. In this case, he would be most similar to Andrew (2/3 matches), and least similar to Adam (0/3) matches. After I've found out which user is most similar to Mark I would like to extract the entires that Andrew got but Mark doesn't.
Would this be possible in MySQL? I appreciate all help, thank you guys!
Edit: OVERWHELMED by all the good help! THANK you SO MUCH guys! I will be sure to check out all of your contributions!

The following query attempts to list all the users with the number of matches to Mark. It basically joins the table with Mark's entries and counts the common entries for all users.
SELECT ours.user, theirs.user, count(*) as `score`
FROM tableName as `theirs`, (SELECT *
FROM tableName
WHERE user = 'Mark') as `ours`
WHERE theirs.user != 'Mark' AND
theirs.color = ours.color
GROUP BY theirs.user
ORDER BY score DESC
The query, however, wouldn't work if there's duplicate data (i.e. one person picks the same color twice). But that shouldn't be a problem as you mention in the comments that it wouldn't occur.
The query can be modified to show the score for all users:
SELECT ours.user as `myUser`, theirs.user as `theirUser`, count(*) as `score`
FROM tableName as `ours`, tableName as `theirs`
WHERE theirs.user != ours.user AND
theirs.color = ours.color
GROUP BY ours.user, theirs.user
ORDER BY score DESC
Let Q be the above query that gives you the most similar user. Once you have that user, you can use it to show the distinct entries between them. Here's what we're trying to do:
SELECT *
FROM tableName as theirs
WHERE user = 'Andrew'
AND NOT EXISTS (SELECT 1
FROM tableName as ours
WHERE ours.user = 'Mark'
AND ours.color = theirs.color)
Replacing the inputs Andrew and Mark from Q:
SELECT similar.myUser, theirs.user, theirs.color
FROM tableName as theirs JOIN (Q) as similar
ON theirs.user = similar.theirUser
WHERE NOT EXISTS (SELECT 1
FROM tableName as ours
WHERE ours.user = similar.myUser
AND ours.color = theirs.color)
Here's the final query up and running. Hope this makes sense.

Use FULLTEXT INDEXES. And your query will be like:
SELECT * FROM user WHERE MATCH (name,color) AGAINST ('Mark blue');
Or simplest way, is using LIKE search
SELECT * FROM user WHERE name LIKE '%Mike%' OR color = 'blue'
You can choose which way more suitable for you

select
name,
sum(case when t2.cnt > t1.cnt then t1.cnt else t2.cnt end) matches
from (
select name, color, count(*) cnt
from table
where name <> 'Mark'
group by name, color
) t1 left join (
select color, count(*) cnt
from table
where name = 'Mark'
group by color
) t2 on t2.color = t1.color
group by name
order by matches desc
The derived table t1 contains the # of colors each user (except Mark) has, t2 contains the same for Mark. The tables are then left joined on the color and the smaller of the 2 counts is taken i.e. if Amy has 2 reds and Mark has 1 red, then 1 is taken as the number of matches. Finally group by name and return the largest sum of matches.

select match.name, count(*) as count
from table
join table as match
on match.name <> table.name
and table.name = 'mark'
and match.color = table.color
group by match.name
order by count(*) desc

Below query returns matching score between name and matching_name and the maximum score it could get, so that you know what % value your matching has.
This code counts duplicate values in color column as only one, so that if you have record Mark - Red twice, it will only count as 1.
select
foo.name, foo.matching_name, count(*) AS matching_score, goo.color_no AS max_score
from
(
select
distinct a.name, a.color, b.name AS matching_name
from
(
select name, color from yourtable
) a
left join yourtable b on a.color = b.color and a.name <> b.name
where b.name is not null
) foo
left join ( select name, count(distinct color) AS color_no from yourtable group by name ) goo
on foo.name = goo.name
group by foo.name, foo.matching_name
Attaching SQLFiddle to preview the output.

This should get you close. The complexity comes from the fact that you allow each user to pick each color multiple times and require that each otherwise-identical pair be matched in the other user you're comparing to. Therefore, we're really interested in knowing how many total color picks a user per color and how that number compares to the compared-to users' count for that same color.
First, we create a derived relation that does the simple math for us (counting up each users' number of picks by color):
CREATE VIEW UserColorCounts (User, Color, TimesSeen)
AS SELECT User, Color, COUNT(*) FROM YourTable GROUP BY User, Color
Second, we need some kind of relation that compares each color-count for the primary user to the color-counts of each secondary user:
CREATE VIEW UserColorMatches (User, OtherUser, Color, TimesSeen, TimesMatched)
AS SELECT P.User, S.User, P.Color, P.TimesSeen, LEAST(P.TimesSeen, S.TimesSeen)
FROM UserColorCounts P LEFT OUTER JOIN UserColorCounts S
ON P.Color = S.Color AND P.User <> S.User
Finally, we total up the color-counts for each primary user and compare against the matching color counts for each secondary user:
SELECT User, OtherUser, SUM(TimesMatched) AS Matched, SUM(TimesSeen) AS OutOf
FROM UserColorMatches WHERE OtherUser IS NOT NULL
GROUP BY User, OtherUser

Related

SQL query with most recent name and total count

I already have a table, "table_one", set up on phpMyAdmin that has the following columns:
USER_ID: A discord user ID (message.author.id)
USER_NAME: A discord username (message.author.name)
USER_NICKNAME: The user's display name on the server (message.author.display_name)
TIMESTAMP: A datetime timestamp when the message was entered (message.created_at)
MESSAGE CONTENT: A cleaned input keyword to successful completion of content, just for this example consider "apple" or "orange" as the two target keywords.
What I'd like as a result is a view or query that returns a table with the following:
The user's most recent display name (USER_NICKNAME), based off the most recent timestamp
The total number of times a user has entered a specific keyword. Such as confining the search to only include "apple" but not instances "orange"
My intention is that if a user entered a keyword 10 times, then changed their server nickname and entered the same keyword 10 more times, the result would show their most recent nickname and that they entered the keyword 20 times in total.
This is the closest I have gotten to my desired result so far. The query correctly groups instances where user has changed their nickname based on the static discord ID, but I would like it to retain this functionality while instead showing the most recent USER_NICKNAME instead of a USER_ID:
SELECT USER_ID, COUNT(USER_ID)
FROM table_one
WHERE MESSAGE_CONTENT = 'apple'
GROUP BY USER_ID
I don't think there is an uncomplicated way to do this. In Postgres, I would use the SELECT DISTINCT ON to get the nickname, but in MySQL I believe you are limited to JOINing grouped queries.
I would combine two queries (or three, depending how you look at it).
First, to get the keyword count, use your original query:
SELECT USER_ID, COUNT(USER_ID) as apple_count
FROM table_one
WHERE MESSAGE_CONTENT = 'apple'
GROUP BY USER_ID;
Second, to get the last nickname, group by USER_ID without subsetting rows and use the result as a subquery in a JOIN statement:
SELECT a.USER_ID, a.USER_NICKNAME AS last_nickname
FROM table_one a
INNER JOIN
(SELECT USER_ID, MAX(TIMESTAMP) AS max_ts
FROM table_one
GROUP BY USER_ID) b
ON a.USER_ID = b.USER_ID AND TIMESTAMP = max_ts
I would then JOIN these two, using a WITH statement to increase the clarity of what's going on:
WITH
nicknames AS
(SELECT a.USER_ID, a.USER_NICKNAME AS last_nickname
FROM table_one a
INNER JOIN
(SELECT USER_ID, MAX(TIMESTAMP) AS max_ts
FROM table_one
GROUP BY USER_ID) b
ON a.USER_ID = b.USER_ID AND TIMESTAMP = max_ts),
counts AS
(SELECT USER_ID, COUNT(USER_ID) AS apple_count
FROM table_one
WHERE MESSAGE_CONTENT = 'apple'
GROUP BY USER_ID)
SELECT nicknames.USER_ID, nicknames.last_nickname, counts.apple_count
FROM nicknames
INNER JOIN counts
ON nicknames.USER_ID = counts.USER_ID;

Creating a SQL view with personal best records

I have the following SQL Database structure:
Users are the registered users. Maps are like circuits or race tracks. When a user is driving a time a new time record will be created including the userId, mapId and the time needed to finish the racetrack.
I wish to create a view where all the users personal bests on all maps are listed.
I tried creating the view like this:
CREATE VIEW map_pb AS
SELECT MID, UID, TID
FROM times
WHERE score IN (SELECT MIN(score) FROM times)
ORDER BY registered
This does not lead to the wished result.
Thank you for your help!
I hope that you have 'times' table created as the above diagram and 'score' column in the table that you use to measure the best record.
(MIN(score) is the best record).
You can simply create a view to have the personal best records using sub-queries like this.
CREATE VIEW map_pb AS
SELECT a.MID, a.UID, a.TID
FROM times a
INNER JOIN (
SELECT TID, UID, MIN(score) score
FROM times
GROUP BY UID
) b ON a.UID = b.UID AND a.score= b.score
-- if you have 'registered' column in the 'times' table to order the result
ORDER BY registered
I hope this may work.
You probably need to use a query that will first return the minimum score for each user on each map. Something like this:
SELECT UID,
MID,
MIN(score) AS best_time
FROM times
GROUP BY UID, MID
Note: I used MIN(score) as this is what is shown in your example query, but perhaps it should be MIN(time) instead?
Then just use the subquery JOINed to your other tables to get the output:
SELECT *
FROM (
SELECT UID,
MID,
MIN(score) AS best_time
FROM times
GROUP BY UID, MID
) a
INNER JOIN users u ON u.UID = a.UID
INNER JOIN maps m ON m.MID = a.MID
Of course, replace SELECT * with the columns you actually want.
Note: code untested but does give an idea as to a solution.
Start with a subquery to determine each user's minimum score on each map
SELECT UID, TID, MIN(time) time
FROM times
GROUP BY UID, TID
Then join that subquery into a main query.
SELECT times.UID, times.TID,
mintimes.time
FROM times
JOIN (
) mintimes ON times.TID = mintimes.TID
AND times.UID = mintimes.UID
AND times.time = mintimes.time
JOIN maps ON times.MID = maps.MID
JOIN users ON times.UID = users.UID
This query pattern uses a GROUP BY function to find the outlying (MIN in this case) value for each combination. It then uses that subquery to find the detail record for each outlying value.

Retrieving most recent records in groups

What I'm looking to do is this...
I have a database that logs changes in the status of a record...
For example, it may be set as "inactive" then, a later row might reactivate the record...
1 Company1 Active
2 Company2 Active
3 Company1 Inactive
4 Company1 Active
The query needs to look for currently active results... and should return two records... (one for Company1 and one for Company2.)
I need to return only records that are CURRENTLY active.
This query does part of it...
SELECT id, gid, status
FROM companies
GROUP BY gid
HAVING status = 'Active'
ORDER BY id
But it doesn't look for results to return based on the last record...
What I am basically looking at is how to incorporate something that would check only the the most recent record like "LIMIT 1,1" with "ORDER BY id DESC) within each group... I have no idea how to incorporate it into the query.
Update I've got it down to this so far... Based on an answer but it's bringing back the last row of each group whether it is currently active or not...
select t.*
from (
select status, max(id) as id
from companies
group by gid
having status = 'Active'
) active_companies
inner join companies t on active_companies.id = t.id
If your IDs are always in ascending order, this will select the most recent ID for every company:
SELECT MAX(id), gid
FROM companies
GROUP BY gid
and this will return all the records that you need:
SELECT companies.*
FROM companies INNER JOIN (SELECT MAX(id) m_id
FROM companies
GROUP BY gid) m
ON companies.id = m.m_id
WHERE
status = 'Active'
You might also be tempted to use this little trick:
SELECT *
FROM (SELECT * FROM companies ORDER BY id DESC) t
GROUP BY gid
HAVING status='Active';
but please don't! It usually/most always work, it looks cleaner and is often faster, but relies on an undocumented behaviour, so it is not guaranteed that you will always get the correct result.
Please see fiddle here!
The following query returns the current status for each company, using a MySQL trick:
select gid, substring_index(group_concat(status order by id desc), ',', 1) as status
from companies
group by gid;
If you want only the ones that are currently active:
select gid
from companies
group by gid
having 'active' = substring_index(group_concat(status order by id desc), ',', 1)
Let's split the problem into two smaller ones:
First, you want to get the most recent entry from each group:
select max(id)
from companies
group by gid;
Next, you want to have only entries (which are last entries) and which are active:
select t.*
from companies t
inner join (
select max(id) as id
from companies
group by gid
) last_entries
on t.id = last_entries.id
where t.status = 'Active';
Try a subselect get records where status = 'Active' and order by id (which i assume is auto-increment ) desc then in parent select do group by
SELECT t.* FROM (
SELECT *
FROM companies
WHERE `status` = 'Active'
ORDER BY id DESC
) t
GROUP BY t.gid

MySQL evaluate case with subquery

I am trying to create a custom sort that involves the count of some records in another table. For example, if one record has no records associated with it in the other table, it should appear higher in the sort than if it had one or more records. Here's what I have so far:
SELECT People.*, Organizations.Name AS Organization_Name,
(CASE
WHEN Sent IS NULL AND COUNT(SELECT * FROM Graphics WHERE People.Organization_ID = Graphics.Organization_ID) = 0 THEN 0
ELSE 1
END) AS Status
FROM People
LEFT JOIN Organizations ON Organizations.ID = People.Organization_ID
ORDER BY Status ASC
The subquery within the COUNT is not working. What is the correct way to do something like this?
Update: I moved the case statement into the order by clause and added a join:
SELECT People.*, Organizations.Name AS Organization_Name
FROM People
LEFT JOIN Organizations ON Organizations.ID = People.Organization_ID
LEFT JOIN Graphics ON Graphics.Organization_ID = People.Organization_ID
GROUP BY People.ID
ORDER BY
CASE
WHEN Sent IS NULL AND Graphics.ID IS NULL THEN 0
ELSE 1
END ASC
So if if the People record does not have any graphics, Graphics.ID will be null. This achieves the immediate need.
If what you tried does not work, it can be done by joining against a subquery, and placing the CASE expression into ORDER BY as well:
SELECT
People.*,
orgcount.num
FROM People JOIN (
SELECT Organization_ID, COUNT(*) AS num FROM Graphics GROUP BY Organization_ID
) orgcount ON People.Organization_ID = orgcount.num
ORDER BY
CASE WHEN Sent IS NULL AND orgcount.num = 0 THEN 0 ELSE 1 END,
orgcount.num DESC
You could use an outer join to the Graphics table to get the data needed for your sort.
Since I don't know your schema, I made an assumption that the People table has a primary key column called ID. If the PK column has a different name, you should substitute that in the GROUP BY clause.
Something like this should work for you:
SELECT People.*, (count(Distinct Graphics.Organization_ID) > 0) as Status
FROM People
LEFT OUTER JOIN Graphics ON People.Organization_ID = Graphics.Organization_ID
GROUP BY People.ID
ORDER BY Status ASC
Fairly straight forward with a LEFT JOIN provided you have some kind of primary key in the People table to GROUP on;
SELECT p.*, sent IS NOT NULL or COUNT(g.Organization_ID) Status
FROM People p LEFT JOIN Graphics g ON g.Organization_ID = p.Organization_ID
GROUP BY p.primary_key
ORDER BY Status
Demo here.

MySQL: count records by value and then update another column with their count

Three very related SQL questions (I am actually using mySQL):
Assuming a people table with two columns name, country
1) How can I show the people who have fellow citizens? I can display the count of citizens by country:
select country, count(*) as fellow_citizens
from people
group by country
But I can't seem to show those records for which fellow_citizens is > 1. The following is invalid:
select name, country, count(*) as fellow_citizens
from people
group by country
where fellow_citizens > 1;
... and would not be what I want any way since I don't want to group people.
2) To solve the above, I had the idea to store fellow_citizens in a new column.
This sounds like a variation on questions such as MySQL: Count records from one table and then update another. The difference is that I want to update the same table.
However the answer given there doesn't work here:
update people as dest
set fellow_citizens =
(select count(*) from people as src where src.country = dest.country);
I get this error message:
You can't specify target table 'dest' for update in FROM clause
It seems like I need to through another temporary table to do that. Can it be done without a temporary table?
3) As a variant of the above, how can I update a column with a people counter by country? I found I can update a global counter with something like:
set #i:=0; update people set counter = #i:=#i+1 order by country;
But here I would like to reset my #i counter when the value of country changes.
Is there a way to do that in mySQL without going to full-blown procedural code?
Your select query should look something like:
SELECT name, country, count(*) as fellow_citizens
FROM people
GROUP BY country
HAVING fellow_citizens > 1;
Recommended solution
You don't want to group, you just want to repeat the count for every person.
This data is not normalized and code like this is not a great way to do things.
However the following select will give you all the people and their counts:
SELECT p.*
, s.fellow_citizens
FROM people p
INNER JOIN (
SELECT country, count(*) as fellow_citizens
FROM people
GROUP BY country
HAVING count(*) > 1) s ON (s.country = p.country)
If you don't want the actual count, just people from countries with more than 1 citizen, another way to write this query is (this may be faster than previous method, you have to test with your data):
SELECT p.*
FROM people p
WHERE EXISTS
( SELECT *
FROM people p2
WHERE p2.country = p.country
AND p2.PK <> p.PK -- the Primary Key of the table
)
Only do this if you are facing slowness
If you want to include the counts into the table using a update statement I suggest you use a separate count table, because at least that's somewhat normalized.
INSERT INTO people_counts (country, pcount)
SELECT p.country, count(*) as peoplecount
FROM people p
GROUP BY p.country
ON DUPLICATE KEY pcount = VALUES(pcount)
Then you can join the counts table with the people data to speed up the select.
SELECT p.*, c.pcount
FROM people p
INNER JOIN people_counts c ON (p.country = c.country)
1 ->
select country, count(*) as fellow_citizens
from people
group by country
having fellow_citizens > 1
Sorry, but could not really understand points 2 and 3.
I would do this.
create view citizencount as
select country, count(*) citizens
from people
group by country
Then your Query becomes (Something like)
select p.personid, p.country, cc.citizens -1 fellow_citizens
from people p
join citizencount cc on
cc.country = p.country
You might then want to add
where fellow_citizens > 0
Have you tried this for both of your problem
select name, fellow_citizens
from people left join
(select country, count(*) as fellow_citizens
from people group by country
having fellow_citizens > 1) as citizens
on people.country = citizens.country
I am not very good at writing sql query; but I think this can solve your problem