Finding a list of mutual followers in mysql - mysql

I'm working on a sql query which requires me to generate a list of mutual followers (e.g. A follows B and B follows A)
Our table consists of the FriendshipID, FollowerUserID, FollowingUserID as shown below. Note that FollowerUserID folllows FollowingUserID enter image description here
I have tried creating a view table using the code below:
Create view MutualFriends AS
(select distinct a.FollowerUserID, a.FollowingUserID
from friendship as a, friendship as b
where a.FollowerUserID = b.FollowingUserID and a.FollowingUserID = b.FollowerUserID);
However, it returns a repeated view table, e.g. 1 follows 3 is repeated as 3 follow 1.
enter image description here
How can we remove the repeated rows in view?
Or are there any other ways to generate the results (without repeating)

Here is one way to do this, using a LEAST/GREATEST trick:
SELECT
LEAST(FollowerUserID, FollowingUserID) AS friend1,
GREATEST(FollowerUserID, FollowingUserID) AS friend2
FROM (SELECT DISTINCT FollowerUserID, FollowingUserID FROM friendship) t
GROUP BY
friend1,
friend2
HAVING
COUNT(*) = 2;
The subquery aliased as t first removes all duplicate follower/following pairs from the original friendship table. This may be necessary if a given one way relationship could appear more than once in the table, e.g. 1 -> 2 appears twice for some reason.
Then, we aggregate by the least of the follower/following and the greatest of the same pair. If we find that there are two such records, then it implies that the follower and following are mutual.

Just agree with solution from #Tim, but I think we don't need to use subquery at "FROM" step (select distinct from friendship....)
select
least(FollowerUserID, FollowingUserID) as u1,
greatest(FollowerUserID, FollowingUserID) as u2
from friendship
group by u1,u2
having count(*) = 2

Related

Get a certain query result from two different mysql tables

Im having the following problem:
I try to implement an achievementsystem. I have two tables. Table 1 contains the achievement_id and achievement_info. Table 2 contains the link to the user, meaning achievement_id and player_id, so that you can tell which user has achieved certain things.
I'm trying to write a method that returns me all achievements, but additionally a flag that tells me if a certain user has achieved this row or not.
E.g.: getPlayerAchievements(playerid) --> returns a list of Achievements with id, info, and a bool flag whether the user has achieved it.
table 1:
achievement_id|achievement_info
1 |info1
2 |info2
3 |info3
table 2:
achievement_id|player_id;
1 |15
3 |15
the result I need by entering the player_id "15":
achievement_id|achievement_info|(bool)achieved
1 |info1 |true
2 |info2 |false
3 |info3 |true
I already have the achievement class so I just have to fill them with my data.
I could always use two seperate sql queries to achieve that, but I thought maybe there was a way to simplify it, since I use php to get my data and don't want two connections and queries in one php script.
You want to select all records from the achievemets table and show them. That's the easy part :-) For every record you want to show whether player 1234 has attained this achievement. You can do this with an EXISTS clause:
select
achievement_id,
achievement_info,
exists
(
select *
from players p
where p.player_id = 1234
and p.achievement_id = a.achievement_id
) as achieved
from achievements a;
Or even simpler with IN:
select
achievement_id,
achievement_info,
achievement_id in (select achievement_id from players where player_id = 1234) as achieved
from achievements;
You can use left join to get a complete list of achievements and the matching records from the user's achievements table:
select t1.achievement_id, t1.achievement_info, (t2.achievement_id is null) as achieved
from table1 t1
left join table2 t2 on t1.achievement_id=t2.achievement_id and t2.player_id=15

Return records from one table where Field not found in another table

I'm so lost here I don't even know how to best title my question.
I am creating a simple dating site. I want the women to be able to block the men just like all other dating sites. When this happens, I don't want the womens' profiles to be returned in a query.
Table A
Members table containing all the profile information including a member name
Table B
Blocked members table containing the woman's name and the man's name for each case in which the woman has blocked a man
So, I want something like this:
$query = Return all records from table A where sex=female and there is no record in table B containing the woman's name and the man's name
I thought I would run a query against table B to retrieve all women who have blocked me, then run a query against table A to return all females in which the woman's username is NOT contained in the results of my first query. However, I can't figure out how to do this.
If I understand your question...seems like a simple join, no? Not sure if I'm misunderstanding. Something like this perhaps:
SELECT * FROM Table1 WHERE Table1.ID NOT IN (SELECT BLOCK_ID FROM table2)
So Table1 has all ID's of the women, and Table2 has all block id's (for example) and you want what is not in that? Obviously some changes required on top of this.
If you wanted to see a list of all the female members who had blocked the current user, you would use a query like:
SELECT member.*
FROM TableA member
JOIN TableB blocked ON (member.name = blocked.user_who_blocked)
WHERE member.sex = female
AND blocked.blocked_user_name = 'Joe McCurrentUser'
;
So, if you want to see the set of users where that is not the case, you use a LEFT JOIN and look for a null id.
SELECT member.*
FROM TableA member
LEFT JOIN TableB blocked ON (member.name = blocked.user_who_blocked)
WHERE member.sex = female
AND blocked.blocked_user_name = 'Joe McCurrentUser'
AND blocked.id IS NULL
;
You can modify as you wish to use the actual columns in your tables. Make sure you have indices on both the user_who_blocked and blocked_user_name columns in that table.
Would this work?
Select * from Table A
inner join Table B on a.womans_name = B.womans_name and B.mans_name="Mans Name"
where B.womans_name IS NULL
If Table B contains a record with the matching womans_name and mans_name then the join will create one record containing all the fields in Table A and Table B but the Where clause will reject this record because the womans_name from Table B will not be null. If Table B does not contain a matching record then all those fields will be null (including B.womans_name) so the Where clause is satisfied.

Finding and dealing with duplicate users

In a large user database with the following format and sample data, we are trying to identify duplicated people:
id first_name last_name email
---------------------------------------------------
1 chris baker
2 chris baker chris#gmail.com
3 chris baker chris#hotmail.com
4 chris baker crayzyguy#crazy.com
5 carl castle castle#npr.org
6 mike rotch fakeuser#sample.com
I am using the following query:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
This works great; I get a list of duplicates with the id numbers of the involved rows.
We would re-assign any associated data tied to a duplicate to the actual person (set user_id = 2 where user_id = 3), then we delete the duplicating user row.
The trouble comes after we make this report the first time, as we clean up the list after manually verifying that they are indeed duplicates -- some ARE NOT duplicates. There are 2 Chris Bakers that are legitimate users.
We don't want to keep seeing Chris Baker in subsequent duplicate reports until the end of time, so I am looking for a way to flag that user id 1 and user id 4 are NOT duplicates of each other for future reports, but they could be duplicated by new users added later.
What I tried
I added a is_not_duplicate field to the user table, but then if a new duplicate "Chris Baker" gets added to the database, it will cause this situation to not show on the duplicate report; the is_not_duplicate improperly excludes one of the accounts. My HAVING statement would not meet the > 1 threshold until there are -two- duplicates of Chris Baker, plus the "real" one marked is_not_duplicate.
Question Summed Up
How can I build exceptions into the above query without looping results or multiple queries?
Sub-queries are fine, but the size of the dataset makes every query count and I'd like the solution to be as performant as possible.
Try to add the is_not_duplicate boolean field and modify your code as follows:
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count",
SUM(is_not_duplicate) AS "real_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
duplicate_count - real_count > 0
Newly added duplicates will have is_not_duplicate=0 so the real_count for that name will be less than duplicate_count and the row will be shown
My brain is too fried to come up with the actual query for this at the moment, but I might be able to give you a nudge in a path that should work :)
What if you did add another column (maybe a table of valid duplicated users instead?...both will accomplish the same thing), and ran a subquery that would count up all of the valid duplicates and then you could compare against the count in your current query. You would exclude any users that have matching counts, and would pull in any with counts that are higher. Hopefully that makes sense; I will create a use case:
Chris Baker with id 1 and 4 are marked as valid_duplicates
There are 4 Chris Baker's in the system
You get a count of valid Chris Baker's
You get a count of all Chris Baker's
valid_count <> total_count, so return Chris Baker
*You probably can even modify the query so that it does not even list the duplicate id's (even if you get a duplicate marking of only 1 id). Rather than having to re-check which are the valids. This would be a little more complicated. Without it, at least you ignore Chris Baker until another enters the system
I have written up the basic query, dealing with excluding specific id's I will try to roll in tonight. But, this at least solves your initial need. If you do not need the more complicated query, do let me know so that I do not waste my time on it :)
SELECT
GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
WHERE NOT EXISTS
(
SELECT 1
FROM
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "valid_duplicate_count"
FROM
users
WHERE
is_valid_duplicate = 1 --true
GROUP BY
name
HAVING
valid_duplicate_count > 1
) AS duplicate_users
WHERE
duplicate_users.name = users.name
AND valid_duplicate_count = duplicate_count
)
GROUP BY
name
HAVING
duplicate_count > 1
Below is the query that should do the same as above, but the final list will only print the id's that are not in the valid list. This actually ended up being a lot simpler than I thought. And, it is mostly the same as above, but the only reason I kept above is to keep the two options and in case I messed the above up...it does get complicated as it is many nested queries. If CTE's are available to you, or even temp tables. It might make the query more expressive to break it up into temp tables :). Hopefully this helps and is what you are looking for
SELECT GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "final_duplicate_count"
--This count could actually be 1 due to the nature of the query
FROM
users
--get the list of duplicated user names
WHERE EXISTS
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "total_duplicate_count"
FROM
users AS total_dup_users
--ignore valid_users whose count still matches
WHERE NOT EXISTS
(
SELECT 1
FROM
(
SELECT
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "valid_duplicate_count"
FROM
users AS valid_users
WHERE
is_valid_duplicate = 1 --true
GROUP BY
name
HAVING
valid_duplicate_count > 1
) AS duplicate_users
WHERE
--join inner table to outer table
duplicate_users.name = total_dup_users.name
--valid count check
AND valid_duplicate_count = total_duplicate_count
)
--join inner table to outer table
AND total_dup_users.Name = users.Name
GROUP BY
name
HAVING
duplicate_count > 1
)
--ignore users that are valid when doing the actual counts
AND NOT EXISTS
(
SELECT 1
FROM users AS valid
WHERE
--join inner table to outer table
users.name =
CONCAT(UPPER(valid.first_name), UPPER(valid.last_name))
--only valid users
AND valid.is_valid_duplicate = 1 --true
)
GROUP BY
FinalDuplicates.Name
Since this is basically a many-to-many relationship I would add a new table not_duplicate with fields user1 and user2.
I would probably add two rows for each not_duplicate relationship such that I have one row for 2 -> 3 and a symmetric row for 3 -> 2 to ease querying, but that may introduce data inconsistencies so make sure you delete both rows at the same time (or have only one row and make the correct query in your script).
well it seems to me that the is_not_duplicate column is not complex enough to hold the information you want to store - from what I understand you want to manually tell your detection that two distinct users are not duplicates of each other. so either you create a column like is_not_duplicate_of=other-user-id or if you want to keep the possibility open that one user can be manually defined not duplicate of more than one users, you need a seperate table with two user-id columns.
the query telling you the non overridden duplicates probably has to be a bit more complex than the one you suggested, I cannot think of one that works with a group by and having logic. The only thing that would come to my mind is something like
SELECT u1.* FROM users u1
INNER JOIN users u2
ON u1.id <> u2.id
AND u2.name = u1.name
WHERE NOT EXISTS (
SELECT *
FROM users_non_dups un
WHERE (un.id1 = u1.id AND un.id2 = u2.id)
OR (un.id1 = u2.id AND un.id2 = u1.id)
)
If you were to correct all duplicates each time you run the report, then a very simple solution might be to modify the query:
SELECT
GROUP_CONCAT(id) AS "ids",
MAX(id) AS "max_id",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name",
COUNT(*) AS "duplicate_count"
FROM
users
GROUP BY
name
HAVING
duplicate_count > 1
AND
max_id > MAX_ID_LAST_TIME_DUPLICATE_REPORT_WAS_GENERATED;
I would go ahead and make the "confirmed_unique" column, defaulted as "False."
In order to avoid the problems you mentioned,
Then I would select all elements that may look like duplicates and have a "False" entry for "confirmed_unique."
I am not sure if this will work, but could you consider the reverse logic of adding a *is_duplicate_of* column? That way you can mark duplicates by entering the ID of the first record at this column which will be greater than zero. The records that you wish to retain will have a 0 value at this field. You can set the default (unchecked records) to -1 to keep track of the validation status for each record.
Afterwards you can keep executing an SQL that will compare new records only with correct records having is_duplicate_of = 0 .
If you are ok to make a slight change to the format of the report. You could do a self-join like this -
SELECT
CONCAT(u1.id,",", u2.id) AS "ids",
CONCAT(UPPER(u1.first_name), UPPER(u1.last_name)) AS "name"
FROM
users u1, users u2
WHERE
u1.id < u2.id AND
UPPER(u1.first_name) = UPPER(u2.first_name) AND
UPPER(u1.last_name) = UPPER(u2.last_name) AND
CONCAT(u1.id,",", u2.id) NOT IN (SELECT ids from not_dupe)
which reports duplicates as follows:
ids | name
----|--------
1,2 | CHRISBAKER
1,3 | CHRISBAKER
...
And the not_dupe table would have rows like below:
ids
------
1,2
3,4
...
I think it would make sense to create a lookup-table storing the ids of the ones that are not duplicates. Thus confirmed non duplicants are removed and the query will only have to ad a small look up for duplicates actualy found on the lookup table.
for instance in this example we would have
id 1 | id 2
2 4
if crayzyguy#crazy.com and chris#gmail.com are diffrent persons.
If I were you, I will add some geolocalisation tables/fields to my database schema.
The probability two end-users are having the same names AND are living in the same place is very very low - except in very big town - but you can split geolocalization to small areas too - it's about granularity.
Good luck.
I would suggest you to create a couple of things:
A Boolean column to flag confirmed users
A String column to save ids
A trigger that will check if the first name and last name are already there to fill up the flag, and save in the string column all ids to which this one is a possible duplicate.
And then build a report that looks for duplicated true and decode the string field to match the possible duplicated
I gave Justin Pihony +1 as the 1st to suggest comparing the duplicate count with the not duplicate count, and Hrant Khachatrian +1 for being the 1st to show an efficient way of doing that.
Here is a slightly different method, plus some renaming to make everything a bit more self explanatory, plus some extra columns in the query to make it obvious which records need to be compared as potential duplicates.
I would call the new column "CONFIRMED_UNIQUE" instead of "IS_NOT_DUPLICATE". Like Hrant I would make it Boolean (tinyint(1) with 0=FALSE and 1=TRUE).
The "potential_duplicate_count" is the maximum number of records that would have to be deleted.
select
group_concat(case when not confirmed_unique then id end) as potential_duplicate_ids,
group_concat(case when confirmed_unique then id end) as confirmed_unique_ids,
concat(upper(first_name), upper(last_name)) as name,
sum( case when not confirmed_unique then 1 end ) - (not max(confirmed_unique)) as potential_duplicate_count
from
users
group by
name
having
potential_duplicate_count > 0
I see someone else has been voted down for the suggestion of merging, but nothing about your problem statement says the data needs to be inplace. The OP followed up with their solution which happens to be a put SQL one, that doesn't imply that every solution needs to be limited to that.
The issue as I understand is around contacts having multiple, similar, but not necessarily identical records in your database, which has cost and reputational implications so you're looking to deduplicate these records.
I would write a batch job that searches for potential duplicates (this can be as complicated or as simple as you like) and then close the two records that it finds are dupes and create a new record.
To enable that you'd need four new columns:
Status, which would be either Open, Merged, Split
RelatedId, which would hold the value of who the record was merged with
ChainId, the new record Id
DateStatusChanged, obvious enough
Open would be the default status
Merged would be when the record is merged (effectively closed and replaced)
Split would be if the merge was reversed
So, as an example, go through all of the records that, for example, have the same name. Merge them in pairs. So if you have three Chris Bakers, records 1, 2 and 3, merge 1 and 2 to make record 4 and then 3 and 4 to make record 5. Your table would end up something like:
ID NAME STATUS RELATEDID CHAINID DATESTATUSCHANGED [other rows omitted]
1 Chris Baker MERGED 2 4 27-AUG-2012
2 Chris Baker MERGED 1 4 27-AUG-2012
3 Chris Baker MERGED 4 5 28-AUG-2012
4 Chris Baker MERGED 3 5 28-AUG-2012
5 Chris Baker OPEN
This way you have a full record of what has happened to your data can reverse any changes by unmerging, if for example contacts 1 and 2 weren't the same you reverse the merge of 3 and 4, reverse the merge of 1 and 2, you'd end up with this:
ID NAME STATUS RELATEDID CHAINID DATESTATUSCHANGED
1 Chris Baker SPLIT 2 4 29-AUG-2012
2 Chris Baker SPLIT 1 4 29-AUG-2012
3 Chris Baker SPLIT 4 5 29-AUG-2012
4 Chris Baker CLOSED 3 5 29-AUG-2012
5 Chris Baker CLOSED 29-AUG-2012
You could then manually merge, as you'd probably not want your job to automatically remerge split records.
Is there a good reason for not merging duplicate accounts into a single account?
From the comments, it seems like the information is being used mostly for contact information so merging should be relatively painless and low risk. Once you merge users they will no longer appear in your duplicate report. Furthermore, you users table will actually shrink which could help with performance.
Add is_not_duplicate by datatype bit to your table and use below query after set is_not_duplicate data value:
SELECT GROUP_CONCAT(id) AS "ids",
CONCAT(UPPER(first_name), UPPER(last_name)) AS "name"
FROM users
GROUP BY name
HAVING COUNT(*) > SUM(CAST(is_not_duplicate AS INT))
above query compare total duplicate rows by total valid duplicate rows.
Why don't you make the email column to be a unique identifier in this case, and after you cleanse your records once, you do not allow duplicates from there onwards?

A MySQL query addressing three tables: How many from A are not in B or C?

I have a problem formulating a MySQL query to do the following task, although I have seen similar queries discussed here, they are sufficiently different from this one to snooker my attempts to transpose them. The problem is (fairly) simple to state. I have three tables, 'members', 'dog_shareoffered' and 'dog_sharewanted'. Members may have zero, one or more adverts for things they want to sell or want to buy, and the details are stored in the corresponding offered or wanted table, together with the id of the member who placed the ad. The column 'id' is unique to the member, and common to all three tables. The query I want is to ask how many members have NOT placed an ad in either table.
I have tried several ways of asking this. The closest I can get is a query that doesn't crash! (I am not a MySQL expert by any means). The following I have put together from what I gleaned from other examples, but it returns zero rows, where I know the result should be greater than zero.
SELECT id
FROM members
WHERE id IN (SELECT id
FROM dog_sharewanted
WHERE id IS NULL)
AND id IN (SELECT id
FROM dog_shareoffered
WHERE id IS NULL)
THis query looks pleasingly simple to understand, unlike the 'JOIN's' I've seen but I am guessing that maybe I need some sort of Join, but how would that look in this case?
If you want no ads in either table, then the sort of query you are after is:
SELECT id
FROM members
WHERE id NOT IN ( any id from any other table )
To select ids from other tables:
SELECT id
FROM <othertable>
Hence:
SELECT id
FROM members
WHERE id NOT IN (SELECT id FROM dog_shareoffered)
AND id NOT IN (SELECT id FROM dog_sharewanted)
I added the 'SELECT DISTINCT' because one member may put in many ads, but there's only one id. I used to have a SELECT DISTINCT in the subqueries above but as comments below mention, this is not necessary.
If you wanted to avoid a sub-query (a possible performance increase, depending..) you could use some LEFT JOINs:
SELECT members.id
FROM members
LEFT JOIN dog_shareoffered
ON dog_shareoffered.id = members.id
LEFT JOIN dog_sharewanted
ON dog_sharewanted.id = members.id
WHERE dog_shareoffered.id IS NULL
AND dog_sharewanted.id IS NULL
Why this works:
It takes the table members and joins it to the other two tables on the id column.
The LEFT JOIN means that if a member exists in the members table but not the table we're joining to (e.g. dog_shareoffered), then the corresponding dog_shareoffered columns will have NULL in them.
So, the WHERE condition picks out rows where there's a NULL id in both dog_shareoffered and dog_sharewanted, meaning we've found ids in members with no corresponding id in the other two tables.

How do I select a record from one table in a mySQL database, based on the existence of data in a second?

Please forgive my ignorance here. SQL is decidedly one of the biggest "gaps" in my education that I'm working on correcting, come October. Here's the scenario:
I have two tables in a DB that I need to access certain data from. One is users, and the other is conversation_log. The basic structure is outlined below:
users:
id (INT)
name (TXT)
conversation_log
userid (INT) // same value as id in users - actually the only field in this table I want to check
input (TXT)
response (TXT)
(note that I'm only listing the structure for the fields that are {or could be} relevant to the current challenge)
What I want to do is return a list of names from the users table that have at least one record in the conversation_log table. Currently, I'm doing this with two separate SQL statements, with the one that checks for records in conversation_log being called hundreds, if not thousands of times, once for each userid, just to see if records exist for that id.
Currently, the two SQL statements are as follows:
select id from users where 1; (gets the list of userid values for the next query)
select id from conversation_log where userid = $userId limit 1; (checks for existing records)
Right now I have 4,000+ users listed in the users table. I'm sure that you can imagine just how long this method takes. I know there's an easier, more efficient way to do this, but being self-taught, this is something that I have yet to learn. Any help would be greatly appreciated.
You have to do what is called a 'Join'. This, um, joins the rows of two tables together based on values they have in common.
See if this makes sense to you:
SELECT DISTINCT users.name
FROM users JOIN conversation_log ON users.id = converation_log.userid
Now JOIN by itself is an "inner join", which means that it will only return rows that both tables have in common. In other words, if a specific conversation_log.userid doesn't exist, it won't return any part of the row, user or conversation log, for that userid.
Also, +1 for having a clearly worded question : )
EDIT: I added a "DISTINCT", which means to filter out all of the duplicates. If a user appeared in more than one conversation_log row, and you didn't have DISTINCT, you would get the user's name more than once. This is because JOIN does a cartesian product, or does every possible combination of rows from each table that match your JOIN ON criteria.
Something like this:
SELECT *
FROM users
WHERE EXISTS (
SELECT *
FROM conversation_log
WHERE users.id = conversation_log.userid
)
In plain English: select every row from users, such that there is at least one row from conversation_log with the matching userid.
What you need to read is JOIN syntax.
SELECT count(*), users.name
FROM users left join conversion_log on users.id = conversation_log.userid
Group by users.name
You could add at the end if you wanted
HAVING count(*) > 0