How to build a matrix that contains counts of matches between tables? - mysql

Other than diving in brute force one query at a time, I'm stumped on a repeatable efficient way of doing this:
assume I have 4 ticketed events around the country (EventA-2018,
EventB-2018, EventC-2018, and EventD-2018)
I now need to present a simple 4x4 table with counts of people who attended X also attended Y
each event has an associated MySQL table (e.g., event-a-2018-buyers, event-b-2018-buyers, etc.) and each one contains
a single column called email representing the buyer.
The resulting table should look something like this:
+------------+-------------+-------------+-------------+-------------+
| | EventA-2018 | EventB-2018 | EventC-2018 | EventD-2018 |
+------------+-------------+-------------+-------------+-------------+
|EventA-2018 | X | a | b | c |
+------------+-------------+-------------+-------------+-------------+
|EventB-2018 | a | X | d | e |
+------------+-------------+-------------+-------------+-------------+
|EventC-2018 | b | d | X | f |
+------------+-------------+-------------+-------------+-------------+
|EventD-2018 | c | e | f | X |
+------------+-------------+-------------+-------------+-------------+
So the top row basically says, "Of the people who bought tickets for EventA-2018, there were a who also bought for EventB-2018, b who also bought for EventC-2018, and c also who bought for EventD-2018".
The diagonal isn't important since it would represent 100% each time.
Out of the 12 remaining cells, I obviously only need to fill in 6 since they are repeated (e.g., a,b,c,d,e,f).
There are actually more than 4 events and each one happens each year, but I'm assuming I can adapt any solutions to expand accordingly.
My current MySQL skills stop just after doing a join on two of the event tables. So I could easily figure out the 6 inner joins I need to run to fill in this matrix and manually build the table, but I'm hoping there is a more programmatic way that will allow me to automate it into a dashboard.
Here is how I would fill in one cell at a time:
SELECT
Count( eventa_2018.email ) as cell_a
FROM
( SELECT DISTINCT email FROM eventa_2018
INNER JOIN ( SELECT DISTINCT email FROM eventb_2018 ON eventa_2018.email = eventb_2018.email;
SIDE NOTE: A completely different approach I'm considering is to combine all tables into one with only two fields - email, event. Then I could strip out everyone who only attended one event. For the rest, I could create a simpler report showing the counts of people who attended each combination of more than one event (whereas the table above only shows two events at at time). The resulting business case for all of this is to learn where to invest in more cross promotion of events and create segments of most valuable customers.

Not an answer. Too long for a comment.
A normalised schema might look something like this:
event year buyer
a 2018 joe#amgil.com
b 2018 kat#plape.com
Start there. See my comments above, and then get back to us.

Related

How to extract relational data from a flat table using SQL?

I have a single flat table containing a list of people which records their participation in different groups and their activities over time. The table contains following columns:
- name (first/last)
- e-mail
- secondary e-mail
- group
- event date
+ some other data in a series of columns, relevant to a specific event (meeting, workshop).
I want to extract distinct people from that into a separate table, so that further down the road it could be used for their profiles giving them a list of what they attended and relevant info. In other words, I would like to have a list of people (profiles) and then link that to a list of groups they are in and then a list of events per group they participated in.
Obviously, same people appear a number of times:
| Full name | email | secondary email | group | date |
| John Smith | jsmith#someplace.com | | AcOP | 2010-02-12 |
| John Smith | jsmith#gmail.com | jsmith#somplace.com | AcOP | 2010-03-14 |
| John Smith | jsmith#gmail.com | | CbDP | 2010-03-18 |
| John Smith | jsmith#someplace.com | | BDz | 2010-04-02 |
Of course, I would like to roll it into one record for John Smith with both e-mails in the resulting People table. I can't rule out that there might be more records for same person with other e-mails than those two - I can live with that. To make it more complex ideally I would like to derive a list of groups, creating a Groups table (possibly with further details on the groups) and then a list of meetings/activities for each group. By linking that I would then have clean relational model.
Now, the question: is there a way to perform such a transformation of data in SQL? Or do I need to write a procedure (program) that would traverse the database and do it?
The database is in MySQL, though I can also use MS Access (it was given to me in that format).
There is no tool that does this automatically. You will have to write a couple queries (unless you want to write a DTS package or something proprietary). Here's a typical approach:
Write two select statements for the two tables you wish to create-- one for users and one for groups. You may need to use DISTINCT or GROUP BY to ensure you only get one row when the source table contains duplicates.
Run the two select statements and inspect them for problems. For example, it's possible some users show up with two different email addresses, or some users have the same name and were combined incorrectly. These will need to be cleaned up in order to proceed. There is great way to do this-- it's more or less a manual process requiring expert knowledge of the data.
Write CREATE TABLE scripts based on the two SELECT statements so that you can store the results somewhere.
Use INSERT FROM or SELECT INTO to populate the tables from your two SELECT statements.

Crossfilter dimension of joined table without repeating values

I will preface this with that I am both relatively unfamiliar with SQL and dc.js. I feel the solution to this is likely pretty simple.
I am currently drawing a group of charts based on a join of two tables that results in a form similar to the following:
Subject | Gender | TestName
------- | ------ | --------
1 | M | Test 1
2 | M | Test 1
1 | M | Test 2
2 | M | Test 2
Essentially, a lot of unique data that is repeated on the join due to TestName changing per subject. The join is on Subject.
I have one bar chart for Gender, which can be either M or F. I also have a bar graph of a count of each test (TestName) and how many subjects were present in that test. As you can tell from the join, a single subject can and often is a member of more than one test.
My issue is that when I crossfilter this data, the counts for each test is correct (here, it would be 2 each), but my gender information is inflated (4, instead of 2) since it counts what should be each unique subject every time it appears in this joined data set. I want my charts to display n Subjects for Gender, but instead it displays n * 'number of tests'.
Is there a way to chart the correct count per test case but keep my other charts displaying only a maximum count of unique subjects while keeping my crossfilter working?

Find date range overlaps within the same table, for specific user MySQL

I am by no means an MySQL expert, so I am looking for any help on this matter.
I need to perform a simple test (in principle), I have this (simplified) table:
tableid | userid | car | From | To
--------------------------------------------------------
1 | 1 | Fiesta | 2015-01-01 | 2015-01-31
2 | 1 | MX5 | 2015-02-01 | 2015-02-28
3 | 1 | Navara | 2015-03-01 | 2015-03-31
4 | 1 | GTR | 2015-03-28 | 2015-04-30
5 | 2 | Focus | 2015-01-01 | 2015-01-31
6 | 2 | i5 | 2015-02-01 | 2015-02-28
7 | 2 | Aygo | 2015-03-01 | 2015-03-31
8 | 2 | 206 | 2015-03-29 | 2015-04-30
9 | 1 | Skyline | 2015-04-29 | 2015-05-31
10 | 2 | Skyline | 2015-04-29 | 2015-05-31
I need to find two things here:
If any user has date overlaps in his car assignments of more than one day (end of the assignment can be on the same day as the new assignment start).
Did any two users tried to get the same car assigned on the same date, or the date ranges overlap for them on the same car.
So the query (or queries) I am looking for should return those rows:
tableid | userid | car | From | To
--------------------------------------------------------
3 | 1 | Navara | 2015-03-01 | 2015-03-31
4 | 1 | GTR | 2015-03-28 | 2015-04-30
7 | 2 | Aygo | 2015-03-01 | 2015-03-31
8 | 2 | 206 | 2015-03-29 | 2015-04-30
9 | 1 | Skyline | 2015-04-29 | 2015-05-31
10 | 2 | Skyline | 2015-04-29 | 2015-05-31
I feel like I am bashing my head against the wall here, I would be happy with being able to do these comparisons in separate queries. I need to display them in one table but I could always then join the results.
I've done research and few hours of testing but I cant get nowhere near the result I want.
SQLFiddle with the above test data
I've tried these posts btw (they were not exactly what I needed but were close enough, or so I thought):
Comparing two date ranges within the same table
How to compare values of text columns from the same table
This was the closest solution I could find but when I tried it on a single table (joining table to itself) I was getting crazy results: Checking a table for time overlap?
EDIT
As a temporary solution I have adapted a different approach, similar to the posts I have found during my research (above). I will now check if the new car rental / assignment date overlaps with any date range within the table. If so I will save the id(s) of the rows that the date overlaps with. This way at least I will be able to flag overlaps and allow a user to look at the flagged rows and to resolve any overlaps manually.
Thanks to everyone who offered their help with this, I will flag philipxy answer as the chosen one (in next 24h) unless someone has better way of achieving this. I have no doubt that following his answer I will be able to eventually reach the results I need. At the moment though I need to adopt any solution that works as I need to finish my project in next few days, hence the change of approach.
Edit #2
The both answers are brilliant and to anyone who finds this post having the same issue as I did, read them both and look at the fiddles! :) A lot of amazing brain-work went into them! Temporarily I had to go with the solution I mention in #1 Edit of mine but I will be adapting my queries to go with #Ryan Vincent approach + #philipxy edits/comments about ignoring the initial one day overlap.
Here is the first part: Overlapping cars per user...
SQLFiddle - correlated Query and Join Query
Second part - more than one user in one car at the same time: SQLFiddle - correlated Query and Join Query. Query below...
I use the correlated queries:
You will likely need indexes on userid and 'car'. However - please check the 'explain plan' to see how it mysql is accessing the data. And just try it :)
Overlapping cars per user
The query:
SELECT `allCars`.`userid` AS `allCars_userid`,
`allCars`.`car` AS `allCars_car`,
`allCars`.`From` AS `allCars_From`,
`allCars`.`To` AS `allCars_To`,
`allCars`.`tableid` AS `allCars_id`
FROM
`cars` AS `allCars`
WHERE
EXISTS
(SELECT 1
FROM `cars` AS `overlapCar`
WHERE
`allCars`.`userid` = `overlapCar`.`userid`
AND `allCars`.`tableid` <> `overlapCar`.`tableid`
AND NOT ( `allCars`.`From` >= `overlapCar`.`To` /* starts after outer ends */
OR `allCars`.`To` <= `overlapCar`.`From`)) /* ends before outer starts */
ORDER BY
`allCars`.`userid`,
`allCars`.`From`,
`allCars`.`car`;
The results:
allCars_userid allCars_car allCars_From allCars_To allCars_id
-------------- ----------- ------------ ---------- ------------
1 Navara 2015-03-01 2015-03-31 3
1 GTR 2015-03-28 2015-04-30 4
1 Skyline 2015-04-29 2015-05-31 9
2 Aygo 2015-03-01 2015-03-31 7
2 206 2015-03-29 2015-04-30 8
2 Skyline 2015-04-29 2015-05-31 10
Why it works? or How I think about it:
I use the correlated query so I don't have duplicates to deal with and it is probably the easiest to understand for me. There are other ways of expressing the query. Each has advantages and drawbacks. I want something I can easily understand.
Requirement: For each user ensure that they don't have two or more cars at the same time.
So, for each user record (AllCars) check the complete table (overlapCar) to see if you can find a different record that overlaps for the time of the current record. If we find one then select the current record we are checking (in allCars).
Therefore the overlap check is:
the allCars userid and the overLap userid must be the same
the allCars car record and the overlap car record must be different
the allCars time range and the overLap time range must overlap.
The time range check:
Instead of checking for overlapping times use positive tests. The easiest approach, is to check it doesn't overlap, and apply a NOT to it.
One car with More than One User at the same time...
The query:
SELECT `allCars`.`car` AS `allCars_car`,
`allCars`.`userid` AS `allCars_userid`,
`allCars`.`From` AS `allCars_From`,
`allCars`.`To` AS `allCars_To`,
`allCars`.`tableid` AS `allCars_id`
FROM
`cars` AS `allCars`
WHERE
EXISTS
(SELECT 1
FROM `cars` AS `overlapUser`
WHERE
`allCars`.`car` = `overlapUser`.`car`
AND `allCars`.`tableid` <> `overlapUser`.`tableid`
AND NOT ( `allCars`.`From` >= `overlapUser`.`To` /* starts after outer ends */
OR `allCars`.`To` <= `overlapUser`.`From`)) /* ends before outer starts */
ORDER BY
`allCars`.`car`,
`allCars`.`userid`,
`allCars`.`From`;
The results:
allCars_car allCars_userid allCars_From allCars_To allCars_id
----------- -------------- ------------ ---------- ------------
Skyline 1 2015-04-29 2015-05-31 9
Skyline 2 2015-04-29 2015-05-31 10
Edit:
In view of the comments, by #philipxy , about time ranges needing 'greater than or equal to' checks I have updated the code here. I havn't changed the SQLFiddles.
For each input and output table find its meaning. Ie a statement template parameterized by column names, aka predicate, that a row makes into a true or false statement, aka proposition. A table holds the rows that make its predicate into a true proposition. Ie rows that make a true proposition go in a table and rows that make a false proposition stay out. Eg for your input table:
rental [tableid] was user [userid] renting car [car] from [from] to [to]
Then phrase the output table predicate in terms of the input table predicate. Don't use descriptions like your 1 & 2:
If any user has date overlaps in his car assignments of more than one day (end of the assignment can be on the same day as the new assignment start).
Instead find the predicate that an arbitrary row states when in the table:
rental [tableid] was user [user] renting car [car] from [from] to [to]
in self-conflict with some other rental
For the DBMS to calculate the rows making this true we must express this in terms of our given predicate(s) plus literals & conditions:
-- query result holds the rows where
FOR SOME t2.tableid, t2.userid, ...:
rental [t1.tableid] was user [t1.userid] renting car [t1.car] from [t1.from] to [t1.to]
AND rental [t2.tableid] was user [t2.userid] renting car [t2.car] from [t2.from] to [t2.to]
AND [t1.userid] = [t2.userid] -- userids id the same users
AND [t1.to] > [t2.from] AND ... -- tos/froms id intervals with overlap more than one day
...
(Inside an SQL SELECT statement the cross product of JOINed tables has column names of the form alias.column. Think of . as another character allowed in column names. Finally the SELECT clause drops the alias.s.)
We convert a query predicate to an SQL query that calculates the rows that make it true:
A table's predicate gets replaced by the table alias.
To use the same predicate/table multiple times make aliases.
Changing column old to new in a predicate adds ANDold=new.
AND of predicates gets replaced by JOIN.
OR of predicates gets replaced by UNION.
AND NOT of predicates gets replaced by EXCEPT, MINUS or appropriate LEFT JOIN.
ANDcondition gets replaced by WHERE or ON condition.
For a predicate true FOR SOMEcolumns to drop or when THERE EXISTScolumns to drop, SELECT DISTINCTcolumns to keep.
Etc. (See this.)
Hence (completing the ellipses):
SELECT DISTINCT t1.*
FROM t t1 JOIN t t2
ON t1.userid = t1.userid -- userids id the same users
WHERE t1.to > t2.from AND t2.to > t1.from -- tos/froms id intervals with overlap more than one day
AND t1.tableid <> t2.tableid -- tableids id different rentals
Did any two users tried to get the same car assigned on the same date, or the date ranges overlap for them on the same car.
Finding the predicate that an arbitrary row states when in the table:
rental [tableid] was user [user] renting car [car] from [from] to [to]
in conflict with some other user's rental
In terms of our given predicate(s) plus literals & conditions:
-- query result holds the rows where
FOR SOME t2.*
rental [t1.tableid] was user [t1.userid] renting car [t1.car] from [t1.from] to [t1.to]
AND rental [t2.tableid] was user [t2.userid] renting car [t2.car] from [t2.from] to [t2.to]
AND [t1.userid] <> [t2.userid] -- userids id different users
AND [t1.car] = [t2.car] -- .cars id the same car
AND [t1.to] >= [t2.from] AND [t2.to] >= [t1.from] -- tos/froms id intervals with any overlap
AND [t1.tableid] <> [t2.tableid] -- tableids id different rentals
The UNION of queries for predicates 1 & 2 returns the rows for which predicate 1ORpredicate 2.
Try to learn to express predicates--what rows state when in tables--if only as the goal for intuitive (sub)querying.
PS It is good to always have data checking edge & non-edge cases for a condition being true & being false. Eg try query 1 with GTR starting on the 31st, an overlap of only one day, which should not be a self-conflict.
PPS Querying involving duplicate rows, as with NULLs, has quite complex query meanings. It's hard to say when a tuple goes in or stays out of a table and how many times. For queries to have the simple intuitive meanings per my correspondences they can't have duplicates. Here SQL unfortunately differs from the relational model. In practice people rely on idioms when allowing non-distinct rows & they rely on rows being distinct because of constraints. Eg joining on UNIQUE columns per UNIQUEs, PKs & FKs. Eg: A final DISTINCT step is only doing work at a different time than a version that doesn't need it; time might or might not be be an important implementation issue affecting the phrasing chosen for a given predicate/result.

Collating data from two tables

I'm using the following statement to try and collect and display data correctly. It is necessary to to a 'LEFT JOIN' with one table to collect more information, but I should say that it's not necessary to do this for the second case (but such is my work-around).
SELECT
COALESCE(building.campus_id, campus.campus_id) AS campus
member.*
FROM location
LEFT JOIN cu_member AS member ON
(member.member_id = location.member_id)
LEFT JOIN cu_building AS building ON
(location.params LIKE 'building_id=%' = building.id)
LEFT JOIN cu_campus AS campus ON
(location.params LIKE 'campus_id=%' = campus.id)
I'm my above query, I would want to use the wildcard value.
LEFT JOIN cu_building AS building ON
('39' = building.id)
Below is how my location table looks. I'm trying to use the data from the params column to get the resulting campus from another table (building). I only need to do this for fields containing the building_id tag, not for those with 'campus_id`, because that is already known.
-----------------------------
member_id | params
-----------------------------
1 | building_id=39
2 | building_id=24
3 | campus_id=6
4 | campus_id=3
5 | building_id=11
6 | campus_id=14
7 | building_id=15
This is how my building table looks. It lists which building is part of which campus.
--------------------------
building_id | campus_id
--------------------------
39 | 5
24 | 4
11 | 2
15 | 2
I have another table named `campus'. My problem is, this table only lists, but I was hoping to use this table in order to display the correct data in the final result.
--------------------------
campus_id | name
--------------------------
6 | ...
3 | ..
14 | .
The result I want to achieve with the MySQL query is this. Here, the collected results are shown in one table.
-----------------------------
member_id | campus
-----------------------------
1 | 5 (building_id=39)
2 | 4 (building_id=24)
3 | 6 (campus_id=6)
4 | 3 (campus_id=3)
5 | 2 (building_id=11)
6 | 14 (campus_id=14)
7 | 2 (building_id=15)
First things first. You really need to revise your database structure. That location table will give you an infinite number of problems, complicating each and every query you ever need to join members with buildings and campuses. As per the data shown, you really should have building_id and campus_id on the member table. Another, softer, solution would be to have a building_id and a campus_id column in the location table.
That said, you do not need neither regex, nor LIKEs to have this query work. Something like this should work adeguately:
SELECT
COALESCE(b.campus_id, c.id) AS campus
m.*
FROM cu_member m
JOIN location l ON l.member_id = m.member_id
LEFT JOIN cu_building b ON l.params = CONCAT('building_id=',b.id)
LEFT JOIN cu_campus c ON l.params = CONCAT('campus_id=',c.id)
I can see that the last join seems a bit redundant, since you really only need the campus id, and not the name or other info. That already resides in the location table, so it would seem unnecessary to JOIN the campus table. The problem is that it is embedded in the params column. Extracting it is a mess.
In the example you provide, it would be better to expand your params column into a building_id and a campus_id column.
Joining this then becomes very easy.
SELECT
COALESCE(building.campus_id, campus.campus_id) AS campus
member.*
FROM location
LEFT JOIN cu_member AS member ON (member.member_id = location.member_id)
LEFT JOIN cu_building AS building ON (location.building_id = building.building_id)
LEFT JOIN cu_campus AS campus ON (location.campus_id = campus.campus_id)
If there are other reasons here that make creating separate columns difficult, then use the code provided by #Frazz

How do I use mysql to match against multiple possibilities from a second table?

I'm not entirely sure how to ask this question, so I'll lead by providing an example table and an example output and then follow up with a more thorough explanation of what I'm attempting to accomplish.
Imagine that I have two tables. In the first is a list of companies. Some of these companies have duplicate entries due to being imported and continuously updated from different sources. For example, the company table may look something like this:
| rawName | strippedName |
| Kohl's | kohls |
| kohls.com | kohls |
| kohls Corporation | kohls |
So in this situation, we have information that has come in from three different sources. In an attempt to allow my program to understand that each of these sources are all the same store, I created the stripped name column (which I also use for creating URL's and whatnot).
In the second table, we have information about deals, coupons, shipping offers, etc. However, since these come in from their various sources, the end up with the three different rawNames that we identified above. For example, the second table might look something like this:
| merchantName | dealInformation |
| kohls.com | 10% off everything... |
| kohl's | Free shipping on... |
| kohls corporation | 1 Day Flash Sale! |
| kohls.com | Buy one get one... |
So here we have four entries that are all from the same company. However, when a user on the site visits the listing for Kohls, I want it to display all the entries from each source.
Here is what I currently have, but it doesn't seem to be doing the trick. This seems to only work if I set the LIMIT in that sub-query to 1 so that it only brings back one of the rawNames. I need it to match against all of the rawNames.
SELECT * FROM table2
WHERE merchantName = (SELECT rawName FROM table1 WHERE strippedName = '".$strippedName."')
The quickest fix is to replace your mercahantName = with merchantName IN
SELECT * FROM table2
WHERE merchantName IN (SELECT rawName FROM table1 WHERE strippedName = '".$strippedName."')
The = operator needs to have exactly one value on each side - the IN keyword matches a value against multiple values.