Confused about how to properly JOIN - mysql

I have the following tables:
log - stores info about each interaction. Indexes on clickID (unique) and businessID (not unique).
actions - stores info about each specific action taken by a customer. indexes on clickID, actionID, personID, businessID
customers - stores info about each specific customer of a specific business. indexes on personID and businessID (neither is unique, but the combo of the two together will be)
people - stores universal stats about each person who is a customer of one or more businesses. Index on personID (unique).
I need to get all of this info in one result set to pull data from, so that I can connect interactions to individual people's data, and their business-specific data.
I am currently using two datasets, that I correlate in PHP, but I'd prefer to work from one returned dataset, if it makes sense.
Here is my current set of queries:
SELECT * FROM `log`
WHERE `timestamp` >= STARTTIME AND `timestamp` <= ENDTIME AND `pageID`='aPageID' AND `businessID`='aBusinessID'
ORDER BY `timestamp` DESC
SELECT * FROM `actions` AS `t1`
INNER JOIN `people` AS `t2` ON (`t1.personID`=`t2.personID`)
INNER JOIN `customers` AS `t2` ON (`t1.personID`=`t3.personID` AND `t1.businessID`=`t3.businessID`)
WHERE `timestamp` >= STARTTIME AND `timestamp` <= ENDTIME AND `pageID`='aPageID' AND `businessID`='aBusinessID'
ORDER BY `timestamp` DESC
It seems like I'd to better with one query where the actionID (and all following results) might be null, but I don't really know what that would look like, or how it would impact performance. Help?

SELECT * FROM `log` AS t1
INNER JOIN `actions` AS t2 ON (t1.`clickID`=t2.`clickID` AND t1.`businessID`=t2.`businessID`)
INNER JOIN `customers` AS t3 ON (t1.`businessID`=t3.`BusinessID` AND t2.`personID`=t3.`personID`)
INNER JOIN `people` AS t4 ON (t2.`personID`=t4.`personID`)
WHERE `timestamp` >= STARTTIME AND `timestamp` <= ENDTIME AND `pageID`='aPageID' AND `businessID`='aBusinessID'
ORDER BY `timestamp` DESC

Related

Query time that I can't understand in MYSQL

I'm new to this platform, even this is my first question. Sorry for my bad English, I use translate. Let me know if I have used inappropriate language.
my table is like this
CREATE TABLE tbl_records (
id int(11) NOT NULL,
data_id int(11) NOT NULL,
value double NOT NULL,
record_time datetime NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
ALTER TABLE tbl_records
ADD PRIMARY KEY (id),
ADD KEY data_id (data_id),
ADD KEY record_time (record_time);
ALTER TABLE tbl_records
MODIFY id int(11) NOT NULL AUTO_INCREMENT;
COMMIT;
my first query
It takes 0.0096 seconds
SELECT b.* FROM tbl_records b
INNER JOIN
(SELECT MAX(id) AS id FROM tbl_records GROUP BY data_id) a
ON a.id=b.id;
my second query
Its takes 2.4957 seconds
SELECT MAX(id) AS id FROM tbl_records GROUP BY data_id;
When I do these operations over and over again, the result is similar.
There are 20 million data in the table.
Why is the one with the subquery faster?
Also what I really need is MAX(record_time) but
SELECT b.* FROM tbl_records b
INNER JOIN
(SELECT MAX(record_time) AS id FROM tbl_records GROUP BY data_id) a
ON a.id=b.id
It takes minutes when I run it.
I also need records such as hourly, daily, and monthly. I couldn't see much performance difference between GROUP BY SUBSTR(record_time,1,10) or GROUP BY DATE_FORMAT(record_time,'%Y%m%d') both take minutes.
What am I doing wrong?
The first query can be simplified to
SELECT * FROM tbl_records
ORDER BY id DESC
LIMIT 1.
The second:
SELECT id FROM tbl_records
ORDER BY data_id DESC
LIMIT 1;
I don't know what the third is trying to do. This does not make sense: MAX(record_time) AS id -- it is a DATETIME that will subsequently be compared to an INT in ON a.id=b.id.
Another option for turning a DATETIME into a DATE is simply DATE(record_time). But it will not be significantly faster.
If the goal is to build daily counts and subtotals, then there is a much better way. Build and incrementally maintain a Summary table .
(responding to Comment)
The GROUP BY that you have is improper and probably incorrect. I took the liberty of changing from id to data_id:
SELECT b.*
FROM
( SELECT data_id, MAX(record_time) AS max_time
FROM tbl_records
GROUP BY data_id
) AS a
FROM tbl_records AS b
ON a.data_id = b.data_id
AND a.max_time = b.record_time
And have
INDEX(data_id, record_time)
Can there be duplicate times for one data_id? To discuss that and other "groupwise-max" queries, see http://mysql.rjweb.org/doc.php/groupwise_max

Optimizing MySQL query removing subquery

Having these tables:
customers
---------------------
`id` smallint(5) unsigned NOT NULL auto_increment,
`name` varchar(100) collate utf8_unicode_ci default NOT NULL,
....
customers_subaccounts
-------------------------
`companies_id` mediumint(8) unsigned NOT NULL,
`customers_id` mediumint(8) unsigned NOT NULL,
`subaccount` int(10) unsigned NOT NULL
I need to get all the customers whom have been assigned more than one subaccount for the same company.
This is what I've got:
SELECT * FROM customers
WHERE id IN
(SELECT customers_id
FROM customers_subaccounts
GROUP BY customers_id, companies_id
HAVING COUNT(subaccount) > 1)
This query is too slow though. It's even slower if I add the DISTINCT modifier to customers_id in the SELECT of the subquery, which in the end retrieves the same customers list for the whole query. Maybe there's a better way without subquerying, anything faster will help, and I'm not sure whether it will retrieve an accurate correct list.
Any help?
You can replace the subquery with an INNER JOIN:
SELECT t1.id
FROM customers t1
INNER JOIN
(
SELECT DISTINCT customers_id
FROM customers_subaccounts
GROUP BY customers_id, companies_id
HAVING COUNT(*) > 1
) t2
ON t1.id = t2.customers_id
You can also try using EXISTS() which may be faster then a join :
SELECT * FROM customers t
WHERE EXISTS(SELECT 1 FROM customers_subaccounts s
WHERE s.customers_id = t.id
GROUP BY s.customers_id, s.companies_id
HAVING COUNT(subaccount) > 1)
You should also considering adding the following indexes(if not exists yet) :
customers_subaccounts (customers_id,companies_id,subaccount)
customers (id)
Assuming that you want different subaccounts for the company (or that they are guaranteed to be different anyway), then the following could be faster under some circumstances:
select c.*
from (select distinct cs.customers_id
from customers_subaccounts cs join
customers_subaccounts cs2
on cs.customers_id = cs2.customers_id and
cs.companies_id = cs2.companies_id and
cs.subaccount < cs2.subaccount
) cc join
customers c
on c.customers_id = cc.customers_id;
In particular, this can take advantage of an index on customers_subaccounts(customers_id, companies_id, subaccount).
Note: This assumes that the subaccounts are different for the rows you want. What is really needed is a way of defining unique rows in the customers_subaccounts table.
There is a way to speed up the query by using cache the sub-query result. A simple change in your query aware mysql that can cache the sub-query result:
SELECT * FROM customers
WHERE id IN
(select * from
(SELECT distinct customers_id
FROM customers_subaccounts
GROUP BY customers_id, companies_id
HAVING COUNT(subaccount) > 1) t1);
I used it many years ago and it helped me very much.
Try following;)
SELECT DISTINCT t1.*
FROM customers t1
INNER JOIN customers_subaccounts t2 ON t1.id = t2.customers_id
GROUP BY t1.id, t1.name, t2.companies_id
HAVING COUNT(t2.subaccount) > 1
Also you may add index on customers_id.

MySQL: How do I select fields from multiple tables for insert to a third table?

I am implementing a time tracking solution for our small company.I have this query that does an insert into a table through a Perl script. The basic query works fine but I have two inputs, project_id and category_id, that I need to use to select the id from another table for insert.
INSERT INTO `time_entries` (`project_id`, `user_id`, `category_id`, `start`)
SELECT a.`project_id`, a.`user_id`, a.`category_id`, a.`start` FROM
(SELECT
(SELECT `id` FROM `projects` WHERE `title` = $scanin[0]) `project_id`,
$scanin[1] `user_id`,
(SELECT `id` FROM `categories` WHERE `barcode` = $scanin[2]) `category_id`,
NOW() `start`) a
WHERE NOT EXISTS
( SELECT 1 FROM `time_entries` WHERE `project_id` = (SELECT `id` FROM `projects` WHERE `title` = $scanin[0])
AND `user_id` = $scanin[1]
AND `category_id` = $scanin[2]
AND `end` = '0000-00-00 00:00:00')
It works fine if I am selecting from one table for insert but obviously won't work with two tables. Is it even possible to do this? I am pretty good with simple SQL statements but this is complex and joins have always been a problem for me.I just don't do a lot of it.
time_entries projects categories
------------ -------- ----------
id id id
project_id title barcode
user_id
category_id
start
end
That is a pretty obfuscate query, with way to many unnecessary nested queries.
First lets clean it up, assuming the following database structure;
projects categories time_entries
-------- ---------- ------------
id id id
cat_id title project_id
usr_id user_id
title user_id
category_id
start
end
We can simplify your query to a more developer friendly version;
INSERT INTO `time_entries` (`project_id`, `user_id`, `category_id`, `start`)
SELECT project_id, user_id, category_id, now()
FROM projects JOIN category ON projects.cat_id = category.id
WHERE project_id NOT IN (
SELECT project_id
FROM time_entries
WHERE title = $scanin[0]
AND `user_id` = $scanin[1]
AND `category_id` = $scanin[2]
AND `end` = '0000-00-00 00:00:00'
)
Ok, so now it should be much easier to add another table as you requested by using the following pattern;
INSERT INTO time_e... , column_n, column_n_plus_1
....
FROM proj....
JOIN table_n on id_n = project_id
JOIN table_n_plus_one on id_n_plus_one = project_id
....
Adding an index gave me the behavior I was looking for.
ALTER TABLE `time_entries` ADD UNIQUE `unique_record_index`(`project_id`,`user_id`,`category_id`,`end`)
Thanks for the help!

SQL unusual query, find max deltas between consecutive elements

I've met an interesting problem.
I have a table of workers' ids' and days of their visits. Here is dump:
CREATE TABLE `pp` (
`id` int(11) DEFAULT '1',
`day` int(11) DEFAULT '1',
`key` varchar(45) NOT NULL,
PRIMARY KEY (`key`)
)
INSERT INTO `pp` VALUES
(1,1,'1'),
(1,20,'2'),
(1,50,'3'),
(1,70,'4'),
(2,1,'5'),
(2,120,'6'),
(2,90,'7'),
(1,90,'8'),
(2,100,'9');
So I need to find workers which have missed more than 50 days at least once. For example, if worker visited at 5th, 95th, 96th, 97th day, if we look at deltas, we can see that the largest delta (90) is more than 50, so we should include this worker into result.
The problem is how do I efficiently find deltas between visits of different workers?
I can't even imagine how to work with mysql tables as consequent arrays of data.
So we need to separate day values for different workers, sort them and then find max deltas for each. But how? Is there any way to, for example, enumerate sorted arrays in sql?
Try this query -
edited:
SELECT t.id, t.day1, t.day2 FROM (
SELECT p1.id, p1.day day1, p2.day day2 FROM pp p1
JOIN (SELECT * FROM pp ORDER BY day) p2
ON p1.id = p2.id AND p1.day < p2.day
GROUP BY p1.id, p1.day
) t
GROUP BY t.id
HAVING MAX(day2 - day1) >= 50
This is a way I used to cope with such problems:
SELECT distinct t3.id FROM
(SELECT t1.id, t1.day, MIN(t2.day) nextday
FROM pp t1
JOIN pp t2 ON t1.id=t2.id AND t1.day<t2.day
GROUP BY t1.id, t1.day
HAVING nextday-t1.day >50) t3
(EDIT this version is slightly better)
This finds all the IDs for which there is a delta > 50. (I assumed that this is what you're after)
To see it working: SQL fiddle
To find the max deltas:
SELECT t3.id, MAX(t3.nextday-t3.day) FROM
(SELECT t1.id, t1.day, MIN(t2.day) nextday
FROM pp t1
JOIN pp t2 ON t1.id=t2.id AND t1.day<t2.day
GROUP BY t1.id, t1.day) t3
GROUP BY t3.id
The logic behind is to find the "next" item, whatever that means. As this is an ordered attribute, the next item can be defined as having the lowest value among those rows that have the value larger than the one examined... Then you join the "next" values to the original values, conpute the delta, and return only those that are applicable. If you need the other columns too, just do a JOIN on the outer select to the original table.
I'm not sure if this is the best solution regarding perfirmance, but I only wrote queries for one-off reports, with which I could afford the query to run for a while.
There is one semantic error though, that can arise: if somebody was present on the 1st, 2nd and 3rd days, but never after, this does not find the absence. To overcome this, you could add a special row with UNIONing a select to the tables specifying tomorrow's day count for all IDs, but that would make this query disgusting enough not to try writing it down...
This could also be a solution:
select distinct pp.id
from pp
where pp.day-(select max(day)
from pp pp2
where
pp2.id=pp.id and
pp2.day<pp.day)>=50
(since days are not ordered by key, i'm not searching for the previous key but for the max day before current day)

How do I write this kind of query (returning the latest avaiable data for each row)

I have a table defined like this:
CREATE TABLE mytable (id INT NOT NULL AUTO_INCREMENT, PRIMARY KEY(id),
user_id INT REFERENCES user(id) ON UPDATE CASCASE ON DELETE RESTRICT,
amount REAL NOT NULL CHECK (amount > 0),
record_date DATE NOT NULL
);
CREATE UNIQUE INDEX idxu_mybl_key ON mytable (user_id, amount, record_date);
I want to write a query that will have two columns:
user_id
amount
There should be only ONE entry in the returned result set for a given user. Furthermore, the amount figure returned should be the last recoreded amount for the user (i.e. MAX(record_date).
The complication arises because weights are recorded on different dates for different users, so there is no single LAST record_date for all users.
How may I write (preferably an ANSI SQL) query to return the columns mentioned previously, but ensuring that its only the amount for the last recorded amount for the user that is returned?
As an aside, it is probably a good idea to return the 'record_date' column as well in the query, so that it is eas(ier) to verify that the query is working as required.
I am using MySQL as my backend db, but ideally the query should be db agnostic (i.e. ANSI SQL) if possible.
First you need the last record_date for each user:
select user_id, max(record_date) as last_record_date
from mytable
group by user_id
Now, you can join previous query with mytable itself to get amount for this record_date:
select
t1.user_id, last_record_date, amount
from
mytable t1
inner join
( select user_id, max(record_date) as last_record_date
from mytable
group by user_id
) t2
on t1.user_id = t2.user_id
and t1.record_date = t2.last_record_date
A problem appears becuase a user can have several rows for same last_record_date (with different amounts). Then you should get one of them, sample (getting the max of the different amounts):
select
t1.user_id, t1.record_date as last_record_date, max(t1.amount)
from
mytable t1
inner join
( select user_id, max(record_date) as last_record_date
from mytable
group by user_id
) t2
on t1.user_id = t2.user_id
and t1.record_date = t2.last_record_date
group by t1.user_id, t1.record_date
I do not now about MySQL but in general SQL you need a sub-query for that. You must join the query that calculates the greatest record_date with the original one that calculates the corresponding amount. Roughly like this:
SELECT B.*
FROM
(select user_id, max(record_date) max_date from mytable group by user_id) A
join
mytable B
on A.user_id = B.user_id and A.max_date = B.record_date
SELECT datatable.* FROM
mytable AS datatable
INNER JOIN (
SELECT user_id,max(record_date) AS max_record_date FROM mytable GROUP BS user_id
) AS selectortable ON
selectortable.user_id=datatable.user_id
AND
selectortable.max_record_date=datatable.record_date
in some SQLs you might need
SELECT MAX(user_id), ...
in the selectortable view instead of simply SELECT user_id,...
The definition of maximum: there is no larger(or: "more recent") value than this one. This naturally leads to a NOT EXISTS query, which should be available in any DBMS.
SELECT user_id, amount
FROM mytable mt
WHERE mt.user_id = $user
AND NOT EXISTS ( SELECT *
FROM mytable nx
WHERE nx.user_id = mt.user_id
AND nx.record_date > mt.record_date
)
;
BTW: your table definition allows more than one record to exist for a given {id,date}, but with different amounts. This query will return them all.