I'm not exactly sure how to phrase the title. I have a query that I cannot figure out:
I have a table 'values' with timestamps (1970 epoch decimal) and a blob for each row. I have a second table called 'keys' that contains a timestamps and keys to decrypt each of the blobs in the first table 'values'. The key changes periodically at random intervals and each time the key changes, a new set of keys are written to the 'keys' table. There are multiple keys and when a new key set is written to the 'keys' table, each key has a separate entry with the same timestamp.
if I do something like this:
select distinct timestamp from keys;
I get a set returned for every time the keys rotated and I wrote a new keyset into the database.
What I would like is a sql statement in mysql that returns timestamps for each keyset and the total number of records in the 'values' table between each of those key timestamps.
For instance:
Timestamp
Count
1635962134
23
1636048054
450
1636145254
701
etc...
The last row needs special consideration since its the "current" set doesn't have another entry in the keytable (yet..)
SQL Fiddle with Sample Data:
SQL FIDDLE WITH SAMPLE DATA
For the sample data above, the results should be:
| Timestamp | Count |
| --------- | ----- |
| 1635962134 | 14|
| 1636043734 | 28|
| 1636119328 | 11|
You are a little limited by the mySQL version, but you can use a variable to help create the row set. You could do it with joins, too, but it would be a little more complicated.
https://dbfiddle.uk/?rdbms=mysql_5.6&fiddle=b5d587b30f1a758ce31e3fa4745f26d0
SELECT k.key1, k.key2, count(*) as vol
FROM my_values v
JOIN (
SELECT key_ts as key1, #curr-1 as key2, #curr:= key_ts
FROM (
SELECT DISTINCT key_ts FROM my_keys
JOIN (SELECT #curr:=9999999999) var
ORDER BY key_ts DESC
) z
) k ON (v.val_ts BETWEEN k.key1 and k.key2)
GROUP BY key1, key2
First (the innermost subquery) select the distinct timestamps from my_keys and order them. I use that join just to set the variable. You could use a SET statement in a separate query also. I set the variable to an arbitrarily high timestamp, so that the last timestamp in the series will always have a partner.
Select from that the key and the variable value minus 1 (to prevent overlap), and then after that set the variable to the current key. That has to be done after everything else in the select query. That will generate two paired timestamps representing a time range.
Then just join my_values to those keys, and use BETWEEN as the join condition.
Let me know if that works for you.
Related
I am looking at making a request using 2 tables faster.
I have the following 2 tables :
Table "logs"
id varchar(36) PK
date timestamp(2)
more varchar fields, and one text field
That table has what the PHP Laravel Framework calls a "polymorphic many to many" relationship with several other objects, so there is a second table "logs_pivot" :
id unsigned int PK
log_id varchar(36) FOREIGN KEY (logs.id)
model_id varchar(40)
model_type varchar(50)
There is one or several entries in logs_pivot per entry in logs. They have 20+ and 10+ millions of rows, respectively.
We do queries like so :
select * from logs
join logs_pivot on logs.id = logs_pivot.log_id
where model_id = 'some_id' and model_type = 'My\Class'
order by date desc
limit 50;
Obviously we have a compound index on both the model_id and model_type fields, but the requests are still slow : several (dozens of) seconds every times.
We also have an index on the date field, but an EXPLAIN show that this is the model_id_model_type index that is used.
Explain statement:
+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+
| 1 | SIMPLE | logs_pivot | NULL | ref | logs_pivot_model_id_model_type_index,logs_pivot_log_id_index | logs_pivot_model_id_model_type_index | 364 | const,const | 1 | 100.00 | Using temporary; Using filesort |
| 1 | SIMPLE | logs | NULL | eq_ref | PRIMARY | PRIMARY | 146 | the_db_name.logs_pivot.log_id | 1 | 100.00 | NULL |
+----+-------------+-------------+------------+--------+--------------------------------------------------------------------------------+-----------------------------------------------+---------+-------------------------------------------+------+----------+---------------------------------+
In other tables, I was able to make a similar request much faster by including the date field in the index. But in that case they are in a separate table.
When we want to access these data, they are typically a few hours/days old.
Our InnoDB pools are much too small to hold all that data (+ all the other tables) in memory, so the data is most probably always queried on disk.
What would be all the ways we could make that request faster ?
Ideally only with another index, or by changing how it is done.
Thanks a lot !
Edit 17h05 :
Thank you all for your answers so far, I will try something like O Jones suggest, and also to somehow include the date field in the pivot table, so that I can include in the index index.
Edit 14/10 10h.
Solution :
So I ended up changing how the request was really done, by sorting on the id field of the pivot table, which indeed allow to put it in an index.
Also the request to count the total number of rows is changed to only be done on the pivot table, when it is not filtered by date.
Thank you all !
Just a suggestion. Using a compound index is obviously a good thing. Another might be to pre-qualify an ID by date, and extend your index based on your logs_pivot table indexing on (model_id, model_type, log_id ).
If your querying data, and the entire history is 20+ million records, how far back does the data go where you are only dealing with getting a limit of 50 records per given category of model id/type. Say 3-months? vs say your log of 5 years? (not listed in post, just a for-instance). So if you can query the minimum log ID where the date is greater than say 3 months back, that one ID can limit what else is going on from your logs_pivot table.
Something like
select
lp.*,
l.date
from
logs_pivot lp
JOIN Logs l
on lp.log_id = l.id
where
model_id = 'some_id'
and model_type = 'My\Class'
and log_id >= ( select min( id )
from logs
where date >= datesub( curdate(), interval 3 month ))
order by
l.date desc
limit
50;
So, the where clause for the log_id is done once and returns just an ID from as far back as 3 months and not the entire history of the logs_pivot. Then you query with the optimized two-part key of model id/type, but also jumping to the end of its index with the ID included in the index key to skip over all the historical.
Another thing you MAY want to include are some pre-aggregate tables of how many records such as per month/year per given model type/id. Use that as a pre-query to present to users, then you can use that as a drill-down to further get more detail. A pre-aggregate table can be done on all the historical stuff once since it would be static and not change. The only one you would have to constantly update would be whatever the current single month period is, such as on a nightly basis. Or even possibly better, via a trigger that either inserts a record every time an add is done, or updates a count for the given model/type based on year/month aggregations. Again, just a suggestion as no other context on how / why the data will be presented to the end-user.
I see two problems:
UUIDs are costly when tables are huge relative to RAM size.
The LIMIT cannot be handled optimally because the WHERE clauses come from one table, but the ORDER BY column comes from another table. That is, it will do all of the JOIN, then sort and finally peel off a few rows.
SELECT columns FROM big table ORDER BY something LIMIT small number is a notorious query performance antipattern. Why? the server sorts a whole mess of long rows then discards almost all of them. It doesn't help that one of your columns is a LOB -- a TEXT column.
Here's an approach that can reduce that overhead: Figure out which rows you want by finding the set of primary keys you want, then fetch the content of only those rows.
What rows do you want? This subquery finds them.
SELECT id
FROM logs
JOIN logs_pivot
ON logs.id = logs_pivot.log_id
WHERE logs_pivot.model_id = 'some_id'
AND logs_pivot.model_type = 'My\Class'
ORDER BY logs.date DESC
LIMIT 50
This does all the heavy lifting of working out the rows you want. So, this is the query you need to optimize.
It can be accelerated by this index on logs
CREATE INDEX logs_date_desc ON logs (date DESC);
and this three-column compound index on logs_pivot
CREATE INDEX logs_pivot_lookup ON logs_pivot (model_id, model_type, log_id);
This index is likely to be better, since the Optimizer will see the filtering on logs_pivot but not logs. Hence, it will look in logs_pivot first.
Or maybe
CREATE INDEX logs_pivot_lookup ON logs_pivot (log_id, model_id, model_type);
Try one then the other to see which yields faster results. (I'm not sure how the JOIN will use the compound index.) (Or simply add both, and use EXPLAIN to see which one it uses.)
Then, when you're happy -- or satisfied anyway -- with the subquery's performance, use it to grab the rows you need, like this
SELECT *
FROM logs
WHERE id IN (
SELECT id
FROM logs
JOIN logs_pivot
ON logs.id = logs_pivot.log_id
WHERE logs_pivot.model_id = 'some_id'
AND model_type = 'My\Class'
ORDER BY logs.date DESC
LIMIT 50
)
ORDER BY date DESC
This works because it sorts less data. The covering three-column index on logs_pivot will also help.
Notice that both the sub query and main query have ORDER BY clauses, to make sure the returned detail result set is in the order you need.
Edit Darnit, been on MariaDB 10+ and MySQL 8+ so long I forgot about the old limitation. Try this instead.
SELECT *
FROM logs
JOIN (
SELECT id
FROM logs
JOIN logs_pivot
ON logs.id = logs_pivot.log_id
WHERE logs_pivot.model_id = 'some_id'
AND model_type = 'My\Class'
ORDER BY logs.date DESC
LIMIT 50
) id_set ON logs.id = id_set.id
ORDER BY date DESC
Finally, if you know you only care about rows newer than some certain time you can add something like this to your subquery.
AND logs.date >= NOW() - INTERVAL 5 DAY
This will help a lot if you have tonnage of historical data in your table.
I am trying to get those rows whose Id's I have described in a column of that same table . Here's the data:
id layer sublayer
1 A 2, 3, 4
2 B 5
3 C NULL
4 D NULL
5 E NULL
Here's what I am trying to do.
For layer A ,I want to fetch B,C,D whose id's are described in the column sublayer . Here id is the primary key.Is it possible to read individual values from a column separated with special characters ?
Using Find_In_set function for set datatypes in conjunction with group_concat (cross join may not be needed but I like it for the fact it calculates the set once instead of for each row..)
I'm using group concat to bring all the rows into one large data set so we can simply check for the existence of the ID. However, I'm not sure how well group_concat will work with a set datatype already having a , separated values...
On a large data set I would be concerned about performance. Your best bet long term is to normalize the data. But if that's not an option...
Working SQL Fiddle:
SELECT *
FROM layer
CROSS JOIN (SELECT group_Concat(sublayer separator ',') v FROM layer) myset
WHERE FIND_IN_SET(ID, myset.v) > 0;
try this way
SELECT *
FROM layer
WHERE FIND_IN_SET(ID,(
select sublayer from table where layer='A'))>1
I have two tables named Header and Detail. Both tables have an InvoiceNumber (double) column and a YearMonthCode (char4) column.
In the header table there is a column LastChangedDate (char(10) and in the detail table there is a column ItemNumber (varchar(16)) and OrderedQty (integer)
Before I go further let me say that I did not design these tables and cannot change them other than to add or remove indexes or stored procedures. I would not have stored dates as strings but I have to deal with it in the following form (mm-dd-yyyy)
The header table has about 900,000 records and the detail table as about 2,500,000 records
The yearmonthcode column goes from 1301 (January 2013) to 1312 (December 2013)
The objective of my query is to return the sum(OrderedQty) for a specific item over a set lastdatechage range.
Here is the query that I came up with and it does return the information that I need BUT it takes about 30 seconds to 45 seconds per item and the overall project reports on 600 to 100 items.....
I have this as a stored Procedure
GetItemTotal(in itm varchar(16), in ymcode (char(4), in SDate Date, in EDate Date, out total int)
Select sum(D.OrderedQty) into total
FROM Header H use index (index_YMC)
Inner Join Detail D use index (index_YMC)
on H.invoicenumber = D.InvoiceNumber
AND H.YearMonthCode = D.YearMonthCode
Where H.YearMonthCode = ymcode
AND D.ItemNumber = itm
AND STR_TO_DATE(H.LastChangeDate,'%m/%d/%YYYY')
Between Sdate AND Edate;
I call the procedure like this:
Call GetItemTotal('6458-20115','1311', '2013-11-15', '2013-11-30', #total)
Actually those would all be variables but I hard coded values for testing.
Both tables have PRIMARY indexes and index_YMC on the yearmonthcode columns.
Explain shows the following
id select_type table type possible_keys key key_len ref rows extra
1 SIMPLE H ref Index_YMC Index_YMC 5 const 100408 Using where
1 SIMPLE D ALL Index_YMC null null null 2032249 Using where;
Using join buffer
I am very new to database programming and if anyone can give me some ideas on how to make this query faster I would appreciate it.
If you are already summing values for the entire year (in your population of the temp table), then I suggest adding another column to your temp table population query, derived something like:
sum(case when STR_TO_DATE(LastChangeDate,'%m/%d/%YYYY') Between Sdate AND Edate
then OrderedQty
end) as DateRangeTotal
This should remove the need for the GetItemTotal stored procedure altogether.
(You could probably make the rest of your processing more efficient by using set-based operations instead of looping through 600 items, but that's outside the scope of the question.)
I have a table that contains event information for users, i.e. each row contains an ID, a user's start date, completion date, and the event number. Example:
Row 1: ID 256 | StartDate 13360500 | FinishDate 13390500 | EventNum 3
I am trying to intersect all of the rows for users who have finished events 3 & 2, but I can't figure out why my query is returning no results:
SELECT table_id FROM table
WHERE table_EventNum = 3
AND table_FinishDate > 0
AND table_id IN (SELECT table_id FROM table WHERE table_EventNum = 2);
This query without the subquery (the line separated from the rest at the bottom) returns a bunch of non-null results, as it should, and the subquery also returns a bunch of non-null results (again, as it should). But for some reason the composite query doesn't return any rows at all. I know the IN command returns NULL if the expression on either side is NULL, but since both queries return results I'm not sure what else might cause this.
Any ideas? Thanks!
Assuming FinishDate is NULL when the event is not complete. Also assuming that there has to be a row with matching id and event number 2 and that event 3 cannot happen before event 2:
SELECT t1.table_id FROM table t1 INNER JOIN table t2 ON t1.table_id = t2.table_id
WHERE t1.table_EventNum = 3 AND t2.table_EventNum = 2
AND NOT t1.table_FinishDate IS NULL
Note that I could not find anything wrong with your query other than the fact that you do not need a subquery.
I have the following table my_table with primary key id set to AUTO_INCREMENT.
id group_id data_column
1 1 'data_1a'
2 2 'data_2a'
3 2 'data_2b'
I am stuck trying to build a query that will take an array of data, say ['data_3a', 'data_3b'], and appropriately increment the group_id to yield:
id group_id data_column
1 1 'data_1a'
2 2 'data_2a'
3 2 'data_2b'
4 3 'data_3a'
5 3 'data_3b'
I think it would be easy to do using a WITH clause, but this is not supported in MySQL. I am very new to SQL, so maybe I am organizing my data the wrong way? (A group is supposed to represent a group of files that were uploaded together via a form. Each row is a single file and the the data column stores its path).
The "Psuedo SQL" code I had in mind was:
INSERT INTO my_table (group_id, data_column)
VALUES ($NEXT_GROUP_ID, 'data_3a'), ($NEXT_GROUP_ID, 'data_3b')
LETTING $NEXT_GROUP_ID = (SELECT MAX(group_id) + 1 FROM my_table)
where the made up 'LETTING' clause would only evaluate once at the beginning of the query.
You can start a transaction do a select max(group_id)+1, and then do the inserts. Or even by locking the table so others can't change (insert) to it would be possible
I would rather have a seperate table for the groups if a group represents files which belong together, especially when you maybe want to save meta data about this group (like the uploading user, the date etc.). Otherwise (in this case) you would get redundant data (which is bad – most of the time).
Alternatively, MySQL does have something like variables. Check out http://dev.mysql.com/doc/refman/5.1/en/set-statement.html