I want to read data for reporting purpose. Currently, I populate a table using another table's calculated data, and read data for reporting from the populated table. My current logic is too delete the old data, and insert the new data, all within a transaction.
UPDATE
Requirements
1) The logic below is to run once every second. Please note that other processes also udpates tableB with the same refresh rate.
2) TableB is used for reporting purpose. TableA and TableB resides in different databases.
3) TableB contains around 10 millions rows, around 4 millions rows will be updated once every second by the code below. Other processes also update the other part of data (6 = 10-4 millions) in tableB at the same refresh rate.
My concern is that:
1) three statements use similar sum, and where clauses, which might be improved.
2) There are about 1-2 millions row in tablea to update to tableB. Using an explicit temporary table might slow down.
3) Using transaction might slow down, but it seems the only way.
4) Update the data might be a better option than delete and insert (which one should I choose?)
I want to find a better performant way (including table redesign etc.). Below is the current way:
pseudocode below:
start/begin transaction here
DELETE from tableb data that I want to insert below, e.g. delete data where Code = 'code'
INSERT INTO tableb(Code, Total)
SELECT sum(a.Code, price)
FROM tablea a
GROUP BY a.Code;
IINSERT INTO tableb(Code, Total)
SELECT sum(a.Code price) -- use price
FROM tablea a
WHERE a.meanPrice IS NOT NULL
GROUP BY a.Code;
INSERT INTO tableb(Code, Total)
SELECT sum(a.Code, meanPrice ) -- use meanPrice
FROM tablea a
WHERE a.meanPrice IS NOT NULL
GROUP BY a.Code;
Commit transaction here
It is for MySQL, but ideally it should be generic.
Any idea?
Do you actually need to update values in the table? They are not tagged with any id or names to identify them.
The following SELECT statement returns the data you want:
SELECT code,
sum(price),
sum(case when a.meanPrice is not null then price else 0 end),
sum(case when a.meanPrice is not null then meanprice else 0 end)
FROM tablea a
GROUP BY a.Code;
If you needed to insert this into a temp table, you can unpivot the data. However, that format does not make sense to me. Can you explain why you are using a table with one numeric column in this way?
This query does your INSERT task in 1 step, but ... kids, don't do this at home without actually measuring actual performance:
http://sqlfiddle.com/#!2/381e2/9
INSERT INTO tableb(Total)
SELECT
CASE
WHEN t.v = 1
THEN SUM( price )
WHEN t.v = 2
THEN SUM(
CASE
WHEN meanPrice IS NOT NULL THEN price
ELSE 0
END
)
WHEN t.v = 3
THEN SUM( meanPrice )
END AS Total
FROM tablea
INNER JOIN
( SELECT 1 AS v UNION ALL
SELECT 2 AS v UNION ALL
SELECT 3 AS v
) AS t
GROUP BY tablea.Code, t.v;
Point 3 is false.
Solution 1: Create an store procedure.
Solution 2: Create a trigger on the impacted tables.
Solution 3: Don't ask for a sum every time, do the sum the first time and then save the number on another table. On every modification of this table, do the sum over your new table, I wont be 1 millions records, only one per table.
Pivot tables!
Related
I have a table which has ~500 GBs of data and have two queries running on it.
-- Query 1
select Count(*) from table
where C1 = A
-- Query 2
select Count(*) from table
where C1 = A and C2 = B
I feel Query 2 execution on whole table is un-necessary as the results are subset of Query 1. Is there any optimized way to first execute Query 1 then run Query 2 on the results of it and finally return Count of both the results.
SELECT
COUNT(*) AS cnt_1,
SUM(c2 = 'B') AS cnt_2
FROM yourTable
WHERE c1 = 'A';
The index yourTable (c1, c2) will improve.
No. Any such inter-query optimization would depend on the database, and I'm not familiar with any database that caches intermediate result sets. In addition, such an optimization would be rendered useless if the underlying table changes -- and relational databases are designed to support changing data.
As a note: the optimization would have to be very sophisticated, because the count returned by the first query has nothing to do with the count returned by the second. You are thinking that the rows in the second are a subset of the first, but those rows are not actually returned. Some databases -- including MySQL -- can cache result sets so the same query run later would use the cache. However, MySQL is removing that support because of the complications it introduces.
If you want to phrase this as two queries, your best bet is an index on t(c1, c2). The index will be used for both queries and should be prety efficient.
Otherwise, use a single query. Akina's solution is the best approach among the other answers because it filters before aggregating.
Use conditional aggregation:
SELECT
SUM(C1 = A) AS cnt_1,
SUM(C1 = A AND C2 = B) AS cnt_2
FROM yourTable;
The above works because MySQL happens to support summing boolean expressions. On most other databases, you would use this version:
SELECT
COUNT(CASE WHEN C1 = A THEN 1 END) AS cnt_1,
COUNT(CASE WHEN C1 = A AND C2 = B THEN 1 END) AS cnt_2
FROM yourTable;
select SUM(CASE WHEN C1='A' THEN 1 ELSE 0 END), A_CNT
SUM(CASE WHEN C1='A' AND C2='B' THEN 1 ELSE 0 END)B_CNT
from table
I have an issue on creating tables by using select keyword (it runs so slow). The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query.
SELECT *
FROM amusementPart a
INNER JOIN (
SELECT DISTINCT name, type, cageID, dateOfEntry
FROM bigRegistrations
GROUP BY cageID
) r ON a.type = r.cageID
But because of slow performance, someone suggested me steps to improve the performance. 1) use temporary table, 2)store the result and use it and join it the the other statement.
use myzoo
CREATE TABLE animalRegistrations AS
SELECT DISTINCT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
unfortunately, It is still slow. If I only use the select statement, the result will be shown in 1-2 seconds. But if I add the create table, the query will take ages (approx 25 minutes)
Any good approach to improve the query time?
edit: the size of big registration table is around 3.5 million rows
Can you please try the query in the way below to achieve The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query, the query you are using is not fetching records as per your requirement and it will faster:
SELECT a.*, b.name, b.type, b.cageID, b.dateOfEntry
FROM amusementPart a
INNER JOIN bigRegistrations b ON a.type = b.cageID
INNER JOIN (SELECT c.cageID, max(c.dateOfEntry) dateofEntry
FROM bigRegistrations c
GROUP BY c.cageID) t ON t.cageID = b.cageID AND t.dateofEntry = b.dateofEntry
Suggested indexing on cageID and dateofEntry
This is a multipart question.
Use Temporary Table
Don't use Distinct - group all columns to make distinct (dont forget to check for index)
Check the SQL Execution plans
Here you are not creating a temporary table. Try the following...
CREATE TEMPORARY TABLE IF NOT EXISTS animalRegistrations AS
SELECT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
Have you tried doing an explain to see how the plan is different from one execution to the next?
Also, I have found that there can be locking issues in some DB when doing insert(select) and table creation using select. I ran this in MySQL, and it solved some deadlock issues I was having.
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
The reason the query runs so slow is probably because it is creating the temp table based on all 3.5 million rows, when really you only need a subset of those, i.e. the bigRegistrations that match your join to amusementPart. The first single select statement is faster b/c SQL is smart enough to know it only needs to calculate the bigRegistrations where a.type = r.cageID.
I'd suggest that you don't need a temp table, your first query is quite simple. Rather, you may just need an index. You can determine this manually by studying the estimated execution plan, or running your query in the database tuning advisor. My guess is you need to create an index similar to below. Notice I index by cageId first since that is what you join to amusementParks, so that would help SQL narrow the results down the quickest. But I'm guessing a bit - view the query plan or tuning advisor to be sure.
CREATE NONCLUSTERED INDEX IX_bigRegistrations ON bigRegistrations
(cageId, name, type, dateOfEntry)
Also, if you want the animal with the latest entry date, I think you want this query instead of the one you're using. I'm assuming the PK is all 4 columns.
SELECT name, type, cageID, dateOfEntry
FROM bigRegistrations BR
WHERE BR.dateOfEntry =
(SELECT MAX(BR1.dateOfEntry)
FROM bigRegistrations BR1
WHERE BR1.name = BR.name
AND BR1.type = BR.type
AND BR1.cageID = BR.cageID)
I want to insert 5 new lines in a table if and only if none of the 5 lines are already there. If one of them is in the table, then I want to abort the insertion (without updating anything), and know which one (or which ones) were already there.
I can think of long ways to do this (such as looking if SELECT col1 WHERE col1 IN (value1,value2,...) returns anything, and then inserting only if it doesn't)
I also guess transactions can do this, but I'm currently learning how they work. However, I don't know if a transaction can give me which entry(ies) is (are) a duplicate(s).
With or without transactions, is there any way to do this in one or two queries only ?
Thanks
I doubt there is a better way than the solution you mentioned: First run a SELECT query and if it doesn't return anything, INSERT. You asked for something in one or two queries. This is exactly two queries, so pretty efficient in my view. I can't think of an efficient way to use transactions for this. Transactions are good when you have multiple INSERT or UPDATE queries, you have only one.
The insert instruction does not give a lot of chances to do the job. If you turn on an UNIQUE constraint in the desired field and than insert all the fields in only one instruction such
INSERT INTO FOO(col1) VALUES
(val1),
(val2),
(val3),
(val4),
(val5);
It is going to give an exception due the constraint violation and therefore abort the instruction.
If you want avoid the exception the job becomes a little pervert:
INSERT INTO FOO(col1) VALUES
Seleect a.* from (Select val1
union
Select val2
union
select val3
union
select val4
union
select val5 ) a
inner join
( select g.* from(
select false b from foo where col1 in(val1,val2....)
union
select true) g
limit 1) b on b.b
What happen? the most inner query returns true only if there is no values therefore it will insert all the values only if there is no values.
I have 2 indentical tables called LIVE and BACKUP.
What I cam trying to do is to compare a LIVE record with its equivalent BACKUP record to see if they match. This check is required each time an individual LIVE record is accessed.
i.e. I only want to compare record number 59 (as an example) rather than all records in the LIVE table?
Currently I can do what I want by simply comparing the LIVE record and its equivalent BACKUP record on a field by field basis.
However, I was wondering if it is possible to do a simple "Compare LIVE record A with BACKUP record A".
I don't need to know what the differences are or even in which fields they occur. I only need to know a simple yes/no as to whether both records match or not.
Is such a thing possible or am I stuck comparing the tables on a field by field basis?
Many thanks,
Pete
Here is a hack, assuming the columns really are all the same:
select count(*)
from ((select *
from live
where record = 'A'
) union
(select *
from backup
where record = 'A'
)
) t
This will return "1" if they are identical and "2" if more than one record exists. If you want to ensure against two values being in the same table, then use the modified form:
select count(distinct which)
from ((select 'live' as which, l.*
from live .
where record = 'A'
) union
(select 'backup' as which, b.*
from backup b
where record = 'A'
)
) t;
Also . . . Note the use of union. The duplicate removal is very intentional here.
This is an extension of the "dual" table concept (temporary table created on the fly for one query and discarded straight after)
I am trying to join a multi row dual table with another one, so as to avoid to have to run the same query several times with different parameters, using 1 statement.
One of the issue I am having is that union is very slow for dual tables, and I am unaware of any more efficient way to accomplish the following. (100 ms when joining 50 dual together)
SELECT
b.id,
b.ref_unid,
a.date
FROM
(
SELECT
'b8518a84-c501-11dd-b0b6-001d7dc91168' as unid,
'2010-01-05' as date
UNION
SELECT
'b853a1f2-c501-11dd-b0b6-001d7dc91168',
'2010-01-06'
UNION
SELECT
'b8557bd0-c501-11dd-b0b6-001d7dc91168',
'2010-01-07'
/* ... */
) as a
join other_table b
ON
b.ref_unid = a.unid
Is there another way of accomplishing this goal?
Is there any syntax similar to that of insert into values statement that would accomplish that goal, such as:
SELECT
unid,
id
FROM
(
WITH (unid, date) USING VALUES
(
('b8518a84-c501-11dd-b0b6-001d7dc91168','2010-01-05'),
('b853a1f2-c501-11dd-b0b6-001d7dc91168','2010-01-06'),
('b8557bd0-c501-11dd-b0b6-001d7dc91168','2010-01-07'),
/* ... */
)
) as a
join other_table b
ON
b.ref_unid = a.unid
I'm looking for a 1-statement solution. Multiple trips to the database aren't possible.
There's no other convention I'm aware of that's available in MySQL to construct a derived table in a single statement. If this dealt with a single column, at ~50 values it could be converted to use an IN clause.
The best performing approach is to load the data into a table of one form or another -- in MySQL, for a temporary use I'd recommend using the MEMORY engine. At ~50 tuples, I have to wonder why the data isn't already in the database...