mysql (innodb) select distinct + calculate values to copy in other table - mysql

I'm trying to combine some queries to (eventually) use them in a stored procedure because I'm afraid in the long run it will take some time to process due to calculating and I guess a datawarehouse makes faster lookups possible.
I came up with this, but I keep getting syntax error messages (#1064) (check the manual that corresponds to your MySQL server version for the right syntax to use near 'select distinct)
Does anyone know a nice aproach for this?
insert into datawarehouse (itemid,rating) values
(select distinct itemid from ratings),
(select sum(rating)/count(*)from ratings where itemid in (select distinct itemid from ratings))
If I run the inner select queries separately it works but combining it seems troublesome.
In a nutshell I want to retrieve (i) the distinct itemid from table ratings, (ii) perform some calculations on the rows of table ratings for each itemid
and copy those into table datawarehouse.
If anyone has an idea or a good read on this, I would love to hear it.

Use INSERT INTO ... SELECT and GROUP BY:
INSERT INTO datawarehouse (itemid, rating) SELECT itemid, sum(rating)/count(*) FROM ratings GROUP BY itemid

Related

Fast to query slow to create table

I have an issue on creating tables by using select keyword (it runs so slow). The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query.
SELECT *
FROM amusementPart a
INNER JOIN (
SELECT DISTINCT name, type, cageID, dateOfEntry
FROM bigRegistrations
GROUP BY cageID
) r ON a.type = r.cageID
But because of slow performance, someone suggested me steps to improve the performance. 1) use temporary table, 2)store the result and use it and join it the the other statement.
use myzoo
CREATE TABLE animalRegistrations AS
SELECT DISTINCT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
unfortunately, It is still slow. If I only use the select statement, the result will be shown in 1-2 seconds. But if I add the create table, the query will take ages (approx 25 minutes)
Any good approach to improve the query time?
edit: the size of big registration table is around 3.5 million rows
Can you please try the query in the way below to achieve The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query, the query you are using is not fetching records as per your requirement and it will faster:
SELECT a.*, b.name, b.type, b.cageID, b.dateOfEntry
FROM amusementPart a
INNER JOIN bigRegistrations b ON a.type = b.cageID
INNER JOIN (SELECT c.cageID, max(c.dateOfEntry) dateofEntry
FROM bigRegistrations c
GROUP BY c.cageID) t ON t.cageID = b.cageID AND t.dateofEntry = b.dateofEntry
Suggested indexing on cageID and dateofEntry
This is a multipart question.
Use Temporary Table
Don't use Distinct - group all columns to make distinct (dont forget to check for index)
Check the SQL Execution plans
Here you are not creating a temporary table. Try the following...
CREATE TEMPORARY TABLE IF NOT EXISTS animalRegistrations AS
SELECT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
Have you tried doing an explain to see how the plan is different from one execution to the next?
Also, I have found that there can be locking issues in some DB when doing insert(select) and table creation using select. I ran this in MySQL, and it solved some deadlock issues I was having.
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
The reason the query runs so slow is probably because it is creating the temp table based on all 3.5 million rows, when really you only need a subset of those, i.e. the bigRegistrations that match your join to amusementPart. The first single select statement is faster b/c SQL is smart enough to know it only needs to calculate the bigRegistrations where a.type = r.cageID.
I'd suggest that you don't need a temp table, your first query is quite simple. Rather, you may just need an index. You can determine this manually by studying the estimated execution plan, or running your query in the database tuning advisor. My guess is you need to create an index similar to below. Notice I index by cageId first since that is what you join to amusementParks, so that would help SQL narrow the results down the quickest. But I'm guessing a bit - view the query plan or tuning advisor to be sure.
CREATE NONCLUSTERED INDEX IX_bigRegistrations ON bigRegistrations
(cageId, name, type, dateOfEntry)
Also, if you want the animal with the latest entry date, I think you want this query instead of the one you're using. I'm assuming the PK is all 4 columns.
SELECT name, type, cageID, dateOfEntry
FROM bigRegistrations BR
WHERE BR.dateOfEntry =
(SELECT MAX(BR1.dateOfEntry)
FROM bigRegistrations BR1
WHERE BR1.name = BR.name
AND BR1.type = BR.type
AND BR1.cageID = BR.cageID)

Why can't I use tuples or ordered pairs in a subquery in MySQL?

Say I want to get the most recent row in a table that has a bunch of records with different IDs.
First, I create a temp table, where I find the most recent rows (grouped by ID of course):
CREATE TEMPORARY TABLE
temp1
AS
SELECT DISTINCT ID, max(date) FROM atable GROUP BY ID;
But, since the whole point was to get all the values for these records, I have to join this back to the original table, atable. Annoying, but what can you do.
I really, really want to use a tuple or an order pair. Why can't I do this in MySQL??
SELECT * FROM atable
WHERE (ID, date) IN
(SELECT ID, date FROM temp1);
What is the canonical syntax to do this?
(Further, philosophical question: Why is MySQL so clunky with this? It's been around decades, and nobody have ever implemented something this basic?)
Why can't I do this in MySQL??
You can, but not with IN. The way to match multiple columns is with EXISTS:
SELECT * FROM atable
WHERE EXISTS
(SELECT NULL FROM temp1
WHERE temp1.ID = atable.ID AND temp1.date = atable.date);
Note that you could also use your original query as a subquery:
SELECT * FROM atable
WHERE date = (SELECT max(date) FROM atable WHERE ID = atable.ID)
Why is MySQL so clunky with this?
The "standard" use of IN is to find records where one value is contained in a list of other values (e.g. WHERE NAME IN ('Smith', 'Jones'). This has been extended to allow a subquery to provide the list rather than a static list.
If you feel that it's a worthy feature for MySQL to implement (and you seem to think it's very easy) then you can submit a feature request, but since 1) there is another way to accomplish it and 2) it would be non-standard SQL I would be surprised if it got a lot of attention.

Joining two large tables in mysql giving server time out

This query is inefficient and unable to execute. track and desiredspeed table have almost million records.... after this we want to self join the track table for further processing. any efficient approach to execute bellow query is appreciated..
select
t_id,
route_id,
t.timestamp,
s_lat,
s_long,
longitude,
latitude,
SQRT(POW((latitude - d_lat),2) + POW((longitude - d_long),2)) as dst,
SUM(speed*18/5)/count(*) as speed,
'20' as actual_speed,
((20-(speed*18/5))/(speed*18/5))*100 as speed_variation
from
track t,
desiredspeed s
WHERE
LEFT(s_lat,6) = LEFT(latitude,6)
AND LEFT(s_long,6)=LEFT(longitude,6)
AND t_id > 53445
group by
route_id,
s_lat,
s_long
order by
t_id asc
firstly you are using sybase join syntax i would change that
you are also performing two computations per join across large datasets this is likely to be inefficient
this will not be able to use an index as you are performing computation on the column, either store the data precomputed or alternately add a computed column based on the rule applied above, and index accordingly
Finally it may be quicker if you used temp tables or common Table expressions (although do not know MySQL too well here)

Why does MySQL allow you to group by columns that are not selected

I'm reading a book on SQL (Sams Teach Yourself SQL in 10 Minutes) and its quite good despite its title. However the chapter on group by confuses me
"Grouping data is a simple process. The selected columns (the column list following
the SELECT keyword in a query) are the columns that can be referenced in the GROUP
BY clause. If a column is not found in the SELECT statement, it cannot be used in the
GROUP BY clause. This is logical if you think about it—how can you group data on a
report if the data is not displayed? "
How come when I ran this statement in MySQL it works?
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
You're right, MySQL does allow you to create queries that are ambiguous and have arbitrary results. MySQL trusts you to know what you're doing, so it's your responsibility to avoid queries like that.
You can make MySQL enforce GROUP BY in a more standard way:
mysql> SET SQL_MODE=ONLY_FULL_GROUP_BY;
mysql> select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
ERROR 1055 (42000): 'test.EMPLOYEE_PAY_TBL.EMP_ID' isn't in GROUP BY
Because the book is wrong.
The columns in the group by have only one relationship to the columns in the select according to the ANSI standard. If a column is in the select, with no aggregation function, then it (or the expression it is in) needs to be in the group by statement. MySQL actually relaxes this condition.
This is even useful. For instance, if you want to select rows with the highest id for each group from a table, one way to write the query is:
select t.*
from table t
where t.id in (select max(id)
from table t
group by thegroup
);
(Note: There are other ways to write such a query, this is just an example.)
EDIT:
The query that you are suggesting:
select EMP_ID, SALARY
from EMPLOYEE_PAY_TBL
group by BONUS;
would work in MySQL but probably not in any other database (unless BONUS happens to be a poorly named primary key on the table, but that is another matter). It will produce one row for each value of BONUS. For each row, it will get an arbitrary EMP_ID and SALARY from rows in that group. The documentation actually says "indeterminate", but I think arbitrary is easier to understand.
What you should really know about this type of query is simply not to use it. All the "bare" columns in the SELECT (that is, with no aggregation functions) should be in the GROUP BY. This is required in most databases. Note that this is the inverse of what the book says. There is no problem doing:
select EMP_ID
from EMPLOYEE_PAY_TBL
group by EMP_ID, BONUS;
Except that you might get multiple rows back for the same EMP_ID with no way to distinguish among them.

Subquery for fetching table name

I have a query like this :
SELECT * FROM (SELECT linktable FROM adm_linkedfields WHERE name = 'company') as cbo WHERE group='BEST'
Basically, the table name for the main query is fetched through the subquery.
I get an error that #1054 - Unknown column 'group' in 'where clause'
When I investigate (removing the where clause), I find that the query only returns the subquery result at all times.
Subquery table adm_linkedfields has structure id | name | linktable
Currently am using MySQL with PDO but the query should be compatible with major DBs (viz. Oracle, MSSQL, PgSQL and MySQL)
Update:
The subquery should return the name of the table for the main query. In this case it will return tbl_company
The table tbl_company for the main query has this structure :
id | name | group
Thanks in advance.
Dynamic SQL doesn't work like that, what you created is an inline-view, read up on that. What's more, you can't create a dynamic sql query that will work on every db. If you have a limited number of linktables you could try using left-joins or unions to select from all tables but if you don't have a good reason you don't want that.
Just select the tablename in one query and then make another one to access the right table (by creating the query string in php).
Here is an issue:
SELECT * FROM (SELECT linktable FROM adm_linkedfields WHERE name = 'company') as cbo
WHERE group='BEST';
You are selecting from DT which contains only one column "linktable", then you cant put any other column in where clause of outer block. Think in terms of blocks the outer select is refering a DT which contains only one column.
Your problem is similar when you try to do:
create table t1(x1 int);
select * from t1 where z1 = 7; //error
Your query is:
SELECT *
FROM (SELECT linktable
FROM adm_linkedfields
WHERE name = 'company'
) cbo
WHERE group='BEST'
First, if you are interested in cross-database compatibility, do not name columns or tables after SQL reserved words. group is a really, really bad name for a column.
Second, the from clause is returning a table containing a list of names (of tables, but that is irrelevant). There is no column called group, so that is the problem you are having.
What can you do to fix this? A naive solution would be to run the subquery, run it, and use the resulting table name in a dynamic statement to execute the query you want.
The fundamental problem is your data structure. Having multiple tables with the same structure is generally a sign of a bad design. You basically have two choices.
One. If you have control over the database structure, put all the data in a single table, linktable for instance. This would have the information for all companies, and a column for group (or whatever you rename it). This solution is compatible across all databases. If you have lots and lots of data in the tables (think tens of millions of rows), then you might think about partitioning the data for performance reasons.
Two. If you don't have control over the data, create a view that concatenates all the tables together. Something like:
create view vw_linktable as
select 'table1' as which, t.* from table1 t union all
select 'table2', t.* from table2 t
This is also compatible across all databases.