merging tables which consist of 17 million records - mysql

I have 3 tables in which 2 tables have 200 000 records and another table of 1 800 000 records. I do merge these 3 tables using 2 contraints that is OCN and TIMESTAMP(month,year). first two tables has columns for month and year as Monthx (which includes both month,date and year). and other table as seperate columns for each month and year. I gave the query as,
mysql--> insert into trail
select * from A,B,C
where A.OCN=B.OCN
and B.OCN=C.OCN
and C.OCN=A.OCN
and date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b')
and date_format(A.Monthx,'%b')=C.IMonth
and date_format(B.Monthx,'%b')=C.month
and year(A.Monthx)=year(B.Monthx)
and year(B.Monthx)=C.Iyear
and year(A.Monthx)=C.Iyear
I gave this query 4days before its still running.could u tell me whether this query is correct or wrong and provide me a exact query..(i gave tat '%b' because my C table has a column which has months in the form JAN,MAR).

Please don't use implicit where joins, bury it in 1989, where it belongs. Use explicit joins instead
select * from a inner join b on (a.ocn = b.ocn and
date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b') ....
This select part of the query (had to rewrite it because I refuse to deal with '89 syntax)
select * from A
inner join B on (
A.OCN=B.OCN
and date_format(A.Monthx,'%b')=date_format(B.Monthx,'%b')
and year(A.Monthx)=year(B.Monthx)
)
inner join C on (
C.OCN=A.OCN
and date_format(A.Monthx,'%b')=C.IMonth
and date_format(B.Monthx,'%b')=C.month
and year(B.Monthx)=C.Iyear
and year(A.Monthx)=C.Iyear
)
Has a lot of problems.
using a function on a field will kill any opportunity to use an index on that field.
you are doing a lot of duplicate test. if (A = B) and (B = C) then it logically follows that (A = C)
the translations of the date fields take a lot of time
I would suggest you rewrite your tables to use fields that don't need translating (using functions), but can be compared directly.
A field like yearmonth : char(6) e.g. 201006 can be indexed and compared much faster.
If the table A,B,C have a field called ym for short than your query can be:
INSERT INTO TRAIL
SELECT a.*, b.*, c.* FROM a
INNER JOIN b ON (
a.ocn = b.ocn
AND a.ym = b.ym
)
INNER JOIN c ON (
a.ocn = c.ocn
AND a.ym = c.ym
);
If you put indexes on ocn (primary index probably) and ym the query should run about a million rows a second (or more).

To test if your query is ok, import a small subset of records from A, B and C to a temporary database and test it their.
You have redundancies in your implicit JOIN because you are joining A.OCN with B.OCN, B.OCN with C.OCN and then C.OCN to A.OCN, on of those can be deleted. If A.OCN = B.OCN and B.CON = C.OCN, A.OCN = C.OCN is implied. Further, I guess you have redundancies in your date comparisons.

Related

Why this table join type in the sql is ALL?

I'm trying to query the students whose course1 score are better than course 2. Here is my sql.
SELECT sc1.score C1Score, sc2.score C2Score,s.SId ID, s.Sname Name, s.Sage Age, s.Ssex Sex
FROM sc sc1 INNER JOIN sc sc2
ON sc1.SId = sc2.SId
AND sc1.CId = '01'
AND sc2.CId = '02'
AND sc1.score > sc2.score
INNER JOIN student s
ON s.SId = sc1.SId
;
And the sql explain result:
Table student join type is all. Why a full table scan is done when I said:ON s.SId = sc1.SId
DB structure:
If a database searches for a row with the help of an index, it follows these steps:
Search in the index for the row ("index seek")
Use the information in the index to loop up the row in the table ("lookup")
These steps are very fast for a small number of rows. But they add up if you have to do them often. For a query that returns all rows in a table this method degenerates into very long runtimes.
Just reading the full table ("table scan") is the opposite: it's fast for a large number of rows. You don't have to search and you don't have to do lookups.
As a rule of thumb, when a database expects a join to require more than 100 lookups, it prefers the full table scan.

my sql query keeps on fetching, is it the code or something else?

I try to connect three tables through the following code:
SELECT *
FROM tickets t
JOIN evenementen e
ON e.idEvenement = t.fk_tiEvenementID
JOIN klanttyperesult k
ON k.kltr_idKlant = t.fk_tiKlantID;
Is there a problem with code or should I look for the problem elsewhere?
By "keeps on fetching" I assume you mean hangs, chances are you're selecting a lot of data. To break down what your query is doing, and therefore why it would be taking a long time:
SELECT * - Select every single column from every table this query references. This will mean that you are returning a lot of data, if each table has 7 columns you'll be returning 21 columns worth of data.
FROM tickets t - From the table tickets with an alias of t
JOIN evenementen e - Join the table evenementen with alias e, filtering out results with no join condition between tables.
ON e.idEvenement = t.fk_tiEvenementID - On the condition given
JOIN klanttyperesult k - Join the table klanttyperesult with alias k, filtering out results with no join condition between tables.
ON k.kltr_idKlant = t.fk_tiKlantID; - On the condition given.
If these tables have lots of rows, then you're going to very quickly rack up into expensive join land.
You also might be lacking indexes, meaning you're using a nested-loop join (can be inefficient for large data sets, see https://dev.mysql.com/doc/refman/5.7/en/nested-loop-joins.html). Try adding some indexes to the tables like this.
CREATE INDEX tickets_evenementenIdIdx ON tickets (fk_tiEvenementID)
CREATE INDEX tickets_klanttyperesultIdIdx ON tickets (fk_tiKlantID)
CREATE INDEX evenementen_ticketsIdIdx ON evenementen (idEvenement)
CREATE INDEX klanttyperesultIdIdx ON klanttyperesult (kltr_idKlant)

MySQL Query Performance with a Derived Query

I am looking at a few queries for performance and made a change to a query, which is based on the following examples. The change turned a 6 minute query into one which completes in few seconds and I was wondering why? How has this altered things to such an extent?
In the example, please assume the BOOK table to contain the general details for all books in a library and the FORMATS table contains details, such as HARDBACK, PAPERBACK and eBOOK (allowing for new formats to be added) where there is a key (called FORMATID) linking the two tables.
Query executes in 6 minutes
select b.bookid, f.formatname
from book b
inner join formats f on f.formatid = b.formatid
select b.bookid, f.formatname
from book b
left join formats f on f.formatid = b.formatid
Query executes in 12 seconds
select b.bookid, (select f.formatname from formats f where f.formatid = b.formatid)
from book b
where b.formatid is not null
select b.bookid, (select f.formatname from formats f where f.formatid = b.formatid)
from book b
In the above, the first query of each pair achieves INNER JOIN results and the second, achieves LEFT JOIN. The results difference on my database is 295166 and 295376 rows; the ties differences remain pretty much the same.
[added] For confirmation; I have tested this (with the same results) by creating the two test tables mentioned herein, populating the BOOKS table with ~1 million rows and NOT applying any index or other optimisation.

INNER JOIN execution/evaluation order

I was wondering how exactly inner joins works in mysql.
If I do
SELECT * FROM A a
INNER JOIN B b ON a.row = b.row
INNER JOIN C c ON c.row2 = b.row2
WHERE name='Paul';
Does it do the joins first, then pick the ones where name = paul? Because when I do it this way, it is SUPER DUPER slow.
is there a way to do something along the lines:
SELECT * FROM (A a WHERE name='paul')
INNER JOIN B b ON a.row = b.row
INNER JOIN C c ON c.row2 = b.row2]
When I try it that way, I just get an error.
or alternately, is it better to just have 3 separate queries, one for A, B and C? example:
string query1 = "SELECT * FROM A WHERE name = 'paul'";
//send query, get data reader
string query2 = "SELECT * FROM b WHERE b = " + query1.b;
//send query, get data reader
string query3 = "SELECT * FROM C WHERE c = " + query1.c;
//send query, get data reader
Obviously this is just pseudo code, but I think it illustrates the point.
Which way is faster/recommended?
Edit
Table structure:
**tblTimesheet**
int timesheetID (primary key)
datetime date
varchar username
int projectID
string description
float hours
**tblProjects**
int projectID (primary key)
string project name
int clientID
**tblClients**
int clientID
string clientName
The join that I want is:
select * from tblTimesheet time
INNER JOIN tblProject proj on time.projectID = proj.projectID
INNER JOIN tblClient client on proj.clientID = client.clientID
WHERE username = 'paul';
something like that
You are probably missing an index on a key table; you can use the MySql EXPLAIN keyword to help in finding out where your query is slow.
To answer another section of your question;
is there a way to do something along the lines:
SELECT * FROM (A a WHERE name='paul')
INNER JOIN B b ON a.row = b.row
INNER JOIN C c ON c.row2 = b.row2]
You can use a SubQuery;
SELECT *
FROM (SELECT * FROM tblTimesheet WHERE username = 'Paul') AS time
INNER JOIN tblProject proj on time.projectID = proj.projectID
INNER JOIN tblClient client on proj.clientID = client.clientID
What this query is effectively doing is attempting to prefilter the fields the JOIN will operate on. Rather than join all the fields to together, and then filter those down, it only attempts to JOIN fields from tblTimesheet where the name is 'Paul' first.
However, the query optimizer should already be doing this so this query should perform similarly to your original query.
For more help with indexes, the understanding of which will aid you greatly in database development, start by looking at a tutorial like this one.
It's fantastically unlikely a join will be slower than three database hits. Reordering the clauses shouldn't have an impact either if MySQL's query optimizer is at all competent. Are the columns in the WHERE / ON clauses indexed?
I think you'll find the query optimiser will give you the best possible query most of the time. You need to look at the execution plan to find out why the query is slow - my guess is lack of indexes.
When MySql looks in these tables, it will usually do it in the best way to get the best speed - a simple join as you've illustrated won't confuse the query optimiser, but missing indexes can cause the database engine to scan tables instead of looking up values (i.e. it needs to walk through the table row by row to match the criteria you specified)
An index ensures that the engine doesn't need to go searching down to the leaf page level and will usually speed up queries
What's the table structure here or is this all hypothetical?
The general rule of thumb with SQL is - try it and see!
Use your first query, the mySql query optimizer should pick the fastest strategy
if you want it to be faster, make sure that there is an index on the name column

MySQL JOIN tables with WHERE clause

I need to gather posts from two mysql tables that have different columns and provide a WHERE clause to each set of tables. I appreciate the help, thanks in advance.
This is what I have tried...
SELECT
blabbing.id,
blabbing.mem_id,
blabbing.the_blab,
blabbing.blab_date,
blabbing.blab_type,
blabbing.device,
blabbing.fromid,
team_blabbing.team_id
FROM
blabbing
LEFT OUTER JOIN
team_blabbing
ON team_blabbing.id = blabbing.id
WHERE
team_id IN ($team_array) ||
mem_id='$id' ||
fromid='$logOptions_id'
ORDER BY
blab_date DESC
LIMIT 20
I know that this is messy, but i'll admit, I am no mysql veteran. I'm a beginner at best... Any suggestions?
You could put the where-clauses in subqueries:
select
*
from
(select * from ... where ...) as alias1 -- this is a subquery
left outer join
(select * from ... where ...) as alias2 -- this is also a subquery
on
....
order by
....
Note that you can't use subqueries like this in a view definition.
You could also combine the where-clauses, as in your example. Use table aliases to distinguish between columns of different tables (it's a good idea to use aliases even when you don't have to, just because it makes things easier to read). Example:
select
*
from
<table> as alias1
left outer join
<othertable> as alias2
on
....
where
alias1.id = ... and alias2.id = ... -- aliases distinguish between ids!!
order by
....
Two suggestions for you since a relative newbie in SQL. Use "aliases" for your tables to help reduce SuperLongTableNameReferencesForColumns, and always qualify the column names in a query. It can help your life go easier, and anyone AFTER you to better know which columns come from what table, especially if same column name in different tables. Prevents ambiguity in the query. Your left join, I think, from the sample, may be ambigous, but confirm the join of B.ID to TB.ID? Typically a "Team_ID" would appear once in a teams table, and each blabbing entry could have the "Team_ID" that such posting was from, in addition to its OWN "ID" for the blabbing table's unique key indicator.
SELECT
B.id,
B.mem_id,
B.the_blab,
B.blab_date,
B.blab_type,
B.device,
B.fromid,
TB.team_id
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
WHERE
TB.Team_ID IN ( you can't do a direct $team_array here )
OR B.mem_id = SomeParameter
OR b.FromID = AnotherParameter
ORDER BY
B.blab_date DESC
LIMIT 20
Where you were trying the $team_array, you would have to build out the full list as expected, such as
TB.Team_ID IN ( 1, 4, 18, 23, 58 )
Also, not logical "||" or, but SQL "OR"
EDIT -- per your comment
This could be done in a variety of ways, such as dynamic SQL building and executing, calling multiple times, once for each ID and merging the results, or additionally, by doing a join to yet another temp table that gets cleaned out say... daily.
If you have another table such as "TeamJoins", and it has say... 3 columns: a date, a sessionid and team_id, you could daily purge anything from a day old of queries, and/or keep clearing each time a new query by the same session ID (as it appears coming from PHP). Have two indexes, one on the date (to simplify any daily purging), and second on (sessionID, team_id) for the join.
Then, loop through to do inserts into the "TempJoins" table with the simple elements identified.
THEN, instead of a hard-coded list IN, you could change that part to
...
FROM
blabbing B
LEFT JOIN team_blabbing TB
ON B.ID = TB.ID
LEFT JOIN TeamJoins TJ
on TB.Team_ID = TJ.Team_ID
WHERE
TB.Team_ID IN NOT NULL
OR B.mem_id ... rest of query
What I ended up doing is;
I added an extra column to my blabbing table called team_id and set it to null as well as another field in my team_blabbing table called mem_id
Then I changed the insert script to also insert a value to the mem_id in team_blabbing.
After doing this I did a simple UNION ALL in the query:
SELECT
*
FROM
blabbing
WHERE
mem_id='$id' OR
fromid='$logOptions_id'
UNION ALL
SELECT
*
FROM
team_blabbing
WHERE
team_id
IN
($team_array)
ORDER BY
blab_date DESC
LIMIT 20
I am open to any thought on what I did. Try not to be too harsh though:) Thanks again for all the info.