Inner join and Split on large volume of data

Inner join and Split on large volume of data - sql-server-2008

We are working on large volume data (row counts given below) :
Table 1 : 708408568 rows -- 708 million
Table 2 : 1416817136 rows -- 1.4 billion
Table 1 Schema:
----------------
ID - Int PK
column2 - Int
Table 2 Schema
----------------
Table1ID - Int FK
SomeColumn - Int
SomeColumn - Int
Table1 has PK1 which servers as FK for Table 2.
Index details :
Table1 :
PK Clustered Index on Id
Non Clustered (Non Unique) on column2
Table 2 :
Table1ID (FK) Clustered Index
Below is the query which needs to be executed :
SELECT t1.[id]
,t1.[column2]
FROM Table1 t1
inner join Table2 t2
on s.id = cs.id
WHERE t1.[column2] in (select [id] from ConvertCsvToTable('1,2,3,4,5.......10000')) -- 10,000 Comma seperated Ids
So to summarize, The inner join on ID should be handled by the clustered index on the same Ids on both PK and FK.
and as for the "huge" Where condition on column2 we have a nonclustered index.
However, the query is taking 4 minutes for a small subset of 100 Ids, we need to pass 10,000 ids.
Is there a better way design wise that we can do this, or possibly does Table Partitioning help?
Just wanted to get some ways of how to solve huge volume Select with Inner Join and Where IN.
Note : ConvertCsvToTable is a Split function which has already been determined to perform optimally.
Thanks !

This is what I would try:
Create a temp table with the structure of the return from the function. Make sure to set the column ID as primary key so that the optimizer takes it into consideration...
CREATE TABLE #temp
(id int not null
...
,PRIMARY KEY (id) )
then call the function
insert into #temp exec ConvertCsvToTable('1,2,3,4,5.......10000')
then use the temp table directly joined in the query
SELECT t1.[id], t1.[column2]
FROM Table1 t1, t2, #temp
where t1.id = t2.id
and t1.[column2] = #temp.id

Bring the condition into the join
It gives the optimizer a chance to first filter by t1.[column2] first
Try different hash hints
SELECT t1.[id], t1.[column2]
FROM Table1 t1 with (nolock)
inner join Table2 t2 with (nolock)
on s.id = cs.id
and t1.[column2] in (select [id] from ConvertCsvToTable('1,2,3,4,5.......10000'))
You may need to tell it to use that index on Column2.
But give it a chance to do the right thing.
In the where you were not giving it a chance to do the right thing.
If you go with #temp then try
(and declare a PK on the temp as Rodolfo stated +1)
This will pretty much force it to start with small table
It could still get stupid do the join on T2 first but I doubt it.
SELECT t1.[id], t1.[column2]
FROM #temp
JOIN Table1 t1 with (nolock)
on t1.[column2] = #temp.ID
join Table2 t2 with (nolock)
on t2.ID = t1.ID

Related

Is there a better alternative for the MySQL query below?

The execution of the following MySQL query often takes 2-3 minutes. The objective is to select records from a table, where 2 of it's columns' values are also contained in an other, previously created temporary table. This table has only one column. This temporary table is created instead of 2 subqueries, because 4 tables are needed to be joined in order to get the values.
The temporary table holds around 40 000 records in general, the values are of type varchar(32) COLLATE 'utf8mb4_bin', the table1 table has 45 000 records.
table1
a | varchar(32)
b | varchar(32)
temp
name | varchar(32)
CREATE TEMPORARY TABLE IF NOT EXISTS temp AS SELECT name FROM names ...;
SELECT a, b
FROM table1
WHERE a IN (SELECT name FROM temp)
AND b IN (SELECT name FROM temp);
a and b columns of table1 are indexed.
How to improve the execution speed? Is there a more efficient way of doing this?

Add an index to the temp table:
ALTER TABLE temp ADD INDEX (name);
Also use JOIN rather than IN. MySQL generally optimizes this better.
SELECT DISTINCT a, b
FROM table1 AS t1
JOIN temp AS t2 ON t1.a = t2.name
JOIN temp AS t3 ON t1.b = t3.name

getting data from three different table and inserting it into new table

I have three tables with contents, now i want to get them and add it into new table but am having this sql error "Column count doesn't match value count at row 1"
here is the sql query.
insert into compare_year(yeara,yearb,yearc,data)
SELECT yeara
FROM table_1
UNION ALL
SELECT yearb, data
FROM table_2
UNION ALL
SELECT yearc
FROM table_3
below is how i created the tables
create table table_1(id int primary key auto_increment,yeara varchar(100));
create table table_2(id int primary key auto_increment,yearb varchar(100),data varchar(100));
create table table_3(id int primary key auto_increment,yearc varchar(100));
my new table is now
create table compare_year(id int primary key auto_increment,yeara varchar(100),yearb varchar(100),yearc varchar(100),data varchar(100))
please can someone help me. thanks

Note:when you union select queries,the number of columns should be equal.
and also you cannot insert mutiple select columns into a single row of another.
My solution will be like:
if three table contain same id,then you can do like this
insert into compare_year(yeara,yearb,yearc,data)
SELECT T1.yeara,T2.yearb,T3.yearc,T2.data
FROM table_1 T1
left Join table_2 T2 on T2.Id = T1.Id
left Join table_3 T3 on T3.Id = T2.Id

It looks like what you want is a JOIN rather than a UNION. When you union two select statements, they must have the same number of fields in the SELECT. For example,
insert into compare_year(yeara)
SELECT yeara
FROM table_1
UNION ALL
SELECT yearb AS yeara
FROM table_2
UNION ALL
SELECT yearc AS yeara
FROM table_3
would be acceptable syntactically. If you want to join the tables,
INSERT INTO compare_year(yeara, yearb, yearc, data)
SELECT table_1.yeara, table_2.yearb, table_3.yearc, table_2.data
FROM table_1, table_2, table_3
but note that this is full cartesian product of the tables. It's likely you want some conditionals as well in a WHERE clause. It's also worth noting that the order of the select cause is what's important for the INSERT, not the field names.

MySQL: JOIN where ON may compare with null

I have:
simple_table
|- first_id
|- second_id
SELECT * FROM table t1 JOIN table t2
ON [many many conditions]
ON t1.id IN (SELECT first_id FROM simple_table)
AND t2 = (
SELECT second_id FROM simple_table WHERE t1.id = first_id //4th row, can return NULL
)
Questions:
How to handle situation where 4th row return null?
Can I use t1 & t2 alias inside subqueries?
Updated [extra wxplanation]
I have very big table. I need to iterate through table and check some conditions. Actually simple_table provide the ids of table entities, conditions of which I should check. I mean:
simple_table
first_id second_id
11 128
table
id <other_fields>
................
11 <other_data>
...............
128 <other_data>
So, I should check whether those two entities in table have right conditions relatively one another.

The question is unclear, but given the update the query should work better if there is an index on the ID of the big table (probably it's there already as the PK).
As the condition seems to be on the same table the easiest query will be
SELECT ...
FROM bigtable t1
INNER JOIN simple_table st ON t1.ID IN (st.first_id, st.second_id)
or
SELECT ...
FROM bigtable t1
INNER JOIN simple_table st ON t1.ID = st.first_id
INNER JOIN bigtable t2 ON st.second_id = t2
to get the two rows from bigtable on the same row of the result.
The second query will make the checks easier to write, the first will be faster but most probable need a GROUP BY to return the wanted results.
Some performance tests on the OP machine are needed to get the fastest one.
In case one of the ID in simple_table is NULL only the other will be considered, the code will have to check about it.
You can use the alias of the tables in the subqueries, and you'll need to do that as you'll probably have the same table in the subqueries.
The relative condition to check are still undisclosed by the OP so that's all I can help with.

Alternative to in operator in mysql

I see In operator alternative in mysql
I have nearly 25,000 ids.I am using in operator on that.Then i am getting Stackoverflow Exception.Is there any other alternative for IN operator in mysql.
Thanks in advance..

If the ID's are in another table:
SELECT * FROM table1 WHERE id IN (SELECT id FROM table2);
then you can use a join instead:
SELECT table1.* FROM table1 INNER JOIN table2 ON table1.id = table2.id;

You could do the following:
1 - Create a MySQL Temporary Table
CREATE TEMPORARY TABLE tempIdTable (id int unsigned not null primary key);
2 - Insert All Your ids into the Temporary Table
For every id in your list:
insert ignore into myId (id) values (anId);
(this will have the added bonus of de-duplicating your list of ids ready for the final step)
3 - Join Against the Temporary Table
SELECT t1.* FROM myTable1 t1 INNER JOIN tempIdTable tt ON t1.id = tt.id;
The temporary table will disappear as soon as your connection is dropped so your don't have to worry about dropping it before you create it next time.

MySQL Join Best Practice on Large Data

table1_shard1 (1,000,000 rows per shard x 120 shards)
id_user hash
table2 (100,000 rows)
value hash
Desired Output:
id_user hash value
I am trying to find the fastest way to associate id_user with value from the tables above.
My current query ran for 30 hours without result.
SELECT
table1_shard1.id_user, table1_shard1.hash, table2.value
FROM table1_shard1
LEFT JOIN table2 ON table1_shard1.hash=table2.hash
GROUP BY id_user
UNION
SELECT
table1_shard2.id_user, table1_shard2.hash, table2.value
FROM table1_shard1
LEFT JOIN table2 ON table1_shard2.hash=table2.hash
GROUP BY id_user
UNION
( ... )
UNION
SELECT
table1_shard120.id_user, table1_shard120.hash, table2.value
FROM table1_shard1
LEFT JOIN table2 ON table1_shard120.hash=table2.hash
GROUP BY id_user

Firstly, do you have indexes on the hash fields
I think you should merge your tables in one before the query (at least temporarily)
CREATE TEMPORARY TABLE IF NOT EXISTS tmp_shards
SELECT * FROM table1_shard1;
CREATE TEMPORARY TABLE IF NOT EXISTS tmp_shards
SELECT * FROM table1_shard2;
# ...
Then do the main query
SELECT
table1_shard120.id_user
, table1_shard120.hash
, table2.value
FROM tmp_shards AS shd
LEFT JOIN table2 AS tb2 ON (shd.hash = tb2.hash)
GROUP BY id_user
;
Not sure for the performance gain but it'll be at least more maintainable.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Inner join and Split on large volume of data - sql-server-2008

Related

Is there a better alternative for the MySQL query below?

getting data from three different table and inserting it into new table

MySQL: JOIN where ON may compare with null

Alternative to in operator in mysql

MySQL Join Best Practice on Large Data

Categories

Resources