Inserting millions of records with deduplication SQL - mysql

This is a theoretical scenario, and I am more than amateur when it comes to large scale SQL databases...
How would I go about inserting around 2million records into an existing database off 6million records (table1 into table2), whilst at the same time using email de-duplication (some subscribers may already exist in site2, but we don't want to insert those that already exist)?
I understand how to simply get the records from site 1 and add them into site 2, but how would we do this on such a large scale, and not causing data duplication? Any reading sources would be more than helpful for me, as ive found that a struggle.
i.e.:
Table 1: site1Subscribers
site1Subscribers(subID, subName, subEmail, subDob, subRegDate, subEmailListNum, subThirdParties)
Table 2: site2Subscribers
site2Subscribers(subID, subName, subEmail, subDob, subRegDate, subEmailListNum, subThirdParties)

I would try something like this:
insert into site2Subscribers
select * from site1Subscribers s1
left outer join site2Subscribers s2
on s1.subEmail = s2.subEmail
where s2.subEmail is null;
The left outer join along with the null check will return only those rows from site1Subscribers that have no matching entry in site2Subscribers.

Related

Optimising a SQL query with a huge where clause

I am working on a system (with Laravel) where users can fill a few filters to get the data they need.
Data is not prepared real time, once the filters are set, a job is pushed to the queue and once the query finishes a CSV file is created. Then the user receives an email with the file which was created so that they can download it.
I have seen some errors in the jobs where it took longer than 30 mins to process one job and when I checked I have seen some users created filter with more than 600 values.
This filter values are translated like this:
SELECT filed1,
field2,
field6
FROM table
INNER JOIN table2
ON table.id = table2.cid
/* this is how we try not to give same data to the users again so we used NOT IN */
WHERE table.id NOT IN(SELECT data_id
FROM data_access
WHERE data_user = 26)
AND ( /* this bit is auto populated with the filter values */
table2.filed_a = 'text a'
OR table2.filed_a = 'text b'
OR table2.filed_a = 'text c' )
Well I was not expecting users to go wild and fine tune with a huge filter set. It is okay for them to do this but need a solution to make this query quicker.
One way is to create a temp table on the fly with the filter values and covert the query for INNER JOIN but not sure if it would increase the performance.
Also, given that in a normal day system would need to create at least 40-ish temp tables and delete them afterwards. Would this become another issue in the long run?
I would love to hear any other suggestions that may help me solve this issue other then temp table method.
I would suggest writing the query like this:
SELECT ?.filed1, ?.field2, ?.field6 -- qualify column names (but no effect on performance)
FROM table t JOIN
table2 t2
ON t.id = t2.cid
WHERE NOT EXISTS (SELECT 1
FROM data_access da
WHERE t.id = da.data_id AND da.data_user = 26
) AND
t2.filed_a IN ('text a', 'text b', 'text c') ;
Then I would recommend indexes. Most likely:
table2(filed_a, cid)
table1(id) (may not be necessary if id is already the primary key)
data_access(data_id, data_user)
You can test this as your own query. I don't know how to get Laravel to produce this (assuming it meets your performance objectives).

SQL most efficient way to check if rows from one table are also present in another

I have two DB tables each containing email addresses
One is mssql with 1.500.000.000 entries
One is mysql with 70.000.000 entries
I now want to check how many identical email addresses are present in both tables.
i.e. the same address is present in both tables.
Which approach would be the fastest:
1. Download both datasets as csv, load it into memory and compare in program code
2. Use the DB queries to get the overlapping resultset.
if 2 is better: What would be a suggested SQL query?
I would go with a DBQuery. Set up a linked server connection between the two DBs (probably on the MSSQL side), and use a simple inner join query to produce the list of e-mails that occur in both tables:
select a.emailAddress
from MSDBServ.DB.dbo.Table1 a
join MySqlServ.DB..Table2 b
on a.EmailAddress = b.EmailAddress
Finding the set difference, that's going to take more processor power (and it's going to produce at least 1.4b results in the best-case scenario of every MySql row matching an MSSQL row), but the query isn't actually that much different. You still want a join, but now you want that join to return all records from both tables whether they could be joined or not, and then you specifically want the results that aren't joined (in which case one side's field will be null):
select a.EmailAddress, b.EmailAddress
from MSDBServ.DB.dbo.Table1 a
full join MySqlServ.DB..Table2 b
on a.EmailAddress = b.EmailAddress
where a.EmailAddress IS NULL OR b.EmailAddress IS NULL
You could do a sql query to check how many identical email addresses are present in two databases: first number is how many duplicates, second value is the email address.
SELECT COUNT(emailAddr),emailAddr FROM table1 A
INNER JOIN
table2 B
ON A.emailAddr = B.emailAddr
Table1 has the 70,000,000 email addresses, table2 has the 1,500,000,000. I use Oracle so the Upper function may or may not have an equivalent in MySQL.
Select EmailAddress from table1 where Upper(emailaddress) in (select Upper(emailaddress) from table2)
Quicker than comparing spreadsheets and this assumes both tables are in the same database.

MySQL statement to read data from one table with checks on another table

I have these two tables:
Achievement:
Achieves:
Question:
I want to retrieve rows from table Achievement. But, I do not want all the rows, I want the rows that a specific Steam ID has acquired. Let's take STEAM_0:0:46481449 for example, I want to check first the list of IDs that STEAM_0:0:46481449 has acquired (4th column in Achieves table states whether achievement is acquired or not) and then read only those achievements.
I hope that made sense, if not let me know so I can explain a little better.
I know how to do this with two MySQL statements, but can this be done with a single MySQL statement? That would be awesome if so please tell me :D
EDIT: I will add the two queries below
SELECT * FROM Achieves WHERE Achieves.SteamID = 'STEAM_0:0:46481449' AND Achieves.Acquired = 1;
Then after that I do the following query
SELECT * FROM Achievement;
And then through PHP I would check the IDs that I should take and output those. That's why I wanted to get the same result in 1 query since it's more readable and easier.
In sql left join, applying conditions on second table will filter the result when join conditions doesn't matter:
Select * from achievement
left join achieves on (achievement.id=achieves.id)
where achieves.acquired=1 and achieves.SteamID = 'STEAM_0:0:46481449'
Besides,I suggest not using ID in the achieves table as the shared key between two tables. Name it something else.
I don't think a left join makes sense here. There is no case where you don't want to see the Achievement table.
Something like this
SELECT *
FROM Achieves A
JOIN Achievement B on A.ID = B.ID
WHERE A.SteamID = 'STEAM_0:0:46481449'
AND A.Acquired = 1;

Database design to enable Multiple tags like Stackoverflow?

I have the following tables.
Articles table
a_id INT primary unique
name VARCHAR
Description VARCHAR
c_id INT
Category table
id INT
cat_name VARCHAR
For now I simply use
SELECT a_id,name,Description,cat_name FROM Articles LEFT JOIN Category ON Articles.a_id=Category.id WHERE c_id={$id}
This gives me all articles which belong to a certain category along with category name.
Each article is having only one category.
AND I use a sub category in a similar way(I have another table named sub_cat).But every article doesn't necessary have a sub category.It may belong to multiple categories instead.
I now think of tagging an article with more than one category just like the questions at stackoverflow are tagged(eg: with multiple tags like PHP,MYSQL,SQL etc).AND later I have to display(filter) all article with certain tags(eg: tagged with php,php +MySQL) and I also have to display the tags along with the article name,Description.
Can anyone help me redesign the database?(I am using php + MySQL at back-end)
Create a new table:
CREATE TABLE ArticleCategories(
A_ID INT,
C_ID INT,
Constraint PK_ArticleCategories Primary Key (Article_ID, Category_ID)
)
(this is the SQL server syntax, may be slightly different for MySQL)
This is called a "Junction Table" or a "Mapping Table" and it is how you express Many-to-Many relationships in SQL. So, whenever you want to add a Category to an Article, just INSERT a row into this table with the IDs of the Article and the Category.
For instance, you can initialize it like this:
INSERT Into ArticleCategories(A_ID,C_ID)
SELECT A_ID,C_ID From Articles
Now you can remove c_id from your Articles table.
To get back all of the Categories for a single Article, you would do use a query like this:
SELECT a_id,name,Description,cat_name
FROM Articles
LEFT JOIN ArticleCategories ON Articles.a_id=ArticleCategories.a_id
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id={$a_id}
Alternatively, to return all articles that have a category LIKE a certain string:
SELECT a_id,name,Description
FROM Articles
WHERE EXISTS( Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
)
(You may have to adjust the last line, as I am not sure how string parameters are passed MySQL+PHP.)
Ok RBarryYoung you asked me about an reference/analyse you get one
This reference / analyse is based off the documention / source code analyse off the MySQL server
INSERT Into ArticleCategories(A_ID,C_ID)
SELECT A_ID,C_ID From Articles
On an large Articles table with many rows this copy will push one core off the CPU to 100% load and will create a disk based temporary table what will slow down the complete MySQL performance because the disk will be stress out with that copy.
If this is a one time process this is not that bad but do the math if you run this every time..
SELECT a_id,name,Description
FROM Articles
WHERE EXISTS( Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
)
Note dont take the Execution Times on sqlfriddle for real its an busy server and the times vary alot to make a good statement but look to what View Execution Plan has to say
see http://sqlfiddle.com/#!2/48817/21 for demo
Both querys always trigger an complete table scan on table Articles and two DEPENDENT SUBQUERYS thats not good if you have an large Articles table with many records.
This means the performance depends on the number of Articles rows even when you want only the articles that are in the category.
Select *
From ArticleCategories
INNER JOIN Category ON ArticleCategories.c_id=Category.id
WHERE Articles.a_id=ArticleCategories.a_id
AND Category.cat_name LIKE '%'+{$match}+'%'
This query is the inner subquery but when you try to run it, MySQL cant run because it depends on a value of the Articles table so this is correlated subquery. a subquery type that will be evaluated once for each row processed by the outer query. not good indeed
There are more ways off rewriting RBarryYoung query i will show one.
The INNER JOIN way is much more efficent even with the LIKE operator
Note ive made an habbit out off it that i start with the table with the lowest number off records and work my way up if you start with the table Articles the executing will be the same if the MySQL optimizer chooses the right plan..
SELECT
Articles.a_id
, Articles.name
, Articles.description
FROM
Category
INNER JOIN
ArticleCategories
ON
Category.id = ArticleCategories.c_id
INNER JOIN
Articles
ON
ArticleCategories.a_id = Articles.a_id
WHERE
cat_name LIKE '%php%';
;
see http://sqlfiddle.com/#!2/43451/23 for demo Note that this look worse because it looks like more rows needs to be checkt
Note if the Article table has low number off records RBarryYoung EXIST way and INNER JOIN way will perform more or less the same based on executing times and more proof the INNER JOIN way scales better when the record count become larger
http://sqlfiddle.com/#!2/c11f3/1 EXISTS oeps more Articles records needs to be checked now (even when they are not linked with the ArticleCategories table) so the query is less efficient now
http://sqlfiddle.com/#!2/7aa74/8 INNER JOIN same explain plan as the first demo
Extra notes about scaling it becomes even more worse when you also want to ORDER BY or GROUP BY the NOT EXIST way has an bigger chance it will create an disk based temporary table that will kill MySQL performance
Lets also analyse the LIKE '%php%' vs = 'php' for the EXIST way and INNER JOIN way
the EXIST way
http://sqlfiddle.com/#!2/48817/21 / http://sqlfiddle.com/#!2/c11f3/1 (more Articles) the explain tells me both patterns are more or less the same but 'php' should be little faster because off the const type vs ref in the TYPE column but LIKE %php% will use more CPU because an string compare algoritme needs to run.
the INNER JOIN way
http://sqlfiddle.com/#!2/43451/23 / http://sqlfiddle.com/#!2/7aa74/8 (more Articles) the explain tell me the LIKE '%php%' should be slower because 3 more rows need to be analysed but not shocking slower in this case (you can see the index is not really used on the best way).
RBarryYoung way works but doenst keep performance atleast not on a MySQL server
see http://sqlfiddle.com/#!2/b2bd9/1 or http://sqlfiddle.com/#!2/34ea7/1
for examples that will scale on large tables with lots of records this is what the topic starter needs

MySQL how to select on multiple tables using Not Exist

I have three tables. One is a table of deletion candidates. This table was created with certain criteria, but did not include a couple of factors for consideration (limitations of the system). The other two tables were created considering those "left out" factors. So, I need to run a SELECT query on these three tables to come up with a deletion list.
What I started with is:
SELECT inactive.id
FROM inactive, renamed, returned
WHERE NOT EXISTS (inactive.id = remamed.id and inactive.id = returned.id)
But this is giving me an error. Can someone point out my error here?
Thank you
It's not entirely clear what you are trying to do here.
I assume you want a list of all rows from the inactive table that do not exist in either the renamed table or the inactive table. Is that right?
If so you can use a query like this:
SELECT inactive.id
FROM inactive
WHERE NOT EXISTS (select null from renamed where renamed.id = inactive.id)
AND NOT EXISTS (select null from returned where returned.id = inactive.id)