merge two large tables with unique values - mysql

I have to large tables, a main one (TableA) with around 14 million records and a second one I want to merge into it with 20 million records (TableB). For the most part the first is a subset of the second.
I tried making a Unique Index using 2 or 3 fields combined that would identify records as such but MySql wouldn't do it.
I then made my own field 'Unique' by concatenating those three fields.
My question is how do I an import TableB into TableA using only unique records i.e. ones where the value in Unique field in TableB does not already exist in Unique field in TableA. Since I could not make the Unique field an actual unique index will/should I try to make each one a PK and or ordinary index in the respective tables?
Any thoughts on how to do this efficiently appreciated.

Use the sql union statement.
"select * from tableA join tableb"

Related

"Filtering" huge MariaDB/Mysql table based on different table

Struggling with a large dataset in my mariaDB database. I have two tables, where table A contains 57 million rows and table B contains around 500. Table B is a subset of ids related to a column in table A. I want to delete all rows from A which do not have a corresponding ID in table B.
Example table A:
classification_id
Name
20
Mercedes
30
Kawasaki
80
Leitz
70
HP
Example table B:
classification_id
Type
20
car
30
bike
40
bus
50
boat
So in this example the last two rows from table A would be deleted (or a mirror table would be made containing only the first two rows, thats also fine).
I tried to do the second one using an inner join but this query took a few minutes before giving an out of memory exception.
Any suggestions on how to tackle this?
try this:
delete from "table A" where classification_id not in (select classification_id from "table B");
Since you say that the filter table contains a relatively small number of rows, your best bet would be creating a separate table that contains the same columns as the original table A and the rows that match your criteria, then replace the original table and drop it. Also, with this number of IDs you probably want to use WHERE IN () instead of joins - as long as the field you're using there is indexed, it will usually be way faster. Bringing it all together:
CREATE TABLE new_A AS
SELECT A.* FROM A
WHERE classification_id IN (SELECT classification_id FROM B);
RENAME TABLE A TO old_A, new_A to A;
DROP TABLE old_A;
Things to be aware of:
Backup your data! And test the queries thoroughly before running that DROP TABLE. You don't want to lose 57M rows of data because of a random answer at StackOverflow.
If A has any indexes or foreign keys, these won't be copied over - so you'll have to recreate them all manually. I'd recommend running SHOW CREATE TABLE A first and making note on its structure. Alternatively, you may consider creating the table new_A explicitly using the output of SHOW CREATE TABLE A as a template and then performing INSERT INTO new_A SELECT ... instead of CREATE TABLE new_A AS SELECT ... with the same query after this.

spring jpa join query between table in mutlple database

I have to join 3 tables from 3 different database.
I am fetching records from table 1 (every hour and around 100K record every hour). Using the key from table1, I will have to get record from table 2 and then using key from table 2, i will have to fetch record from table 3.
I am thinking of using IN clause however I am not sure if that would be good option.
Also, Records in table2 and table3 are less in numbers and wont change frequently. So there is an option of using second level cache to get records of these two tables in cache and then filter records based on what is required for table1.
should I join tables approach or use cache for less frequently updated tables.
Please suggest.

There is a way to index information on different tables in MySQL

My MySql schema looks like the following
create table TBL1 (id, person_id, ....otherData)
create table TBL2 (id, tbl1_id, month,year, ...otherData)
I am querying this schema as
select * from TBL1 join TBL2 on (TBL2.tbl1_id=TBL1.id)
where TBL1.person_id = ?
and TBL2.month=?
and TBL2.year=?
The current problem is that there is about 18K records on TBL1 associated with some person_id and there is also about 20K records on TBL2 associated with the same values of month/year.
For now i have two indexes.
index1 on TBL1(person_id) and other on index2 on TBL2(month,year)
when the database runs the query it uses index1 (ignoring month and year params) or index2 (ignoring person_id param). So, in both cases it scans about 20K records and doesn't perform as expected.
There is any way for me to create a single index on both tables or tell to mysql to merge de index on querying?
No, an index can belong to only one table. You will need to look at the EXPLAIN for this query to see if you can determine where the performance issue is coming from.
Do you have indexes on TBL2.tbl1_id and TBL1.id?
No. Indexes are on single tables.
You need compound indices on both table, to include the join column. If you add "ID" to both indices, the query optimizer should pick that up.
Can you post an "EXPLAIN"?

Index counter shared by multiple tables in mysql

I have two tables, each one has a primary ID column as key. I want the two tables to share one increasing key counter.
For example, when the two tables are empty, and counter = 1. When record A is about to be inserted to table 1, its ID will be 1 and the counter will be increased to 2. When record B is about to be inserted to table 2, its ID will be 2 and the counter will be increased to 3. When record C is about to be inserted to table 1 again, its ID will be 3 and so on.
I am using PHP as the outside language. Now I have two options:
Keep the counter in the database as a single-row-single-column table. But every time I add things to table A or B, I need to update this counter table.
I can keep the counter as a global variable in PHP. But then I need to initialize the counter from the maximum key of the two tables at the start of apache, which I have no idea how to do.
Any suggestion for this?
The background is, I want to display a mix of records from the two tables in either ASC or DESC order of the creation time of the records. Furthermore, the records will be displayed in page-style, say, 50 records per page. Records are only added to the database rather than being removed. Following my above implementation, I can just perform a "select ... where key between 1 and 50" from two tables and merge the select datasets together, sort the 50 records according to IDs and display them.
Is there any other idea of implementing this requirement?
Thank you very much
Well, you will gain next to nothing with this setup; if you just keep the datetime of the insert you can easily do
SELECT * FROM
(
SELECT columnA, columnB, inserttime
FROM table1
UNION ALL
SELECT columnA, columnB, inserttime
FROM table2
)
ORDER BY inserttime
LIMIT 1, 50
And it will perform decently.
Alternatively (if chasing last drop of preformance), if you are merging the results it can be an indicator to merge the tables (why have two tables anyway if you are merging the results).
Or do it as SQL subclass (then you can have one table maintain IDs and other common attributes, and the other two reference the common ID sequence as foreign key).
if you need creatin time wont it be easier to add a timestamp field to your db and sort them according to that field?
i believe using ids as a refrence of creation is bad practice.
If you really must do this, there is a way. Create a one-row, one-column table to hold the last-used row number, and set it to zero. On each of your two data tables, create an AFTER INSERT trigger to read that table, increment it, and set the newly-inserted row number to that value. I can't remember the exact syntax because I haven't created a trigger for years; see here http://dev.mysql.com/doc/refman/5.0/en/triggers.html

Deleting Duplicates in Access 2003

I have an Access 2003 table with ~4000 records which was made from 17 different tables. Roughly half of these records are duplicates. There is no unique identifying column (id, name etc). There is an id column which was auto filled when the tables were combined meaning that the duplicates aren't completely identical (though this column could be removed if it makes things easier).
I have used the Access Find Duplicates Query Wizard which gives me a list of the duplicated records but won't let me delete them (seriously what use is this query if I can't delete them?). I've tried converting the generated query to a remove query but that changes the number of rows that it finds. I'd alter the sql by hand but it's a bit beyond me and is 7 lines long.
Does anyone know a good way of getting rid of the duplicates?
The reason the find duplicates query won't let you delete the records is because it is basically just an aggregate query, it is counting the number of duplicates it finds and returning the cases where the count is greater than 1.
Consider that if you did make a delete query based on the find duplicates, it would delete all rows that have duplicate values, which is maybe not what you want. You want to delete all but one of the duplicates.
You should try to delete all duplicates of a record apart from one, excluding the ID column in your comparison. I suggest the simplest way to do this is to make a make-table query of all the unique values (Select Distinct Field1, Field2... from MyTable) instead for every field except for the ID field, using the results in a to create a new table of around 2000 records (if half are duplicates).
Then, create an ID column on your new table, use an update query to update this ID to the first matching ID in the original table (you could do this using DLookup, which will return the first EXPRESSION value where CRITERIA is true in DOMAIN).
The DLookup() function returns one
value from a single field even if more
than one record satisfies the
criteria. If no record satisfies the
criteria, or if the domain contains no
records, DLookup() returns a Null.
Since you are identifying the first matching ID based on all the other fields, which are unique values, the unmatched IDs will belong to duplicates. You will be reversing the PK relation, identifying the first matching key given a set of unique fields. After that, you should set the ID to be PK. Of course this assumes the ID has no inherent meaning, and you don't care about keeping one particular ID for a given duplicated row over any of the IDs belonging to the other duplicated rows. This assumes you care about the data in the ID column so you want to preserve it for all remaining rows, otherwise just ignore the DLookup step and do a Select Distinct on all columns apart from the ID.
Use a select with all columns except the ID column:
SELECT DISTINCTROW Column1, Column2, Column3
INTO MYNEWTABLE
FROM TABLE
You can simply swap the names.
This solution will give you a new table with non duplicates.
The following will preserve original IDs and do it in one step:
DELETE FROM table_with_duplicates
WHERE table_with_duplicates.id NOT IN
(SELECT max(id)
FROM table_with_duplicates
GROUP BY duplicated_field_1, duplicated_field_2, ...
)
Now you have the original table with no duplicates and preserved ids.
And always remember to backup you data before trying large DELETEs.
DELETE * FROM table_with_duplicates
WHERE table_with_duplicates.ID In
(SELECT max(ID)
FROM table_with_duplicates
GROUP BY [duplicated_field_1]
HAVING Count(*)>1
)
Actually I Found A very simple solution took a while but it all of your fields across are the same like a complete duplicate record then just make one query with every field and sort by "Group BY". Thus the duplicates will combine and you can just append this information to a new table and rename it the same as the existing table. If you have a primary key field you could just ignore it in the query and then it would still combine the data (assuming that you don't care about the data in the primary field). I don't know why no one has mentioned this solution took me 5 hr. to come up with. :)