MySQL most efficient method to find missing values from multiple tables - mysql

I have three tables, the first is a list of email addresses:
addresses:
id - integer, this is the primary key<br>
email - varchar(255) field holding the address
sent:
sid - integer, foreign key references id in addresses table
received:
rid - integer, foreign key references id in addresses table
Obviously the "sent" and "received" tables have other columns, but they are not important for this question. the sent and received tables are populated every time an email is sent or received and if the address is not already in the "addresses" table, it gets added. The tables can get quite large (100,000+).
Entries for the "sent" and "received" tables are purged on a regular basis and entries removed for various reasons, leaving orphaned entries in the "addresses" table.
I am looking for the most efficient method in MySQL to purge orphaned entries in the "addresses" table. The query I have so far is:
delete
from addresses
where id not in
(select rid from received)
and id not in
(select sid from sent);
This works, but it can take a looong time to run and is definitely not the most efficient way of doing this! I also tried this:
delete
from addresses
where not exists
(select 'x' from sent where sent.sid=addresses.id)
and not exists
(select 'x' from rceieved where recieved.rid=addresses.id);
This was a bit quicker, but still takes a long time, I suspect I need to use the JOIN syntax but my sql knowledge has run out on me at this point !

This should do the trick
DELETE adresses.* FROM adresses
LEFT JOIN sent ON sent.sid=adresses.id
LEFT JOIN received ON received.rid=adresses.id
WHERE sent.sid IS NULL AND received.rid IS NULL

Try this:
delete from adresses a left join sent s
on (a.sentid=s.id) where s.id is null

I'm sorry I can't really give a definite answer. But I had a similar problem, and after looking around it seems there are only two main choices:
using WHERE x NOT IN y
using LEFT JOIN x ON y WHERE z IS NULL
I tried both methods by comparing two tables, of 2822291 and 916626 records respectively.
The performance conclusions are as follows:
Type 1 is significantly faster than Type 2. (600 sec vs 6000 sec)
Indexes or keys have a reasonable impact on performance for this operation on both types.
Performance is almost independent of the actual DISTINCT number of values. Thus comparing 2000 distinct values or just 15 for both tables takes about the same time.
Thus, concluding, as of now (08-2013) is seems that option 1 is still the faster way to go. Using NOT EXISTS might be even faster, but performance changes there aren't dramatic compared to type 1.
I hope this helps anyone out eventually.

Did some testing using 2 300k myisam tables which contained 2 id columns (and several other non-identical columns). The ids were identical except for 2 records in one table. Tried the 3 methods mentioned to find these ids:
WHERE NOT EXISTS
LEFT JOIN
IN ()
Making sure to use SQL_NO_CACHE and all queries performed identically, the server returned the two results in ~14.6 seconds.
The differences mentioned above must either be caching, differing versions of mysql and/or general server configuration.

Related

How to set up MYSQL Tables for fast SELECT

The question is about *.FIT files (link to definition) (1 to extremely many and constantly more), from Sports watches, speedometers,
in which there is always a timestamp (1 to n seconds), as well as 1 to n further parameters (which also have either a timestamp or a counter from 1 to x).
To perform data analysis, I need the data in the database to calculate e.g. the heart rates in relation to the altitude over several FIT files / training units / time periods.
Because of the changing number of parameters in a FIT file (depending on the connected devices, the device that created the file, etc.) and the possibility to integrate more/new parameters in the future, my idea was to have a separate table for each parameter instead of writing everything in one big table (which would then have extremely many "empty" cells whenever a parameter is not present in a FIT file).
Basic tables:
1 x tbl_file
id
filename
date
1
xyz.fit
2022-01-01
2
vwx.fit
2022-01-02
..
..
..
n x tbl_parameter_xy / tbl_ parameter_yz / ....
id
timestamp/counter
file_id
value
1
0
1
value
2
1
1
value
3
0
2
value
..
..
..
..
And these parameter tables would then be linked to each other via the file_id as well as to the FIT File.
I then used a test server, set up a MYSQL-DB to test this and was shocked:
SELECT * FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
WHERE x.file_id = 999
Took almost 30 seconds to give me the results.
In my parameter tables there are 209918 rows.
file_id 999 consists of 1964 rows.
But my SELECT with JOIN returns 3857269 rows, so there must be an/the error and that's the reason why it takes 30sec.
In comparison, fetching from a "large complete" table was done in 0.5 seconds:
SELECT * FROM tbl_all_parameters
WHERE file_id = 999
After some research, I came across INDEX and thought I had the solution.
I created an index (file_id) for each of the parameter tables, but the result was even slower/same.
Right now I´m thinking about building that big "one in all" table, which makes it easier to handle and faster to select from, but I would have to update it frequently to insert new cols for new parameters. And I´m afraid it will grow so big it kills itself
I have 2 questions:
Which table setup is recommended, primary with focus on SELECT speed, secondary with size of DB.
Do I have a basic bug in my SELECT that makes it so slow?
EXPLAIN SELECT
You're getting a combinatorial explosion in your JOIN. Your result set contains one output row for every pair of input rows in your two parameter tables.
If you say
SELECT * FROM a LEFT JOIN b
with no ON condition at all you get COUNT(a) * COUNT(b) rows in your result set. And you said this
SELECT * FROM a LEFT JOIN b WHERE a.file_id = b.file_id
which gives you a similarly bloated result set.
You need another ON condition... possibly try this.
SELECT *
FROM tbl_parameter_xy as x
LEFT JOIN tbl_parameter_yz as y
ON x.file_id = y.file_id
AND x.timestamp = y.timestamp
if the timestamps in the two tables are somehow in sync.
But, with respect, I don't think you have a very good database design yet.
This is a tricky kind of data for which to create an optimal database layout, because it's extensible.
If you find yourself with a design where you routinely create new tables in production (for example, when adding a new device type) you almost certainly have misdesigned you database.
An approach you might take is creating an attribute / value table. It will have a lot of rows in it, but they'll be short and easy to index.
Your observations will go into a table like this.
file_id part of your primary key
parameter_id part of your primary key
timestamp part of your primary key
value
Then, when you need to, say, retrieve parameters 2 and 3 from a particular file, you would do
SELECT timestamp, parameter_id, value
FROM observation_table
WHERE file_id = xxxx
AND parameter_id IN (2,3)
ORDER BY timestamp, parameter_id
The multicolumn primary key I suggested will optimize this particular query.
Once you have this working, read about denormalization.

database - how to do right indexing for fast execution of large data in mysql

I have a table which has a huge amount of data. I have 9 column in that table (bp_detail) and 1 column of ID which is my primary key in the table. So I am fetching data using query
select * from bp_detail
so what I need to do to get data in a fast way? should I need to make indexes? if yes then on which column?
I am also using that table (bp_detail) for inner join with a table (extras) to get record on the base of where clause, and the query that I am using is:
select * from bp_detail bp inner join extras e
on (bp.id = e.bp_id)
where bp.id = '4' or bp.name = 'john'
I have joined these tables by applying foreign key on bp_detail id and extras bp_id so in this case what should I do to get speedy data. Right Now I have an indexed on column "name" in extras table.
Guidance highly obliged
If selecting all records you would gain nothing by indexing any column. Index makes filtering/ordering by the database engine quicker. Imagine large book with 20000 pages. Having index on first page with chapter names and page numbers you can quickly navigate through the book. Same applies to the database since it is nothing more than a collection of records kept one after another.
You are planning to join tables though. The filtering takes place when JOINING:
on (bp.id = e.bp_id)
and in the WHERE:
where bp.id = '4' or bp.name = 'john'
(Anyway, any reason why you are filtering by both the ID and the NAME? ID should be unique enough).
Usually table ID's should be primary keys so joining is covered. If you plan to filter by the name frequently, consider adding an index there too. You ought to check how does database indexes work as well.
Regarding the name index, the lookup speed depends on search type. If you plan to use the = equality search it will be very quick. It will be quite quick with right wildcard too (eg. name = 'john%'), but quite slow with the wildcard on both sides (eg. name = '%john%').
Anyway, is your database large enough? Without much data and if your application is not read-intensive this feels like beginner's mistake called premature optimization.
depending on your searching criteria, if you are just selecting all of the data then the primary key is enough, to enhance the join part you can create an index on e.bp_id can help you more if you shared the tables schema

Is JOIN less efficient than two sql queries?

I have two tables
Table A (Primary Key is ID)
id \ firstname \ lastname \ zip \ state
Table B
some_field\ business name \ zip \ id
I need to get the first name and last name associated with the id using the id from Table B (note this is the same id as in Table A)
I did a JOIN on Table A and Table B so that I could get the first name and last name
A friend of mine said I should not use JOIN this way and that I should of just done two separate queries. Does that make any sense?
Does JOIN do anything that makes the process slower than two seperate queries? How could two seperate queries ever be faster than one query?
Q: Does this make sense?
A: No, without some valid reasons, it doesn't make sense.
Q: Does JOIN do anything that makes the process slower than two separate queries?
A: Yes, there are some things that can make a join slower, so we can't rule out that possibility. We can't make a blanket statement that "two separate queries will be faster" or that a "join will be slower".
An equijoin of two tables that are properly indexed is likely to be more efficient. But performance is best gauged by actually executing the statements, at expected production volumes of data, and observing and measuring performance.
Some of the things that could potentially make a join slower... complicated join predicate (involving columns wrapped in functions, inequality comparisons, compound predicates combined with OR, multiple tables involved where the optimizer has more join paths and operations to consider to come up with an execution plan. Or, a join that produces a hugh jass intermediate result that is later collapsed with a GROUP BY. (In short, it is possible to write a horrendously inefficient statement that uses a join operation. But it is usually not the join operation that is the culprit. This list of things is just a sampling, it's not an exhaustive list.)
A JOIN is the normative pattern for the use case you describe. It's not clear why your friend recommended that you avoid a JOIN operation. what reason your friend gives.
If your main query is primarily against (the unfortunately named) Table_B, and you are wanting to do a lookup of first_name and last_name from Table_A, the JOIN is suited to that.
If you are only returning a one row (or a few rows) from Table_B, then an extra roundtrip for another query to get first_name and last_name won't be a problem. But if you are returning thousands for rows from Table_B, then executing thousands of separate, singleton queries against Table_A is going to kill performance and scalability.
If your friend is concerned that a value in the foreign key column in Table_B won't match a value in the id column of Table_A, or there is a NULL value in the foreign key column, your friend would be right to point out that an inner join would prevent the row from Table_B from being returned.
In that case, we'd use an outer join, so we can return the row from Table_B even when a matching row from Table_A is not found.
Your friend might also be concerned about performance of the JOIN operation, possibly because your friend has been burned by not having suitable indexes defined.
Assuming that a suitable index exists on Table_A (with a leading column id). and that id is UNIQUE in Table_A... then performance of a single query with a simple equijoin between a single column foreign key and single column primary key will likely be more efficient than running a bloatload of separate statements.
Or, perhaps your friend is concerned with issue with an immature ORM framework, one that doesn't efficiently handle the results returned from a join query.
If the database is being implemented in way that the two tables could be on separate database servers, then using a JOIN would fly in the face of that design. And if that was the design intent, a separation of the tables, then the application should also be using a separate connection for each of the two tables.
Unless your friend can provide some specific reason for avoiding a JOIN operation, my recommendation is that you ignore his advice.
(There has to be a good reason to avoid a JOIN operation. I suspect that maybe your friend doesn't understand how relational databases work.)
In your case it doesn't make any big difference because you just have an id as a foreign key on it which anyways has an index. Since it's indexed, it will be efficient and having a join on that is the best thing.
It becomes more complicated based on what you want, what are the fields and what you want to accomplish etc.
So, yes, no big difference in your case.

SQL - Comparing text(combinations) on 100million table

I have a problem.
I have a table that has around 80-100million records in it. In that table I have a field, that has stored from 3 up to 16 different "combinations"(varchar). Combination is a 4-digit number, a colon and a char(A-E), . For example:
'0001:A/0002:A/0005:C/9999:E'. In this case there are 4 different combinations (they can go up to 16). This field is in every row of the table, never a null.
Now the problem: I have to go through the table, find every row, and see if they are similar.
Example rows:
0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A
0001:A/0002:A/0003:C
0001:A/0002:A/0003:A/0006:C
0701:A/0709:A/0711:C/0712:A/0713:A
As you can see, each of these rows is similar to the others (in some way). The thing that needs to be done here is when you send '0001:A/0002:A/0003:C' via program(or parameter in SQL), that it checks every row and see if they have the same "group". Now the catch here is that it has to go both ways and it has to be done "quick", and the SQL needs to compare them somehow.
So when you send '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' it has to find all fields where there are 3-16 same combinations and return the rows. This 3-16 can be specified via parameter, but the problem is that you would need to find all possible combinations, because you can send '0002:A:/0711:C/0713:A', and as you can see you can send 0002:A as the first parameter.
But you cannot have indexing because a combination can be on any place in a string, and you can send different combinations that are not "attached" (there could be a different combination in the middle).
So, sending '0001:A/0002:A/0003:C/0005:A/0684:A/0699:A/0701:A/0707:A/0709:A/0710:D/0711:C/0712:A/0713:A' has to return all fields that has the same 3-16 fields
and it has to go both ways, if you send "0001:A/0002:A/0003:C" it has to find the row above + similar rows(all that contain all the parameters).
Some things/options I tried:
Doing LIKE for all send combinations is not practical + too slow
Giving a field full-index search isn't an option(don't know why exactly)
One of the few things that could work would be making some "hash" type of encoding for fields, calculating it via program, and searching for all same "hashes"(Don't know how would you do that, given that the hash would generate different combinations for similar texts, maybe some hash that would be written exactly for that
Making a new field, calculating/writing(can be done on insert) all possible combinations and checking via SQL/program if they have the same % of combinations, but I don't know how you can store 10080 combinations(in case of 16) into a "varchar" effectively, or via some hash code + knowing then which of them are familiar.
There is another catch, this table is in usage almost 24/7, doing combinations to check if they are the same in SQL is too slow because the table is too big, it can be done via program or something, but I don't have any clue on how could you store this in a new row that you would know somehow that they are the same. It is a possibility that you would calculate combinations, storing them via some hash code or something on each row insert, calculating "hash" via program, and checking the table like:
SELECT * FROM TABLE WHERE ROW = "a346adsad"
where the parameter would be sent via program.
This script would need to be executed really fast, under 1 minute, because there could be new inserts into the table, that you would need to check.
The whole point of this would be to see if there are any similar combinations in SQL already and blocking any new combination that would be "similar" for inserting.
I have been dealing with that problem for 3 days now without any possible solution, the thing that was the closest is different type of insert/hash like, but I don't know how could that work.
Thank you in advance for any possible help, or if this is even possible!
it checks every row and see if they have the same "group".
IMHO if the group is a basic element of your data structure, your database structure is flawed: it should have each group in its own cell to be normalized. The structure you described makes it clear that you store a composite value in the field.
I'd tear up the table into 3:
one for the "header" information of the group sequences
one for the groups themselves
a connecting table between the two
Something along these lines:
CREATE TABLE GRP_SEQUENCE_HEADER (
ID BIGINT PRIMARY KEY,
DESCRIPTION TEXT
);
CREATE TABLE GRP (
ID BIGINT PRIMARY KEY,
GROUP_TXT CHAR(6)
);
CREATE TABLE GRP_GRP_SEQUENCE_HEADER (
GROUP_ID BIGINT,
GROUP_SEQUENCE_HEADER_ID BIGINT,
GROUP_SEQUENCE_HEADER_ORDER INT, /* For storing the order in the sequence */
PRIMARY KEY(GROUP_ID, GROUP_SEQUENCE_HEADER_ID)
);
(of course, add the foreign keys, and most importantly the indexes necessary)
Then you only have to break up the input into groups, and execute a simple query on a properly indexed table.
Also, you would probably save on the disk space too by not storing duplicates...
A sample query for finding the "similar" sequences' IDs:
SELECT ggsh.GROUP_SEQUENCE_HEADER_ID,COUNT(1)
FROM GRP_GRP_SEQUENCE_HEADER ggsh
JOIN GRP g ON ggsh.GROUP_ID=g.GROUP_ID
WHERE g.GROUP_TXT IN (<groups to check for from the sequence>)
GROUP BY gsh.ID
HAVING COUNT(1) BETWEEN 3 AND 16 --lower and upper boundaries
This returns all the header IDs that the current sequence is similar to.
EDIT
Rethinking it a bit more, you could even break up the group into the two parts, but as I seem to understand, you always have full groups to deal with, so it doesn't seem to be necessary.
EDIT2 Maybe if you want to speed the process up even more, I'd recommend to translate the sequences using bijection into numeric data. For example, evaluate the first 4 numbers to be an integer, shift it by 4 bits to the left (multiply by 16, but quicker), and add the hex value of the character in the last place.
Examples:
0001/A --> 1 as integer, A is 10, so 1*16+10 =26
...
0002/B --> 2 as integer, B is 11, so 2*16+11 =43
...
0343/D --> 343 as integer, D is 13, so 343*16+13 =5501
...
9999/E --> 9999 as integer, E is 14, so 9999*16+14 =159998 (max value, if I understood correctly)
Numerical values are handled more efficiently by the DB, so this should result in an even better performance - of course with the new structure.
So basically you want to execute a complex string manipulation on 80-100 million rows in less than a minute! Ha, ha, good one!
Oh wait, you're serious.
You cannot hope to do these searches on the fly. Read Joel Spolsky's piece on getting Back to Basics to understand why.
What you need to do is hive off those 80-100 million strings into their own table, broken up into those discrete tokens i.e. '0001:A/0002:A/0003:C' is broken up into three records (perhaps of two columns - you're a bit a vague about the relationship between the numeric and alphabetic components of th etokens). Those records can be indexed.
Then it is simply a matter of tokenizing the search strings and doing a select joining the search tokens to the new table. Not sure how well it will perform: that rather depends on how many distinct tokens you have.
As people have commented you would benefit immensely from normalizing your data, but can you not cheat and create a temp table with the key and exploding out your column on the "/", so you go from
KEY | "0001:A/0002:A/0003:A/0006:C"
KEY1| "0001:A/0002:A/0003:A"
to
KEY | 0001:A
KEY | 0002:A
KEY | 0003:A
KEY | 0006:C
KEY1| 0001:A
KEY1| 0002:A
KEY1| 0003:A
Which would allow you to develop a query something like the following (not tested):
SELECT
t1.key
, t2.key
, COUNT(t1.*)
FROM
temp_table t1
, temp_table t2
, ( SELECT t3.key, COUNT(*) AS cnt FROM temp_table t3 GROUP BY t3.key) t4
WHERE
t1.combination IN (
SELECT
t5.combination
FROM
temp_table t5
WHERE
t5.key = t2.key)
AND t1.key <> t2.key
HAVING
COUNT(t1.*) = t4.cnt
So return the two keys where key1 is a proper subset of key?
I guess I can recommend to build special "index".
It will be quite big but you will achieve superspeedy results.
Let's consider this task as searching a set of symbols.
There are design conditions.
The symbols are made by pattern "NNNN:X", where NNNN is number [0001-9999] and X is letter [A-E].
So we have 5 * 9999 = 49995 symbols in alphabet.
Maximum length of words with this alphabet is 16.
We can build for each word set of combinations of its symbols.
For example, the word "abcd" will have next combinations:
abcd
abc
ab
a
abd
acd
ac
ad
bcd
bc
b
bd
cd
с
d
As symbols are sorted in words we have only 2^N-1 combinations (15 for 4 symbols).
For 16-symbols word there are 2^16 - 1 = 65535 combinations.
So we make for an additional index-organized table like this one
create table spec_ndx(combination varchar2(100), original_value varchar2(100))
Performance will be excellent with price of overhead - in the worst case for each record in the original table there will be 65535 "index" records.
So for 100-million table we will get 6-trillion table.
But if we have short values size of "special index" reduces drastically.

How to select records in one table but not another with multiple PKIDs?

Here is my setup:
Table records contains multiple (more than two) PKID columns along with some other columns.
Table cached_records only has two columns, which are the same as two of the PKIDs for records.
For instance, let's assume records has PKIDs 'keyA', 'keyB', and 'keyC' and cached_records only has 'keyA' and 'keyB'.
I need to pull the rows from the records table where the appropriate PKIDs (so, 'keyA' and 'keyB') are not in the cached_records table.
IF I was working with only ONE PKID, I know how simple this task would be:
SELECT
pkid
FROM
records
WHERE
pkid NOT IN (SELECT pkid FROM cached_records)
However, the fact that there is two PKIDs means I can't use a simple NOT IN. This is what I currently have:
SELECT
`keys`.`keyA` AS `keyA`,
`keys`.`keyB` AS `keyB`
FROM
(
SELECT DISTINCT
`keyA`,
`keyB`
FROM
`records`
) AS `keys`
LEFT JOIN
`cached_records` AS `cached`
ON
`keys`.`keyA` = `cached`.`keyA`
AND
`keys`.`keyB` = `cached`.`keyB`
WHERE
(
`cached`.`keyA` IS NULL
AND
`cached`.`keyB` IS NULL
)
(The DISTINCT is needed because since I am only grabbing two of the multiple PKIDs from the records table, there could be duplicates and I really don't need duplicates; 'keyC' is not being used and it helps determine uniqueness of the records).
This query above works just fine, however, as the cached_records table grows, the query takes longer and longer to process (we're talking minutes now, sometimes takes long enough that my code hangs and crashes).
So, I'm wondering what the most efficient way is to do this kind of operation (selecting rows from one table where the rows don't exist in another) with multiple PKIDS instead of just one...
This should be quicker:
SELECT DISTINCT
`records`.`keyA` AS `keyA`,
`records`.`keyB` AS `keyB`
FROM
`records`
LEFT JOIN
`cached_records` AS `cached`
ON
`records`.`keyA` = `cached`.`keyA`
AND
`records`.`keyB` = `cached`.`keyB`
WHERE
`cached`.`keyA` IS NULL -- one is enough here
Notes:
with the query as table, you lose a lot of performance. You can do the distinct in the outmost SELECT here.
it is enough to check one of the two keys if they are null, as none can be null
you should verify that the keyA and keyB columns are of the same type, and no conversion occurs (seen such in working live code...)
You should have proper indexes on the tables. Minutes for this query is the sign of something awful going on... (Or an insane amount of data)