MySQL database relationship without an ID - mysql

Hi StackOverflow community,
I have these two tables:
tbl_users
ID_user (PRIMARY KEY)
Username (UNIQUE)
Password
...
tbl_posts
ID_post (PRIMARY KEY)
Owner (UNIQUE)
Description
...
Why always everybody make database relationships with foreign keys? What about if I want to relate Username with Owner instead of doing ID_user with ID_user in both tables?
Username is UNIQUE and the Owner is the username of the creator of the post.
Can it be done like that? There is something to correct or make better? Maybe I have a misconception.
I would appreciate detailed and understandable answers.
Thank you in advance.

The reason is primarily for data integrity. The argument concerning performance is a little misleading. While neither exhaustive, nor definitive, I hope this little example will shed some light on that fact:
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(i INT NOT NULL AUTO_INCREMENT PRIMARY KEY
,s CHAR(12) NOT NULL UNIQUE
);
STEP1:
INSERT IGNORE INTO my_table (s)
SELECT CONCAT(CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
,CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
);
STEP2:
INSERT IGNORE INTO my_table (s)
SELECT CONCAT(CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
,CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97),CHAR((RAND()*26)+97)
)
FROM my_table;
[REPEAT STEP 2 SEVERAL TIMES]
SELECT COUNT(*) FROM my_table;
+----------+
| COUNT(*) |
+----------+
| 16384 |
+----------+
1 row in set (0.01 sec)
SELECT * FROM my_table ORDER BY i LIMIT 12;;
+----+------------+
| i | s |
+----+------------+
| 1 | kkxeehxsvy |
| 2 | iuyhrk{vaq |
| 3 | ngpedelooc |
| 4 | irkbyqgkhc |
| 6 | yqkcifcxdz |
| 7 | sgezlgvjjq |
| 8 | blavbvxbnl |
| 9 | wdbtqvgvgt |
| 13 | pakzpbnhxr |
| 14 | vpoy{gdwyd |
| 15 | ezlhz{drwg |
| 16 | ncwcwbpudh |
+----+------------+
SELECT * FROM my_table x JOIN my_table y ON y.i < x.i ORDER BY x.i,y.i LIMIT 1;
+---+------------+---+------------+
| i | s | i | s |
+---+------------+---+------------+
| 2 | iuyhrk{vaq | 1 | kkxeehxsvy |
+---+------------+---+------------+
1 row in set (1 min 22.60 sec)
SELECT * FROM my_table x JOIN my_table y ON y.s < x.s ORDER BY x.s,y.s LIMIT 1;
+-------+------------+------+------------+
| i | s | i | s |
+-------+------------+------+------------+
| 21452 | aabetdlvum | 6072 | aabdnegtav |
+-------+------------+------+------------+
1 row in set (1 min 13.59 sec)
So, we have two queries doing essentially the same thing (a comparison of 270 million values). The first joins the table to itself on an integer value. The second joins the table to itself on a string value. Both columns are indexed. As you can see, in this example, the string join actually performs better than the integer join - even though the hit on the CPU may actually be greater!

Related

how to omit mysql rows that match another table when neither is unique

I have two tables, one that holds potential items, the other holds completed items.
The potential item table currently contains the records that have also been added to the completed items table. I want to remove (either by deleting or selecting new results) the already completed items from the list of potential items.
In both tables, items may appear multiple times, and I only want to remove the number of items that are completed, not all that match.
The real data set is more larger of course, but here are samples.
Potential items:
mysql> select * from stack;
+----------+------+------+
| stack_id | type | name |
+----------+------+------+
| 3 | a | aa |
| 4 | b | bb |
| 5 | c | cc |
| 6 | d | dd |
| 7 | a | aa |
| 8 | b | bb |
+----------+------+------+
6 rows in set (0.00 sec)
Completed items
mysql> select * from temp;
+----------+------+------+
| item_id | type | name |
+----------+------+------+
| 1 | a | aa |
| 2 | b | bb |
| 6 | b | bb |
+----------+------+------+
3 rows in set (0.00 sec)
The IDs between tables do not correlate, so they should be ignored as far as finding matches.
I want to omit 1 instance of a/aa and 2 of b/bb since those have been completed and exist in the other table.
when I try this:
mysql> select stack.* from stack where (type,name) not in (select type,name from temp);
I get this:
+----------+------+------+
| stack_id | type | name |
+----------+------+------+
| 5 | c | cc |
| 6 | d | dd |
+----------+------+------+
2 rows in set (0.03 sec)
But this omitted both instances of type="a" and name="aa" and I want to only omit one of them (since it only exists once in the completed items table)
How do I get this?
+----------+------+------+
| stack_id | type | name |
+----------+------+------+
| 5 | c | cc |
| 6 | d | dd |
| 7 | a | aa |
+----------+------+------+
I don't care which instance of a/aa is deleted (whether id=7 or id=3)
The best I've been able to come up with is to use PHP rather than MySQL to loop through each record in temp and delete with a LIMIT 1 from stack.
But I'd rather not have to run code for this, I'd like to do it in queries, it works better that way in my workflow
Thanks!
CREATE TABLE `test`.`stack` (
`stack_id` INT UNSIGNED NOT NULL,
`type` VARCHAR(45) NULL,
`name` VARCHAR(45) NULL,
PRIMARY KEY (`stack_id`));
CREATE TABLE `test`.`temp` (
`item_id` INT UNSIGNED NOT NULL,
`type` VARCHAR(45) NULL,
`name` VARCHAR(45) NULL,
PRIMARY KEY (`item_id`));
and than something like this:
select
min(stack_id), type, name
from stack _s
inner join
(
select min(item_id) item_id, type, name
from temp
group by type, name
) _t using(type, name)
group by _s.type, _s.name
will give you only one the first item in temp:
stack_id
type
name
3
a
aa
4
b
bb

MySQL query slow with OR in WHERE statement

I have a SQL query which looks simple but runs very slow ~4s:
SELECT tblbooks.*
FROM tblbooks LEFT JOIN
tblauthorships ON tblbooks.book_id = tblauthorships.book_id
WHERE (tblbooks.added_by=3 OR tblauthorships.author_id=3)
GROUP BY tblbooks.book_id
ORDER BY tblbooks.book_id DESC
LIMIT 10
EXPLAIN result:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+----------------+-------+-------------------+---------+---------+------------------------+------+-------------+
| 1 | SIMPLE | tblbooks | index | fk_books__users_1 | PRIMARY | 62 | NULL | 10 | Using where |
| 1 | SIMPLE | tblauthorships | ref | book_id | book_id | 62 | tblbooks.book_id | 1 | Using where |
+------+-------------+----------------+-------+-------------------+---------+---------+------------------------+------+-------------+
2 rows in set (0.000 sec)
If I run the above query individually on each part of OR in WHERE statement, both queries return result in less than 0.01s.
Simplified schema:
tblbooks (~1 million rows):
| Field | Type | Null | Key | Default | Extra |
+---------------+-----------------------+------+-----+---------------------+----------------+
| id | int(10) unsigned | NO | MUL | NULL | auto_increment |
| book_id | varchar(20) | NO | PRI | NULL | |
| added_by | int(11) unsigned | NO | MUL | NULL | |
+---------------+-----------------------+------+-----+---------------------+----------------+
tblauthorships (< 100 rows):
| Field | Type | Null | Key | Default | Extra |
+---------------+------------------+------+-----+---------------------+----------------+
| authorship_id | int(11) unsigned | NO | PRI | NULL | auto_increment |
| book_id | varchar(20) | NO | MUL | NULL | |
| author_id | int(11) unsigned | NO | MUL | NULL | |
+---------------+------------------+------+-----+---------------------+----------------+
Both book_id and author_id columns in tblauthorships have their index created.
Can anyone point me to the right direction?
Note: I'm aware of book_id varchar issue.
My usual analogy for indexing is a telephone book. It's sorted by last name then by first name. If you look up a person by last name, you can find them efficiently. If you look up a person by last name AND first name, it's also efficient. But if you look up a person by first name only, the sort order of the book doesn't help, and you have to search every page the hard way.
Now what happens if you need to search a telephone book for a person by last name OR first name?
SELECT * FROM TelephoneBook WHERE last_name = 'Thomas' OR first_name = 'Thomas';
This is just as bad as searching only by first name. Since all entries matching the first name you searched should be included in the result, you have to find them all.
Conclusion: Using OR in an SQL search is hard to optimize, given that MySQL can use only one index per table in a given query.
Solution: Use two queries and UNION them:
SELECT * FROM TelephoneBook WHERE last_name = 'Thomas'
UNION
SELECT * FROM TelephoneBook WHERE first_name = 'Thomas';
The two individual queries each use an index on the respective column, then the results of both queries are unified (by default UNION eliminates duplicates).
In your case you don't even need to do the join for one of the queries:
(SELECT b.*
FROM tblbooks AS b
WHERE b.added_by=3)
UNION
(SELECT b.*
FROM tblbooks AS b
INNER JOIN tblauthorships AS a USING (book_id)
WHERE a.author_id=3)
ORDER BY book_id DESC
LIMIT 10
The two answers so far are not very optimal. Since they have both UNION and LIMIT, let me further optimize their answers:
( SELECT ...
ORDER BY ...
LIMIT 10
) UNION DISTINCT
( SELECT ...
ORDER BY ...
LIMIT 10
)
ORDER BY ...
LIMIT 10
This gives each SELECT a chance to optimize the ORDER BY and LIMIT, making them faster. Then the UNION DISTINCT dedups. Finally, the first 10 are peeled off to make the resultset.
If there will be pagination via OFFSET, this optimization gets trickier. See http://mysql.rjweb.org/doc.php/index_cookbook_mysql#or
Also... Your table needs two indexes:
INDEX(added_by)
INDEX(author_id)
(Please use SHOW CREATE TABLE; it is more descriptive than DESCRIBE.)

MySQL - nested select queries running many time slower than sequential queries (on a large table)

I have a MySQL query that I am having performance problems with that I do not understand. When I try to debug and run the overall query as a sequence of separate subqueries they seem to perform reasonably well, given the volume of data. When I combine them into a single nested query I get much much much longer execution times.
The main ratings table mentioned below is approx 30 million rows (4GB of disk space), with a couple of foreign keys (it's a many-to-many table linking users and items with a small amount of additional supplementary user specific item information - approx 13 fields and 30 bytes).
Query 1 - approx 23s
SELECT COUNT(1) FROM (SELECT fields FROM ratings WHERE (id >= 0 AND id < 10000)
AND item_type = 1) AS t1;
Query 1 saved to table - approx 65s if I save the results to a temporary table
CREATE TABLE temp_table SELECT fields FROM ratings WHERE (id >= 0 AND id < 10000)
AND item_type = 1;
Query 2 - approx 3s
SELECT COUNT(1) FROM temp_table WHERE id IN (SELECT id from item_stats WHERE
ratings_count > 1000);
Bases on this I would expect a combined query to be approx 30s or so, and not more than approx 70s.
Combined query (Query 1 + Query 2) - indeterminate time (10s of minutes before I give up and cancel)
SELECT COUNT(1) from (SELECT * FROM (SELECT fields FROM ratings WHERE (id >= 0
AND id < 10000) AND item_type = 1) AS t1 WHERE t1.id IN (SELECT id FROM
item_stats WHERE ratings_count > 1000)) as t2;
Can anyone help explain this difference and guide me in creating a query that works? If I need to I can rely on the sequential queries (which would take approx 70s), but that is cumbersome and does not seem the right way to go.
I have tried using INNER JOIN instead of IN but this did not seem to make much difference. The ID count from the item_stats table is about 2700 IDs.
It's using MySQL 8.0 on a laptop (16GB RAM, SSD).
Response to suggestions / questions:
Query 1
EXPLAIN select user_id, game_id, item_type_id, rating, plays, own, bgg_last_modified from collections where (user_id >= 0 and user_id < 10000) and item_type_id = 1;
+----+-------------+-------------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------------+------------+------+---------------+------+---------+------+----------+----------+-------------+
| 1 | SIMPLE | collections | NULL | ALL | user_id | NULL | NULL | NULL | 32898400 | 1.31 | Using where |
+----+-------------+-------------+------------+------+---------------+------+---------+------+----------+----------+-------------+
1 row in set, 1 warning (0.00 sec)
Query 2
EXPLAIN select * from temp_coll where game_id in (select game_id from games_ratings_stats where (ratings_count > 1000) or (ratings_count > 500 and ratings_avg >= 7.0));
+----+--------------+---------------------+------------+------+---------------+------+---------+------+---------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+---------------------+------------+------+---------------+------+---------+------+---------+----------+--------------------------------------------+
| 1 | SIMPLE | <subquery2> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | NULL |
| 1 | SIMPLE | temp_coll | NULL | ALL | NULL | NULL | NULL | NULL | 1674386 | 10.00 | Using where; Using join buffer (hash join) |
| 2 | MATERIALIZED | games_ratings_stats | NULL | ALL | NULL | NULL | NULL | NULL | 81585 | 40.74 | Using where |
+----+--------------+---------------------+------------+------+---------------+------+---------+------+---------+----------+--------------------------------------------+
3 rows in set, 1 warning (0.00 sec)
Combined query
EXPLAIN select * from (select user_id, game_id, item_type_id, rating, plays, own, bgg_last_modified from collections where (user_id >= 0 and user_id < 10000) and item_type_id = 1) as t1 where t1.game_id in (select game_id from games_ratings_stats where (ratings_count > 1000) or (ratings_count > 500 and ratings_avg >= 7.0));
+----+--------------+---------------------+------------+------+-----------------+---------+---------+---------------------+-------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+--------------+---------------------+------------+------+-----------------+---------+---------+---------------------+-------+----------+-------------+
| 1 | SIMPLE | <subquery3> | NULL | ALL | NULL | NULL | NULL | NULL | NULL | 100.00 | Using where |
| 1 | SIMPLE | collections | NULL | ref | user_id,game_id | game_id | 5 | <subquery3>.game_id | 199 | 1.31 | Using where |
| 3 | MATERIALIZED | games_ratings_stats | NULL | ALL | NULL | NULL | NULL | NULL | 81585 | 40.74 | Using where |
+----+--------------+---------------------+------------+------+-----------------+---------+---------+---------------------+-------+----------+-------------+
3 rows in set, 1 warning (0.00 sec)
Your query appears to be functionally identical to the following (rather implausible) query:
SELECT COUNT(*) total
FROM ratings r
JOIN item_stats s
ON s.id = r.id
WHERE r.id >= 0
AND r.id < 10000
AND r.item_type = 1
AND s.ratings_count > 1000
r.id is, presumably, the PRIMARY KEY, so it's automatically included in any INNODB index, which leaves just item_type and ratings_count requiring indexes.
You would benefit a lot from an online tutorial on learning how to read the EXPLAIN plan. The EXPLAINS you shared clearly show missing indexes.
As a general rule, queries should not take 23 seconds or 65 seconds, even with millions of rows. Proper indexes + partitioning should resolve the slowness.
Query 1: The user_id index on that table is not helping performance, as 99% of users are within the range in the where clause. You can add an index on item_type_id
ALTER TABLE collections ADD KEY (item_type_id)
Query 2: The temp_coll table is missing a game_id index. Also, I'm not sure if the underlying code for games_ratings_stats has an index on ratings_count and if that would help. I dont have experience with MySQL materialized tables.
ALTER TABLE temp_coll ADD KEY (game_id)
Query 3:
Would benefit from above indexes.
Increasing the InnoDB Buffer Pool Size (now set to 8GB) seems to have made a significant improvement. If anyone has any further setup or tuning advice on MySQL then that would be appreciated!

MySQL, how to merge table duplicates entries [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How can I remove duplicate rows?
Remove duplicates using only a MySQL query?
I have a large table with ~14M entries. The table type is MyISAM ans not InnoDB.
Unfortunately, I have some duplicate entries in this table that I found with the following request :
SELECT device_serial, temp, tstamp, COUNT(*) c FROM up_logs GROUP BY device_serial, temp, tstamp HAVING c > 1
To avoid these duplicates in the future, I want to convert my current index to a unique constraint using SQL request :
ALTER TABLE up_logs DROP INDEX UK_UP_LOGS_TSTAMP_DEVICE_SERIAL,
ALTER TABLE up_logs ADD INDEX UK_UP_LOGS_TSTAMP_DEVICE_SERIAL ( `tstamp` , `device_serial` )
But before that, I need to clean up my duplicates!
My question is : How can I keep only one entry of my duplicated entries? Keep in mind that my table contain 14M entries, so I would like avoid loops if it is possible.
Any comments are welcome!
Creating a new unique key on the over columns you need to have as uniques will automatically clean the table of any duplicates.
ALTER IGNORE TABLE `table_name`
ADD UNIQUE KEY `key_name`(`column_1`,`column_2`);
The IGNORE part does not allow the script to terminate after the first error occurs. And the default behavior is to delete the duplicates.
Since MySQL allows Subqueries in update/delete statements, but not if they refer to the table you want to update, I´d create a copy of the original table first. Then:
DELETE FROM original_table
WHERE id NOT IN(
SELECT id FROM copy_table
GROUP BY column1, column2, ...
);
But I could imagine that copying a table with 14M entries takes some time... selecting the items to keep when copying might make it faster:
INSERT INTO copy_table
SELECT * FROM original_table
GROUP BY column1, column2, ...;
and then
DELETE FROM original_table
WHERE id IN(
SELECT id FROM copy_table
);
It was some time since I used MySQL and SQL in general last time, so I´m quite sure that there is something with better performance - but this should work ;)
This is how you can delete duplicate rows... I'll write you my example and you'll need to apply to your code. I have Actors table with ID and I want to delete the rows with repeated first_name
mysql> select actor_id, first_name from actor_2;
+----------+-------------+
| actor_id | first_name |
+----------+-------------+
| 1 | PENELOPE |
| 2 | NICK |
| 3 | ED |
....
| 199 | JULIA |
| 200 | THORA |
+----------+-------------+
200 rows in set (0.00 sec)
-Now I use a Variable called #a to get the ID if the next row have the same first_name(repeated, null if it's not).
mysql> select if(first_name=#a,actor_id,null) as first_names,#a:=first_name from actor_2 order by first_name;
+---------------+----------------+
| first_names | #a:=first_name |
+---------------+----------------+
| NULL | ADAM |
| 71 | ADAM |
| NULL | AL |
| NULL | ALAN |
| NULL | ALBERT |
| 125 | ALBERT |
| NULL | ALEC |
| NULL | ANGELA |
| 144 | ANGELA |
...
| NULL | WILL |
| NULL | WILLIAM |
| NULL | WOODY |
| 28 | WOODY |
| NULL | ZERO |
+---------------+----------------+
200 rows in set (0.00 sec)
-Now we can get only duplicates ID:
mysql> select first_names from (select if(first_name=#a,actor_id,null) as first_names,#a:=first_name from actor_2 order by first_name) as t1;
+-------------+
| first_names |
+-------------+
| NULL |
| 71 |
| NULL |
...
| 28 |
| NULL |
+-------------+
200 rows in set (0.00 sec)
-the Final Step, Lets DELETE!
mysql> delete from actor_2 where actor_id in (select first_names from (select if(first_name=#a,actor_id,null) as first_names,#a:=first_name from actor_2 order by first_name) as t1);
Query OK, 72 rows affected (0.01 sec)
-Now lets check our table:
mysql> select count(*) from actor_2 group by first_name;
+----------+
| count(*) |
+----------+
| 1 |
| 1 |
| 1 |
...
| 1 |
+----------+
128 rows in set (0.00 sec)
it works, if you have any question write me back

Find value within a range in database table

I need the SQL equivalent of this.
I have a table like this
ID MN MX
-- -- --
A 0 3
B 4 6
C 7 9
Given a number, say 5, I want to find the ID of the row where MN and MX contain that number, in this case that would be B.
Obviously,
SELECT ID FROM T WHERE ? BETWEEN MN AND MX
would do, but I have 9 million rows and I want this to run as fast as possible. In particular, I know that there can be only one matching row, I now that the MN-MX ranges cover the space completely, and so on. With all these constraints on the possible answers, there should be some optimizations I can make. Shouldn't there be?
All I have so far is indexing MN and using the following
SELECT ID FROM T WHERE ? BETWEEN MN AND MX ORDER BY MN LIMIT 1
but that is weak.
If you have an index spanning MN and MX it should be pretty fast, even with 9M rows.
alter table T add index mn_mx (mn, mx);
Edit
I just tried a test w/ a 1M row table
mysql> select count(*) from T;
+----------+
| count(*) |
+----------+
| 1000001 |
+----------+
1 row in set (0.17 sec)
mysql> show create table T\G
*************************** 1. row ***************************
Table: T
Create Table: CREATE TABLE `T` (
`id` int(10) NOT NULL AUTO_INCREMENT,
`mn` int(10) DEFAULT NULL,
`mx` int(10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `mn_mx` (`mn`,`mx`)
) ENGINE=InnoDB AUTO_INCREMENT=1048561 DEFAULT CHARSET=utf8
1 row in set (0.00 sec)
mysql> select * from T order by rand() limit 1;
+--------+-----------+-----------+
| id | mn | mx |
+--------+-----------+-----------+
| 112940 | 948004986 | 948004989 |
+--------+-----------+-----------+
1 row in set (0.65 sec)
mysql> explain select id from T where 948004987 between mn and mx;
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
| 1 | SIMPLE | T | range | mn_mx | mn_mx | 5 | NULL | 239000 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+--------+--------------------------+
1 row in set (0.00 sec)
mysql> select id from T where 948004987 between mn and mx;
+--------+
| id |
+--------+
| 112938 |
| 112939 |
| 112940 |
| 112941 |
+--------+
4 rows in set (0.03 sec)
In my example I just had an incrementing range of mn values and then set mx to +3 that so that's why I got more than 1, but should apply the same to you.
Edit 2
Reworking your query will definitely be better
mysql> explain select id from T where mn<=947892055 and mx>=947892055;
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| 1 | SIMPLE | T | range | mn_mx | mn_mx | 5 | NULL | 9 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
It's worth noting even though the first explain reported many more rows to be scanned I had enough innodb buffer pool set to keep the entire thing in RAM after creating it; so it was still pretty fast.
If there are no gaps in your set, a simple gte comparison will work:
SELECT ID FROM T WHERE ? >= MN ORDER BY MN ASC LIMIT 1