Optimizing MySQL query for the Tatoeba project - mysql

I've downloaded the Tatoeba project databases and I'm trying to query them but queries with a subquery are taking way too long.
-- 800.000 rows approx.
CREATE TABLE `sentences` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`language` char(3) DEFAULT NULL,
`text` mediumtext,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=912551 DEFAULT CHARSET=utf8
-- 1.5 million rows approx.
CREATE TABLE `links` (
`sentenceId` int(10) unsigned NOT NULL,
`translatedId` int(10) unsigned NOT NULL,
PRIMARY KEY (`sentenceId`,`translatedId`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
Basically the links table joins two sentences in the sentences table together (the original sentence and one translation). A sentence can have zero or more translations. So I have an id of a sentence I want to work with and want to grab ALL the translations available.
This query gets me what I want but takes almost 18 seconds to complete.
SELECT * FROM `sentences` WHERE `id` IN (SELECT `translatedId` FROM `links` WHERE `sentenceId` = 157967);
Running both queries by themselves just takes an instant.
What am I doing wrong?

SELECT `sentences`.* FROM
`sentences` JOIN
`links` ON `id` = `translatedId`
WHERE `sentenceId` = 157967;
Some versions of MySQL are known not to use indexes in sub-queries.

Try this(using an EXISTS clause):
SELECT *
FROM `sentences` a
WHERE EXISTS (SELECT 1 FROM `links` b WHERE `sentenceId` = 157967 AND b.`translatedId`=a.`id`);
If the translatedId is unique in the links you can go for a inner join as given below
SELECT a.*
FROM `sentences` a INNER JOIN `links` b
ON b.`translatedId`=a.`id`

Related

Can I improve my movie selecting SQL query

I've created a database to store movies data. My tables are the following:
movies:
CREATE TABLE IF NOT EXISTS `movies` (
`movieId` int(11) NOT NULL AUTO_INCREMENT,
`imdbId` varchar(255) DEFAULT NULL,
`imdbRating` float DEFAULT NULL,
`movieTitle` varchar(255) NOT NULL,
`movieLength` varchar(255) NOT NULL,
`imdbRatingCount` varchar(255) NOT NULL,
`poster` varchar(255) NOT NULL,
`year` varchar(255) NOT NULL,
PRIMARY KEY (`movieId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
I have a table in which i store movie actors:
CREATE TABLE IF NOT EXISTS `actors` (
`actorId` int(10) NOT NULL AUTO_INCREMENT,
`actorName` varchar(255) NOT NULL,
PRIMARY KEY (`actorId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;
And one other in which i store the relation between the movies and actors: (movieActor)
CREATE TABLE IF NOT EXISTS `movieActor` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`movieId` int(10) NOT NULL,
`actorId` int(10) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Now when i want to select a list of movies in which are the selected actors my query is:
SELECT *
FROM movies m inner join
(SELECT movieId FROM movieActor WHERE actorId IN(1,2,3) GROUP BY movieId having count(*) = 3) ma ON m.movieId = ma.movieId
WHERE imdbRating IS NOT NULL ORDER BY imdbRating DESC
This is working perfectly, but i don't know that this is the optimal table structure and query to accomplish this. Are there any better table structure to store data or query the list?
First of all, use indexes on your tables. In my opinion it should be useful to have 3 indexes on movieActor. MovieId - ActorID - MovieIdActorId.
Second try tu use foreign keys. These help to identify the best execution plan for your dbs.
Third try to avoid generating temp tables in your execution plan of your query. Subselects often creates temp tables which are used when the database has to temporarily save something in the RAM. To check this, write EXPLAIN in front of goer query.
I would write it like this:
SELECT m.*, movieActor
FROM movies m inner join
movieActor ma ON m.movieId = ma.movieId
WHERE imdbRating IS NOT NULL
and actorId IN(1,2,3)
GROUP BY movieId
having count(*) = 3)
ORDER BY imdbRating DESC
(Not tested)
Just try to optimize it with the EXPLAIN keyword. It also can help you to create the right indexes.

MySQL INSERT INTO ... SELECT ... GROUP BY is too slow

I have a table with about 50M rows and format:
CREATE TABLE `big_table` (
`id` BIGINT NOT NULL,
`t1` DATETIME NOT NULL,
`a` BIGINT NOT NULL,
`type` VARCHAR(10) NOT NULL,
`b` BIGINT NOT NULL,
`is_c` BOOLEAN NOT NULL,
PRIMARY KEY (`id`),
INDEX `a_b_index` (a,b)
) ENGINE=InnoDB;
I then define the table t2, with no indices:
Create table `t2` (
`id` BIGINT NOT NULL,
`a` BIGINT NOT NULL,
`b` BIGINT NOT NULL,
`t1min` DATETIME NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
I then populate t2 using a query from big_table (this will add about 12M rows).
insert into opportunities
(id, a,b,t1min)
SELECT id,a,b,min(t1)
FROM big_table use index (a_b_index)
where type='SUBMIT' and is_c=1
GROUP BY a,b;
I find that it takes this query about a minute to process 5000 distinct (a,b) in big_table.
Since there are 12M distinct (a,b) in big_table then it would take about 40 hours to run
the query on all of big_table.
What is going wrong?
If I just do SELECT ... then the query does 5000 lines in about 2s. If I SELECT ... INTO OUTFILE ..., then the query still takes 60s for 5000 lines.
EXPLAIN SELECT ... gives:
id,select_type,table,type,possible_keys,key,key_len,ref,rows,Extra
1,SIMPLE,stdnt_intctn_t,index,NULL,a_b_index,16,NULL,46214255,"Using where"
I found that the problem was that the GROUP_BY resulted in too many random-access reads of big_table. The following strategy allows one sequential trip through big_table. First, we add a key to t2:
Create table `t2` (
`id` BIGINT NOT NULL,
`a` BIGINT NOT NULL,
`b` BIGINT NOT NULL,
`t1min` DATETIME NOT NULL,
PRIMARY KEY (a,b),
INDEX `id` (id)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Then we fill t2 using:
insert into t2
(id, a,b,t1min)
SELECT id,a,b,t1
FROM big_table
where type='SUBMIT' and is_c=1
ON DUPLICATE KEY UPDATE
t1min=if(t1<t1min,t1,t1min),
id=if(t1<t1min,big_table.id,t2.id);
The resulting speed-up is several orders of magnitude.
The group by might be part of the issue. You are using an index on (a,b), but your where is not being utilized. I would have an index on
(type, is_c, a, b )
Also, you are getting the "ID", but not specifying which... you probably want to do a MIN(ID) for a consistent result.

Eliminating values from one table with another. Super slow

In the same datbase I have a table messages whos columns: id, title, text I want. I want only the records of which title has no entries in the table lastlogon who's title equivalent is then named username.
I have been using this SQL command in PHP, it generally took 2-3 seconds to pull up:
SELECT DISTINCT * FROM messages WHERE title NOT IN (SELECT username FROM lastlogon) LIMIT 1000
This was all good until the table lastlogon started to have about 80% of the values table messages. Messages has about 8000 entries, lastlogon about 7000. Now it takes about a minute to 2 minutes for it to go through. MySQL shoots up to very high CPU usage.
I tried the following but had no luck reducing the time:
SELECT id,title,text FROM messages a LEFT OUTER JOIN lastlogon b ON (a.title = b.username) LIMIT 1000
Why all of a sudden is it taking so long for such low amount of entries? I tried restarting mysql and apache multiple times. I am using debian linux.
Edit: Here are the structures
--
-- Table structure for table `lastlogon`
--
CREATE TABLE IF NOT EXISTS `lastlogon` (
`username` varchar(25) NOT NULL,
`lastlogon` date NOT NULL,
`datechecked` date NOT NULL,
PRIMARY KEY (`username`),
KEY `username` (`username`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
-- --------------------------------------------------------
--
-- Table structure for table `messages`
--
CREATE TABLE IF NOT EXISTS `messages` (
`id` smallint(9) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
`name` varchar(255) NOT NULL,
`email` varchar(50) NOT NULL,
`text` mediumtext,
`folder` tinyint(2) NOT NULL,
`read` smallint(5) unsigned NOT NULL,
`dateline` int(10) unsigned NOT NULL,
`ip` varchar(15) NOT NULL,
`attachment` varchar(255) NOT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`username` varchar(300) NOT NULL,
`error` varchar(500) NOT NULL,
PRIMARY KEY (`id`),
KEY `title` (`title`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=9010 ;
Edit 2
Edited structure with new indexes.
After putting an index on both messages.title and lastlogon.username I came up with these results:
Showing rows 0 - 29 (623 total, Query took 74.4938 sec)
First: replace the key on title, with a compound key on title + id
ALTER TABLE messages DROP INDEX title;
ALTER TABLE messages ADD INDEX title (title, id);
Now change the select to:
SELECT m.* FROM messages m
LEFT JOIN lastlogon l ON (l.username = m.title)
WHERE l.username IS NULL
-- GROUP BY m.id DESC -- faster replacement for distinct. I don't think you need this.
LIMIT 1000;
Or
SELECT m.* FROM messages m
WHERE m.title NOT IN (SELECT l.username FROM lastlogon l)
-- GROUP BY m.id DESC -- faster than distinct, I don't think you need it though.
LIMIT 1000;
Another problem with the slowness is the SELECT m.* part.
By selecting all column, you are forcing MySQL to do extra work.
Only select the columns you need:
SELECT m.title, m.name, m.email, ......
This will speed up the query as well.
There's another trick you can use:
Replace the limit 1000 with a cutoff date.
Step 1: Add an index on timestamp (or whatever field you want to use for the cutoff).
SELECT m.* FROM messages m
LEFT JOIN lastlogon l ON (l.username = m.title)
WHERE (m.id > (SELECT MIN(M2.ID) FROM messages m2 WHERE m2.timestamp >= '2011-09-01'))
AND l.username IS NULL
-- GROUP BY m.id DESC -- faster replacement for distinct. I don't think you need this.
I suggest you to add an index on messages.title . Then try to run again the query and test the performance.

Some help needed with a SQL query

I need some help with a MySQL query. I have two tables, one with offers and one with statuses. An offer can has one or more statuses. What I would like to do is get all the offers and their latest status. For each status there's a table field named 'added' which can be used for sorting.
I know this can be easily done with two queries, but I need to make it with only one because I also have to apply some filters later in the project.
Here's my setup:
CREATE TABLE `test`.`offers` (
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`client` TEXT NOT NULL ,
`products` TEXT NOT NULL ,
`contact` TEXT NOT NULL
) ENGINE = MYISAM ;
CREATE TABLE `statuses` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`offer_id` int(11) NOT NULL,
`options` text NOT NULL,
`deadline` date NOT NULL,
`added` datetime NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Should work but not very optimal imho :
SELECT *
FROM offers
INNER JOIN statuses ON (statuses.offer_id = offers.id
AND statuses.id =
(SELECT allStatuses.id
FROM statuses allStatuses
WHERE allStatuses.offer_id = offers.id
ORDER BY allStatuses.added DESC LIMIT 1))
Try this:
SELECT
o.*
FROM offers o
INNER JOIN statuses s ON o.id = s.offer_id
ORDER BY s.added
LIMIT 1

MySQL query killing my server

Looking at this query there's got to be something bogging it down that I'm not noticing. I ran it for 7 minutes and it only updated 2 rows.
//set product count for makes
$tru->query->run(array(
'name' => 'get-make-list',
'sql' => 'SELECT id, name FROM vehicle_make',
'connection' => 'core'
));
while($tempMake = $tru->query->getArray('get-make-list')) {
$tru->query->run(array(
'name' => 'update-product-count',
'sql' => 'UPDATE vehicle_make SET product_count = (
SELECT COUNT(product_id) FROM taxonomy_master WHERE v_id IN (
SELECT id FROM vehicle_catalog WHERE make_id = '.$tempMake['id'].'
)
) WHERE id = '.$tempMake['id'],
'connection' => 'core'
));
}
I'm sure this query can be optimized to perform better, but I can't think of how to do it.
vehicle_make = 45 rows
taxonomy_master = 11,223 rows
vehicle_catalog = 5,108 rows
All tables have appropriate indexes
UPDATE: I should note that this is a 1-time script so overhead isn't a big deal as long as it runs.
CREATE TABLE IF NOT EXISTS `vehicle_make` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(32) NOT NULL,
`product_count` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=46 ;
CREATE TABLE IF NOT EXISTS `taxonomy_master` (
`product_id` int(10) NOT NULL,
`v_id` int(10) NOT NULL,
`vehicle_requirement` varchar(255) DEFAULT NULL,
`is_sellable` enum('True','False') DEFAULT 'True',
`programming_override` varchar(25) DEFAULT NULL,
PRIMARY KEY (`product_id`,`v_id`),
KEY `idx2` (`product_id`),
KEY `idx3` (`v_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
CREATE TABLE IF NOT EXISTS `vehicle_catalog` (
`v_id` int(10) NOT NULL,
`id` int(11) NOT NULL,
`v_make` varchar(255) NOT NULL,
`make_id` int(11) NOT NULL,
`v_model` varchar(255) NOT NULL,
`model_id` int(11) NOT NULL,
`v_year` varchar(255) NOT NULL,
PRIMARY KEY (`v_id`,`v_make`,`v_model`,`v_year`),
UNIQUE KEY `idx` (`v_make`,`v_model`,`v_year`),
UNIQUE KEY `idx2` (`v_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Update: The successful query to get what I needed is here....
SELECT
m.id,COUNT(t.product_id) AS CountOf
FROM taxonomy_master t
INNER JOIN vehicle_catalog v ON t.v_id=v.id
INNER JOIN vehicle_make m ON v.make_id=m.id
GROUP BY m.id;
without the tables/columns this is my best guess from reverse engineering the given queries:
UPDATE m
SET product_count =COUNT(t.product_id)
FROM taxonomy_master t
INNER JOIN vehicle_catalog v ON t.v_id=v.id
INNER JOIN vehicle_make m ON v.make_id=m.id
GROUP BY m.name
The given code loops over each make, and then runs a query the counts for each. My answer just does them all in one query and should be a lot faster.
have an index for each of these:
vehicle_make.id cover on name
vehicle_catalog.id cover make_id
taxonomy_master.v_id
EDIT
give this a try:
CREATE TEMPORARY TABLE CountsOf (
id int(11) NOT NULL
, CountOf int(11) NOT NULL DEFAULT 0.00
);
INSERT INTO CountsOf
(id, CountOf )
SELECT
m.id,COUNT(t.product_id) AS CountOf
FROM taxonomy_master t
INNER JOIN vehicle_catalog v ON t.v_id=v.id
INNER JOIN vehicle_make m ON v.make_id=m.id
GROUP BY m.id;
UPDATE taxonomy_master,CountsOf
SET taxonomy_master.product_count=CountsOf.CountOf
WHERE taxonomy_master.id=CountsOf.id;
instead of using nested query ,
you can separated this query to 2 or 3 queries,
and in php insert the result of the inner query to the out query ,
its faster !
#haim-evgi Separating the queries will not increase the speed significantly, it will just shift the load from the DB server to the Web server and create overhead of moving data between the two servers.
I am not sure with the appropriate indexes you run such query 7 minutes. Could you please show the table structure of the tables involved in these queries.
Seems like you need the following indices:
INDEX BTREE('make_id') on vehicle_catalog
INDEX BTREE('v_id') on taxonomy_master