MySQL indexing and Using filesort - mysql

This related to my last problem. I made a new two columns in the listings table, one for composed views views_point (increment every 100 view) and one for publish on date publishedon_hourly (by year-month-day hour only) to make some unique values.
This is my new table:
CREATE TABLE IF NOT EXISTS `listings` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`type` tinyint(1) NOT NULL DEFAULT '1',
`hash` char(32) NOT NULL,
`source_id` int(10) unsigned NOT NULL,
`link` varchar(255) NOT NULL,
`short_link` varchar(255) NOT NULL,
`cat_id` mediumint(5) NOT NULL,
`title` mediumtext NOT NULL,
`description` mediumtext,
`content` mediumtext,
`images` mediumtext,
`videos` mediumtext,
`views` int(10) unsigned NOT NULL DEFAULT '0',
`views_point` int(10) unsigned NOT NULL DEFAULT '0',
`comments` int(11) DEFAULT '0',
`comments_update` int(11) NOT NULL DEFAULT '0',
`editor_id` int(11) NOT NULL DEFAULT '0',
`auther_name` varchar(255) DEFAULT NULL,
`createdby_id` int(10) NOT NULL,
`createdon` int(20) NOT NULL,
`editedby_id` int(10) NOT NULL,
`editedon` int(20) NOT NULL,
`deleted` tinyint(1) NOT NULL,
`deletedon` int(20) NOT NULL,
`deletedby_id` int(10) NOT NULL,
`deletedfor` varchar(255) NOT NULL,
`published` tinyint(1) NOT NULL DEFAULT '1',
`publishedon` int(11) unsigned NOT NULL,
`publishedon_hourly` int(10) unsigned NOT NULL DEFAULT '0',
`publishedby_id` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `hash` (`hash`),
KEY `views_point` (`views_point`),
KEY `listings` (`publishedon_hourly`,`published`,`cat_id`,`source_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=FIXED AUTO_INCREMENT=365513 ;
When I run a query like this:
SELECT *
FROM listings
WHERE (`publishedon_hourly` BETWEEN
UNIX_TIMESTAMP( '2015-09-5 00:00:00' )
AND UNIX_TIMESTAMP( '2015-10-5 12:00:00' ))
AND (published =1)
AND cat_id IN ( 1, 2, 3, 4, 5 )
ORDER BY by `views_point` DESC
LIMIT 10
It is working great and this the explanation:
But when I change the date range from month to day like this:
SELECT *
FROM listings
WHERE (`publishedon_hourly` BETWEEN
UNIX_TIMESTAMP( '2015-09-5 00:00:00' )
AND UNIX_TIMESTAMP( '2015-09-5 12:00:00' ))
AND (published =1)
AND cat_id IN ( 1, 2, 3, 4, 5 )
ORDER BY `views_point` DESC
LIMIT 10
Then the query becomes slow and the filesort appears. Any one know the reason and how can I fix it?
the data sample (from the slow query)
INSERT INTO `listings` (`id`, `type`, `hash`, `source_id`, `link`, `short_link`, `cat_id`, `title`, `description`, `content`, `images`, `videos`, `views`, `views_point`, `comments`, `comments_update`, `editor_id`, `auther_name`, `createdby_id`, `createdon`, `editedby_id`, `editedon`, `deleted`, `deletedon`, `deletedby_id`, `deletedfor`, `published`, `publishedon`, `publishedon_hourly`, `publishedby_id`) VALUES
(94189, 1, '44a46d128ce730c72927b19c445ab26e', 8, 'http://Larkin.com/sapiente-laboriosam-omnis-tempore-aliquam-qui-nobis', '', 5, 'And Alice was more and.', 'So they got settled down again very sadly and quietly, and.', 'Dormouse. ''Fourteenth of March, I think it so quickly that the Gryphon only answered ''Come on!'' and ran the faster, while more and more sounds of broken glass, from which she concluded that it was looking down at them, and then a voice sometimes choked with sobs, to sing this:-- ''Beautiful Soup, so rich and green, Waiting in a natural way. ''I thought you did,'' said the Dormouse, without considering at all what had become of it; and as it.', NULL, '', 200, 19700, 0, 0, 0, 'Max', 0, 1441442729, 0, 0, 0, 0, 0, '', 1, 1441442729, 1441440000, 0),
(19030, 1, '3438f6a555f2ce7fdfe03cee7a52882a', 3, 'http://Romaguera.com/voluptatem-rerum-quia-sed', '', 2, 'Dodo said, ''EVERYBODY.', 'I wish I hadn''t to bring but one; Bill''s got the.', 'I wonder what they''ll do well enough; don''t be particular--Here, Bill! catch hold of this remark, and thought to herself. (Alice had no idea what Latitude or Longitude I''ve got to the confused clamour of the other queer noises, would change to dull reality--the grass would be offended again. ''Mine is a long way. So she went on. ''I do,'' Alice said nothing; she had succeeded in curving it down ''important,'' and some were birds,) ''I suppose so,''.', NULL, '', 800, 19400, 0, 0, 0, 'Antonio', 0, 1441447567, 0, 0, 0, 0, 0, '', 1, 1441447567, 1441447200, 0),
(129247, 4, '87d2029a300d8b4314508786eb620a24', 10, 'http://Ledner.com/', '', 4, 'I ever saw one that.', 'The Cat seemed to be a person of authority among them,.', 'I BEG your pardon!'' she exclaimed in a natural way again. ''I wonder what was the same height as herself; and when she looked down at her feet as the question was evidently meant for her. ''I can tell you my history, and you''ll understand why it is I hate cats and dogs.'' It was all dark overhead; before her was another long passage, and the blades of grass, but she had sat down a very little! Besides, SHE''S she, and I''m sure I have dropped them, I wonder?'' As she said to herself; ''his eyes are so VERY tired of being all alone here!'' As she said to itself ''Then I''ll go round a deal.', NULL, '', 1000, 19100, 0, 0, 0, 'Drake', 0, 1441409756, 0, 0, 0, 0, 0, '', 1, 1441409756, 1441407600, 0),
(264582, 2, '5e44fe417f284f42c3b10bccd9c89b14', 8, 'http://www.Dietrich.info/laboriosam-quae-eaque-aut-dolorem', '', 2, 'Alice asked in a very.', 'THINK; or is it directed to?'' said the Mock Turtle,.', 'I can listen all day to such stuff? Be off, or I''ll have you executed.'' The miserable Hatter dropped his teacup and bread-and-butter, and then unrolled the parchment scroll, and read as follows:-- ''The Queen will hear you! You see, she came upon a little of the players to be lost, as she spoke--fancy CURTSEYING as you''re falling through the wood. ''It''s the stupidest tea-party I.', NULL, '', 800, 18700, 0, 0, 0, 'Kevin', 0, 1441441192, 0, 0, 0, 0, 0, '', 1, 1441441192, 1441440000, 0),
(44798, 1, '567cc77ba88c05a4a805dc667816a30c', 14, 'http://www.Hintz.com/distinctio-nulla-quia-incidunt-facere-reprehenderit-sapiente-sint.html', '', 5, 'The Cat seemed to Alice.', 'And the moral of that is--"Be what you mean,'' said Alice..', 'Alice very politely; but she felt very lonely and low-spirited. In a little faster?" said a sleepy voice behind her. ''Collar that Dormouse,'' the Queen said severely ''Who is it directed to?'' said the Footman, and began staring at the Footman''s head: it just at first, but, after watching it a violent blow underneath her chin: it had no pictures or conversations in it, ''and what is the capital of Paris, and Paris is the same thing, you know.'' ''I DON''T.', NULL, '', 300, 17600, 0, 0, 0, 'Rocio', 0, 1441442557, 0, 0, 0, 0, 0, '', 1, 1441442557, 1441440000, 0),
(184472, 1, 'f852e3ed401c7c72c5a9609687385f65', 14, 'https://www.Schumm.biz/voluptatum-iure-qui-dicta-modi-est', '', 4, 'Alice replied, so.', 'I should have liked teaching it tricks very much, if--if.', 'NEVER come to the Dormouse, not choosing to notice this question, but hurriedly went on, ''What''s your name, child?'' ''My name is Alice, so please your Majesty,'' said Two, in a great thistle, to keep back the wandering hair that WOULD always get into her face. ''Wake up, Alice dear!'' said her sister; ''Why, what a dear quiet thing,'' Alice went on, spreading out the answer to shillings and pence. ''Take off your hat,'' the King had said that day. ''No, no!'' said the Gryphon. ''They can''t have anything to say, she simply bowed, and took the watch and looked at it again: but he could.', NULL, '', 900, 17600, 0, 0, 0, 'Billy', 0, 1441407837, 0, 0, 0, 0, 0, '', 1, 1441407837, 1441407600, 0),
(344246, 2, '09dc73287ff642cfa2c97977dc42bc64', 6, 'http://www.Cole.com/sit-maiores-et-quam-vitae-ut-fugiat', '', 1, 'IS the use of a.', 'And when I learn music.'' ''Ah! that accounts for it,'' said.', 'Gryphon answered, very nearly carried it out loud. ''Thinking again?'' the Duchess by this time.) ''You''re nothing but a pack of cards, after all. I needn''t be so stingy about it, you know--'' ''But, it goes on "THEY ALL RETURNED FROM HIM TO YOU,"'' said Alice. ''Call it what you mean,'' the March Hare, ''that "I breathe when I breathe"!'' ''It IS the same side of WHAT? The other guests had taken his watch out of it, and talking over its head. ''Very uncomfortable for the first to speak. ''What size do you like to go and get.', NULL, '', 600, 16900, 0, 0, 0, 'Enrico', 0, 1441406107, 0, 0, 0, 0, 0, '', 1, 1441406107, 1441404000, 0),
(19169, 1, '116c443b5709e870248c93358f9a328e', 12, 'http://www.Gleason.com/et-vero-optio-exercitationem-aliquid-optio-consectetur', '', 4, 'Let this be a lesson to.', 'Sir, With no jury or judge, would be very likely to eat.', 'I wonder who will put on your head-- Do you think I can find them.'' As she said this, she was quite out of sight before the end of every line: ''Speak roughly to your little boy, And beat him when he sneezes; For he can EVEN finish, if he had never heard of such a subject! Our family always HATED cats: nasty, low, vulgar things! Don''t let him know she liked them best, For this must ever be A secret, kept from all the creatures wouldn''t be so kind,'' Alice replied, so eagerly that the way I want to get very tired of being upset, and their curls got entangled together. Alice was not a regular rule: you invented it just grazed his nose, you know?'' ''It''s the thing Mock Turtle would be only.', NULL, '', 700, 16800, 0, 0, 0, 'Unique', 0, 1441407961, 0, 0, 0, 0, 0, '', 1, 1441407961, 1441407600, 0),
(192679, 1, '06a33747b5c95799034630e578e53dc5', 10, 'http://www.Pouros.com/qui-id-molestias-non-dolores-non', '', 5, 'Rabbit just under the.', 'KNOW IT TO BE TRUE--" that''s the jury-box,'' thought Alice,.', 'Mock Turtle, who looked at Two. Two began in a hoarse, feeble voice: ''I heard every word you fellows were saying.'' ''Tell us a story.'' ''I''m afraid I can''t tell you how it was too dark to see what I should say "With what porpoise?"'' ''Don''t you mean by that?'' said the King; and as it was indeed: she was now more than Alice could not make out exactly what they WILL do next! As for pulling me out of court! Suppress him! Pinch him! Off with his head!"'' ''How dreadfully savage!'' exclaimed Alice. ''That''s the first witness,'' said the Duchess. ''Everything''s got a moral, if only you can find it.'' And she squeezed herself up and ran the faster, while more and more faintly came, carried on the end of every line:.', NULL, '', 800, 15900, 0, 0, 0, 'Gene', 0, 1441414720, 0, 0, 0, 0, 0, '', 1, 1441414720, 1441411200, 0),
(251878, 4, '3eafacc53f86c8492c309ca2772fbfe9', 5, 'http://www.Schinner.info/tempora-et-est-qui-nulla', '', 2, 'NOT!'' cried the Mouse,.', 'Twinkle, twinkle--"'' Here the Queen till she heard the.', 'Alice and all of them even when they hit her; and the sounds will take care of the gloves, and she dropped it hastily, just in time to begin at HIS time of life. The King''s argument was, that she had forgotten the Duchess to play croquet with the Dormouse. ''Write that down,'' the King added in an undertone to the fifth bend, I think?'' ''I had NOT!'' cried the Mouse, sharply and very neatly and simply arranged; the only difficulty was, that if something wasn''t done about it in less than a pig, my dear,'' said Alice, a little wider. ''Come, it''s pleased so far,'' said the Gryphon. ''Do you play croquet with the glass table and the King hastily said, and went by without noticing her. Then followed the Knave ''Turn them over!'' The Knave of.', NULL, '', 500, 15900, 0, 0, 0, 'Demarcus', 0, 1441414681, 0, 0, 0, 0, 0, '', 1, 1441414681, 1441411200, 0);

In your first query, the ORDER BY is done using the views_point INDEX, because it was used in the WHERE part of the query and therefore in MySQL can be used for sorting.
In the second query, MySQL resolves the WHERE part using a different index, listing_pcs. This cannot be used to satisfy the ORDER BY condition. MySQL uses filesort instead, which is the best option if an index cannot be used.
MySQL only uses indexes to sort if the index is the same as that used in the WHERE condition. This is what the manual means by:
In some cases, MySQL cannot use indexes to resolve the ORDER BY, although it still uses indexes to find the rows that match the WHERE clause. These cases include the following:
The key used to fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
So what can you do:
Try increasing your sort_buffer_size config option to make filesorting as effective as possible. Large results that are too big for the sort buffer cause MySQL to break the sort down into chunks, which is slower.
Force MySQL to choose a different index. It’s worth noting that different MySQL versions choose default indexes differently. Version 5.1, for example, is pretty bad as the Query Optimizer had been vastly re-written for this release and needed lots of refinement. Version 5.6 is pretty good.
SELECT *
FROM listings
FORCE INDEX (views_point)
WHERE (`publishedon_hourly` BETWEEN
UNIX_TIMESTAMP( '2015-09-5 00:00:00' )
AND UNIX_TIMESTAMP( '2015-09-5 12:00:00' ))
AND (published =1)
AND cat_id IN ( 1, 2, 3, 4, 5 )
ORDER BY `views_point` DESC
LIMIT 10

It seems that some kind of news database, so try to think about make some sort of news archiving every month.
Think about this solution, it's not the best but it may help
Add these columns into listings table
publishedmonth tinyint(2) UNSIGNED NOT NULL DEFAULT '0'
publishedyear tinyint(2) UNSIGNED NOT NULL DEFAULT '0'
publishedminute mediumint(6) UNSIGNED NOT NULL DEFAULT '0'
Add this INDEXING KEY into listings table
ADD KEY published_month (publishedmonth,publishedyear,publishedminute)
During inserting use these values from PHP code
publishedmonth will has date('n')
publishedyear will has date('y')
publishedminute will has date('jHi')
Dump huge number of records then test this query
SELECT * FROM listings WHERE publishedmonth = 2 AND publishedyear = 17 ORDER BY publishedminute

The EXPLAIN says listings_pcs, but the SHOW CREATE TABLE does not list that index. Are we missing something?
Don't use SELECT * if you only need a few columns. In particular the TEXT columns will prevent one form of performance speedup during the query.
Subqueries to work out part of the query usually show things down. However, in your case (lots of MEDIUMTEXT being fetched, and use of LIMIT), it may be efficient to get the ids in a subquery first, then fetch the bulky columns. ("Lazy eval") See below.
A range value (publishedon_hourly) is better off last, not first, in an index.
Starting an index with = column (published) is usually best.
The Optimizer chooses, sometimes incorrectly, to focus on the ORDER BY instead of the WHERE. (Neither is very productive in your case).
INDEX(published, views_point) may avoid the sort, while helping some with the WHERE.
Having a flag (published) that is always tested in queries adds to the complexity and inefficiency of the schema.
BETWEEN is inclusive, so the second query is actually scanning 12 hours plus one second.
Splitting a date into year+month+day usually hurts more than helps.
Do not set sort_buffer_size bigger than, say, 1% of RAM. Otherwise, you may encounter other problems.
FORCE INDEX may help today, but then hurt tomorrow when the constants change. Caveat emptor.
It is often better to put "click_count" or "likes" or "upvotes" into a separate table. This separates rapidly changing counters from the bulky, relatively static, data. Hence, there is less interference between the two.
If you do the above, simply remove non-published rows from the counter table, thereby simplifying several things.
Most people vilify the filesort, but it is usually other things that are the villains -- in your case, the number and size of rows.
Please provide EXPLAIN FORMAT=JSON SELECT ...; there may be some interesting clues.
Your findings are odd enough to warrant filling a bug at bugs.mysql.com .
I would add these indexes with the columns in the order given, and see what the Optimizer picks:
INDEX(published, views_point) -- aiming at the ORDER BY, plus picking up '='
INDEX(published, cat_id, publishedon_hourly) -- possibly the best for WHERE
Or, maybe, the "lazy eval" of
SELECT L.*
FROM listings AS L
JOIN (
SELECT id
FROM listings
WHERE `publishedon_hourly` BETWEEN UNIX_TIMESTAMP(...)
AND UNIX_TIMESTAMP(...)
AND published = 1
AND cat_id IN ( 1, 2, 3, 4, 5 )
ORDER BY `views_point` DESC
LIMIT 10
) AS s ON L.id = s.id
ORDER BY views_point DESC
-- with
INDEX(published, cat_id, publishedon_hourly, views_point, id)
Notes:
The subquery will be "Using Index"; that is, the index is covering.
There will be two file sorts. One is in the subquery, but working from the index, not the bulky texts. And one is only 10 rows, although bulky.

Very odd behavior. Hard to see why views_point would not be used for sort operation withiout seeing the data in question. You can try to give an index hint for MySQL to use views_point for sort like this.
SELECT * FROM listings
USE INDEX FOR ORDER BY (`views_point`)
WHERE
(
`publishedon_hourly` BETWEEN UNIX_TIMESTAMP( '2015-09-5 00:00:00' )
AND UNIX_TIMESTAMP( '2015-09-5 12:00:00' )
)
AND (published =1)
AND cat_id IN ( 1, 2, 3, 4, 5 )
ORDER BY `views_point` DESC LIMIT 10

Query optimizer is not perfect. This one of those cases where it makes wrong decision. It happens for some border line cases. If the data in your table would change even by a small amount it will perhaps use the other index and run the faster query.
You don't wont to wait for it, you can change your listing_pcs index. It has source_id but you are not using. So why not replace it with view_points?
KEY `listings` (`publishedon_hourly`,`published`,`point`,`cat_id`)
Also using tinyint(1) not much use for speed or saving space. It still takes one full byte. And same mediumint(5) it take 3 bytes. Combine deleted, type,catid and published into one column and put index on that one column.

Related

Improving select speed - mysql - very large tables

Newbie to MySQL and SQL in general - so please be gentle :-)
I have a table with a very high number of rows. The table is:
create table iostat (
pkey int not null auto_increment,
serverid int not null,
datestr char(15) default 'NULL',
esttime int not null default 0,
rs float not null default 0.0,
ws float not null default 0.0,
krs float not null default 0.0,
kws float not null default 0.0,
wait float not null default 0.0,
actv float not null default 0.0,
wsvct float not null default 0.0,
asvct float not null default 0.0,
pctw int not null default 0,
pctb int not null default 0,
device varchar(50),
avgread float not null default 0.0,
avgwrit float not null default 0.0,
primary key (pkey),
index i_serverid (serverid),
index i_esttime (esttime),
index i_datestr (datestr),
index i_rs (rs),
index i_ws (ws),
index i_krs (krs),
index i_kws (kws),
index i_wait (wait),
index i_actv (actv),
index i_wsvct (wsvct),
index i_asvct (asvct),
index i_pctb (pctb),
index i_device (device),
index i_servdate (serverid, datestr),
index i_servest (serverid, esttime)
)
engine = MyISAM
data directory = '${IOSTATdatadir}'
index directory = '${IOSTATindexdir}'
;
Right now the table has 834,317,203 rows.
Yes - I need all the data. The highest level organization of the data is by the collection date (datestr). It is a CHAR instead of a date to preserve the specific date format I use for the various load, extract, and analysis scripts.
Each day adds about 16,000,000 rows.
One of the operations I would like to speed up is (Limit is generally 50 but ranges from 10 to 250):
create table TMP_TopLUNsKRead
select
krs, device, datestr, esttime
from
iostat
where
${WHERECLAUSE}
order by
krs desc limit ${Limit};
WHERECLAUSE is:
serverid = 29 and esttime between X and Y and device like '%t%'
where X and Y are timestamps spanning anywhere from 4 minutes to 24 hours.
I'd prefer to not change the DB engine. This lets me put data and indexes on separate drives which gave me significant overall performance. It's also a total of 1.6 billion rows, which would take an insane amount of time to reload.
device like '%t%'
This is the killer. The leading % means it is a search of the whole column, or index if it's indexed, not an index lookup. See if you can do without the leading %.
Without knowing what's in your ${WHERECLAUSE} it's impossible to help you. You are correct that this is a huge table.
But here is an observation that might help: A compound covering index on
(krs, device, datestr, esttime)
might speed up the ordering and extraction of your subset of data.

How to to insert floats into mysql and then query eqality

Wowee ..does mysql work with floats or not!
1) I insert a float into mysql field
price = 0.1
2) I run the below query:
select * from buy_test where price = 0.1
WOW! I get no results
3) I run the below query:
select * from buy_test where price < 0.1
I get no results
4) I run the below query
select * from buy_test where price > 0.1
YAY! I get results but no..I wanted where price =0.1
How to I insert a float to mysql so I can query a float in mysql
Thanks
CREATE TABLE `buy_test` (
`user_id` varchar(45) DEFAULT NULL,
`order_id` varchar(100) NOT NULL,
`price` float DEFAULT NULL,
`insert_time` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`order_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1$$
That's because 0.1 doesn't exist in floating point arithmetic.
It would take an infinity number of digits to print the real value of 0.1 in binary (just like it would take an infinity number of digits do print the real value of 10/3).
In your table, you are storing the price with a 'float' type, which is represented on 32 bits. The value 0.1 is rounded to 0.100000001490116119384765625 (which is the nearest representation of 0.1 in the float type format).
When you are requesting all rows where prices are equal to 0.1, I strongly suspect the interpreter to use the double type, or at least, a more precise type than float.
But let's consider it's using the double type on 64 bits.
In the double type, 0.1 is rounded to 0.1000000000000000055511151231257827021181583404541015625 .
When the engines makes the comparison, it leads to:
if (0.100000001490116119384765625 ==
0.1000000000000000055511151231257827021181583404541015625) ...
which is obviously false. But it's true for operator > .
I'm pretty sure that this where clause would work: "where price = 0.100000001490116119384765625"
By the way, when the result of your query tells you that the price is "0.1", it's a lie. The value is rounded to be "beautifully displayed".
There is no real solution to your problem, everybody knowing floating point arithmetic problems will discourage you to use equality comparison on floats.
You may use an epsilon for your request.
There is a very interesting article named "What Every Computer Scientist Should Know About Floating-Point Arithmetic"; you can find it there:
http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

MySQL - searching a self join and ranges or data

I'm tasked with my local community center to build a 'newlywed' type game in time for Valentines day, so no rush!
So, We've got 50 odd couples who know each other quite well are going to be asked 100 questions before time. Each question has the users response and a range at which allow a margin of error (this range quota will be limited). and they can then select what they think their partners answer will be, with the same range for margin of error.
EG (I'll play a round as me and my GF):
Question: Do you like fruit?
I am quite fussy about fruit so I'll put a low score out of 100.. say 20. But what I do like, I LOVE and think that my GF might think I will put a higher answer, so my margin of error I'll allow is going to be 30.
I think she loves fruit and will put at least 90.. but she enjoys alot of foods so may just rank it lower, so I'll give her a margin of 20.
Ok, repeat that process for 100 questions and 50 couples.
I'm left with a table like this:
u_a = User answer
u_l = user margin of error level
p_a = partner answer
p_l = partner margin of error level
CREATE TABLE IF NOT EXISTS `large` (
`id_user` int(11) NOT NULL,
`id_q` int(11) NOT NULL,
`u_a` int(11) NOT NULL,
`u_l` int(11) NOT NULL,
`p_a` int(11) NOT NULL,
`p_l` int(11) NOT NULL,
KEY `id_user` (`id_user`,`id_q`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Stackoverflow Test';
So my row will be in the previous example:
(1, 1, 20, 30, 90, 20)
my mission is to search ALL users to see who the best matches are out of the 50.. (and hope that couples are good together!).
I'll want to search the DB for all users where my answer for my partner matches their answer, but for every user.
Here's what I've got so far (Note I've commented out some code, that's cause I'm trying two ways, not sure what's best):
SELECT
match.id_user,
count(*) as count
from `large` `match`
INNER JOIN `large` `me` ON me.id_q = match.id_q
WHERE
me.id_user = 1 AND
match.id_user != 1 AND
GREATEST(abs(me.p_a - match.u_a), 0) <= me.p_l
AND
GREATEST(abs(match.p_a - me.u_a), 0) <= match.p_l
#match.u_a BETWEEN GREATEST(me.p_a - me.p_l, 0) AND (me.p_a + me.p_l)
#AND
#me.u_a BETWEEN GREATEST(match.p_a - match.p_l, 0) AND (match.p_a + match.p_l)
GROUP BY match.id_user
ORDER BY count DESC
My question today is :
This query takes AGES! I'd like to do it during the game and allow users a chance to change answers on the night and get instant results, so this has to be quick. I'm looking at 40 seconds when looking up all matches for me (user 1).
I'm reading about DB engines and indexing now to make sure I'm doing all that I can... but suggestions are welcome!
Cheers and PHEW!
Your query shouldn't be taking 40 seconds on a smallish data set. The best way to know what is going on is to use explain before the query.
However, I suspect the problem is the condition on me. The MySQL engine might be creating all possible combinations for all users and then filtering you out. You can test this by modifying this code:
from `large` `match` INNER JOIN
`large` `me`
ON me.id_q = match.id_q
WHERE me.id_user = 1 AND
match.id_user != 1 AND . . . .
To:
from `large` `match` INNER JOIN
(select me.*
from `large` `me`
where me.id_user = 1
) me
ON me.id_q = match.id_q
WHERE match.id_user != 1 AND . . . .
In addition, the following indexes might help the query: large(id_user, id_q) and large(id_q).

Optimizing a MySQL query summing and averaging by multiple groups over a given date range

I'm currently working on a home-grown analytics system, currently using MySQL 5.6.10 on Windows Server 2008 (moving to Linux soon, and we're not dead set on MySQL, still exploring different options, including Hadoop).
We've just done a huge import, and what was a lightning-fast query for a small customer is now unbearably slow for a big one. I'm probably going to add an entirely new table to pre-calculate the results of this query, unless I can figure out how to make the query itself fast.
What the query does is take #StartDate and #EndDate as parameters, and calculates, for every day of that range, the date, the number of new reviews on that date, a running total of number of reviews (including any before #StartDate), and the daily average rating (if there is no information for a given day, the average rating will be carried over from the previous day).
Available filters are age, gender, product, company, and rating type. Every review has 1-N ratings, containing at the very least an "overall" rating, but possibly more per customer/product, such as "Quality", "Sound Quality", "Durability", "Value", etc...
The API that calls this injects these filters based on user selection. If no rating type is specified, it uses "AND ratingTypeId = 1" in place of the AND clause comment in all three parts of the query I'll be listing below. All ratings are integers between 1 and 5, though that doesn't really matter to this query.
Here are the tables I'm working with:
CREATE TABLE `times` (
`timeId` int(11) NOT NULL AUTO_INCREMENT,
`date` date NOT NULL,
`month` char(7) NOT NULL,
`quarter` char(7) NOT NULL,
`year` char(4) NOT NULL,
PRIMARY KEY (`timeId`),
UNIQUE KEY `date` (`date`)
) ENGINE=MyISAM
CREATE TABLE `reviewCount` (
`companyId` int(11) NOT NULL,
`productId` int(11) NOT NULL,
`createdOnTimeId` int(11) NOT NULL,
`ageId` int(11) NOT NULL,
`genderId` int(11) NOT NULL,
`totalReviews` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`companyId`,`productId`,`createdOnTimeId`,`ageId`,`genderId`),
KEY `companyId_fk` (`companyId`),
KEY `productId_fk` (`productId`),
KEY `createdOnTimeId` (`createdOnTimeId`),
KEY `ageId_fk` (`ageId`),
KEY `genderId_fk` (`genderId`)
) ENGINE=MyISAM
CREATE TABLE `ratingCount` (
`companyId` int(11) NOT NULL,
`productId` int(11) NOT NULL,
`createdOnTimeId` int(11) NOT NULL,
`ageId` int(11) NOT NULL,
`genderId` int(11) NOT NULL,
`ratingTypeId` int(11) NOT NULL,
`negativeRatings` int(10) unsigned NOT NULL DEFAULT '0',
`positiveRatings` int(10) unsigned NOT NULL DEFAULT '0',
`neutralRatings` int(10) unsigned NOT NULL DEFAULT '0',
`totalRatings` int(10) unsigned NOT NULL DEFAULT '0',
`ratingsSum` double unsigned DEFAULT '0',
`totalRecommendations` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`companyId`,`productId`,`createdOnTimeId`,`ageId`,`genderId`,`ratingTypeId`),
KEY `companyId_fk` (`companyId`),
KEY `productId_fk` (`productId`),
KEY `createdOnTimeId` (`createdOnTimeId`),
KEY `ageId_fk` (`ageId`),
KEY `genderId_fk` (`genderId`),
KEY `ratingTypeId_fk` (`ratingTypeId`)
) ENGINE=MyISAM
The 'times' table is pre-filled with every day from 1900-01-01 to 2049-12-31, and the two count tables are populated by an ETL script with a roll-up query grouped by company, product, age, gender, ratingType, etc...
What I'm expecting back from the query is something like this:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 7020 10586 4.017514595496247
2013-01-25 5505 16091 4.058400718778077
2013-01-27 2043 18134 3.992957746478873
2013-01-28 3280 21414 3.983625730994152
2013-01-29 4648 26062 3.921597633136095
...
2013-03-09 1608 60297 3.9409722222222223
2013-03-10 470 60767 3.7743682310469313
2013-03-11 1028 61795 4.036697247706422
2013-03-13 494 62289 3.857388316151203
2013-03-14 449 62738 3.8282208588957056
I'm pretty sure I could pre-calculate everything grouped by age, gender, etc..., except for the average, but I may be wrong on that. If I had three reviews for two products on one day, with all other groups different, and one had a rating of 2 and 5, and the other a 4, the first would have a daily average of 3.5, and the second 4. Averaging those averages would give me 3.75, when I'd expect to get 3.66667. Maybe I could do something like multiplying the average for that grouping by the number of reviews to get the total rating sum for the day, sum those up, then divide them by total ratings count at the end. Seems like a lot of extra work, but it may be faster than what I'm currently doing. Speaking of which, here's my current query:
SET #cumulativeCount :=
(SELECT coalesce(sum(rc.totalReviews), 0)
FROM reviewCount rc
INNER JOIN times dt ON rc.createdOnTimeId = dt.timeId
WHERE dt.date < #StartDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
);
SET #dailyAverageWithCarry :=
(SELECT SUM(rc.ratingsSum) / SUM(rc.totalRatings)
FROM ratingCount rc
INNER JOIN times dt ON rc.createdOnTimeId = dt.timeId
WHERE dt.date < #StartDate
AND rc.totalRatings > 0
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY dt.timeId
ORDER BY dt.date DESC LIMIT 1
);
SELECT
subquery.d AS `Date`,
subquery.newReviewsCount AS `NewReviews`,
(#cumulativeCount := #cumulativeCount + subquery.newReviewsCount) AS `CumulativeReviewsCount`,
(#dailyAverageWithCarry := COALESCE(subquery.dailyRatingAverage, #dailyAverageWithCarry)) AS `DailyRatingAverage`
FROM
(
SELECT
dt.date AS d,
COALESCE(SUM(rc.totalReviews), 0) AS newReviewsCount,
SUM(rac.ratingsSum) / SUM(rac.totalRatings) AS dailyRatingAverage
FROM times dt
LEFT JOIN reviewCount rc ON dt.timeId = rc.createdOnTimeId
LEFT JOIN ratingCount rac ON dt.timeId = rac.createdOnTimeId
WHERE dt.date BETWEEN #StartDate AND #EndDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY dt.timeId
ORDER BY dt.timeId
) AS subquery;
The query currently takes ~2 minutes to run, with the following row counts:
times 54787
reviewCount 276389
ratingCount 473683
age 122
gender 3
ratingType 28
product 70070
Any help would be greatly appreciated. I'd either like to make this query much faster, or if it would be faster to do so, to pre-calculate the values grouped by date, age, gender, product, company, and ratingType, then do a quick roll-up query on that table.
UPDATE #1: I tried Meherzad's suggestions of adding indexes to times and ratingCount with:
ALTER TABLE times ADD KEY `timeId_date_key` (`timeId`, `date`);
ALTER TABLE ratingCount ADD KEY `createdOnTimeId_totalRatings_key` (`createdOnTimeId`, `totalRatings`);
Then ran my initial query again, and it was about 1s faster (~89s), but still too slow. I tried Meherzad's suggested query, and had to kill it after a few minutes.
As requested, here is the EXPLAIN results from my query:
id|select_type|table|type|possible_keys|key|key_len|ref|rows|Extra
1|PRIMARY|<derived2>|ALL|NULL|NULL|NULL|NULL|6808032|NULL
2|DERIVED|dt|range|PRIMARY,timeId_date_key,date|date|3|NULL|88|Using index condition; Using temporary; Using filesort
2|DERIVED|rc|ref|PRIMARY,companyId_fk,createdOnTimeId|createdOnTimeId|4|dt.timeId|126|Using where
2|DERIVED|rac|ref|createdOnTimeId,createdOnTimeId_total_ratings_key|createdOnTimeId|4|dt.timeId|614|NULL
I checked the cache read miss rate as mentioned in the article on buffer sizes, and it was
Key_reads 58303
Key_read_requests 147411279
For a miss rate of 3.9551247635535405672723319902814e-4
UPDATE #2: Solved! The indices definitely helped, so I'll give credit for the answer to Meherzad. What actually made the most difference was realizing that calculating the rolling average and daily/cumulative review counts in the same query was joining those two huge tables together. I saw that the variable initialization was done in two separate queries, and decided to try separating the two big queries into subqueries and then joining them based on the timeId. Now it runs in 0.358s with the following query:
SET #StartDate = '2013-01-24';
SET #EndDate = '2013-04-24';
SELECT
#StartDateId:=MIN(timeId), #EndDateId:=MAX(timeId)
FROM
times
WHERE
date IN (#StartDate , #EndDate);
SELECT
#CumulativeCount:=COALESCE(SUM(totalReviews), 0)
FROM
reviewCount
WHERE
createdOnTimeId < #StartDateId
-- Add Filters
;
SELECT
#DailyAverage:=COALESCE(SUM(ratingsSum) / SUM(totalRatings), 0)
FROM
ratingCount
WHERE
createdOnTimeId < #StartDateId
AND totalRatings > 0
-- Add Filters
GROUP BY createdOnTimeId
ORDER BY createdOnTimeId DESC
LIMIT 1;
SELECT
t.date AS `Date`,
COALESCE(q1.newReviewsCount, 0) AS `NewReviews`,
(#CumulativeCount:=#CumulativeCount + COALESCE(q1.newReviewsCount, 0)) AS `CumulativeReviewsCount`,
(#DailyAverage:=COALESCE(q2.dailyRatingAverage,
COALESCE(#DailyAverage, 0))) AS `DailyRatingAverage`
FROM
times t
LEFT JOIN
(SELECT
rc.createdOnTimeId AS createdOnTimeId,
COALESCE(SUM(rc.totalReviews), 0) AS newReviewsCount
FROM
reviewCount rc
WHERE
rc.createdOnTimeId BETWEEN #StartDateId AND #EndDateId
-- Add Filters
GROUP BY rc.createdOnTimeId) AS q1 ON t.timeId = q1.createdOnTimeId
LEFT JOIN
(SELECT
rc.createdOnTimeId AS createdOnTimeId,
SUM(rc.ratingsSum) / SUM(rc.totalRatings) AS dailyRatingAverage
FROM
ratingCount rc
WHERE
rc.createdOnTimeId BETWEEN #StartDateId AND #EndDateId
-- Add Filters
GROUP BY rc.createdOnTimeId) AS q2 ON t.timeId = q2.createdOnTimeId
WHERE
t.timeId BETWEEN #StartDateId AND #EndDateId;
I had assumed that two subqueries would be incredibly slow, but they were insanely fast because they weren't joining completely unrelated rows. It also pointed out the fact that my earlier results were way off. For example, from above:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 7020 10586 4.017514595496247
Should have been, and now is:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 599 407327 4.017514595496247
The average was correct, but the join was screwing up the number of both new and cumulative reviews, which I verified with a single query.
I also got rid of the joins to the times table, instead determining the start and end date IDs in a quick initialization query, then just rejoined to the times table at the end.
Now the results are:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 599 407327 4.017514595496247
2013-01-25 551 407878 4.058400718778077
2013-01-26 455 408333 3.838926174496644
2013-01-27 433 408766 3.992957746478873
2013-01-28 425 409191 3.983625730994152
...
2013-04-13 170 426066 3.874239350912779
2013-04-14 182 426248 3.585714285714286
2013-04-15 171 426419 3.6202531645569622
2013-04-16 0 426419 3.6202531645569622
2013-04-17 0 426419 3.6202531645569622
2013-04-18 0 426419 3.6202531645569622
2013-04-19 0 426419 3.6202531645569622
2013-04-20 0 426419 3.6202531645569622
2013-04-21 0 426419 3.6202531645569622
2013-04-22 0 426419 3.6202531645569622
2013-04-23 0 426419 3.6202531645569622
2013-04-24 0 426419 3.6202531645569622
The last few averages properly carry the earlier ones, too, since we haven't imported from that customer's data feed in about 10 days.
Thanks for the help!
Try this query
You don't have necessary indexes to optimize your query
Table times add compound index on (timeId, dateId)
Table ratingCount add compound index on (createdOnTimeId, totalRatings)
As you have already mentioned that you are using various other AND filters according to the user input so create a compound index for those columns in the order which you are adding for their respective table Ex Table ratingCount compound index (createdOnTimeId, totalRatings, ratingType, age, gender, product, and company). NOTE This index will be useful only if you add these constraints in the query.
I'd also check to make sure your buffer pool is large enough to hold your indexes. You don't want indexes to be paging in and out of the buffer pool during a query.
Check your buffer pool size
BUFFER_SIZE
If you don't find any improvement in performance please post explain statement for your query also, it will help in understanding problem properly.
I have tried to understand your query and made a new one check whether it works or not.
SELECT
*
FROM
(SELECT
dt.timeId
dt.date,
COALESCE(SUM(rc.totalReviews), 0) AS `NewReviews`,
(#cumulativeCount := #cumulativeCount + subquery.newReviewsCount) AS `CumulativeReviewsCount`,
(#dailyAverageWithCarry := COALESCE(SUM(rac.ratingsSum) / SUM(rac.totalRatings), #dailyAverageWithCarry)) AS `DailyRatingAverage`
FROM
times dt
LEFT JOIN
reviewCount rc
ON
dt.timeId = rc.createdOnTimeId
LEFT JOIN
ratingCount rac ON dt.timeId = rac.createdOnTimeId
JOIN
(SELECT #cumulativeCount:=0, #dailyAverageWithCarry:=0) tmp
WHERE
dt.date < #EndDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY
dt.timeId
ORDER BY
dt.timeId
) AS subquery
WHERE
subquery.date>#StartDate;
Hope this helps....

MySQL - select distinct values against a range of columns

The following table will store exchange rates between various currencies over time:
CREATE TABLE `currency_exchange` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`currency_from` int(11) DEFAULT NULL,
`currency_to` int(11) DEFAULT NULL,
`rate_in` decimal(12,4) DEFAULT NULL,
`rate_out` decimal(12,4) DEFAULT NULL,
`exchange_date` datetime DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
How would I query it to fetch a list the most recent exchange rates?
It'd essentially be identifying the combined currency_from and currency_to columns as a distinct exchange rate, then finding the one with the most recent date.
For example, let's say I've got the data as:
INSERT INTO currency_exchange (id, currency_from, currency_to, rate_in, rate_out, exchange_date) VALUES
(1, 1, 2, 0.234, 1.789, '2012-07-23 09:24:34'),
(2, 2, 1, 0.234, 1.789, '2012-07-24 09:24:34'),
(3, 2, 1, 0.234, 1.789, '2012-07-24 09:24:35'),
(4, 1, 3, 0.234, 1.789, '2012-07-24 09:24:34');
I'd want it to select row ID's:
1 - as the most recent rate between currencies 1 and 2
3 - as the most recent rate between currencies 2 and 1
4 - as the most recent rate between currencies 1 and 3
The following query should work:
SELECT ce.*
FROM currency_exhcnage ce
LEFT JOIN currency_exchange newer
ON (newer.currency_from = ce.currency_from
AND newer.currency_to = ce.currency_to
AND newer.exchange_date > ce.exchange_date)
WHERE newer.id IS NULL
The trick of doing a self-LEFT JOIN is to avoid resorting to a subquery that may be very expensive if you have large datasets. It's essentially looking for records where no "newer" record exists.
Alternatively, you could go for a simpler (although it may (or may not, as noted in comments) be slower):
SELECT *
FROM currency_exchange ce
NATURAL JOIN (
SELECT currency_from, currency_to, MAX(exchange_date) AS exchange_date
FROM currency_exchange
GROUP BY currency_from, currency_to
) AS most_recent
Insert the values of $currency_from and $currency_to in your dynamic query.
The query below will return the nearest row to the current time
SELECT id FROM currency_exchange WHERE currency_from='$currency_from' AND currency_to='$currency_to' ORDER BY ABS( DATEDIFF( exchange_date, now() ) ) LIMIT 1