How should I batch query an append-only table in mysql? - mysql

Suppose I have an append-only table:
CREATE TABLE IF NOT EXISTS `states` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
`start_date` date DEFAULT NULL,
`end_date` date DEFAULT NULL,
`person_id` int(10) unsigned default NULL,
PRIMARY KEY (`id`)
);
There is an index on name and another on person_id (person_id is a fkey reference to another table).
For each name, we store a mapping to person_id for a given date range. The mapping from name -> person_id is many to one (this is a contrived example, but think of it as storing how a person could change their name). We never want to delete history so when altering the mapping, we insert a new entry. The last entry for a given name is the source of truth. We end up wanting to ask two different types of questions on the dataset, for which I have some general questions.
What is the current mapping for a given name/list of names?
If there is only one name, the most straightforward query is:
select * from states where name = 'name' ORDER BY `id` DESC LIMIT 1;
If there is more than one name, the best way I could figure out is to do:
select * from states as a
left join states as b on a.name = b.name and a.id < b.id
where isnull(b.id);
Is this actually the best way to batch query? For a batch of 1, how much worse would the second query be than the first? Using explain, I can tell we end up doing two index lookups instead of 1. Given we care a lot about the performance of this individual lookup, my gut is to run different queries depending on the number of names we are querying for. I'd prefer if there was a way to defer to mysql's optimizer though. Is there a way to write this query so mysql figures out what to do for me?
What are the current mappings that map to person_id / a list of person_ids?
The way I would query for that is:
select * from states as a
left join states as b on a.name = b.name and a.id < b.id
where isnull(b.id) and person_id in person_id_list
I am slightly concerned about the performance for small lists though because my understanding of how mysql works is limited. Using explain, I know that mysql filters by person_id via the index on a before filtering by isnull(b.id). But does it do this before the join or after the join? Could we end up wasting a lot of time joining these two tables? How could I figure this out in general?

The code in (1) is "groupwise-max", but done in a very inefficient way. (Follow the tag I added for more discussion.)
May I suggest you have two tables; one that is append-only, like you have. Let's call this table History. Then have another table called Current. When you add a new entry, INSERT into History, but replace into Current.
If you do take this approach, consider what differences you might have in the two tables. The PRIMARY KEY will certainly be different; other indexes may be different, and even some columns may be different.

Related

mysql index selection on large table

I have a couple of tables that looks like this:
CREATE TABLE Entities (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
client_id INT NOT NULL,
display_name VARCHAR(45),
PRIMARY KEY (id)
)
CREATE TABLE Statuses (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
PRIMARY KEY (id)
)
CREATE TABLE EventTypes (
id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(45) NOT NULL,
PRIMARY KEY (id)
)
CREATE TABLE Events (
id INT NOT NULL AUTO_INCREMENT,
entity_id INT NOT NULL,
date DATE NOT NULL,
event_type_id INT NOT NULL,
status_id INT NOT NULL
)
Events is large > 100,000,000 rows
Entities, Statuses and EventTypes are small < 300 rows a piece
I have several indexes on Events, but the ones that come into play are
idx_events_date_ent_status_type (date, entity_id, status_id, event_type_id)
idx_events_date_ent_status_type (entity_id, status_id, event_type_id)
idx_events_date_ent_type (date, entity_id, event_type_id)
I have a large complicated query, but I'm getting the same slow query results with a simpler one like the one below (note, in the real queries, I don't use evt.*)
SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt
JOIN `Entities` ent ON evt.entity_id = ent.id
JOIN `EventTypes` et ON evt.event_type_id = et.id
JOIN `Statuses` s ON evt.status_id = s.id
WHERE
evt.date BETWEEN #start_date AND #end_date AND
evt.entity_id IN ( 19 ) AND -- this in clause is built by code
evt.event_type_id = #type_id
For some reason, mysql keeps choosing the index which doesn't cover Events.date and the query takes 15 seconds or more and returns a couple thousand rows. If I change the query to:
SELECT evt.*, ent.name AS ent_name, s.name AS stat_name, et.name AS type_name
FROM `Events` evt force index (idx_events_date_ent_status_type)
JOIN `Entities` ent ON evt.entity_id = ent.id
JOIN `EventTypes` et ON evt.event_type_id = et.id
JOIN `Statuses` s ON evt.status_id = s.id
WHERE
evt.date BETWEEN #start_date AND #end_date AND
evt.entity_id IN ( 19 ) AND -- this in clause is built by code
evt.event_type_id = #type_id
The query takes .014 seconds.
Since this query is built by code, I would much rather not force the index, but mostly, I want to know why it chooses one index over the other. Is it because of the joins?
To give some stats, there are ~2500 distinct dates, and ~200 entities in the Events table. So I suppose that might be why it chooses the index with all of the low cardinality columns.
Do you think it would help to add date to the end of idx_events_date_ent_status_type? Since this is a large table, it takes a long time to add indexes.
I tried adding an additional index,
ix_events_ent_date_status_et(entity_id, date, status_id, event_type_id)
and it actually made the queries slower.
I will experiment a bit more, but I feel like I'm not sure how the optimizer makes it's decisions.
Additional Info:
I tried removing the join to the Statuses table, and mysql switches to ix_events_date_ent_type, and the query runs in 0.045 sec
I can't wrap my head around why removing a join to a table that is not part of the filter impacts the choice of index.
I would add this index:
ALTER TABLE Events ADD INDEX (event_type_id, entity_id, date);
The order of columns is important. Put all column(s) used in equality conditions first. This is event_type_id in this case.
The optimizer can use multiple columns to optimize equalities, if the columns are left-most and consecutive.
Then the optimizer can use one more column to optimize a range condition. A range condition is anything other than = or IS NULL. So range conditions include >, !=, BETWEEN, IN(), LIKE (with no leading wildcard), IS NOT NULL, and so on.
The condition on entity_id is also an equality condition if the IN() list has one element. MySQL's optimizer can treat a list of one value as an equality condition. But if the list has more than one value, it becomes a range condition. So if the example you showed of IN (19) is typical, then all three columns of the index will be used for filtering.
It's still worth putting date in the index, because it can at least tell the InnoDB storage engine to filter rows before returning them. See https://dev.mysql.com/doc/refman/8.0/en/index-condition-pushdown-optimization.html It's not quite as good as a real index lookup, but it's worthwhile.
I would also suggest creating a smaller table to test with. Doing experiments on a 100 million row table is time-consuming. But you do need a table with a non-trivial amount of data, because if you test on an empty table, the optimizer behaves differently.
Rearrange your indexes to have columns in this order:
Any column(s) that will be tested with = or IS NULL.
Column(s) tested with IN -- If there is a single value, this will be further optimized to = for you.
One "range" column, such as your date.
Note that nothing after a "range" test will be used by WHERE.
(There are exceptions, but most are not relevant here.)
More discussion: Index Cookbook
Since the tables smell like Data Warehousing, I suggest looking into
Summary Tables In some cases, long queries on Events can be moved to the summary table(s), where they run much faster. Also, this may eliminate the need for some (or maybe even all) secondary indexes.
Since Events is rather large, I suggest using smaller numbers where practical. INT takes 4 bytes. Speed will improve slightly if you shrink those where appropriate.
When you have INDEX(a,b,c), that index will handle cases that need INDEX(a,b) and INDEX(a). Keep the longer one. (Sometimes the Optimizer picks the shorter index 'erroneously'.)
To most effectively use a composite index on multiple values of two different fields, you need to specify the values with joins instead of simple where conditions. So assuming you are selecting dates from 2022-12-01 to 2022-12-03 and entity_id in (1,2,3), do:
select ...
from (select date('2022-12-01') date union all select date('2022-12-02') union all select date('2022-12-03')) dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date
If you pre-create a dates table with all dates from 0000-01-01 to 9999-12-31, then you can do:
select ...
from dates
join Entities on Entities.id in (1,2,3)
join Events on Events.entity_id=Entities.id and Events.date=dates.date
where dates.date between #start_date and #end_date

Subqueries and Large Tables. How do I Improve The Speed?

I'm not new to MySQL, but I'm definitely way in over my head here.
I'd like to show a table of differences in temperatures for Panama and Belize based on date and atmospheric level. The query is supposed to match the Panama and Belize data based on date and atmospheric level and return the top 30 differences, ordered by the extent of the differences.
However, it is incredibly slow (over 30s) so it times out. Some other queries that I've written for this dataset are also very slow (about 26s). But if I only run the subqueries, they take only 1.7s or so. I should note that both of the tables below are over 440,000 rows long, though I don't think that's very large. The problem is probably the way that I'm joining the tables or the way that I'm creating the subqueries.
Here's my setup: (It's the SQL from the the exported tables. I'm omitting some columns)
/**The table for Panama weather data */
CREATE TABLE `panama_weather_data` (
`Id` varchar(40) NOT NULL,
`OwmPackageId` varchar(30) NOT NULL,
`Level` FLOAT DEFAULT NULL,
`Dt` date DEFAULT NULL,
`Temperature` float DEFAULT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `panama_weather_data`
ADD PRIMARY KEY (`Id`) USING BTREE;
COMMIT;
/**The table for Belize weather data*/
CREATE TABLE `belize_weather_data` (
`Id` varchar(40) NOT NULL,
`OwmPackageId` varchar(30) NOT NULL,
`Level` FLOAT DEFAULT NULL,
`Dt` date DEFAULT NULL,
`Temperature` float DEFAULT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `belize_weather_data`
ADD PRIMARY KEY (`Id`) USING BTREE;
COMMIT;
/**Code to populate the tables here*/
And here's my query:
SELECT ABS(PanamaTemperature-BelizeTemperature) AS TemperatureDif,
PanamaAtmostphericLevel, PanamaTable.Dt
FROM
(SELECT CAST(panama_weather_data.Dt AS DATETIME) AS Dt,
panama_weather_data.Level AS PanamaAtmostphericLevel,
panama_weather_data.Temperature AS PanamaTemperature
FROM panama_weather_data
WHERE panama_weather_data.OwmPackageId = 'openweathermappkg19758' )
AS PanamaTable
JOIN
(SELECT CAST(belize_weather_data.Dt AS DATETIME) AS Dt,
belize_weather_data.Level AS BelizeAtmosphericLevel,
belize_weather_data.Temperature AS BelizeTemperature
FROM belize_weather_data
WHERE belize_weather_data.OwmPackageId = 'openweathermappkg19758' )
AS BelizeTable
ON PanamaAtmostphericLevel = BelizeAtmosphericLevel
AND PanamaTable.Dt = BelizeTable.Dt
ORDER BY TemperatureDif
LIMIT 30
My question is really: Is there anyway to optimize this query and make it less painful?
CAST(panama_weather_data.Dt AS DATETIME) AS Dt
Why? (all this will do is slow down the query)
Is there anyway to optimize this query
The SQL SELECT statement you have shown us certainly would not be my starting point. However you did not tell us how you intend to query the data in future. Specifically, are you really going to examine all of the data each time you run a query?
Your biggest win comes from not keeping the data in separate tables - it should be a single table with different attributes for the two datasets.
After that, the next biggest improvement would come from storing the temperature difference in the table and indexing it.
A way to increase speed drastically in SQL databases is to use indices. This is a tradeoff between disk space and query performance.
To find out where to put indices, search for the conditions that limit your result sets the most. In your case, you probably have a few hundred thousand rows for both tables, but you only want 30 of those, whose Atmospheric Levels and date are equal. You probably want to put an index on those two columns like so:
CREATE INDEX level_date_panama ON panama_weather_data (Level, Dt);
CREATE INDEX level_date_belize ON belize_weather_data (Level, Dt);
Please tell me if this increases your performance.
You could do a few things to possibly improve performance here:
Remove the subqueries.
From what you posted I see no reason why the subqueries are necessary for the join. You could just as easily remove them and rewrite using the actual column names in place of where you wrote the AS values.
Input your Dt data as a Datetime
A CAST is not a particularly expensive operator, but does take time to complete. If you are only using these columns as Datetimes, you should be entering them as such and change the column type to a Datetime. You could directly compare these values instead of having to cast them.
Compare Dt as a Date
Going off of (2), if all your Dt values are Dates, casting them to Datetimes won't be doing anything to the value, so just compare on the natural Date type.
Index
If the above is not possible due to outside constraints, create an index based on how you are joining, this would be a column used in your ON clause.
What kind of values are in id? Perhaps you can get rid of id, and use PRIMARY KEY(level, dt)?
Why is level a FLOAT? If they are really "floating" values, then is it realistic for both tables to have the same values? I guess they are feet or meters above sea level? In which case, won't MEDIUMINT UNSIGNED suffice?
Then...
SELECT ABS(p.Temperature - b.Temperature) AS TemperatureDif,
p.Level,
p.Dt
FROM panama_weather_data AS p
JOIN belize_weather_data AS b
USING (OwmPackageId, Level, Dt)
WHERE p.OwmPackageId = 'openweathermappkg19758'
ORDER BY TemperatureDif DESC
LIMIT 30;
You will need
INDEX(OwmPackageId, Level, Dt)
with those columns in any order, and on either (or both) tables.
As already mentioned, no CAST is needed. However, if you need some format other than "2017-08-13 10:04:12", then use DATE_FORMAT(...) in the SELECT clause (not the USING clause).
Rather than having two 'identical' tables, consider having one table with an extra column for which country is involved. This would make it easy to extend to an arbitrary number of locations. The SELECT would need to be a "self join" and the syntax would be slightly different.

MySQL indexes - what are the best practices according to this table and queries

i have this table (500,000 row)
CREATE TABLE IF NOT EXISTS `listings` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`type` tinyint(1) NOT NULL DEFAULT '1',
`hash` char(32) NOT NULL,
`source_id` int(10) unsigned NOT NULL,
`link` varchar(255) NOT NULL,
`short_link` varchar(255) NOT NULL,
`cat_id` mediumint(5) NOT NULL,
`title` mediumtext NOT NULL,
`description` mediumtext,
`content` mediumtext,
`images` mediumtext,
`videos` mediumtext,
`views` int(10) unsigned NOT NULL,
`comments` int(11) DEFAULT '0',
`comments_update` int(11) NOT NULL DEFAULT '0',
`editor_id` int(11) NOT NULL DEFAULT '0',
`auther_name` varchar(255) DEFAULT NULL,
`createdby_id` int(10) NOT NULL,
`createdon` int(20) NOT NULL,
`editedby_id` int(10) NOT NULL,
`editedon` int(20) NOT NULL,
`deleted` tinyint(1) NOT NULL,
`deletedon` int(20) NOT NULL,
`deletedby_id` int(10) NOT NULL,
`deletedfor` varchar(255) NOT NULL,
`published` tinyint(1) NOT NULL DEFAULT '1',
`publishedon` int(11) unsigned NOT NULL,
`publishedby_id` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `hash` (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
i'm thinking to make each query by the publishedon between x and y (show in all the site just records of 1 month)
in the same time, i want to add with the publishedon in the where clause published, cat_id , source_id
some thing like this:
SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
that query is ok and fast until now without indexing, but when trying to use order by publishedon its became too slow, so i used this index
CREATE INDEX `listings_pcs` ON listings(
`publishedon` DESC,
`published` ,
`cat_id` ,
`source_id`
)
it worked and the order by publishedon became fast, now i want to order by views like this
SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
this is the explanation
this query is too slow because of ORDER BY views DESC
then i'm tried to drop the old index and add this
CREATE INDEX `listings_pcs` ON listings(
`publishedon` DESC,
`published` ,
`cat_id` ,
`source_id`,
`views` DESC
)
its too slow also
what about if i use just single index on publishedon?
what about using single index on cat_id,source_id,views,publishedon?
i can change the query dependencies like publishedon in one month if i found other indexing method depend on any other columns
what about making index in (cat_id, source_id, publishedon, published) ? but in some cases i will use source_id only?
what is the best indexing schema for that table
This query:
SELECT *
FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) AND
(published = 1) AND
(cat_id in (1,2,3,4,5)) AND
(source_id in (1,2,3,4,5));
Is hard to optimize with only indexes. The best index is one that starts with published and then has the other columns -- it is not clear what their order should be. The reason is because all but published are not using =.
Because your performance problem is on a sort, that suggests that lots of rows are being returned. Typically, an index is used to satisfy the WHERE clause before the index can be used for the ORDER BY. That makes this hard to optimize.
Suggestions . . . None are that great:
If you are going to access the data by month, then you might consider partitioning the data by month. That will make the query without the ORDER BY faster, but won't help the ORDER BY.
Try various orders of columns after published in the index. You might find the most selective column(s). But, once again, this speeds the query before the sorting.
Think about ways that you can structure the query to have more equality conditions in the WHERE clause or to return a smaller set of data.
(Not really recommended) Put an index on published and the ordering column. Then use a subquery to fetch the data. Put the inequality conditions (IN and so on) in the outer query. The subquery will use the index for sorting and then filter the results.
The reason the last is not recommended is because SQL (and MySQL) do not guarantee the ordering of results from a subquery. However, because MySQL materializes subqueries, the results really are in order. I don't like using undocumented side effects, which can change from version to version.
One important general note as to why your query isn't getting any faster despite your attempts is that DESC on indexes is not currently supported on MySQL. See this SO thread, and the source from which it comes.
In this case, your largest problem is in the sheer size of your record. If the engine decides it wouldn't really be faster to use an index, then it won't.
You have a few options, and all are actually pretty decent and can probably help you see significant improvement.
A note on SQL
First, I want to make a quick note about indexing in SQL. While I don't think it's the solution for your woes, it was your main question, and can help.
It usually helps me to think about indexing in three different buckets. The absolutely, the maybe, and the never. You certainly don't have anything in your indexing that's in the never column, but there are some I would consider "maybe" indexes.
absolutely: This is your primary key and any foreign keys. It is also any key you will reference on a very regular basis to pull a small set of data from the massive data you have.
maybe: These are columns which, while you may reference them regularly, are not really referenced by themselves. In fact, through analysis and using EXPLAIN as #Machavity recommends in his answer, you may find that by the time these columns are used to strip out fields, there aren't that many fields anyway. An example of a column that would solidly be in this pile for me would be the published column. Keep in mind that every INDEX adds to the work your queries need to do.
Also: Composite keys are a good choice when you're regularly searching for data based on two different columns. More on that later.
Options, options, options...
There are a number of options to consider, and each one has some drawbacks. Ultimately I would consider each of these on a case-by-case basis as I don't see any of these to be a silver bullet. Ideally, you'd test a few different solutions against your current setting and see which one runs the fastest using a nice scientific test.
Split your SQL table into two or more separate tables.
This is one of the few times where, despite the number of columns in your table, I wouldn't rush to try to split your table into smaller chunks. If you decided to split it into smaller chunks, however, I'd argue that your [action]edon, [action]edby_id, and [action]ed could easily be put into another table, actions:
+-----------+-------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| action_id | int(11) | NO | | NULL | |
| action | varchar(45) | NO | | NULL | |
| date | datetime | NO | | CURRENT_TIMESTAMP | |
| user_id | int(11) | NO | | NULL | |
+-----------+-------------+------+-----+-------------------+----------------+
The downside to this is that it does not allow you to ensure there is only one creation date without a TRIGGER. The upside is that when you don't have to sort as many columns with as many indexes when you're sorting by date. Also, it allows you to sort not only be created, but also by all of your other actions.
Edit: As requested, here is a sample sorting query
SELECT * FROM listings
INNER JOIN actions ON actions.listing_id = listings.id
WHERE (actions.action = 'published')
AND (listings.published = 1)
AND (listings.cat_id in(1,2,3,4,5))
AND (listings.source_id in(1,2,3,4,5))
AND (actions.actiondate between 1441105258 AND 1443614458)
ORDER BY listings.views DESC
Theoretically, it should cut down on the number of rows you're sorting against because it's only pulling relevant data. I don't have a dataset like yours so I can't test it right now!
If you put a composite key on actiondate and listings.id, this should help to increase speed.
As I said, I don't think this is the best solution for you right now because I'm not convinced it's going to give you the maximum optimization. This leads me to my next suggestion:
Create a month field
I used this nifty tool to confirm what I thought I understood of your question: You are sorting by month here. Your example is specifically looking between September 1st and September 30th, inclusive.
So another option is for you to split your integer function into a month, day, and year field. You can still have your timestamp, but timestamps aren't all that great for searching. Run an EXPLAIN on even a simple query and you'll see for yourself.
That way, you can just index the month and year fields and do a query like this:
SELECT * FROM listings
WHERE (publishedmonth = 9)
AND (publishedyear = 2015)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
Slap an EXPLAIN in front and you should see massive improvements.
Because you're planning on referring to a month and a day, you may want to add a composite key against month and year, rather than a key on both separately, for added gains.
Note: I want to be clear, this is not the "correct" way to do things. It is convenient, but denormalized. If you want the correct way to do things, you'd adapt something like this link but I think that would require you to seriously reconsider your table, and I haven't tried anything like this, having lacked the need, and, frankly, will, to brush up on my geometry. I think it's a little overkill for what you're trying to do.
Do your heavy sorting elsewhere
This was hard for me to come to terms with because I like to do things the "SQL" way wherever possible, but that is not always the best solution. Heavy computing, for example, is best done using your programming language, leaving SQL to handle relationships.
The former CTO of Digg sorted using PHP instead of MySQL and received a 4,000% performance increase. You're probably not scaling out to this level, of course, so the performance trade-offs won't be clearcut unless you test it out yourself. Still, the concept is sound: the database is the bottleneck, and computer memory is dirt cheap by comparison.
There are doubtless a lot more tweaks that can be done. Each of these has a drawback and requires some investment. The best answer is to test two or more of these and see which one helps you get the most improvement.
If I were you, I'd at least INDEX the fields in question individually. You're building multi-column indices but it's clear you're pulling a lot of disparate records as well. Having the columns indexed individually can't hurt.
Something you should do is use EXPLAIN which lets you look under the hood of how MySQL is pulling the data. It could further point to what is slowing your query down.
EXPLAIN SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
The rows of your table are enormous (all those mediumtext columns), so sorting SELECT * is going to have a lot of overhead. That's a simple reality of your schema design. SELECT * is generally considered harmful to performance. If you can enumerate the columns you need, and you can leave out some of the big ones, you'll get better performance.
You showed us a query with the following filter criteria
single-value equality on published.
range matching on publishedon.
set matching on cat_id
set matching on source_id.
Ordering on views.
Due to the way MySQL indexing works on MyISAM, the following compound covering index will probably serve you well. It's hard to be sure unless you try it.
CREATE INDEX listings_x_pub_date_cover ON listings(
published, publishedon, cat_id, source_id, views, id )
To satisfy your query the MySQL engine will random-access the index at the appropriate value of published, and then at the begiining of the publishedon range. It will then scan through the index filtering on the other two filtering criteria. Finally, it sorts and and uses the id value to look up each row that passes the filter. Give it a try.
If that performance isn't good enough try this so-called deferred join operation.
SELECT a.*
FROM listings a
JOIN ( SELECT id, views
FROM listings
WHERE published = 1
AND publishedon BETWEEN 1441105258
AND 1443614458
AND cat_id IN (1,2,3,4,5)
AND source_id IN (1,2,3,4,5)
ORDER BY views DESC
) b ON a.id = b.id
ORDER BY b.views DESC
This does the heavy lifting of ordering with just the id and views columns without having to shuffle all those massive text columns. It may or may not help, because the ordering has to be repeated in the outer query. This kind of thing DEFINITELY helps when you have ORDER BY ... LIMIT n pattern in your query, but you don't.
Finally, considering the size of these rows, you may get best performance by doing this inner query from your php program:
SELECT id
FROM listings
WHERE published = 1
AND publishedon BETWEEN 1441105258
AND 1443614458
AND cat_id IN (1,2,3,4,5)
AND source_id IN (1,2,3,4,5)
ORDER BY views DESC
and then fetching the full rows of the table one-by-one using these id values in an inner loop. (This query that fetches just id values should be quite fast with the help of the index I mentioned.) The inner loop solution would be ugly, but if your text columns are really big (each mediumtext column can hold up to 16MiB) it's probably your best bet.
tl;dr. Create the index mentioned. Get rid of SELECT * if you possibly can, giving a list of columns you need instead. Try the deferred join query. If it's still not good enough try the nested query.

Right way to apply INDEX to foreign keys

I have a table with 2 foreign keys. I'm somewhat new to MySQL, can someone tell me which is the right way in applying an INDEX to tables?
# Sample 1
CREATE TABLE IF NOT EXISTS `my_table` (
`topic_id` INT UNSIGNED NOT NULL ,
`course_id` INT UNSIGNED NOT NULL ,
PRIMARY KEY (`topic_id`, `course_id`) ,
INDEX `topic_id_idx` (`topic_id` ASC) ,
INDEX `course_id_idx` (`course_id` ASC) )
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_general_ci;
# Sample 2
CREATE TABLE IF NOT EXISTS `my_table` (
`topic_id` INT UNSIGNED NOT NULL ,
`course_id` INT UNSIGNED NOT NULL ,
PRIMARY KEY (`topic_id`, `course_id`) ,
INDEX `topic_id_idx` (`topic_id`, `course_id`) ,
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
COLLATE = utf8_general_ci;
I guess what I'm really asking is what's the difference between defining both as separate indexes and the other as combined?
The reason why you might want one of these over the other has to do with how you plan on querying the data. Getting this determination right can be a bit of trick.
Think of the combined key in terms of, for example, looking up a student's folder in a filing cabinet, first by the student's last name, and then by their first name.
Now, in the case of the two single indexes in your example, you could imagine, in the student example, having two different sets of organized folders, one with every first name in order, and another with ever last name in order. In this case, you'll always have to work through the greatest amount of similar records, but that doesn't matter so much if you only have one name or the other anyway. In such a case, this arrangement gives you the greatest flexibility while still only maintaining indexes over two columns.
In contrast, if given both first and last name, it's a lot easier for us as humans to look up a student first by last name, then by first name (within a smaller set of potentials). However, when the last name is not known, it makes it very difficult to find the student by first name alone because the students with the same first-name are potential interleaved with every veration of last name (table scan). This is all true for the algorithms the computer uses to look up the information too.
So, as a rule of thumb, add the extra key to a single index if you are going to be filtering the data by both values at once. If at times you will have one and not the other, make sure which ever value that is, it's the leftmost key in the index. If the value could be either, you'll probably want both indexes (one of these could actually have both key for the best of both words, but even that comes at a cost in terms of writes). Getting this stuff right can be pretty important, as this often amounts to an all or nothing game. If all the data the dbms requires to preform the indexed lookup isn't present, it will probably resort to a table scan. Mysql's explain feature is one tool which can be helpful in checking your configuration and identifying optimizations.
if u create index by using one key, then when the data is searched it will find through only that key.
INDEX `topic_id_idx` (`topic_id` ASC) ,
INDEX `course_id_idx` (`course_id` ASC)
in this situation data is searched topic_id and course_id separately. but if you combine them data is searched combining them.
for a example if you have some data as follows :
topic_id course_id
----------
abc 1
pqr 2
abc 3
if you want to search abc - 3 if you put separate indexes then it will search these two columns separately and find the result.
but if you combine them then it will search abc+3 directly.

Proper Indexing/Optimization of a MySQL GROUP BY and JOIN Query

I've done a lot of reading and Googling on this and I cannot find any satisfactory answer so I'd appreciate any help. Most answers I find come close to my situation but do not address it (and attempting to follow the solutions has not done me any good).
See Edit #2 below for the best example
[This was the original question but is not a great representation of what I'm asking.]
Say I have 2 tables, each with 4 columns:
key (int, auto increment)
c1 (a date)
c2 (a varchar of length 3)
c3 (also a varchar of length 3)
And I want to perform the following query:
SELECT t.c1, t.c2, COUNT(*)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.c1, t.c2
Both key fields are indexed as primary keys. I want to get the number of rows returned in each grouping of c1, c2.
When I explain this query I get "using temporary; using filesort". The actual table I'm performing this query on is over 500,000 rows, so that means it's a time consuming query.
So my question is (assuming I'm not doing anything wrong in the query): is there a way to index this table to eliminate the temporary/filesort usage?
Thanks in advance for any help.
Edit
Here is the table definition (in this example both tables are identical - in reality they're not but I'm not sure it makes a difference at this point):
CREATE TABLE `test1` (
`key` int(11) NOT NULL auto_increment,
`c1` date NOT NULL,
`c2` varchar(3) NOT NULL,
`c3` varchar(3) NOT NULL,
PRIMARY KEY (`key`),
UNIQUE KEY `c1` (`c1`,`c2`),
UNIQUE KEY `c2_2` (`c2`,`c1`),
KEY `c2` (`c2`,`c3`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8
Full EXPLAIN statement:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t ALL NULL NULL NULL NULL 2 Using temporary; Using filesort
1 SIMPLE t2 eq_ref PRIMARY PRIMARY 4 tracking.t.key 1 Using index
This is just for my sample tables. In my real tables the rows for t says 500,000+ (every row in the table, though that could be related to something else).
Edit #2
Here is a more concrete example to better explain my situation.
Let's say I have data on Little League baseball games. I have two tables. One holds data on the games:
CREATE TABLE `ex_games` (
`game_id` int(11) NOT NULL auto_increment,
`home_team` int(11) NOT NULL,
`date` date NOT NULL,
PRIMARY KEY (`game_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
The other holds data on the at bats in each game:
CREATE TABLE `ex_atbats` (
`ab_id` int(11) NOT NULL auto_increment,
`game` int(11) NOT NULL,
`team` int(11) NOT NULL,
`player` int(11) NOT NULL,
`result` tinyint(1) NOT NULL,
PRIMARY KEY (`hit_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
So I have two questions. Let's start with the simple version: I want to return a list of games with a count of how many at bats are in each game. So I think I would do something like this:
SELECT date, home_team, COUNT(h.ab_id) FROM `ex_atbats` h
LEFT JOIN ex_games g ON g.game_id = h.game
GROUP BY g.game_id
This query uses filesort/temporary. Is there a better way to structure this or to index the tables to get rid of that?
Then, the trickier part: say I now want to not only include a count of the number of at bats, but also include a count of the number of at bats that were preceded by an at bat with the same result by the same team. I assume that would be something like:
SELECT g.date, g.home_team, COUNT(ab.ab_id), COUNT(ab2.ab_id) FROM `ex_atbats` ab
LEFT JOIN ex_games g ON g.game_id = ab.game
LEFT JOIN ex_atbats ab2 ON ab2.ab_id = ab.ab_id - 1 AND ab2.result = ab.result
GROUP BY g.game_id
Is that the correct way to structure that query? This also uses filesort/temporary.
So what is the optimal way to go about accomplishing these tasks?
Thanks again.
Phrases Using temporary/filesort usually are not related to the indexes used in the JOIN operation. There is numerous examples where you can have all indexes set (they show up in key and key_len columns in EXPLAIN) but you still get Using temporary and Using filesort.
Check out what the manual says about Using temporary and Using filesort:
How MySQL Uses Internal Temporary Tables
ORDER BY Optimization
Having a combined index for all columns used in GROUP BY clause may help to get rid of Using filesort in certain circumstances. If you also issue ORDER BY you may need to add more complex indexes.
If you have a huge dataset consider partitioning it using some criteria like date or timestamp by means of actual partitioning or a simple WHERE clause.
First of all, the tables' definitions do matter. It's one thing to join using two primary keys, another to join using a primary key from one side and a non-unique key in the other, etc. It also matters what type of engine the tables use as InnoDB treats Primary Keys differently than MyISAM engine.
What I notice though is that on table test1, the (c1,c2) combination is Unique and the fields are not nullable. This allows your query to be rewritten as:
SELECT t.c1, t.c2, COUNT(*)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.key
It will give the same results while using the same field for the JOIN and the GROUP BY. Note that MySQL allows you to use in the SELECT list fields that are not in the GROUP BY list, without having aggregate functions on them. This is not allowed in most other systems and is seen as a bug by some. In this situation though it is a very nice feature. Every row can be either identified by (key) or (c1,c2), so it shouldn't matter which of the two is used for the grouping.
Another thing to note is that when you use LEFT JOIN, it's common to use the joining column from the right side for the counting: COUNT(t2.key) and not COUNT(*). Your original query will give 1 in that column for records in test1 that do not mmatch any record in test2 because it counts rows while you probably want to count the related records in test2 - and show 0 in those cases.
So, try this query and post the EXPLAIN:
SELECT t.c1, t.c2, COUNT(t2.key)
FROM test1 t
LEFT JOIN test2 t2 ON t2.key = t.key
GROUP BY t.key
The indexes help with the join, but you still need to do a full sort in order to do the group by. Essentially, it still has to process every record in the set.
Adding a where clause and limiting the set would run faster, of course. It just won't get you the results you want.
There may be other options than doing a group by on the entire table. I notice you're doing a SELECT * - What are you trying to get out of the query?
SELECT DISTINCT c1, c2
FROM test t
LEFT JOIN test2 t2 ON t2.key = t.key
may run faster, for instance. (I realize this was just a sample query, but understand that it's hard to optimize when you don't know what the end goal is!)
EDIT - In doing some reading (http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html), I learned that, under the correct circumstances, indexes can help significantly with the group by.
What I'm seeing is that it needs to be a sorted index (like BTREE), not a HASH. Perhaps:
CREATE INDEX c1c2 IN t (c1, c2) USING BTREE;
might help.
For innodb it will work, as the index caries your primary key by default. For myisam you have to have the key as the last column of your index be "key". That will give the optimizers all keys in the same order and he can skip the sort. You cannot do any range queryies on the index prefix theN, puts you right back into filesort. currently struggling with a similiar problem