Best mysql indexing practices for queries with sums of fields

Best mysql indexing practices for queries with sums of fields - mysql

I'm looking to improve the query performance of the following mysql query:
SELECT * FROM items
WHERE items.createdAt > ?
AND items.createdAt + items.duration < ?
What are the best indexes to use here? Should I have one for both (createdAt) and (createdAt, duration)?
Thanks!

Create one index on createdAt
Then create a generated (aka computed column) column on createdAt + duration and then create an index on the generated column.
https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html

You can't easily sove the problem. Indexes are one-dimensional. You need a 2D index. SPATIAL provides such, but it would be quite contorted to re-frame your data into such. PARTITION sort of gives an extra dimension; but, again, I don't see that as viable.
So, but best you can do is
INDEX(createdAt)
Or perhaps this is a little better:
INDEX(createdAt, duration)
I don't see that a "generated" column will solve the problem -- however it may lead to sometimes running faster for this simple reason:
If createdAt > ? filters out most of the table because createdAt is near the end, the INDEX(createdAt) appears to be a good index.
When that ? is near the beginning of time, such an index is essentially useless-- because most of the table needs to be scanned.
A generated index on createdAt + duration has the same problem, just in the other direction.
If the real query has an ORDER BY and LIMIT, there may be other tricks to play. I discuss that when trying to map IP-addresses to countries or businesses: http://mysql.rjweb.org/doc.php/ipranges

Related

Should I use timestamp or separated time columns in large EAV table?

I have an extremely long table with EAV model.
My columns are tipically ID, Timestamp, Value.
Actually I create an index on ID and Timestamp to increase performance in my queries but it seems to be still slow..
What happend if I split the timestamp in separated integer fields and create an index on those fields? Something like this:
Year(Int), Month(Int), Day(Int), Time(TimeStamp), ID, Value.
Does it increase performance?
Today I'm using two kind of db, MySql and PostgreSQL but I have the same doubt for both.

It's definitely wrong direction. The new index won't increase performance. Additionally, some of your queries may make you some troubles. Think e.g. about the condition
where tstamp between '2015-11-22' and '2016-02-03'
and try to write it so it can use the new index(es).

(At least in MySQL...)
Do not split up a datetime.
Do not use partitioning.
Do not use multiple tables (one per month / user / etc).
Do think about alternatives to EAV; that model is inherently bad. (I added a tag; look at other discussions.)
Do use appropriate 'composite' indexes.
Do provide SHOW CREATE TABLE and some SELECTs so we can help you more specifically.
Do use this pattern for date ranges:
WHERE dt >= '2017-02-26'
AND dt < '2017-02-26' + INTERVAL 7 DAY

Should I avoid ORDER BY in queries for large tables?

In our application, we have a page that displays user a set of data, a part of it actually. It also allows user to order it by a custom field. So in the end it all comes down to query like this:
SELECT name, info, description FROM mytable
WHERE active = 1 -- Some filtering by indexed column
ORDER BY name LIMIT 0,50; -- Just a part of it
And this worked just fine, as long as the size of table is relatively small (used only locally in our department). But now we have to scale this application. And let's assume, the table has about a million of records (we expect that to happen soon). What will happen with ordering? Do I understand correctly, that in order to do this query, MySQL will have to sort a million records each time and give a part of it? This seems like a very resource-heavy operation.
My idea is simply to turn off that feature and don't let users select their custom ordering (maybe just filtering), so that the order would be a natural one (by id in descending order, I believe the indexing can handle that).
Or is there a way to make this query work much faster with ordering?
UPDATE:
Here is what I read from the official MySQL developer page.
In some cases, MySQL cannot use indexes to resolve the ORDER BY,
although it still uses indexes to find the rows that match the WHERE
clause. These cases include the following:
....
The key used to
fetch the rows is not the same as the one used in the ORDER BY:
SELECT * FROM t1 WHERE key2=constant ORDER BY key1;
So yes, it does seem like mysql will have a problem with such a query? So, what do I do - don't use an order part at all?

The 'problem' here seems to be that you have 2 requirements (in the example)
active = 1
order by name LIMIT 0, 50
The former you can easily solve by adding an index on the active field
The latter you can improve by adding an index on name
Since you do both in the same query, you'll need to combine this into an index that lets you resolve the active value quickly and then from there on fetches the first 50 names.
As such, I'd guess that something like this will help you out:
CREATE INDEX idx_test ON myTable (active, name)
(in theory, as always, try before you buy!)
Keep in mind though that there is no such a thing as a free lunch; you'll need to consider that adding an index also comes with downsides:
the index will make your INSERT/UPDATE/DELETE statements (slightly) slower, usually the effect is negligible but only testing will show
the index will require extra space in de database, think of it as an additional (hidden) special table sitting next to your actual data. The index will only hold the fields required + the PK of the originating table, which usually is a lot less data then the entire table, but for 'millions of rows' it can add up.
if your query selects one or more fields that are not part of the index, then the system will have to fetch the matching PK fields from the index first and then go look for the other fields in the actual table by means of the PK. This probably is still (a lot) faster than when not having the index, but keep this in mind when doing something like SELECT * FROM ... : do you really need all the fields?
In the example you use active and name but from the text I get that these might be 'dynamic' in which case you'd have to foresee all kinds of combinations. From a practical point this might not be feasible as each index will come with the downsides of above and each time you add an index you'll add supra to that list again (cumulative).
PS: I use PK for simplicity but in MSSQL it's actually the fields of the clustered index, which USUALLY is the same thing. I'm guessing MySQL works similarly.

Explain your query, and check, whether it goes for filesort,
If Order By doesnt get any index or if MYSQL optimizer prefers to avoid the existing index(es) for sorting, it goes with filesort.
Now, If you're getting filesort, then you should preferably either avoid ORDER BY or you should create appropriate index(es).
if the data is small enough, it does operations in Memory else it goes on the disk.
so you may try and change the variable < sort_buffer_size > as well.

there are always tradeoffs, one way to improve the preformance of order query is to set the buffersize and then the run the order by query which improvises the performance of the query
set sort_buffer_size=100000;
<>
If this size is further increased then the performance will start decreasing

Does index work when query like 'where created_at > ?'

I am using Postgresql, and need to make query like 'WHERE created_at > ?'. I am not sure if the index works in such query.
I have done an experiment. After adding an index on the created_at column, I explained the following 2 queries.
1)
EXPLAIN SELECT * FROM categories WHERE created_at > '2014-05-03 21:34:27.427505';
The result is
QUERY PLAN
------------------------------------------------------------------------------------
Seq Scan on categories (cost=0.00..11.75 rows=47 width=528)
Filter: (created_at > '2014-05-03 21:34:27.427505'::timestamp without time zone)
2)
EXPLAIN SELECT * FROM categories WHERE created_at = '2014-05-03 21:34:27.427505';
The result is
QUERY PLAN
---------------------------------------------------------------------------------------------------
Index Scan using index_categories_on_created_at on categories (cost=0.14..8.16 rows=1 width=528)
Index Cond: (created_at = '2014-05-03 21:34:27.427505'::timestamp without time zone)
Note that the first one is using 'Filter' while the second one is using 'Index Cond', according to the doc of Postgresql, the former is just a one-by-one scan while the later is using index.
Is it indicating that query like 'created_at > ?' will not be fastened by adding an index on 'created_at' column?
Update
I am using Rails 4.0, and according to the console, the index is created by
CREATE INDEX "index_categories_on_created_at" ON "categories" ("created_at")

Indexes on timestamps are normally responsive to range queries, that is, >, <, between, <=, etc. However, as univero points out, selectivity and cost estimation plays a strong role.
PostgreSQL is only going to use an index if it thinks using the index is going to be faster than not using it (for that matter, it tries to pick the fastest index to use if several are available). How much of the table are those 47 rows it expects to get back from the > query? If the answer is "10% of the table" then Postgres is not going to bother with the index. For that matter, the query planner rarely uses indexes for scans of really small tables, because if your whole table fits on 3 data pages, it's faster to scan the entire table.
You can easily play with this if you want.
1) Use EXPLAIN ANALYZE instead of just EXPLAIN so you can compare what the query planner expected vs. what it actually got.
2) Turn off and on index and table scanning with any of these statements:
SET enable_seqscan = false; --turns off table scans
SET enable_indexscan = false; -- turns of index scans
SET enable_bitmapscan = false; -- turns off bitmap index scans
If you play around, you can see where using an index is actually slower.

Using an index means reading the index plus reading the selected rows from the table. There is a trade-off in that it can be more efficient simply to read only the table. The algorithms used by a DBMS to choose which is better for any given query are usually pretty good (though not perfect).
It's easily possible (and likely) that not using the index is the better choice for this query.
Using the #Clockwork-Muse AND #univerio suggestion for selectivity is generally a good idea, though it might not matter in this case due to table size. You might also use an ORDER BY created_at to see if it affects the plan.
Experimentation (per #FuzzyChef) can help find trade-off points. Use different table sizes and change other variables to see results.

Optimize performance of complex GROUP BY query

Is there a way to optimize the following query. It takes about 11s:
SELECT
concat(UNIX_TIMESTAMP(date), '000') as datetime,
TRUNCATE(SUM(royalty_price*conversion_to_usd*
(CASE WHEN sales_or_return = 'R' THEN -1 ELSE 1 END)*
(CASE WHEN royalty_currency = 'JPY' THEN .80
WHEN royalty_currency in ('AUD', 'NZD') THEN .95 ELSE 1 END) )
,2) as total_in_usd
FROM
sales_raw
GROUP BY
date
ORDER BY
date ASC
Doing an explain I get:
1 SIMPLE sales_raw index NULL date 5 NULL 735855 NULL

This is an answer to the question in the comment. It formats better here:
An example of filtering on a set of indexed dates means to do something like this:
where date >= AStartDateVariable
and date < TheDayAfterAnEndDateVariable
If there is no index on the date field, create one.

You may be able to speed this up. You seem to have an index on date. What is happening is that the rows are read in the index, then each row is looked up. If the data is not ordered by the date field, then this might not be optimal, because the reads will be on, essentially, random pages. In the case where the original table does not fit into memory, this results in a condition called "page thrashing". A record is needed, the page is read from memory (displacing another page in the memory cache), and the next read probably also results in a cache miss.
To see if this is occuring, I would suggest one of two things. (1) Try removing the index on date or switching the group by criterion to concat(UNIX_TIMESTAMP(date), '000'). Either of these should remove the index as a factor.
From your additional comment, this is not occuring, although the benefit of the index appears to be on the small side.
(2) You can also expand the index to include all the tables used in the query. Besides date, the index would need to contain royalty_price, conversion_to_usd, sales_or_return, and royalty_currency. This would allow the index to fully satisfy the query, without looking up additional infromation in the pages.
You can also check with your DBA to be aure that you have a large enough page cache that matches your hardware capabilities.

This is a simple group by query which does not even involve joins. I would expect the problem to lie on the functions you are using.
Please start with a simple query just retrieving date and the sum of conversion_to_usd. Check performance and build up the query step by step always checking performance. It should not take long to spot the culprit.
Concats are usually slow operations but I wonder if truncate after sum might be confusing the optimiser. The 2nd case could be replaced by relying on a join with a table of currency codes and respective percentages, but it's not obvious that it makes a big difference. First spot the culprit operation.
You could also store the values with the correct amount but that introduces a denormalisation.

MySQL: a huge table. can't query, even a simple select!

i have a table with about 200,000 records.
it takes a long time to do a simple select query. i am confiused because i am running under a 4 core cpu and 4GB of ram.
how should i write my query?
or is there anything to do with INDEXING?
important note: my table is static (it's data wont change).
what's your solutions?
PS
1 - my table has a primary key id
2 - my table has a unique key serial
3 - i want to query over the other fields like where param_12 not like '%I.S%'
or where param_13 = '1'
4 - 200,000 is not big and this is exactly why i am surprised.
5 - i even have problem when adding a simple field: my question
6 - can i create an INDEX for BOOL fields? (or is it usefull)
PS and thanks for answers
7 - my select shoudl return the fields that has specified 'I.S' or has not.
select * from `table` where `param_12` like '%I.S%'
this is all i want. it seems no Index helps here. ham?

Indexing will help. Please post table definition and select query.
Add index for all "=" columns in where clause.

Yes, you'll want/need to index this table and partitioning would be helpful as well. Doing this properly is something you will need to provide more information for. You'll want to use EXPLAIN PLAN and look over your queries to determine which columns and how you should index them.
Another aspect to consider is whether or not your table normalized. Normalized tables tend to give better performance due to lowered I/O.
I realize this is vague, but without more information that's about as specific as we can be.
BTW: a table of 200,000 rows is relatively small.
Here is another SO question you may find useful

1 - my table has a primary key id: Not really usefull unless you use some scheme which requires a numeric primary key
2 - my table has a unique key serial: The id is also unique by definition; why not use serial as the primary? This one is automatically indexed because you defined it as unique.
3 - i want to query over the other fields like where param_12 not like '%I.S%' or where param_13 = '1': A like '%something%' query can not really use an index; is there some way you can change param12 to param 12a which is the first %, and param12b which is 'I.S%'? An index can be used on a like statement if the starting string is known.
4 - 200,000 is not big and this is exactly why i am surprised: yep, 200.000 is not that much. But without good indexes, queries and/or cache size MySQL will need to read all data from disk for comparison, which is slow.
5 - i even have problem when adding a simple field: my question
6 - can i create an INDEX for BOOL fields? Yes you can, but an index which matches half of the time is fairly useless, an index is used to limit the amount of records MySQL has to load fully as much as possible; if an index does not dramatically limit that number, as is often the case with boolean (in a 50-50 distribution), using an index only requires more disk IO and can slow searching down. So unless you expect something like an 80-20 distribution or better creating an index will cost time, and not win time.

Index on param_13 might be used, but not the one on param_12 in this example, since the use of LIKE '% negate the index use.

If you're querying data with LIKE '%asdasdasd%' then no index can help you. It will have to do a full scan every time. The problem here is the leading % because that means that the substring you are looking for can be anywhere in the field - so it has to check it all.
Possibly you might look into full-text indexing, but depending on your needs that might not be appropriate.

Firstly, ensure your table have a primary key.
To answer in any more detail than that you'll need to provide more information about the structure of the table and the types of queries you are running.

I don't believe that the keys you have will help. You have to index on the columns used in WHERE clauses.
I'd also wonder if the LIKE requires table scans regardless of indexes. The minute you use a function like that you lose the value of the index, because you have to check each and every row.
You're right: 200K isn't a huge table. EXPLAIN PLAN will help here. If you see TABLE SCAN, redesign.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Best mysql indexing practices for queries with sums of fields - mysql

I'm looking to improve the query performance of the following mysql query: SELECT * FROM items WHERE items.createdAt > ? AND items.createdAt + items.duration < ? What are the best indexes to use here? Should I have one for both (createdAt) and (createdAt, duration)? Thanks!

Create one index on createdAt Then create a generated (aka computed column) column on createdAt + duration and then create an index on the generated column. https://dev.mysql.com/doc/refman/8.0/en/create-table-secondary-indexes.html

Related

Should I use timestamp or separated time columns in large EAV table?

Should I avoid ORDER BY in queries for large tables?

Does index work when query like 'where created_at > ?'

Optimize performance of complex GROUP BY query

MySQL: a huge table. can't query, even a simple select!

Categories

Resources