I have a table which will be about 2 - 5 million rows on average. It has a primary key/index called 'instruction_id' and another indexed field called 'mode'. now 'instruction_id' is of course unique since it is the primary key but 'mode' will only be one of 3 different values. The query I run all the time is
SELECT * FROM tablename WHERE mode = 'value1' ORDER BY instruction_id LIMIT 50
This currently takes about 25 sec ( > 1 sec is unacceptably long) but there are only 600K rows right now so it will get worse as the table grows. Would indexing in a different way help? If I index instruction_id and mode together will that make a difference? If I somehow am able to naturally order the table by instruction_id so I don't have to ask for the order by would be another way around this but I don't know how to do that... Any help would be great.
You should try index on (mode, instruction_id), in that order.
The reasoning behind that index is that it creates an index like this
mode instruction_id
A 1
A 3
A 4
A 5
A 10
A 11
B 2
B 8
B 12
B 13
B 14
C 6
C 7
C 9
C 15
C 16
C 17
If you search for mode B the sql server can search the index with a binary search on mode until it finds the first B, then it can simply output the next n rows. This would be really fast, about 22 compares for 4M rows.
Always use ORDER BY if you expect the result to be ordered, regardless of how the data is stored. The query engine might choose a query plan that output the rows in a different order than the order of the PK (maybe not in such simple cases as this, but in general).
You should check out the following links relating to innodb clustered indexes
http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
http://www.xaprb.com/blog/2006/07/04/how-to-exploit-mysql-index-optimizations/
MySQL and NoSQL: Help me to choose the right one
Then build your schema something along the lines of:
drop table if exists instruction_modes;
create table instruction_modes
(
mode_id smallint unsigned not null,
instruction_id int unsigned not null,
primary key (mode_id, instruction_id), -- note the clustered composite PK order !
unique key (instruction_id)
)
engine = innodb;
Cold (mysql restarted) runtime performance as follows:
select count(*) from instruction_modes;
+----------+
| count(*) |
+----------+
| 6000000 |
+----------+
1 row in set (2.54 sec)
select distinct mode_id from instruction_modes;
+---------+
| mode_id |
+---------+
| 1 |
| 2 |
| 3 |
+---------+
3 rows in set (0.06 sec)
select * from instruction_modes where mode_id = 2 order by instruction_id limit 10;
+---------+----------------+
| mode_id | instruction_id |
+---------+----------------+
| 2 | 2 |
| 2 | 3 |
| 2 | 4 |
| 2 | 5 |
| 2 | 6 |
| 2 | 9 |
| 2 | 14 |
| 2 | 25 |
| 2 | 28 |
| 2 | 32 |
+---------+----------------+
10 rows in set (0.04 sec)
0.04 seconds cold seems pretty performant.
Hope this helps :)
Here is one possible solution:
ALTER TABLE `tablename` ADD UNIQUE (`mode`, instruction_id);
Then:
SELECT A.* FROM tablename A JOIN (
SELECT instruction_id FROM tablename
WHERE mode = 'value1'
ORDER BY instruction_id LIMIT 50
) B
ON (A.instruction_id = B.instruction_id);
I have found for large tables that approach seems to work good for speed as the subquery should only be using the index.
I use a similar query on a table with >100mil records and it returns results in 1-2 seconds.
Is 'mode' a character field? If it's only ever going to hold 3 possible values, it sounds like you should make it an enum field, which will still return you the text string but is stored internally as a number.
You should also follow Albin's advice on indexing, which will benefit you further.
Related
I have 2 tables, ticket_data and nps_data.
ticket_data hold general IT issue information and nps_data holds user feedback.
A basic idea of the tables are:
ticket_data table.
aprox. 1,500,000 rows: 30 fields:
Index on ticket_number, logged_date, logged_team, resolution_date
|ticket_number | logged_date | logged_team | resolution_date |
| I00001 | 2017-01-01 | Help Desk | 2017-01-02 |
| I00002 | 2017-02-01 | Help Desk | 2017-03-01 |
| I00010 | 2017-03-04 | desktop sup | 2017-03-04 |
Obviously there are lots of other fields but this is what Im working with
nps_data table
aprox 83,000 rows: 10 fields:
index ticket_number
|ticket_number | resolving team| q1_score|
| I00001 | helpdesk | 5 |
| I00002 | desktop sup | 0 |
| I00010 | desktop sup | 10 |
when I do a simple query such as
select a.*, b.q1_score from
(select * from ticket_data
where resolution_date > '2017-01-01') a
left join nps_data b
on a.ticket_number = b.ticket_number
The query takes forever to run, and when I say that, I mean I stop the query after 10 mins.
However if I run the query to join ticket_data with a table called ticket_details, which has over 1,000,000 rows using the following query
select *
(select * from ticket_data
where resolution_date > '2017-01-01') a
left join ticket_details b
on a.ticket_number = b.ticket_number
the query takes about 1.3 seconds to run.
In the query above, you have a subquery with the alias a that is not running on an index. You are querying the field resolution_date, which is un-indexed.
The simple fix would be to add an index to that field.
Ticket number is indexed. This is probably why when you join on that, the query runs faster.
The other way to further optimize this would not to do select * in your subquery (which is bad practice in a production system anyway). It creates more overhead for the DBMS to pass all results up in the subquery.
Another way would be to do a partial index on the column such as:
create index idx_tickets on ticket_data(ticket_number) where resolution_date > '2017-01-01'
But I would only do that if the timestamp of '2017-01-01' is a constant that will always be used.
You could also create a composite index, so the query engine will run an Index Only Scan whereby it pulls the data straight from the index without having to go back to the table.
In order for me to reference something on this, I would need to know what DBMS you are running on.
Of course, all of this depends on what type of DBMS you are running.
I have an MySQL table creatures:
id | name | base_hp | quantity
--------------------------------
1 | goblin | 5 | 2
2 | elf | 10 | 1
And I want to create creature_instances based on it:
id | name | actual_hp
------------------------
1 | goblin | 5
2 | goblin | 5
3 | elf | 10
The ids of creatures_instances are not important and not relevant to creatures.ids.
How can I make it with just the MySQL in the most optimal (in terms of execution time) way? The single query would be best, but procedure is ok too. I use InnoDB.
I know that with a help of e.g. php I could:
select each row separately,
make for($i=0; $i<line->quantity; $i++) loop in which I insert one row to creatures_instances for each iteration.
The most efficient way is to do everything in SQL. It helps if you have a numbers table. Without one, you can generate the numbers in a subquery. The following works up to 4 copies:
insert into creatures_instances(id, name, actual_hp)
select id, name, base_hp
from creatures c join
(select 1 as n union all select 2 union all select 3 union all select 4
) n
on n.n <= c.quantity;
In my projects I often need to store the result of a SELECT in another table (we call this a "resultset"). The reason is to dynamically display a large number of rows in a web application while loading only small chunks as necessary.
Typically, this is done by queries such as this one:
SET #counter := 0;
INSERT INTO resultsetdata
SELECT "12345", #counter:=#counter+1, a.ID
FROM sometable a
JOIN bigtable b
WHERE (a.foo = b.bar)
ORDER BY a.whatever DESC;
The fixed "12345" value is just a value to identify the "resultset" as a whole and changes for each query. The second column is a incrementing index counter that is meant to allow direct access to a specific row in the result and the ID column references the specific row in the source data table.
When the application needs a certain range of the result I just join resultsetdata with the source table to get the detailed data - which is quick as opposed to the resultsetdata query above which may take 2-3 seconds to complete (which explains why I need this intermediary table).
The SELECT query itself is not relevant for this question.
resultsetdata has the following structure:
CREATE TABLE `resultsetdata` (
`ID` int(11) NOT NULL,
`ContIdx` int(11) NOT NULL,
`Value` int(11) NOT NULL,
PRIMARY KEY (`ID`,`ContIdx`)
) ENGINE=InnoDB;
This usually works like a charm but lately we noticed that in some cases the ORDER of the result is not correct. This depends on the query itself (for example, adding DISTINCT is a typical cause), the server version and the data contained in the source tables, so I guess one can say that the row order is unpredictable with this method. Probably it depends on internal optimizations.
However, the problem is now that I can't think of any alternative solution that gives me the expected result.
Since the resultset can get several thousands of rows, loading all data in memory and then manually INSERTing it is not feasible.
Any suggestions?
EDIT: For further clarification, have a look at these queries:
DROP TABLE IF EXISTS test;
CREATE TABLE test (ID INT NOT NULL, PRIMARY KEY(ID)) ENGINE=InnoDB;
INSERT INTO test (ID) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
SET #counter:=0;
SELECT "12345", #counter:=#counter+1, ID
FROM test
ORDER BY ID DESC;
This produces the following result as "expected":
+-------+----------------------+----+
| 12345 | #counter:=#counter+1 | ID |
+-------+----------------------+----+
| 12345 | 1 | 10 |
| 12345 | 2 | 9 |
| 12345 | 3 | 8 |
| 12345 | 4 | 7 |
| 12345 | 5 | 6 |
| 12345 | 6 | 5 |
| 12345 | 7 | 4 |
| 12345 | 8 | 3 |
| 12345 | 9 | 2 |
| 12345 | 10 | 1 |
+-------+----------------------+----+
10 rows in set (0.00 sec)
As said, in some cases (I can't provide a testcase here, sorry), this may lead to a result similar to this:
+-------+----------------------+----+
| 12345 | #counter:=#counter+1 | ID |
+-------+----------------------+----+
| 12345 | 10 | 10 |
| 12345 | 9 | 9 |
| 12345 | 8 | 8 |
| 12345 | 7 | 7 |
| 12345 | 6 | 6 |
| 12345 | 5 | 5 |
| 12345 | 4 | 4 |
| 12345 | 3 | 3 |
| 12345 | 2 | 2 |
| 12345 | 1 | 1 |
+-------+----------------------+----+
I'm not saying this is a MySQL bug and I fully understand that my method currently provides unpredictable results. Still, I don't know how to tweak this to get predictable results.
This is because the order that records are sorted when they are inserted is unrelated to the order when you retrieve them.
When you retrieve them a query plan will be created. If no ORDER BY is specified in your SELECT statement then the order will depend on the query plan produced. This is why it is unpredictable and adding DISTINCT can change the order.
The solution is to store enough data that you can retrieve them in the correct order using an ORDER BY clause. In your case you have ordered your data by a.whatever. Can a.whatever be stored in resultsetdata? If so then you can read the records out in the correct order.
Maybe you could wrap the select into another select:
SET #counter := 0;
INSERT INTO resultsetdata
SELECT *, #counter := #counter + 1
FROM (
SELECT "12345", a.ID
FROM sometable a
JOIN bigtable b
WHERE a.foo = b.bar
ORDER BY a.whatever DESC
) AS tmp
... but you are still at the mercy of the dumbness of MySQL's optimizer.
That's all I found about this topic, but I couln't find a hard guarantee:
Pure-SQL Technique for Auto-Numbering Rows in Result Set
http://www.xaprb.com/blog/2006/12/02/how-to-number-rows-in-mysql/
http://www.xaprb.com/blog/2005/09/27/simulating-the-sql-row_number-function/
I have a table with a barcode column with a unique index. The data has been loaded with additional chars (-xx) at the end of each barcode to prevent duplicates, but there will be lots of duplicates once I strip off the suffix. Here is a sample of the data:
itemnumber barcode
17912 2-14
18082 2-1
21870 2-10
29219 2-8
Then I created two temporary tables, marty and manny, both with the itemnumber and the stripped down barcodes. So,both tables would contain
itemnumber barcode
17912 2
18082 2
21870 2
29219 2
etc
And the I tried to delete all but the first entry with barcode '2' in the marty table(and every other barcode). I hoped then to update the original table with the correct first entry and the users could fix up the duplicates themselves in time in the application.
So, this was my query to delete all but the first entry in the marty table for each barcode
DELETE FROM marty
WHERE itemnumber NOT IN
(SELECT MIN(itemnumber) FROM manny GROUP BY barcode)
There are 130,000 rows in marty and manny. The query took over 24 hours and then didn't finish properly. The connection to the server crashed and the query did not do all the updates.
Is there a better way to approach this that would not us the subquery, which i think is causing the delay? And the group by is probably slowing things down too with so many records.
Thanks
One more variant: this variant works without any temporary tables for deleting duplicates:
Delete m1
From Marty m1
join Marty m2
on m1.barcode = m2.barcode
and m1.itemnumber > m2.itemnumber
Here is a two-stage approach that avoids use of NOT IN. It also does not use the temporary table "manny". First, join "marty" to itself to pick out rows for which itemnumber != min(itemnumber). Use UPDATE to set barcode for these rows to NULL. A second pass with DELETE then removes all rows that were flagged in the first phase.
For this example, I split the barcode column of "marty" into two columns; it could be done with the table in its original format with some modification (need to split the column values on the fly).
select * from marty;
+------------+---------+---------+
| itemnumber | barcode | subcode |
+------------+---------+---------+
| 17912 | 2 | 14 |
| 18082 | 2 | 1 |
| 21870 | 2 | 10 |
| 29219 | 2 | 8 |
| 30133 | 3 | 5 |
| 30134 | 3 | 7 |
| 30139 | 3 | 9 |
| 30142 | 3 | 12 |
+------------+---------+---------+
8 rows in set (0.00 sec)
UPDATE
(marty m1
JOIN
(SELECT barcode,
MIN(itemnumber) AS itemnumber
FROM marty
GROUP BY barcode) m2
USING(barcode))
SET m1.barcode = NULL WHERE m1.itemnumber != m2.itemnumber;
mysql> select * from marty;
+------------+---------+---------+
| itemnumber | barcode | subcode |
+------------+---------+---------+
| 17912 | 2 | 14 |
| 18082 | NULL | 1 |
| 21870 | NULL | 10 |
| 29219 | NULL | 8 |
| 30133 | 3 | 5 |
| 30134 | NULL | 7 |
| 30139 | NULL | 9 |
| 30142 | NULL | 12 |
+------------+---------+---------+
8 rows in set (0.00 sec)
DELETE FROM marty WHERE barcode IS NULL;
MySQL is notoriously slow when using IN with very large sets. A scripted alternative:
Use a script to construct a long itemnumber = X OR itemnumber = y OR itemnumber = z clause (chunks size ~1000) and INSERT the matched rows (i.e. the ones that would not have been DELETEd in your previous query) into a new table, TRUNCATE the existing and load the contents of the new table back into the old with INSERT INTO marty SELECT * FROM marty_tmp.
You may want to lock the table or run in a transaction for the final TRUNCATE, INSERT.
edit:
Query SELECT MIN(itemnumber) FROM manny GROUP BY barcode from a script, store results in desiredItemNumbers array
Take batches of 1000 desiredItemNumbers and construct this query: INSERT INTO manny_tmp SELECT * FROM manny WHERE itemnumber = desiredItemNumbers[0] OR itemnumber = desiredItemNumbers[1] .... Rerun this query until you've exhausted the desiredItemNumbers array (n.b. the last query will probably have less than 1000 desiredItemNumbers).
You now have a table with the results that you would have been left with had you DELETEd the rest, so swap the contents of the marty and marty_tmp tables.
TRUNCATE marty
INSERT INTO marty SELECT * FROM marty_tmp
If you are creating temp tables anyway, how about building your table with an "INSERT INTO " or "CREATE TABLE .. AS ..." based on:
SELECT MIN(itemnumber) AS itemnumber, barcode
FROM marty
GROUP BY barcode
I don't know if this concept is possible -- maybe in a stored procedure?
Consider a two row table:
`id` (int) | `value` (int) | `date` (datetime)
Lets say these rows exist:
1 | 3 | 2011-02-18
2 | 5 | 2011-02-19
3 | 12 | 2011-02-20
4 | 7 | 2011-02-21
5 | 8 | 2011-02-22
6 | 10 | 2011-02-23
I am trying to find trends, it is rather obvious for the human eye to recognize that the last three values are going up each day: 7 -> 8 -> 10. Is it possible to get the rows that remain in this pattern?
I'm thinking a stored procedure might be able to read through rows sequentially and find the first pattern (10 > 8), then continue checking until it no longer matches that concept:
( 8 > 7 ), but not ( 7 > 12 ) so it would stop.
Any advice in the right direction would be very helpful.
set #was:=null;
select id from
(
select
id,
#was as was,
value as now,
(#was:=value)
from the_table order by date
) as trends
where was is not null and now<=was limit 1;