I have a mysql table where the details of check-in activity performed by all the users is captured. Below is the structure with sample data.
Table Name: check_in
+-----------+--------------------+---------------------+
| id | user_id | time |
+-----------+--------------------+---------------------+
| 1 | 10001 | 2016-04-02 12:04:02 |
| 2 | 10001 | 2016-04-02 11:04:02 |
| 3 | 10002 | 2016-10-27 23:56:17 |
| 4 | 10001 | 2016-04-02 10:04:02 |
| 5 | 10002 | 2016-10-27 22:56:17 |
| 6 | 10002 | 2016-10-27 21:56:17 |
+-----------+--------------------+---------------------+
On the dashboard, I have to display each user and at what time was their latest check-in activity performed (Sample dashboard view shown below).
User 1's last check-in was at 2016-04-02 12:04:02
User 2's last check-in was at 2016-10-27 23:56:17
What is the best and efficient way to write the query to pull this data?
I have written below query, but it is taking 5-8 seconds to complete the execution. Note: This table has hundreds of thousands of rows.
select user_id, max(time) as last_check_in_at from check_in group by user_id
Your SQL query look optimized to me.
The reason for it being slow is probably that you do not have indexes on user_id and time columns.
Try adding the following indexes to your table:
ALTER TABLE `check_in` ADD INDEX `user_id` (`user_id`)
ALTER TABLE `check_in` ADD INDEX `time` (`time`)
and then execute your SQL query again to see if it makes a difference.
The indexes should allow the SQL engine to quickly group the relevant records by user_id and also quickly determine the maximum time.
Indexes will also help to quickly sort data by time (as suggested by Rajesh)
Simply use order by
Note:- please don't use time as a column name because may be it should be reserved keywords in any language
select chk.user_id user_id, chk.time as last_check_in_at from
check_in chk group by chk.user_id ORDER BY chk.time
You need to use order by in query.
Please try this query
select `user_id`,`time` as last_check_in_at from check_in group by user_id order by `time` DESC
It works for me.
Related
I have 2 tables, ticket_data and nps_data.
ticket_data hold general IT issue information and nps_data holds user feedback.
A basic idea of the tables are:
ticket_data table.
aprox. 1,500,000 rows: 30 fields:
Index on ticket_number, logged_date, logged_team, resolution_date
|ticket_number | logged_date | logged_team | resolution_date |
| I00001 | 2017-01-01 | Help Desk | 2017-01-02 |
| I00002 | 2017-02-01 | Help Desk | 2017-03-01 |
| I00010 | 2017-03-04 | desktop sup | 2017-03-04 |
Obviously there are lots of other fields but this is what Im working with
nps_data table
aprox 83,000 rows: 10 fields:
index ticket_number
|ticket_number | resolving team| q1_score|
| I00001 | helpdesk | 5 |
| I00002 | desktop sup | 0 |
| I00010 | desktop sup | 10 |
when I do a simple query such as
select a.*, b.q1_score from
(select * from ticket_data
where resolution_date > '2017-01-01') a
left join nps_data b
on a.ticket_number = b.ticket_number
The query takes forever to run, and when I say that, I mean I stop the query after 10 mins.
However if I run the query to join ticket_data with a table called ticket_details, which has over 1,000,000 rows using the following query
select *
(select * from ticket_data
where resolution_date > '2017-01-01') a
left join ticket_details b
on a.ticket_number = b.ticket_number
the query takes about 1.3 seconds to run.
In the query above, you have a subquery with the alias a that is not running on an index. You are querying the field resolution_date, which is un-indexed.
The simple fix would be to add an index to that field.
Ticket number is indexed. This is probably why when you join on that, the query runs faster.
The other way to further optimize this would not to do select * in your subquery (which is bad practice in a production system anyway). It creates more overhead for the DBMS to pass all results up in the subquery.
Another way would be to do a partial index on the column such as:
create index idx_tickets on ticket_data(ticket_number) where resolution_date > '2017-01-01'
But I would only do that if the timestamp of '2017-01-01' is a constant that will always be used.
You could also create a composite index, so the query engine will run an Index Only Scan whereby it pulls the data straight from the index without having to go back to the table.
In order for me to reference something on this, I would need to know what DBMS you are running on.
Of course, all of this depends on what type of DBMS you are running.
Table structure:
CREATE TABLE `mytable` (
`id` varchar(8) NOT NULL,
`event` varchar(32) NOT NULL,
`event_date` date NOT NULL,
`event_time` time NOT NULL,
KEY `id` (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
The data in this table looks like this:
id | event | event_date | event_time
---------+------------+-------------+-------------
ref1 | someevent1 | 2010-01-01 | 01:23:45
ref1 | someevent2 | 2010-01-01 | 02:34:54
ref1 | someevent3 | 2010-01-18 | 01:23:45
ref2 | someevent4 | 2012-10-05 | 22:23:21
ref2 | someevent5 | 2012-11-21 | 11:22:33
The table contains about 500.000.000 records similar to this.
The query I'd like to ask about here looks like this:
SELECT *
FROM `mytable`
WHERE `id` = 'ref1'
ORDER BY event_date DESC,
event_time DESC
LIMIT 0, 500
The EXPLAIN output looks like:
select_type: SIMPLE
table: E
type: ref
possible_keys: id
key: id
key_len: 27
ref: const
rows: 17024 (a common example)
Extra: Using where; Using filesort
Purpose:
This query is generated by a website, the LIMIT-values are for page navigation element, so if the user wants to see older entries, they'll get adjusted to 500, 500, then 1000, 500 and so on.
Since some items in the field id can be set in quite a lot of rows, more and more rows will of course lead to a slower query. Profiling those slow queries showed me the reason is the sorting, most of the time during the query the mysql server is busy sorting the data. Indexing the fields event_date and event_time didn't change that very much.
Example SHOW PROFILE Result, sorted by duration:
state | duration/sec | percentage
---------------|--------------|-----------
Sorting result | 12.00145 | 99.80640
Sending data | 0.01978 | 0.16449
statistics | 0.00289 | 0.02403
freeing items | 0.00028 | 0.00233
...
Total | 12.02473 | 100.00000
Now the question:
Before delving way deeper into the mysql variables like sort_buffer_size and other server configuration option, can you think of any way to change the query or the sorting behaviour so sorting ain't that big performance eater anymore and the purpose of this query is still in place?
I don't mind a bit of out-of-the-box-thinking.
Thank you in advance!
As I wrote in comment multi-column index (id, evet_date desc, event_time desc) may help.
If this table will grow fast you should consider to adding option in application for user to select data for particular date range.
Example: First step always return 500 records but to select next records user should set date range for data and then set pagination.
Indexing is most likely the solution; you just have to do it right. See the mysql reference page for this.
The most effective way to do it is to create a three-part index on (id, event_date, event_time). You can specify event_date desc, event_time desc in the index, but I don't think it's necessary.
I would start by doing what sufleR suggests - the multi-column index on (id, event_date desc, event_time desc).
However, according to http://dev.mysql.com/doc/refman/5.0/en/create-index.html, the DESC keyword is supported, but doesn't actually do anything. That's a bit of a pain - so try it, and see if it improves the performance, but it probably won't.
If that's the case, you may have to cheat by creating a "sort_column", with an automatically decrementing value (pretty sure you'd have to do this in the application layer, I don't think you can decrement in MySQL), and add that column to the index.
You'd end up with:
id | event | event_date | event_time | sort_value
---------+------------+-------------+-------------------------
ref1 | someevent1 | 2010-01-01 | 01:23:45 | 0
ref1 | someevent2 | 2010-01-01 | 02:34:54 | -1
ref1 | someevent3 | 2010-01-18 | 01:23:45 | -2
ref2 | someevent4 | 2012-10-05 | 22:23:21 | -3
ref2 | someevent5 | 2012-11-21 | 11:22:33 | -4
and and index on ID and sort_value.
Dirty, but the only other suggestion is to reduce the number of records matching the where clause in other ways - for instance, by changing the interface not to return 500 records, but records for a given date.
Actually I am stuck in middle of some mysql code. Can anyone suggest a simple question. I have 6-10 (multiple) tables in a database all having different data means not related to each other.
There is no relation between tables, but have posted time inserted in each of the columns of all the tables. All I want is query all the tables sorted by timed column.
Eg:
table1:
recipename | cook | timetocook | dated (auto posted time - php time())
-----------+------+------------+------
abc | def | 100 | 10
zxy | orp | 102 | 16
table2:
bookname | author | dated (auto posted time - php time())
---------+--------+------
ab | cd | 11
ef | nm | 14
As you can see there is no relation between the table (I have read about joins), I want to show data one by one according to the posted time asc to desc.
like this:
abc def 100 10
ab cd 11
ef nm 14
zxy orp 102 16
So any help...to achieve this ???
SELECT recipename, cook, timetocook, dated
FROM table1
UNION
SELECT bookname, author, dated, NULL
FROM table2
ORDER BY dated
You have to add NULL value to make sure the column counts is the same the order tables.
I have this existing schema where a "schedule" table looks like this (very simplified).
CREATE TABLE schedule (
id int(11) NOT NULL AUTO_INCREMENT,
name varchar(45),
start_date date,
availability int(3),
PRIMARY KEY (id)
);
For each person it specifies a start date and percentage of work time available to spent on this project. That availability percentage implicitly continues until a newer value is specified.
For example take a project that lasts from 2012-02-27 to 2012-03-02:
id | name | start_date | availability
-------------------------------------
1 | Tom | 2012-02-27 | 100
2 | Tom | 2012-02-29 | 50
3 | Ben | 2012-03-01 | 80
So Tom starts on Feb., 27nd, full time, until Feb, 29th, from which on he'll be available only with 50% of his work time.
Ben only starts on March, 1st and only with 80% of his time.
Now the goal is to "normalize" this sparse data, so that there is a result row for each person for each day with the availability coming from the last specified day:
name | start_date | availability
--------------------------------
Tom | 2012-02-27 | 100
Tom | 2012-02-28 | 100
Tom | 2012-02-29 | 50
Tom | 2012-03-01 | 50
Tom | 2012-03-02 | 50
Ben | 2012-02-27 | 0
Ben | 2012-02-28 | 0
Ben | 2012-02-29 | 0
Ben | 2012-03-01 | 80
Ben | 2012-03-02 | 80
Think a chart showing the availability of each person over time, or calculating the "resource" values in a burndown diagram.
I can easily do this with procedural code in the app layer, but would prefer a nicer, faster solution.
To make this remotely effective, I recommend creating a calendar table. One that contains each and every date of interest. You then use that as a template on which to join your data.
Equally, things improve further if you have person table to act as the template for the name dimension of your results.
You can then use a correlated sub-query in your join, to pick which record in Schedule matches the calendar, person template you have created.
SELECT
*
FROM
calendar
CROSS JOIN
person
LEFT JOIN
schedule
ON schedule.name = person.name
AND schedule.start_date = (SELECT MAX(start_date)
FROM schedule
WHERE name = person.name
AND start_date <= calendar.date)
WHERE
calendar.date >= <yourStartDate>
AND calendar.date <= <yourEndDate>
etc
Often, however, it is more efficient to deal with it in one of two other ways...
Don't allow gaps in the data in the first place. Have a nightly batch process, or some other business logic that ensures all relevant dat apoints are populated.
Or deal with it in your client. Return each dimension in you report (data, and name) as seperate data sets to act as your templates, and then return the data as your final data set. Your client can itterate over the data and fill in the blanks as appropriate. It's more code, but can actually use less resource overall than trying to fill-the-gaps with SQL.
(If your client side code does this slowly, post another question examining that code. Provided that the data is sorted, this is acutally quite quick to do in most languages.)
Some day I answered a question on SO (accepted as correct), but the answer left me with a great doubt.
Shortly, user had a table with this fields:
id INT PRIMARY KEY
dt DATETIME (with an INDEX)
lt DOUBLE
The query SELECT DATE(dt),AVG(lt) FROM table GROUP BY DATE(dt) was really slow.
We told him that (part of) the problem was using DATE(dt) as field and grouping, but db was on a production server and wasn't possible to split that field.
So (with a trigger) was inserted another field da DATE (with an INDEX) filled automatically with DATE(dt). Query SELECT da,AVG(lt) FROM table GROUP BY da was a bit faster, but with about 8mln records it took about 60s!!!
I tried on my pc and finally I discovered that, removing the index on field da query took only 7s, while using DATE(dt) after removing index it took 13s.
I've always thought an index on column used for grouping could really speed the query up, not the contrary (8 times slower!!!).
Why? Which is the reason?
Thanks a lot.
Because you still need to read all the data from both index + data file. Since you're not using any where condition - you always will have the query plan, that access all the data, row by row and you can do nothing with this.
If performance is important for this query and it is performed often - I'd suggest to cache the results into some temporary table and update it hourly (daily, etc).
Why it becomes slower: because in index data is already sorted and when mysql calculates cost of the query execution it thinks that it will be better to use already sorted data, then group it, then calculate agregates. But it is not in this case.
I think this is because of this or similiar MySQL bug: Index degrades sort performance and optimizer does not honor IGNORE INDEX
I remember the question as I was going to answer it but got distracted with something else. The problem was that his table design wasnt taking advantage of a clustered primary key index.
I would have re-designed the table creating a composite clustered primary key with the date as the leading part of the index. The sm_id field is still just a sequential unsigned int to guarantee uniqueness.
drop table if exists speed_monitor;
create table speed_monitor
(
created_date date not null,
sm_id int unsigned not null,
load_time_secs double(10,4) not null default 0,
primary key (created_date, sm_id)
)
engine=innodb;
+------+----------+
| year | count(*) |
+------+----------+
| 2009 | 22723200 | 22 million
| 2010 | 31536000 | 31 million
| 2011 | 5740800 | 5 million
+------+----------+
select
created_date,
count(*) as counter,
avg(load_time_secs) as avg_load_time_secs
from
speed_monitor
where
created_date between '2010-01-01' and '2010-12-31'
group by
created_date
order by
created_date
limit 7;
-- cold runtime
+--------------+---------+--------------------+
| created_date | counter | avg_load_time_secs |
+--------------+---------+--------------------+
| 2010-01-01 | 86400 | 1.66546802 |
| 2010-01-02 | 86400 | 1.66662466 |
| 2010-01-03 | 86400 | 1.66081309 |
| 2010-01-04 | 86400 | 1.66582251 |
| 2010-01-05 | 86400 | 1.66522316 |
| 2010-01-06 | 86400 | 1.66859480 |
| 2010-01-07 | 86400 | 1.67320440 |
+--------------+---------+--------------------+
7 rows in set (0.23 sec)