Related
Hi I currently have a query which is taking 11(sec) to run. I have a report which is displayed on a website which runs 4 different queries which are similar and all take 11(sec) each to run. I don't really want the customer having to wait a minute for all of these queries to run and display the data.
I am using 4 different AJAX requests to call an APIs to get the data I need and these all start at once but the queries are running one after another. If there was a way to get these queries to all run at once (parallel) so the total load time is only 11(sec) that would also fix my issue, I don't believe that is possible though.
Here is the query I am running:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
I can't think of anyway to speed this query up at all, below are pictures of the table indexes and the explain statement on this query.
I think the above query is using relevant indexes in the where conditions.
If there is anything you can think of to speed this query up please let me know, I have been working on it for 3 days and can't seem to figure out the problem. It would be great to get the query times down to 5(sec) maximum. If I am wrong about the AJAX issue please let me know as this would also fix my issue.
" EDIT "
I have came across something quite strange which might be causing the issue. When I change the day_epoch range to something smaller (5th - 9th) which returns 130,000 rows the query time is 0.7(sec) but then I add one more day onto that range (5th - 10th) and it returns over 150,000 rows the query time is 13(sec). I have ran loads of different ranges and have came to the conclusion if the amount of rows returned is over 150,000 that has a huge effect on the query times.
Table Definition -
CREATE TABLE `tracking_daily_stats_zone_unique_device_uuids_per_hour` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`day_epoch` int(10) NOT NULL,
`day_of_week` tinyint(1) NOT NULL COMMENT 'day of week, monday = 1',
`hour` int(2) NOT NULL,
`venue_id` int(5) NOT NULL,
`zone_id` int(5) NOT NULL,
`device_uuid` binary(16) NOT NULL COMMENT 'binary representation of the device_uuid, unique for a single day',
`device_vendor_id` int(5) unsigned NOT NULL DEFAULT '0' COMMENT 'id of the device vendor',
`first_seen` int(10) unsigned NOT NULL DEFAULT '0',
`last_seen` int(10) unsigned NOT NULL DEFAULT '0',
`is_repeat` tinyint(1) NOT NULL COMMENT 'is the device a repeat for this day?',
`prev_last_seen` int(10) NOT NULL DEFAULT '0' COMMENT 'previous last seen ts',
PRIMARY KEY (`id`,`venue_id`) USING BTREE,
KEY `venue_id` (`venue_id`),
KEY `zone_id` (`zone_id`),
KEY `day_of_week` (`day_of_week`),
KEY `day_epoch` (`day_epoch`),
KEY `hour` (`hour`),
KEY `device_uuid` (`device_uuid`),
KEY `is_repeat` (`is_repeat`),
KEY `device_vendor_id` (`device_vendor_id`)
) ENGINE=InnoDB AUTO_INCREMENT=450967720 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY HASH (venue_id)
PARTITIONS 100 */
The straight forward solution is to add this query specific index to the table:
ALTER TABLE tracking_daily_stats_zone_unique_device_uuids_per_hour
ADD INDEX complex_idx (`venue_id`, `day_epoch`, `zone_id`)
WARNING This query change can take a while on DB.
And then force it when you call:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
USE INDEX (complex_idx)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND venue_id = 46
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
It is definitely not universal but should work for this particular query.
UPDATE When you have partitioned table you can get profit by forcing particular PARTITION. In our case since that is venue_id just force it:
SELECT device_uuid,
day_epoch,
is_repeat
FROM tracking_daily_stats_zone_unique_device_uuids_per_hour
PARTITION (`p46`)
WHERE day_epoch >= 1552435200
AND day_epoch < 1553040000
AND zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
Where p46 is concatenated string of p and venue_id = 46
And another trick if you go this way. You can remove AND venue_id = 46 from WHERE clause. Because there is no other data in that partition.
What happens if you change the order of conditions? Put venue_id = ? first. The order matters.
Now it first checks all rows for:
- day_epoch >= 1552435200
- then, the remaining set for day_epoch < 1553040000
- then, the remaining set for venue_id = 46
- then, the remaining set for zone_id IN (102,105,108,110,111,113,116,117,118,121,287)
When working with heavy queries, you should always try to make the first "selector" the most effective. You can do that by using a proper index for 1 (or combination) index and to make sure that first selector narrows down the most (at least for integers, in case of strings you need another tactic).
Sometimes, a query simply is slow. When you have a lot of data (and/or not enough resources) you just cant really do anything about that. Thats where you need another solution: Make a summary table. I doubt you show 150.000 rows x4 to your visitor. You can sum it, e.g., hourly or every few minutes and select from that way smaller table.
Offtopic: Putting an index on everything only slows you down when inserting/updating/deleting. Index the least amount of columns, just the once you actually filter on (e.g. use in a WHERE or GROUP BY).
450M rows is rather large. So, I will discuss a variety of issues that can help.
Shrink data A big table leads to more I/O, which is the main performance killer. ('Small' tables tend to stay cached, and not have an I/O burden.)
Any kind of INT, even INT(2) takes 4 bytes. An "hour" can easily fit in a 1-byte TINYINT. That saves over a 1GB in the data, plus a similar amount in INDEX(hour).
If hour and day_of_week can be derived, don't bother having them as separate columns. This will save more space.
Some reason to use a 4-byte day_epoch instead of a 3-byte DATE? Or perhaps you do need a 5-byte DATETIME or TIMESTAMP.
Optimal INDEX (take #1)
If it is always a single venue_id, then either this is a good first cut at the optimal index:
INDEX(venue_id, zone_id, day_epoch)
First is the constant, then the IN, then a range. The Optimizer does well with this in many cases. (It is unclear whether the number of items in an IN clause can lead to inefficiencies.)
Better Primary Key (better index)
With AUTO_INCREMENT, there is probably no good reason to include columns after the auto_inc column in the PK. That is, PRIMARY KEY(id, venue_id) is no better than PRIMARY KEY(id).
InnoDB orders the data's BTree according to the PRIMARY KEY. So, if you are fetching several rows and can arrange for them to be adjacent to each other based on the PK, you get extra performance. (cf "Clustered".) So:
PRIMARY KEY(venue_id, zone_id, day_epoch, -- this order, as discussed above;
id) -- to make sure that the entire PK is unique.
INDEX(id) -- to keep AUTO_INCREMENT happy
And, I agree with DROPping any indexes that are not in use, including the one I recommended above. It is rarely useful to index flags (is_repeat).
UUID
Indexing a UUID can be deadly for performance once the table is really big. This is because of the randomness of UUIDs/GUIDs, leading to ever-increasing I/O burden to insert new entries in the index.
Multi-dimensional
Assuming day_epoch is sometimes multiple days, you seem to have 2 or 3 "dimensions":
A date range
A list of zones
A venue.
INDEXes are 1-dimensional. Therein lies the problem. However, PARTITIONing can sometimes help. I discuss this briefly as "case 2" in http://mysql.rjweb.org/doc.php/partitionmaint .
There is no good way to get 3 dimensions, so let's focus on 2.
You should partition on something that is a "range", such as day_epoch or zone_id.
After that, you should decide what to put in the PRIMARY KEY so that you can further take advantage of "clustering".
Plan A: This assumes you are searching for only one venue_id at a time:
PARTITION BY RANGE(day_epoch) -- see note below
PRIMARY KEY(venue_id, zone_id, id)
Plan B: This assumes you sometimes srefineearch for venue_id IN (.., .., ...), hence it does not make a good first column for the PK:
Well, I don't have good advice here; so let's go with Plan A.
The RANGE expression must be numeric. Your day_epoch works fine as is. Changing to a DATE, would necessitate BY RANGE(TO_DAYS(...)), which works fine.
You should limit the number of partitions to 50. (The 81 mentioned above is not bad.) The problem is that "lots" of partitions introduces different inefficiencies; "too few" partitions leads to "why bother".
Note that almost always the optimal PK is different for a partitioned table than the equivalent non-partitioned table.
Note that I disagree with partitioning on venue_id since it is so easy to put that column at the start of the PK instead.
Analysis
Assuming you search for a single venue_id and use my suggested partitioning & PK, here's how the SELECT performs:
Filter on the date range. This is likely to limit the activity to a single partition.
Drill into the data's BTree for that one partition to find the one venue_id.
Hopscotch through the data from there, landing on the desired zone_ids.
For each, further filter based the date.
example i have some gps devices that send info to my database every seconds
so 1 device create 1 row in mysql database with these columns (8)
id=12341 date=22.02.2018 time=22:40
langitude=22.236558789 longitude=78.9654582 deviceID=24 name=device-name someinfo=asdadadasd
so for 1 minute it create 60 rows , for 24 hours it create 864000 rows
and for 1 month(31days) 2678400 ROWS
so 1 device is creating 2.6 million rows per month in my db table ( records are deleted every month.)
so if there are more devices will be 2.6 Million * number of devices
so my questions are like this:
Question 1: if i make a search like this from php ( just for current day and for 1 device)
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24'
max possible results will be 86400 rows
will it overload my server too much
Question 2: limit with 5 hours (18000 rows) will that be problem for database or will it load server like first example or less
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 18000
Question 3: if i show just 1 result from db will it overload server
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 1
does it mean that if i have millions of rows and 1000rows will load server same if i show just 1 result
Millions of rows is not a problem, this is what SQL databases are designed to handle, if you have a well designed schema and good indexes.
Use proper types
Instead of storing your dates and times as separate strings, store them either as a single datetime or separate date and time types. See indexing below for more about which one to use. This is both more compact, allows indexing, faster sorting, and it makes available date and time functions without having to do conversions.
Similarly, be sure to use the appropriate numeric type for latitude, and longitude. You'll probably want to use numeric to ensure precision.
Since you're going to be storing billions of rows, be sure to use a bigint for your primary key. A regular int can only go up to about 2 billion.
Move repeated data into another table.
Instead of storing information about the device in every row, store that in a separate table. Then only store the device's ID in your log. This will cut down on your storage size, and eliminate mistakes due to data duplication. Be sure to declare the device ID as a foreign key, this will provide referential integrity and an index.
Add indexes
Indexes are what allows a database to search through millions or billions of rows very, very efficiently. Be sure there are indexes on the rows you use frequently, such as your timestamp.
A lack of indexes on date and deviceID is likely why your queries are so slow. Without an index, MySQL has to look at every row in the database known as a full table scan. This is why your queries are so slow, you're lacking indexes.
You can discover whether your queries are using indexes with explain.
datetime or time + date?
Normally it's best to store your date and time in a single column, conventionally called created_at. Then you can use date to get just the date part like so.
select *
from gps_logs
where date(created_at) = '2018-07-14'
There's a problem. The problem is how indexes work... or don't. Because of the function call, where date(created_at) = '2018-07-14' will not use an index. MySQL will run date(created_at) on every single row. This means a performance killing full table scan.
You can work around this by working with just the datetime column. This will use an index and be efficient.
select *
from gps_logs
where '2018-07-14 00:00:00' <= created_at and created_at < '2018-07-15 00:00:00'
Or you can split your single datetime column into date and time columns, but this introduces new problems. Querying ranges which cross a day boundary becomes difficult. Like maybe you want a day in a different time zone. It's easy with a single column.
select *
from gps_logs
where '2018-07-12 10:00:00' <= created_at and created_at < '2018-07-13 10:00:00'
But it's more involved with a separate date and time.
select *
from gps_logs
where (created_date = '2018-07-12' and created_time >= '10:00:00')
or (created_date = '2018-07-13' and created_time < '10:00:00');
Or you can switch to a database with partial indexes like Postgresql. A partial index allows you to index only part of a value, or the result of a function. And Postgresql does a lot of things better than MySQL. This is what I recommend.
Do as much work in SQL as possible.
For example, if you want to know how many log entries there are per device per day, rather than pulling all the rows out and calculating them yourself, you'd use group by to group them by device and day.
select gps_device_id, count(id) as num_entries, created_at::date as day
from gps_logs
group by gps_device_id, day;
gps_device_id | num_entries | day
---------------+-------------+------------
1 | 29310 | 2018-07-12
2 | 23923 | 2018-07-11
2 | 23988 | 2018-07-12
With this much data, you will want to rely heavily on group by and the associated aggregate functions like sum, count, max, min and so on.
Avoid select *
If you must retrieve 86400 rows, the cost of simply fetching all that data from the database can be costly. You can speed this up significantly by only fetching the columns you need. This means using select only, the, specific, columns, you, need rather than select *.
Putting it all together.
In PostgreSQL
Your schema in PostgreSQL should look something like this.
create table gps_devices (
id serial primary key,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigserial primary key,
gps_device_id int references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
create index date_and_device on gps_logs((created_at::date), gps_device_id);
A query can generally only use one index per table. Since you'll be searching on the timestamp and device ID together a lot timestamp_and_device combines indexing both the timestamp and device ID.
date_and_device is the same thing, but it's a partial index on just the date part of the timestamp. This will make where created_at::date = '2018-07-12' and gps_device_id = 42 very efficient.
In MySQL
create table gps_devices (
id int primary key auto_increment,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigint primary key auto_increment,
gps_device_id int references gps_devices(id),
foreign key (gps_device_id) references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
Very similar, but no partial index. So you'll either need to always use a bare created_at in your where clauses, or switch to separate date and time types.
Just read you question, for me the Answer is
Just create a separate table for Latitude and longitude and make your ID Foreign key and save it their.
Without knowing the exact queries you want to run I can just guess the best structure. Having said that, you should aim for the optimal types that use the minimum number of bytes per row. This should make your queries faster.
For example, you could use the structure below:
create table device (
id int primary key not null,
name varchar(20),
someinfo varchar(100)
);
create table location (
device_id int not null,
recorded_at timestamp not null,
latitude double not null, -- instead of varchar; maybe float?
longitude double not null, -- instead of varchar; maybe float?
foreign key (device_id) references device (id)
);
create index ix_loc_dev on location (device_id, recorded_at);
If you include the exact queries (naming the columns) we can create better indexes for them.
Since probably your query selectivity is bad, your queries may run Full Table Scans. For this case I took it a step further I used the smallest possible data types for the columns, so it will be faster:
create table location (
device_id tinyint not null,
recorded_at timestamp not null,
latitude float not null,
longitude float not null,
foreign key (device_id) references device (id)
);
Can't really think of anything smaller than this.
The best what I can recommend to you is to use time-series database for storing and accessing time-series data. You can host any kind of time-series database engine locally, just put a little bit more resources into development of it's access methods or use any specialized databases for telematics data like this.
I'm not new to MySQL, but I'm definitely way in over my head here.
I'd like to show a table of differences in temperatures for Panama and Belize based on date and atmospheric level. The query is supposed to match the Panama and Belize data based on date and atmospheric level and return the top 30 differences, ordered by the extent of the differences.
However, it is incredibly slow (over 30s) so it times out. Some other queries that I've written for this dataset are also very slow (about 26s). But if I only run the subqueries, they take only 1.7s or so. I should note that both of the tables below are over 440,000 rows long, though I don't think that's very large. The problem is probably the way that I'm joining the tables or the way that I'm creating the subqueries.
Here's my setup: (It's the SQL from the the exported tables. I'm omitting some columns)
/**The table for Panama weather data */
CREATE TABLE `panama_weather_data` (
`Id` varchar(40) NOT NULL,
`OwmPackageId` varchar(30) NOT NULL,
`Level` FLOAT DEFAULT NULL,
`Dt` date DEFAULT NULL,
`Temperature` float DEFAULT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `panama_weather_data`
ADD PRIMARY KEY (`Id`) USING BTREE;
COMMIT;
/**The table for Belize weather data*/
CREATE TABLE `belize_weather_data` (
`Id` varchar(40) NOT NULL,
`OwmPackageId` varchar(30) NOT NULL,
`Level` FLOAT DEFAULT NULL,
`Dt` date DEFAULT NULL,
`Temperature` float DEFAULT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `belize_weather_data`
ADD PRIMARY KEY (`Id`) USING BTREE;
COMMIT;
/**Code to populate the tables here*/
And here's my query:
SELECT ABS(PanamaTemperature-BelizeTemperature) AS TemperatureDif,
PanamaAtmostphericLevel, PanamaTable.Dt
FROM
(SELECT CAST(panama_weather_data.Dt AS DATETIME) AS Dt,
panama_weather_data.Level AS PanamaAtmostphericLevel,
panama_weather_data.Temperature AS PanamaTemperature
FROM panama_weather_data
WHERE panama_weather_data.OwmPackageId = 'openweathermappkg19758' )
AS PanamaTable
JOIN
(SELECT CAST(belize_weather_data.Dt AS DATETIME) AS Dt,
belize_weather_data.Level AS BelizeAtmosphericLevel,
belize_weather_data.Temperature AS BelizeTemperature
FROM belize_weather_data
WHERE belize_weather_data.OwmPackageId = 'openweathermappkg19758' )
AS BelizeTable
ON PanamaAtmostphericLevel = BelizeAtmosphericLevel
AND PanamaTable.Dt = BelizeTable.Dt
ORDER BY TemperatureDif
LIMIT 30
My question is really: Is there anyway to optimize this query and make it less painful?
CAST(panama_weather_data.Dt AS DATETIME) AS Dt
Why? (all this will do is slow down the query)
Is there anyway to optimize this query
The SQL SELECT statement you have shown us certainly would not be my starting point. However you did not tell us how you intend to query the data in future. Specifically, are you really going to examine all of the data each time you run a query?
Your biggest win comes from not keeping the data in separate tables - it should be a single table with different attributes for the two datasets.
After that, the next biggest improvement would come from storing the temperature difference in the table and indexing it.
A way to increase speed drastically in SQL databases is to use indices. This is a tradeoff between disk space and query performance.
To find out where to put indices, search for the conditions that limit your result sets the most. In your case, you probably have a few hundred thousand rows for both tables, but you only want 30 of those, whose Atmospheric Levels and date are equal. You probably want to put an index on those two columns like so:
CREATE INDEX level_date_panama ON panama_weather_data (Level, Dt);
CREATE INDEX level_date_belize ON belize_weather_data (Level, Dt);
Please tell me if this increases your performance.
You could do a few things to possibly improve performance here:
Remove the subqueries.
From what you posted I see no reason why the subqueries are necessary for the join. You could just as easily remove them and rewrite using the actual column names in place of where you wrote the AS values.
Input your Dt data as a Datetime
A CAST is not a particularly expensive operator, but does take time to complete. If you are only using these columns as Datetimes, you should be entering them as such and change the column type to a Datetime. You could directly compare these values instead of having to cast them.
Compare Dt as a Date
Going off of (2), if all your Dt values are Dates, casting them to Datetimes won't be doing anything to the value, so just compare on the natural Date type.
Index
If the above is not possible due to outside constraints, create an index based on how you are joining, this would be a column used in your ON clause.
What kind of values are in id? Perhaps you can get rid of id, and use PRIMARY KEY(level, dt)?
Why is level a FLOAT? If they are really "floating" values, then is it realistic for both tables to have the same values? I guess they are feet or meters above sea level? In which case, won't MEDIUMINT UNSIGNED suffice?
Then...
SELECT ABS(p.Temperature - b.Temperature) AS TemperatureDif,
p.Level,
p.Dt
FROM panama_weather_data AS p
JOIN belize_weather_data AS b
USING (OwmPackageId, Level, Dt)
WHERE p.OwmPackageId = 'openweathermappkg19758'
ORDER BY TemperatureDif DESC
LIMIT 30;
You will need
INDEX(OwmPackageId, Level, Dt)
with those columns in any order, and on either (or both) tables.
As already mentioned, no CAST is needed. However, if you need some format other than "2017-08-13 10:04:12", then use DATE_FORMAT(...) in the SELECT clause (not the USING clause).
Rather than having two 'identical' tables, consider having one table with an extra column for which country is involved. This would make it easy to extend to an arbitrary number of locations. The SELECT would need to be a "self join" and the syntax would be slightly different.
i have this table (500,000 row)
CREATE TABLE IF NOT EXISTS `listings` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`type` tinyint(1) NOT NULL DEFAULT '1',
`hash` char(32) NOT NULL,
`source_id` int(10) unsigned NOT NULL,
`link` varchar(255) NOT NULL,
`short_link` varchar(255) NOT NULL,
`cat_id` mediumint(5) NOT NULL,
`title` mediumtext NOT NULL,
`description` mediumtext,
`content` mediumtext,
`images` mediumtext,
`videos` mediumtext,
`views` int(10) unsigned NOT NULL,
`comments` int(11) DEFAULT '0',
`comments_update` int(11) NOT NULL DEFAULT '0',
`editor_id` int(11) NOT NULL DEFAULT '0',
`auther_name` varchar(255) DEFAULT NULL,
`createdby_id` int(10) NOT NULL,
`createdon` int(20) NOT NULL,
`editedby_id` int(10) NOT NULL,
`editedon` int(20) NOT NULL,
`deleted` tinyint(1) NOT NULL,
`deletedon` int(20) NOT NULL,
`deletedby_id` int(10) NOT NULL,
`deletedfor` varchar(255) NOT NULL,
`published` tinyint(1) NOT NULL DEFAULT '1',
`publishedon` int(11) unsigned NOT NULL,
`publishedby_id` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `hash` (`hash`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
i'm thinking to make each query by the publishedon between x and y (show in all the site just records of 1 month)
in the same time, i want to add with the publishedon in the where clause published, cat_id , source_id
some thing like this:
SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
that query is ok and fast until now without indexing, but when trying to use order by publishedon its became too slow, so i used this index
CREATE INDEX `listings_pcs` ON listings(
`publishedon` DESC,
`published` ,
`cat_id` ,
`source_id`
)
it worked and the order by publishedon became fast, now i want to order by views like this
SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
this is the explanation
this query is too slow because of ORDER BY views DESC
then i'm tried to drop the old index and add this
CREATE INDEX `listings_pcs` ON listings(
`publishedon` DESC,
`published` ,
`cat_id` ,
`source_id`,
`views` DESC
)
its too slow also
what about if i use just single index on publishedon?
what about using single index on cat_id,source_id,views,publishedon?
i can change the query dependencies like publishedon in one month if i found other indexing method depend on any other columns
what about making index in (cat_id, source_id, publishedon, published) ? but in some cases i will use source_id only?
what is the best indexing schema for that table
This query:
SELECT *
FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458) AND
(published = 1) AND
(cat_id in (1,2,3,4,5)) AND
(source_id in (1,2,3,4,5));
Is hard to optimize with only indexes. The best index is one that starts with published and then has the other columns -- it is not clear what their order should be. The reason is because all but published are not using =.
Because your performance problem is on a sort, that suggests that lots of rows are being returned. Typically, an index is used to satisfy the WHERE clause before the index can be used for the ORDER BY. That makes this hard to optimize.
Suggestions . . . None are that great:
If you are going to access the data by month, then you might consider partitioning the data by month. That will make the query without the ORDER BY faster, but won't help the ORDER BY.
Try various orders of columns after published in the index. You might find the most selective column(s). But, once again, this speeds the query before the sorting.
Think about ways that you can structure the query to have more equality conditions in the WHERE clause or to return a smaller set of data.
(Not really recommended) Put an index on published and the ordering column. Then use a subquery to fetch the data. Put the inequality conditions (IN and so on) in the outer query. The subquery will use the index for sorting and then filter the results.
The reason the last is not recommended is because SQL (and MySQL) do not guarantee the ordering of results from a subquery. However, because MySQL materializes subqueries, the results really are in order. I don't like using undocumented side effects, which can change from version to version.
One important general note as to why your query isn't getting any faster despite your attempts is that DESC on indexes is not currently supported on MySQL. See this SO thread, and the source from which it comes.
In this case, your largest problem is in the sheer size of your record. If the engine decides it wouldn't really be faster to use an index, then it won't.
You have a few options, and all are actually pretty decent and can probably help you see significant improvement.
A note on SQL
First, I want to make a quick note about indexing in SQL. While I don't think it's the solution for your woes, it was your main question, and can help.
It usually helps me to think about indexing in three different buckets. The absolutely, the maybe, and the never. You certainly don't have anything in your indexing that's in the never column, but there are some I would consider "maybe" indexes.
absolutely: This is your primary key and any foreign keys. It is also any key you will reference on a very regular basis to pull a small set of data from the massive data you have.
maybe: These are columns which, while you may reference them regularly, are not really referenced by themselves. In fact, through analysis and using EXPLAIN as #Machavity recommends in his answer, you may find that by the time these columns are used to strip out fields, there aren't that many fields anyway. An example of a column that would solidly be in this pile for me would be the published column. Keep in mind that every INDEX adds to the work your queries need to do.
Also: Composite keys are a good choice when you're regularly searching for data based on two different columns. More on that later.
Options, options, options...
There are a number of options to consider, and each one has some drawbacks. Ultimately I would consider each of these on a case-by-case basis as I don't see any of these to be a silver bullet. Ideally, you'd test a few different solutions against your current setting and see which one runs the fastest using a nice scientific test.
Split your SQL table into two or more separate tables.
This is one of the few times where, despite the number of columns in your table, I wouldn't rush to try to split your table into smaller chunks. If you decided to split it into smaller chunks, however, I'd argue that your [action]edon, [action]edby_id, and [action]ed could easily be put into another table, actions:
+-----------+-------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+-------------+------+-----+-------------------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| action_id | int(11) | NO | | NULL | |
| action | varchar(45) | NO | | NULL | |
| date | datetime | NO | | CURRENT_TIMESTAMP | |
| user_id | int(11) | NO | | NULL | |
+-----------+-------------+------+-----+-------------------+----------------+
The downside to this is that it does not allow you to ensure there is only one creation date without a TRIGGER. The upside is that when you don't have to sort as many columns with as many indexes when you're sorting by date. Also, it allows you to sort not only be created, but also by all of your other actions.
Edit: As requested, here is a sample sorting query
SELECT * FROM listings
INNER JOIN actions ON actions.listing_id = listings.id
WHERE (actions.action = 'published')
AND (listings.published = 1)
AND (listings.cat_id in(1,2,3,4,5))
AND (listings.source_id in(1,2,3,4,5))
AND (actions.actiondate between 1441105258 AND 1443614458)
ORDER BY listings.views DESC
Theoretically, it should cut down on the number of rows you're sorting against because it's only pulling relevant data. I don't have a dataset like yours so I can't test it right now!
If you put a composite key on actiondate and listings.id, this should help to increase speed.
As I said, I don't think this is the best solution for you right now because I'm not convinced it's going to give you the maximum optimization. This leads me to my next suggestion:
Create a month field
I used this nifty tool to confirm what I thought I understood of your question: You are sorting by month here. Your example is specifically looking between September 1st and September 30th, inclusive.
So another option is for you to split your integer function into a month, day, and year field. You can still have your timestamp, but timestamps aren't all that great for searching. Run an EXPLAIN on even a simple query and you'll see for yourself.
That way, you can just index the month and year fields and do a query like this:
SELECT * FROM listings
WHERE (publishedmonth = 9)
AND (publishedyear = 2015)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
Slap an EXPLAIN in front and you should see massive improvements.
Because you're planning on referring to a month and a day, you may want to add a composite key against month and year, rather than a key on both separately, for added gains.
Note: I want to be clear, this is not the "correct" way to do things. It is convenient, but denormalized. If you want the correct way to do things, you'd adapt something like this link but I think that would require you to seriously reconsider your table, and I haven't tried anything like this, having lacked the need, and, frankly, will, to brush up on my geometry. I think it's a little overkill for what you're trying to do.
Do your heavy sorting elsewhere
This was hard for me to come to terms with because I like to do things the "SQL" way wherever possible, but that is not always the best solution. Heavy computing, for example, is best done using your programming language, leaving SQL to handle relationships.
The former CTO of Digg sorted using PHP instead of MySQL and received a 4,000% performance increase. You're probably not scaling out to this level, of course, so the performance trade-offs won't be clearcut unless you test it out yourself. Still, the concept is sound: the database is the bottleneck, and computer memory is dirt cheap by comparison.
There are doubtless a lot more tweaks that can be done. Each of these has a drawback and requires some investment. The best answer is to test two or more of these and see which one helps you get the most improvement.
If I were you, I'd at least INDEX the fields in question individually. You're building multi-column indices but it's clear you're pulling a lot of disparate records as well. Having the columns indexed individually can't hurt.
Something you should do is use EXPLAIN which lets you look under the hood of how MySQL is pulling the data. It could further point to what is slowing your query down.
EXPLAIN SELECT * FROM listings
WHERE (publishedon BETWEEN 1441105258 AND 1443614458)
AND (published = 1)
AND (cat_id in(1,2,3,4,5))
AND (source_id in(1,2,3,4,5))
ORDER BY views DESC
The rows of your table are enormous (all those mediumtext columns), so sorting SELECT * is going to have a lot of overhead. That's a simple reality of your schema design. SELECT * is generally considered harmful to performance. If you can enumerate the columns you need, and you can leave out some of the big ones, you'll get better performance.
You showed us a query with the following filter criteria
single-value equality on published.
range matching on publishedon.
set matching on cat_id
set matching on source_id.
Ordering on views.
Due to the way MySQL indexing works on MyISAM, the following compound covering index will probably serve you well. It's hard to be sure unless you try it.
CREATE INDEX listings_x_pub_date_cover ON listings(
published, publishedon, cat_id, source_id, views, id )
To satisfy your query the MySQL engine will random-access the index at the appropriate value of published, and then at the begiining of the publishedon range. It will then scan through the index filtering on the other two filtering criteria. Finally, it sorts and and uses the id value to look up each row that passes the filter. Give it a try.
If that performance isn't good enough try this so-called deferred join operation.
SELECT a.*
FROM listings a
JOIN ( SELECT id, views
FROM listings
WHERE published = 1
AND publishedon BETWEEN 1441105258
AND 1443614458
AND cat_id IN (1,2,3,4,5)
AND source_id IN (1,2,3,4,5)
ORDER BY views DESC
) b ON a.id = b.id
ORDER BY b.views DESC
This does the heavy lifting of ordering with just the id and views columns without having to shuffle all those massive text columns. It may or may not help, because the ordering has to be repeated in the outer query. This kind of thing DEFINITELY helps when you have ORDER BY ... LIMIT n pattern in your query, but you don't.
Finally, considering the size of these rows, you may get best performance by doing this inner query from your php program:
SELECT id
FROM listings
WHERE published = 1
AND publishedon BETWEEN 1441105258
AND 1443614458
AND cat_id IN (1,2,3,4,5)
AND source_id IN (1,2,3,4,5)
ORDER BY views DESC
and then fetching the full rows of the table one-by-one using these id values in an inner loop. (This query that fetches just id values should be quite fast with the help of the index I mentioned.) The inner loop solution would be ugly, but if your text columns are really big (each mediumtext column can hold up to 16MiB) it's probably your best bet.
tl;dr. Create the index mentioned. Get rid of SELECT * if you possibly can, giving a list of columns you need instead. Try the deferred join query. If it's still not good enough try the nested query.
I am collecting about 3 - 6 millions lines of stock data per day and storing it in a MySQL database.
All of the data is coming from Interactive Brokers every piece of information comes with these five fields: Symbol, Date, Time, Value and Type (type being information on what type of data I am receiving such as price, volume etc)
Here is my create table statement. idticks is just my unique key but I almost never am able to use it in queries.
CREATE TABLE `ticks` (
`idticks` int(11) NOT NULL AUTO_INCREMENT,
`symbol` varchar(30) NOT NULL,
`date` int(11) NOT NULL,
`time` int(11) NOT NULL,
`value` double NOT NULL,
`type` double NOT NULL,
KEY `idticks` (`idticks`),
KEY `symbol` (`symbol`),
KEY `date` (`date`),
KEY `idx_ticks_symbol_date` (`symbol`,`date`),
KEY `idx_ticks_type` (`type`),
KEY `idx_ticks_date_type` (`date`,`type`),
KEY `idx_ticks_date_symbol_type` (`date`,`symbol`,`type`),
KEY `idx_ticks_symbol_date_time_type` (`symbol`,`date`,`time`,`type`)
) ENGINE=InnoDB AUTO_INCREMENT=13533258 DEFAULT CHARSET=utf8
/*!50100 PARTITION BY KEY (`date`)
PARTITIONS 1 */;
As you can see, I have no idea what I am doing because I just keep on creating indexes to make my queries go faster.
Right now the data is being stored on a rather slow computer for testing purposes so I understand that my queries are not nearly as fast as they could be (I have a 6 core, 64gig of ram, SSD machine arriving tomorrow which should help significantly)
That being said, I am running queries like this one
select time, value from ticks where symbol = "AAPL" AND date = 20150522 and type = 8 order by time asc
The query above, if I do not limit it, returns 12928 records for one of my test days and takes 10.2 seconds if I do it from cleared cache.
I am doing lots of graphing and eventually would like to be able to just query the data as I need to it graph. Right now I haven't noticed a lot of difference in speed between getting part of a days worth of data vs just getting the entire day's. It would be cool to have those queries respond fast enough that there is barely any delay when I moving to the next day/screen whatever.
Another query I am using for usability of a program I am writing to interact with the data include
String query = "select distinct `date` from ticks where symbol = '" + symbol + "' order by `date` desc";
But most of my need is the ability to pull a certain type of data from a certain day for a certain symbol like my first query.
I've googled all over the place and I think I understand that creating tons of indexes makes the database bigger and slows down the input speed (I get about 300 pieces of information per second on a busy day). Should I just index each column individually?
I am willing to throw more harddrives at things if it means responsive interface.
Basically, my questions relate to the creation/altering of my table. Based on the above query, can you think of anything I could do to make that faster? Or an indexing system that would help me out? Is InnoDB even the right engine? I tried googling this vs MyISam and after a couple of hours of this, I still wasn't sure.
Thanks :)
Combine date and time into a DATETIME field
Assuming Price and Volume always come in together, put them together (2 columns) and get rid if type.
Get rid of the AUTO_INCREMENT; change to PRIMARY KEY(symbol, datetime)
Get rid of any indexes that are the left part of some other index.
Once you are using DATETIME, use date ranges to find everything in a single date (if you need such). Do not use DATE(datetime) = '...', performance will be terrible.
Symbol can probably be ascii, not utf8.
Use InnoDB, the clustering of the Primary Key can be beneficial.
Do you expect to collect (and use) more data than will fit in innodb_buffer_pool_size? If so, we need to discuss your SELECTs and look into PARTITIONing.
Make those changes, then come back for more advice/abuse.
You're creating a historical database, so MyISAM would work as well as InnoDB. InnoDB is a transactional relational database, and is better suited for relational databases with multiple tables that must remain synchronized.
Your Stock table looks like this.
Stock
-----
Stock ID (idticks)
Symbol
Date
Time
Value
Type
It would be better if you combine the date and time into a time stamp column, and unpack the types like this.
Stock
-----
Stock ID
Symbol
Time Stamp
Volume
Open
Close
Bid
Ask
...
This makes it easier for the database to return rows for a query on a particular type, like the close value.
As far as indexes, you can create as many indexes as you want. You're adding (inserting) information, so the increased time to add information is offset by the decreased time to query the information.
I'd have a primary index on Stock ID, and a unique index on Symbol and Time Stamp descending. You could also have indexes on the values you query most often, like Close.