How would you accomplish this task to get best performance?
Table schema:
CREATE TABLE `test_truck_report` (
`id` INT(11) NOT NULL AUTO_INCREMENT,
`truck_id` INT(11) NOT NULL,
`odometer_initial` INT(11) NOT NULL,
`odometer_final` INT(11) NOT NULL,
`fuel_initial` INT(11) NOT NULL,
`fuel_final` INT(11) NOT NULL,
PRIMARY KEY (`id`)
)
ENGINE=InnoDB;
What i'm trying to execute is this query:
SELECT
truck_id,
(odometer_final - odometer_initial) AS mileage,
(fuel_initial - fuel_final) AS consumed_fuel,
(consumed_fuel / mileage) AS consumption_per_km
FROM
test_truck_report
WHERE
consumption_per_km > 2
Somehow this obvious on the first sight logic doesn't work and i'm forced to use this query instead:
SELECT
truck_id,
(odometer_final - odometer_initial) AS mileage,
(fuel_initial - fuel_final) AS consumed_fuel,
((fuel_initial - fuel_final) / (odometer_final - odometer_initial)) AS consumption_per_km
FROM
test_truck_report
WHERE
((fuel_initial - fuel_final) / (odometer_final - odometer_initial)) > 2
I assume that constant recalculation of each calculated field every time where it needs to be placed makes significant performance downgrade. And this is just a test case, actual working table has 50+ fields and some of calculated fields consists of 10+ operands. So it's a really HUGE problem at the moment.
Reason why i don't want to actually create these fields and perform something like:
UPDATE
`test_truck_report`
SET
consumed_fuel = fuel_initial - fuel_final
is that existing records are being constantly updated by the users and in that case i would need to constantly update that data.
So do you consider creating actual fields a better idea? Or is there some better way?
Thanks.
Try to use views:
We need an auxiliary view:
CREATE OR REPLACE VIEW vw_truck_data AS
SELECT truck_id,
(odometer_final - odometer_initial) AS mileage,
(fuel_initial - fuel_final) AS consumed_fuel
FROM test_truck_report;
And the final view:
CREATE OR REPLACE VIEW vw_truck_consumption AS
SELECT data.*,
(data.consumed_fuel / data.mileage) AS consumption_per_km
FROM vw_truck_data data;
Now you can query whenever you want in an easy and readable way:
SELECT *
FROM vw_truck_consumption
WHERE consumption_per_km > 2
This way MySQL should be able to only substract each field once so the performance should be at least as good as your solution or better. Normally the CPU cost from adding fields is smaller than the cost to retreive data from the database but of course it depends on your hardware, mysql version, configuration and data distribution. Do some measurements if it is really an issue.
Anyway remmember that you are making a query filtering by consumption_per_km which is a funtion of fields. As MySQL seems to lack funtional indexes it will surely scan the full table and be slow.
Related
I am having difficulties in optimizing this SQL statement in MySQL. I have two tables that are populated independently and so the times logged in each table's column will not be the same. What I want is a single table (view) that lists all the records in the sensor_history with the current process information that was present at the sensor's measurement_time. If a process log time was not present, I can live with NULLs in the process fields in the resulting view for that particular record.
What I have here works but it is brute force and woefully inefficient. There are about 500k records in the sensor_history table and about 20k records in the process_history table. I have tried getting my head around different join methods but I run into syntax issues or bad results. I have tried some online optimizers without success and so I am hoping someone here can point me in the right direction.
For simplicity, I have removed the foreign key relations to other tables. There are no indices in use but feel free to suggest any that may help. Here are the basics:
CREATE TABLE `sensor_history` (
`measurement_time_utc` int(11) NOT NULL,
`sensor_id` int(11) NOT NULL,
`sensor_measurement_x` double NOT NULL,
`sensor_measurement_y` double NOT NULL,
`sensor_measurement_z` double NOT NULL,
`sensor_quality` int(11) NOT NULL
);
CREATE TABLE `process_history` (
`log_time_utc` int(11) NOT NULL,
`process_id` int(11) NOT NULL,
`process_speed` double NOT NULL,
`process_load` int(11) NOT NULL
);
CREATE VIEW `rollup` AS SELECT
`sensor_history`.`measurement_time_utc`,
`sensor_history`.`sensor_id`,
`sensor_history`.`sensor_measurement_x`,
`sensor_history`.`sensor_measurement_y`,
`sensor_history`.`sensor_measurement_z`,
`sensor_history`.`sensor_quality`,
(SELECT `process_history`.`process_id` FROM `process_history` WHERE `sensor_history`.`measurement_time_utc`>=`process_history`.`log_time_utc` ORDER BY `process_history`.`log_time_utc` DESC LIMIT 1) AS `process_id`,
(SELECT `process_history`.`process_speed` FROM `process_history` WHERE `sensor_history`.`measurement_time_utc`>=`process_history`.`log_time_utc` ORDER BY `process_history`.`log_time_utc` DESC LIMIT 1) AS `process_speed`,
(SELECT `process_history`.`process_load` FROM `process_history` WHERE `sensor_history`.`measurement_time_utc`>=`process_history`.`log_time_utc` ORDER BY `process_history`.`log_time_utc` DESC LIMIT 1) AS `process_load`
FROM `sensor_history`;
How can I make a more efficient rollup view? Thanks in advance.
Views are really hard to optimize in MySQL. Your best hope is for an index on:
process_history(log_time_utc, process_id, process_speed)
The last two columns are included so the index covers the query and doesn't need to refer to the data pages.
While you are trying to figure out what the Analysts really need, let's do some improvements that are easier to do now than later.
DOUBLE takes 8 bytes and delivers about 16 significant digits. That is gross overkill for every sensor I have heard of. Consider the 4-byte FLOAT, which gives you about 7 significant digits.
(Where am I going with this? Capturing "sensor" data keeps coming, and it eventually fills up disk and that makes it slow. So, let's shrink things soon.)
INT is 4 bytes and has a range of +/- 2 billion. Are you expecting that many sensors? How about a 1-byte TINYINT UNSIGNED with a range of 0..255? Or `SMALLINT UNSIGNED (1-bytes, range 0..64K)? Ditto for any other ids.
Or... Do you really need to save all the data? Maybe day-old data can be summarized down to hourly min, max, avg, etc? And month-old data is needed only to a day's resolution?
We have lots to discuss once your analysts explain to you what the do want. Then you need to read-between-the-lines to see what they will want. (I can help there, too.)
example i have some gps devices that send info to my database every seconds
so 1 device create 1 row in mysql database with these columns (8)
id=12341 date=22.02.2018 time=22:40
langitude=22.236558789 longitude=78.9654582 deviceID=24 name=device-name someinfo=asdadadasd
so for 1 minute it create 60 rows , for 24 hours it create 864000 rows
and for 1 month(31days) 2678400 ROWS
so 1 device is creating 2.6 million rows per month in my db table ( records are deleted every month.)
so if there are more devices will be 2.6 Million * number of devices
so my questions are like this:
Question 1: if i make a search like this from php ( just for current day and for 1 device)
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24'
max possible results will be 86400 rows
will it overload my server too much
Question 2: limit with 5 hours (18000 rows) will that be problem for database or will it load server like first example or less
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 18000
Question 3: if i show just 1 result from db will it overload server
SELECT * FROM TABLE WHERE date='22.02.2018' AND deviceID= '24' LIMIT 1
does it mean that if i have millions of rows and 1000rows will load server same if i show just 1 result
Millions of rows is not a problem, this is what SQL databases are designed to handle, if you have a well designed schema and good indexes.
Use proper types
Instead of storing your dates and times as separate strings, store them either as a single datetime or separate date and time types. See indexing below for more about which one to use. This is both more compact, allows indexing, faster sorting, and it makes available date and time functions without having to do conversions.
Similarly, be sure to use the appropriate numeric type for latitude, and longitude. You'll probably want to use numeric to ensure precision.
Since you're going to be storing billions of rows, be sure to use a bigint for your primary key. A regular int can only go up to about 2 billion.
Move repeated data into another table.
Instead of storing information about the device in every row, store that in a separate table. Then only store the device's ID in your log. This will cut down on your storage size, and eliminate mistakes due to data duplication. Be sure to declare the device ID as a foreign key, this will provide referential integrity and an index.
Add indexes
Indexes are what allows a database to search through millions or billions of rows very, very efficiently. Be sure there are indexes on the rows you use frequently, such as your timestamp.
A lack of indexes on date and deviceID is likely why your queries are so slow. Without an index, MySQL has to look at every row in the database known as a full table scan. This is why your queries are so slow, you're lacking indexes.
You can discover whether your queries are using indexes with explain.
datetime or time + date?
Normally it's best to store your date and time in a single column, conventionally called created_at. Then you can use date to get just the date part like so.
select *
from gps_logs
where date(created_at) = '2018-07-14'
There's a problem. The problem is how indexes work... or don't. Because of the function call, where date(created_at) = '2018-07-14' will not use an index. MySQL will run date(created_at) on every single row. This means a performance killing full table scan.
You can work around this by working with just the datetime column. This will use an index and be efficient.
select *
from gps_logs
where '2018-07-14 00:00:00' <= created_at and created_at < '2018-07-15 00:00:00'
Or you can split your single datetime column into date and time columns, but this introduces new problems. Querying ranges which cross a day boundary becomes difficult. Like maybe you want a day in a different time zone. It's easy with a single column.
select *
from gps_logs
where '2018-07-12 10:00:00' <= created_at and created_at < '2018-07-13 10:00:00'
But it's more involved with a separate date and time.
select *
from gps_logs
where (created_date = '2018-07-12' and created_time >= '10:00:00')
or (created_date = '2018-07-13' and created_time < '10:00:00');
Or you can switch to a database with partial indexes like Postgresql. A partial index allows you to index only part of a value, or the result of a function. And Postgresql does a lot of things better than MySQL. This is what I recommend.
Do as much work in SQL as possible.
For example, if you want to know how many log entries there are per device per day, rather than pulling all the rows out and calculating them yourself, you'd use group by to group them by device and day.
select gps_device_id, count(id) as num_entries, created_at::date as day
from gps_logs
group by gps_device_id, day;
gps_device_id | num_entries | day
---------------+-------------+------------
1 | 29310 | 2018-07-12
2 | 23923 | 2018-07-11
2 | 23988 | 2018-07-12
With this much data, you will want to rely heavily on group by and the associated aggregate functions like sum, count, max, min and so on.
Avoid select *
If you must retrieve 86400 rows, the cost of simply fetching all that data from the database can be costly. You can speed this up significantly by only fetching the columns you need. This means using select only, the, specific, columns, you, need rather than select *.
Putting it all together.
In PostgreSQL
Your schema in PostgreSQL should look something like this.
create table gps_devices (
id serial primary key,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigserial primary key,
gps_device_id int references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
create index date_and_device on gps_logs((created_at::date), gps_device_id);
A query can generally only use one index per table. Since you'll be searching on the timestamp and device ID together a lot timestamp_and_device combines indexing both the timestamp and device ID.
date_and_device is the same thing, but it's a partial index on just the date part of the timestamp. This will make where created_at::date = '2018-07-12' and gps_device_id = 42 very efficient.
In MySQL
create table gps_devices (
id int primary key auto_increment,
name text not null
-- any other columns about the devices
);
create table gps_logs (
id bigint primary key auto_increment,
gps_device_id int references gps_devices(id),
foreign key (gps_device_id) references gps_devices(id),
created_at timestamp not null default current_timestamp,
latitude numeric(12,9) not null,
longitude numeric(12,9) not null
);
create index timestamp_and_device on gps_logs(created_at, gps_device_id);
Very similar, but no partial index. So you'll either need to always use a bare created_at in your where clauses, or switch to separate date and time types.
Just read you question, for me the Answer is
Just create a separate table for Latitude and longitude and make your ID Foreign key and save it their.
Without knowing the exact queries you want to run I can just guess the best structure. Having said that, you should aim for the optimal types that use the minimum number of bytes per row. This should make your queries faster.
For example, you could use the structure below:
create table device (
id int primary key not null,
name varchar(20),
someinfo varchar(100)
);
create table location (
device_id int not null,
recorded_at timestamp not null,
latitude double not null, -- instead of varchar; maybe float?
longitude double not null, -- instead of varchar; maybe float?
foreign key (device_id) references device (id)
);
create index ix_loc_dev on location (device_id, recorded_at);
If you include the exact queries (naming the columns) we can create better indexes for them.
Since probably your query selectivity is bad, your queries may run Full Table Scans. For this case I took it a step further I used the smallest possible data types for the columns, so it will be faster:
create table location (
device_id tinyint not null,
recorded_at timestamp not null,
latitude float not null,
longitude float not null,
foreign key (device_id) references device (id)
);
Can't really think of anything smaller than this.
The best what I can recommend to you is to use time-series database for storing and accessing time-series data. You can host any kind of time-series database engine locally, just put a little bit more resources into development of it's access methods or use any specialized databases for telematics data like this.
I'm not new to MySQL, but I'm definitely way in over my head here.
I'd like to show a table of differences in temperatures for Panama and Belize based on date and atmospheric level. The query is supposed to match the Panama and Belize data based on date and atmospheric level and return the top 30 differences, ordered by the extent of the differences.
However, it is incredibly slow (over 30s) so it times out. Some other queries that I've written for this dataset are also very slow (about 26s). But if I only run the subqueries, they take only 1.7s or so. I should note that both of the tables below are over 440,000 rows long, though I don't think that's very large. The problem is probably the way that I'm joining the tables or the way that I'm creating the subqueries.
Here's my setup: (It's the SQL from the the exported tables. I'm omitting some columns)
/**The table for Panama weather data */
CREATE TABLE `panama_weather_data` (
`Id` varchar(40) NOT NULL,
`OwmPackageId` varchar(30) NOT NULL,
`Level` FLOAT DEFAULT NULL,
`Dt` date DEFAULT NULL,
`Temperature` float DEFAULT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `panama_weather_data`
ADD PRIMARY KEY (`Id`) USING BTREE;
COMMIT;
/**The table for Belize weather data*/
CREATE TABLE `belize_weather_data` (
`Id` varchar(40) NOT NULL,
`OwmPackageId` varchar(30) NOT NULL,
`Level` FLOAT DEFAULT NULL,
`Dt` date DEFAULT NULL,
`Temperature` float DEFAULT NULL,
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `belize_weather_data`
ADD PRIMARY KEY (`Id`) USING BTREE;
COMMIT;
/**Code to populate the tables here*/
And here's my query:
SELECT ABS(PanamaTemperature-BelizeTemperature) AS TemperatureDif,
PanamaAtmostphericLevel, PanamaTable.Dt
FROM
(SELECT CAST(panama_weather_data.Dt AS DATETIME) AS Dt,
panama_weather_data.Level AS PanamaAtmostphericLevel,
panama_weather_data.Temperature AS PanamaTemperature
FROM panama_weather_data
WHERE panama_weather_data.OwmPackageId = 'openweathermappkg19758' )
AS PanamaTable
JOIN
(SELECT CAST(belize_weather_data.Dt AS DATETIME) AS Dt,
belize_weather_data.Level AS BelizeAtmosphericLevel,
belize_weather_data.Temperature AS BelizeTemperature
FROM belize_weather_data
WHERE belize_weather_data.OwmPackageId = 'openweathermappkg19758' )
AS BelizeTable
ON PanamaAtmostphericLevel = BelizeAtmosphericLevel
AND PanamaTable.Dt = BelizeTable.Dt
ORDER BY TemperatureDif
LIMIT 30
My question is really: Is there anyway to optimize this query and make it less painful?
CAST(panama_weather_data.Dt AS DATETIME) AS Dt
Why? (all this will do is slow down the query)
Is there anyway to optimize this query
The SQL SELECT statement you have shown us certainly would not be my starting point. However you did not tell us how you intend to query the data in future. Specifically, are you really going to examine all of the data each time you run a query?
Your biggest win comes from not keeping the data in separate tables - it should be a single table with different attributes for the two datasets.
After that, the next biggest improvement would come from storing the temperature difference in the table and indexing it.
A way to increase speed drastically in SQL databases is to use indices. This is a tradeoff between disk space and query performance.
To find out where to put indices, search for the conditions that limit your result sets the most. In your case, you probably have a few hundred thousand rows for both tables, but you only want 30 of those, whose Atmospheric Levels and date are equal. You probably want to put an index on those two columns like so:
CREATE INDEX level_date_panama ON panama_weather_data (Level, Dt);
CREATE INDEX level_date_belize ON belize_weather_data (Level, Dt);
Please tell me if this increases your performance.
You could do a few things to possibly improve performance here:
Remove the subqueries.
From what you posted I see no reason why the subqueries are necessary for the join. You could just as easily remove them and rewrite using the actual column names in place of where you wrote the AS values.
Input your Dt data as a Datetime
A CAST is not a particularly expensive operator, but does take time to complete. If you are only using these columns as Datetimes, you should be entering them as such and change the column type to a Datetime. You could directly compare these values instead of having to cast them.
Compare Dt as a Date
Going off of (2), if all your Dt values are Dates, casting them to Datetimes won't be doing anything to the value, so just compare on the natural Date type.
Index
If the above is not possible due to outside constraints, create an index based on how you are joining, this would be a column used in your ON clause.
What kind of values are in id? Perhaps you can get rid of id, and use PRIMARY KEY(level, dt)?
Why is level a FLOAT? If they are really "floating" values, then is it realistic for both tables to have the same values? I guess they are feet or meters above sea level? In which case, won't MEDIUMINT UNSIGNED suffice?
Then...
SELECT ABS(p.Temperature - b.Temperature) AS TemperatureDif,
p.Level,
p.Dt
FROM panama_weather_data AS p
JOIN belize_weather_data AS b
USING (OwmPackageId, Level, Dt)
WHERE p.OwmPackageId = 'openweathermappkg19758'
ORDER BY TemperatureDif DESC
LIMIT 30;
You will need
INDEX(OwmPackageId, Level, Dt)
with those columns in any order, and on either (or both) tables.
As already mentioned, no CAST is needed. However, if you need some format other than "2017-08-13 10:04:12", then use DATE_FORMAT(...) in the SELECT clause (not the USING clause).
Rather than having two 'identical' tables, consider having one table with an extra column for which country is involved. This would make it easy to extend to an arbitrary number of locations. The SELECT would need to be a "self join" and the syntax would be slightly different.
I was reading Django Book and came across interesting statement.
Notice that Django doesn’t use SELECT * when looking up data and instead lists
all fields explicitly. This is by design:
in certain circumstances SELECT * can be slower,
I got this from http://www.djangobook.com/en/1.0/chapter05/
So my question is can someone explain me why SELECT * can be slower, than call every single column explicitly. Would be good if you can give me some examples.
Or if you think the opposite (it doesn't matter), can you explain why?
Update:
That's the table :
BEGIN;
CREATE TABLE "books_publisher" (
"id" serial NOT NULL PRIMARY KEY,
"name" varchar(30) NOT NULL,
"address" varchar(50) NOT NULL,
"city" varchar(60) NOT NULL,
"state_province" varchar(30) NOT NULL,
"country" varchar(50) NOT NULL,
"website" varchar(200) NOT NULL
);
And that's how Django will call SELECT * FROM book_publisher:
SELECT
id, name, address, city, state_province, country, website
FROM book_publisher;
performance (will matter only if you are selecting less columns than there are in the table
I am not sure about how Django works; but in some languages/ db drivers "select *" will cause an error if you change the table schema (say add a new column). This is because the DB driver "caches" the table schema and now its internal schema does not match the table schema.
If you have 100 columns, SELECT * will return the data for all columns. Listing the columns explicitly will reduce the columns returned, therefore reducing the amount of data transmitted between the server and application.
This is clearly not faster in many case, and when one of them is faster, it is by a slight margin: check by yourself, benchmarking a lot of queries :)
It might be faster to select only some columns in some case, including when you select only column that are on a combined index, avoiding the need to read the whole row, and also when you avoid accessing BLOB or TEXT columns on MySQL.
And naturally if you select less column you will transfer less data between MySQL and your application
I think in this exact case there will be no performance difference, this is exactly that in certain circumstances SELECT * can be slower is all about.
This might be a basic question: I am using a temporary table in some of my php code like so:
CREATE TEMPORARY TABLE ttable( `d` DATE NOT NULL , `p` DECIMAL( 11, 2 ) NOT NULL , UNIQUE KEY `date` ( `date` ) );
INSERT INTO ttable( d, p ) VALUES ( '$d' , '$p' );
SELECT * FROM ttable;
As we scale up our site, will this ever be a problem? ie, will user1's ttable & user2's ttable ever get mixed up & user1 sees user2's ttable & vice versa? Is it better to create a unique name for each unique temporary table?
thx
Temporary tables are session-specific. Every time you connect to a host (in PHP, this is done with mysql_connect), temporary tables that you create exist only within that session/connection.
It is almost always better to find a different way than using temporary tables.
The only time I would consider them is under the following conditions:
The activity is rare. Meaning, a given user MIGHT do this once a week.
It is used as a holding container prior to doing a regular full import of data.
It deals with data whose structure is unknown prior to being filled.
All three of those really go with building some type of generic bulk import routines where the data mapping is defined at run time.
If you find yourself creating temp tables frequently in the application, there's probably a better way.
Scalability is going to depend on the amount of data being loaded and frequency of temp table usage. For a low trafficked site it might be okay.
We're in the process of ripping out a ton of temp table usage by a client's app. 90% of the queries in their system result in a temp table being created. Analysis of all the queries have shown that the original dev used this mechanism simply because they didn't understand SQL. We're doing this because performance has radically dropped off as new users are added to the system.
Can you post a use case? Maybe we can help provide an alternate mechanism.
UPDATE:
Now that we have a use case, here is a simple table structure to accomplish what you need.
Table ZipCodes
ZipCode char(5) [or char(10) depending on need]
CityName varchar(50)
*other columns as necessary such as latitude or whatever.
Table TempReadings
ZipCode char(5) [foreign key to the ZipCode table]
ReadingDate datetime
Temperature float (or some equivalent)
To get all the temp readings for a given zip code you would do something like:
select ZipCode, ReadingDate, Temperature
from TempReadings
if you need info from the main ZipCode table:
select Z.ZipCode, Z.CityName, TR.ReadingDate, TR.Temperature
from ZipCodes Z
inner join TempReadings TR on (TR.ZipCode = Z.ZipCode)
add where clauses as necessary. Note that none of the above requires having a separate table per zip code.