I am working with a database full of songs, with titles and durations.
I need to return all songs with a duration greater than 29:59 (MM:SS).
The data is formatted in two different ways.
Format 1
Most of the data in the table is formatted as MM:SS, with some songs being greater than 60 minutes formatted for example as 72:15.
Format 2
Other songs in the table are formatted as HH:MM:SS, where the example given for Format 1 would instead be 01:12:15.
I have tried two different types of queries to solve this problem.
Query 1
The following query returns all of the values that I seek to return for Format 1, but I could not find a way to get values included for Format 2.
select title, duration from songs where
time(cast(duration as time)) >
time(cast('29:59' as time))
Query 2
With the next query, I hoped to use the format specifiers in str_to_date to locate those results with the format HH:MM:SS, but instead I received results such as 3:50. The interpreter is assuming that all of the data is of the form HH:MM, and I do not know how to tell it otherwise without ruining the results.
select title, duration from songs where
time(cast(str_to_date(duration, '%H:%i:%s') as time)) >
time(cast(str_to_date('00:29:59', '%H:%i:%s') as time))
I've tried changing the specifiers in the first call to str_to_date to %i:%s, which gives me all values greater than 29:59, but none greater than 59:59. This is worse than the original query. I've also tried 00:%i:%s and '00:' || duration, '%H:%i:%s'. These two in particular would ruin the results anyway, but I'm just fiddling at this point.
I'm thoroughly stumped, but I'm sure the solution is an easy one. Any help is appreciated.
EDIT: Here is some data requested from the comments below.
Results from show create table:
CREATE TABLE `songs` (
`song_id` int(11) NOT NULL,
`title` varchar(100) NOT NULL,
`duration` varchar(20) DEFAULT NULL,
PRIMARY KEY (`song_id`),
UNIQUE KEY `songs_uq` (`title`,`duration`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
Keep in mind, there are more columns than I described above, but I left some out for the sake of simplicity. I will also leave them out in the sample data.
Sample Data
title duration
(Allegro Moderato) 3:50
Agatha 1:56
Antecessor Machine 06:16
Very Long Song 01:24:16
Also Very Long 2:35:22
You are storing unstructured data in a relational database. And that is making you unhappy. So structure it.
Either add a TIME column, or copy song_id into a parallel time table on the side that you can JOIN against. Select all the two-colon durations and trivially update TIME. Repeat, prepending '00:' to all the one-colon durations. Now you have parsed all rows, and can safely ignore the duration column.
Ok, fine, I suppose you could construct a VIEW that offers UNION ALL of those two queries, but that is slow and ugly, much better to fix the on-disk data.
Forget times. Convert to seconds. Here is one way:
select s.*
from (select s.*,
( substring_index(duration, ':', -1) + 0 +
substring_index(substring_index(duration, ':', -2), ':', 1) * 60 +
(case when duration like '%:%:%' then substring_index(duration, ':', 1) * 60*60
else 0
end)
) as duration_seconds
from songs s
) s
where duration_seconds > 29*60 + 59;
After some research I have come up with an answer of my own that I am happy with.
select title, duration from songs where
case
when length(duration) - length(replace(duration, ':', '')) = 1
then time_to_sec(duration) > time_to_sec('29:59')
else time_to_sec(duration) > time_to_sec('00:29:59')
end
Thank you to Gordon Linoff for suggesting that I convert the times to seconds. This made things much easier. I just found his solution a bit overcomplicated, and it reinvents the wheel by not using time_to_sec.
Output Data
title duration
21 Album Mix Tape 45:40
Act 1 1:20:25
Act 2 1:12:05
Agog Opus I 30:00
Among The Vultures 2:11:00
Anabasis 1:12:00
Avalanches Mixtape 60:00
Beautiful And Timeless 73:46
Beggars Banquet Tracks 76:07
Bonus Tracks 68:55
Chindogu 66:23
Spun 101:08
Note: Gordon mentioned his reason for not using time_to_sec was to account for songs greater than 23 hours long. After testing, I found that time_to_sec does support hours larger than 23, just as it supports minutes greater than 59.
It is also perfectly fine with other non-conforming formats such as 1:4:32 (e.g. 01:04:32).
Related
I want to update a table (cust_id date_id) with randomly generated content
for the cust_id I am using
rand()*1000
for the datetime I am generating random dates over the past three years as follows
CONCAT(ROUND(RAND()*-3) + YEAR(NOW()),"-",ROUND(RAND()*11) + 1,"-",ROUND(RAND()*27) + 1)
Then I am generating many instances by creating inner joins from a table with 10 numbers
FROM numbers JOIN number n2 JOIN number n3
Putting it all together I run
INSERT INTO orders (cust_id, date_id)
SELECT ROUND(RAND()*1000) AS cust_id,
CONVERT(DATETIME, (CONCAT(ROUND(RAND()*-3) + YEAR(NOW()),"-",ROUND(RAND()*11) + 1,"-
",ROUND(RAND()*27) + 1))) AS date_id
FROM numbers JOIN number n2 JOIN number n3 JOIN;
I have played around with different conversion formats, and tried to set the result to a variable and cast that to datetime but it is all throwing up errors. I suspect that the problem is that mysql is reading it as a function rather than a string. I have found another work around where I keep the original datetime and use intervals, but would like to know what the issue with my initial approach is. Any insights that people have would be appreciated.
I have your basic formula working using the DATE() function.
SELECT DATE(CONCAT(ROUND(RAND()*-3) + YEAR(NOW()),"-",
ROUND(RAND()*11) + 1,"-",
ROUND(RAND()*27) + 1))
Still, you're much better off using
SELECT CURDATE() - INTERVAL ROUND(RAND()*3*365.25) DAY
Why? If you leave leap-year February 29 out of your test data, you leave out something critical to test. And if you leave out days 29, 30, and 31 from all your test months, you may not get test coverage for end-of-month date arithmetic.
I am trying to setup a MySQL database using PHPMyAdmin. Before I get long into it I want advice on setting it up and querying it. I set the table like this.
id: primary ket
time_in: date
time_out: date
task: varchar (128)
business: varchar (128)
All I need it to do is to keep track of how much time spent on each task and for what business. Is this good way of doing it or is there a better way?
If this is correct then I am trying to figure out how to query the time. This is what I have come up with as a query, but it far from what I want.
SELECT `Task`,`Business`, (SELECT `Time-Out` - `Time-In`) as `total time` FROM `Sheet`
Is there a way to convert total time into a more readable format?
Unless you're tracking time in days, I'd recommend using TIME or DATETIME for the time_in and time_out columns.
Personally, I'd probably make time_out nullable, to allow tracking current activity (something I've started, but not yet finished).
There's no need to use a sub-select for subtracting the timestamps, you can subtract those two columns inline as well (just drop the SELECT keyword there). For formatting, you could use the TIMEDIFF function:
SELECT '12:00:00' - '10:45:00';
-> 2
SELECT TIMEDIFF('12:00:00', '10:45:00');
-> '01:15:00'
That would make your query:
SELECT `Task`, `Business`, TIMEDIFF(`Time-Out`, `Time-In`) as `total time` FROM `Sheet`
If you do make time_out (or Time-Out) nullable, you'll need to take that into account in your query:
SELECT TIMEDIFF(NULL, '10:45:00');
-> NULL
So ongoing tasks would give a total time of NULL. If you want to know how long you've been working already, you can wrap it in an IFNULL function and get the current time in that case:
SELECT TIMEDIFF(IFNULL(`Time-Out`, NOW()), `Time-In`);
-> '01:15:00' if `Time-Out` is 12:00:00 and `Time-In` is 10:45:00
-> '02:05:13' if `Time-Out` is NULL and it's currently 12:50:13 (server time)
You will want to use TIMEDIFF(end_time, start_time) and TIME_TO_SEC(time) to convert the difference to a total number of seconds. You can then convert the seconds mathematically to whatever format you want.
So for the time of each task:
select ID
,task
,business
,time_to_sec(timediff(time_out, time_in)) as duration
from sheet
To aggregate by task and business:
select task
,business
,sum(time_to_sec(timediff(time_out,time_in))) as total_time
from sheet
group by task
,business
I have a table consisting of about 20 million rows, totalling approximately 2 GB. I need to select every nth row, leaving me with only a few hundred rows. But I cannot for the life of me figure out how to do it without getting a timeout.
ROW_NUMBER is not available, and keeping track of the current row number with a variable (e.g. #row) causes a timeout. I presume this is because it is still iterating over every row, but I'm not too sure. There's no integer index for me to use either. A DATETIME field is used instead. This is an example query using #row:
SET #row = 0;
SELECT `field` FROM `table` WHERE (#row := #row + 1) % 1555200 = 0;
Is there anything else I haven't tried?
Thanks in advance!
It's a tricky one for sure. You could work out the minimum date and then use a datediff to get you the sequential values, but this probably isn't sargeable (as below). For me, it took 18 seconds on a table with 16 million rows, but your mileage may vary.
** EDIT ** I should also add that this was with a nonclustered index scan against an index which included the date column (pretty sure this is forced by the function around the date but perhaps someone with more knowledge can expand on this). After creating an index against that column, I got 12 seconds.
Try it out and let me know how it goes :)
DECLARE #n INT = 5;
SELECT
DATEDIFF(DAY, first_date.min_date, DATE_COLUMN) AS ROWNUM
FROM
ss.YOUR_TABLE
OUTER APPLY
( SELECT
MIN(a.DATE_COLUMN) min_date
FROM ss.YOUR_TABLE a
) first_date
WHERE DATEDIFF(DAY, first_date.min_date, DATE_COLUMN) % #n = 0
Edit again:
Just noticed this has been accepted as an answer... In case anyone else comes across this, it probably shouldn't be. On review, this only works if your datetime field has one entry per day and the datetime is sequential (in that rows are added in the same order as the datetime, or if the datetime is the primary key).
Again only works per day with the above caveats, you can change the date diff to use any unit (Month, Year, Minute etc) if you have one row added per unit of time.
I'm looking for help on how to optimize (if possible) the performance of a SQL query used for reading wind information (see below) by changing the e.g. the database structure, query or something else?
I use a hosted database to store a table with more than 800,000 rows with wind information (speed and direction). New data is added each minute from an anemometer. The database is accessed using a PHP script which creates a web page for plotting the data using Google's visualization API.
The web page takes approximately 15 seconds to load. I've added some time measurements in both the PHP and Javascript part to profile the code and find possible areas for improvements.
One part where I hope to improve is the following query which takes approximately 4 seconds to execute. The purpose of the query is to group 15 minutes of wind speed (min/max/mean) and calculate the mean value and total min/max during this period of measurements.
SELECT AVG(d_mean) AS group_mean,
MAX(d_max) as group_max,
MIN(d_min) AS
group_min,
dir,
FROM_UNIXTIME(MAX(dt),'%Y-%m-%d %H:%i') AS group_dt
FROM (
SELECT #i:=#i+1,
FLOOR(#i/15) AS group_id,
CAST(mean AS DECIMAL(3,1)) AS d_mean,
CAST(min AS DECIMAL(3,1)) AS d_min,
CAST(max AS DECIMAL(3,1)) AS d_max,
dir,
UNIX_TIMESTAMP(STR_TO_DATE(dt, '%Y-%m-%d %H:%i')) AS dt
FROM table, (SELECT #i:=-1) VAR_INIT
ORDER BY id DESC
) AS T
GROUP BY group_id
LIMIT 0, 360
...
$oResult = mysql_query($sSQL);
The table has the following structure:
1 ID int(11) AUTO_INCREMENT
2 mean varchar(5) utf8_general_ci
3 max varchar(5) utf8_general_ci
4 min varchar(5) utf8_general_ci
5 dt varchar(20) utf8_general_ci // Date and time
6 dir varchar(5) utf8_general_ci
The following setup is used:
Database: MariaDB, 5.5.42-MariaDB-1~wheezy
Database client version: libmysql - 5.1.66
PHP version: 5.6
PHP extension: mysqli
I strongly agree with the comments so far -- Cleanse the data as you put it into the table.
Once you have done the cleansing, let's avoid the subquery by doing...
SELECT MIN(dt) as 'Start of 15 mins',
FORMAT(AVG(mean), 1) as 'Avg wind speed',
...
FROM table
GROUP BY FLOOR(UNIX_TIMESTAMP(dt) / 900)
ORDER BY FLOOR(UNIX_TIMESTAMP(dt) / 900);
I don't understand the purpose of the LIMIT. I'll guess that you want to a few days at a time. For that, I recommend you add (after cleansing) between the FROM and the GROUP BY.
WHERE dt >= '2015-04-10'
AND dt < '2015-04-10' + INTERVAL 7 DAY
That would show 7 days, starting '2015-04-10' morning.
In order to handle a table of 800K, you would decidedly need (again, after cleansing):
INDEX(dt)
To cleanse the 800K rows, there are multiple approaches. I suggest creating a new table, copy the data in, test, and eventually swap over. Something like...
CREATE TABLE new (
dt DATETIME,
mean FLOAT,
...
PRIMARY KEY(dt) -- assuming you have only one row per minute?
) ENGINE=InnoDB;
INSERT INTO new (dt, mean, ...)
SELECT str_to_date(...),
mean, -- I suspect that the CAST is not needed
...;
Write the new select and test it.
By now new is missing the newer rows. You can either rebuild it and hope to finish everything in your one minute window, or play some other game. Let us know if you want help there.
Lets say I have a table that contains the following - id and date (just to keep things simple).
It contains numerous rows.
What would my select query look like to get the average TIME for those rows?
Thanks,
Disclaimer: There may be a much better way to do this.
Notes:
You can't use the AVG() function against a DATETIME/TIME
I am casting DATETIME to DECIMAL( 18, 6 ) which appears to yield a reasonably (+- few milliseconds) precise result.
#1 - Average Date
SELECT
CAST( AVG( CAST( TimeOfInterest AS DECIMAL( 18, 6 ) ) ) AS DATETIME )
FROM dbo.MyTable;
#2 - Average Time - Remove Date Portion, Cast, and then Average
SELECT
CAST( AVG( CAST( TimeOfInterest - CAST( TimeOfInterest AS DATE ) AS DECIMAL( 18, 6 ) ) ) AS DATETIME )
FROM dbo.MyTable;
The second example subtracts the date portion of the DATETIME from itself, leaving only the time portion, which is then cast to a decimal for averaging, and back to a DATETIME for formatting. You would need to strip out the date portion (it's meaningless) and the time portion should represent the average time in the set.
SELECT CAST(AVG(CAST(ReadingDate AS real) - FLOOR(CAST(ReadingDate as real))) AS datetime)
FROM Rbh
I know that, in at least some of the SQL standards, the value expression (the argument to the AVG() function) isn't allowed to be a datetime value or a string value. I haven't read all the SQL standards, but I'd be surprised if that restriction had loosened over the years.
In part, that's because "average" (or arithmetic mean) of 'n' values is defined to be the sum of the values divided by the 'n'. And the expression '01-Jan-2012 08:00' + '03-Mar-2012 07:53' doesn't make any sense. Neither does '01-Jan-2012 08:00' / 3.
Microsoft products have a history of playing fast and loose with SQL by exposing the internal representation of their date and time data types. Dennis Ritchie would have called this "an unwarranted chumminess with the implementation."
In earlier versions of Microsoft Access (and maybe in current versions, too), you could multiply the date '01-Jan-2012' by the date '03-Mar-2012' and get an actual return value, presumably in units of square dates.
If your dbms supports the "interval" data type, then taking the average is straightforward, and does what you'd expect. (SQL Server doesn't support interval data types.)
create table test (
n interval hour to minute
);
insert into test values
('1:00'),
('1:30'),
('2:00');
select avg(n)
from test;
avg (interval)
--
01:30:00