I have "varchar" field in my database where I have stored records like:
(11.1.2015) Log info 1
(17.4.2015) Log info 2
(22.5.2015) Log info 3
(25.5.2015) Log info 3
...
Now I would like to make SELECT WHERE date inside () is the same or larger than today and select the first one (so in this sample and todays date I should get 22.5.205). I just can't figure out how to do that, so I need some help.
In principle I agree with Pekka웃 on this one.
You should always strive to use proper data types for your data.
This also means never use one column to store 2 different data segments.
However, from the comments to Pekka웃's answer I understand that changing the table is not possible, so here's my attempt to do it.
Assuming your dates are always at the start of the varchar, and always surrounded by parenthesis, you can probably do something like this:
SELECT *
FROM (
SELECT
CAST(SUBSTR(Log_data, 2, LOCATE(')', Log_data)-1) as date) LogDate,
SUBSTR(Log_data, LOCATE(')', Log_data)+1, CHAR_LENGTH(Log_data)) LogData
FROM logs_table
) NormalizedLogTable
WHERE LogDate >= CURDATE()
Limit 1
See sql fiddle here.
Note #1: This is a workaround for your specific situation.
If you ever get the chance, you should normalize your table.
Note #2 I'm not a MySql guy. Most of my sql experience is with Sql server.
You can probably find a better way to convert strings to date then just using cast, to overcome the ambiguity of values like 1.3.2015.
This is likely to be impossible to do with a varchar field, or hellishly complex. While you can theoretically use Regex functions in mySQL to match patterns, you are looking for a date range. Even if it were possible to build a query somehow, it would be a gigantic pile of work and there is no way mySQL could optimize for any aspect of it.
The normal, fast, and straightforward way to go about this is normalizing your table.
In your case, you would probably create a new table named "logs" and connect it to your existing table through an ID field that shows which parent record each log entry belongs to.
Querying for a certain date range for log entries belonging to a specific parent then become as easy as
SELECT log_date FROM logs WHERE parent = 155 AND log_date >= CURDATE()
It's painful to do at first (as you have to rebuild parts of your structure and likely, your app) and makes some everyday queries more complex, but cases like this become much easier and faster.
Related
I know there are several questions similar to this one, but those I've found do not relate directly to my problem.
Some initial context: I have a facts table, called ft_booking, with around 10MM records. I have a dimension called dm_date, with around 11k records, which are dates. These tables are related through foreign keys, as usual. There are 3 date foreign keys in the table ft_booking, one for boarding, one for booking, and other for cancellation. All columns have the very same definition, and the amount of distinct records for each is similar (ranging from 2.5k to 3k distinct values in each column).
There I go:
EXPLAIN SELECT
*
FROM dw.ft_booking b
LEFT JOIN dw.dm_date db ON db.sk_date = b.fk_date_booking
WHERE date (db.date) = '2018-05-05'
As you can see, index is being used in the table booking, and the query runs really fast, even though, in my filter, I'm using the date() function. For brevity, I'll just state that the same happens using the column fk_date_boarding. But, check this out:
EXPLAIN SELECT
*
FROM dw.ft_booking b
LEFT JOIN dw.dm_date db ON db.sk_date = b.fk_date_cancellation
WHERE date (db.date) = '2018-05-05';
For some mysterious reason, the planner chooses not to use the index. Now, I understand that using some function over a column kind of forces the database to perform a full table scan, in order to be able to apply that function over the column, thus bypassing the index. But, in this case, the function is not over the actual foreign key column, which is where the lookup in the booking table should be ocurring.
If I remove the date() function, the index will be used in any of those columns, as expected. One might say, then, "well, why don't you just get rid of the date() function?" - I use metabase, an Interface which allow users to use a graphical interface in order to build queries without knowing MySQL, and one of the current limitations of that tool is that it always uses the date() function when building queries not directly written in MySQL - hence, I have no way to remove the function in the queries I'm running.
Actual question: why does MySQL use index in the first two cases, but doesn't in the latter, considering the amount of distinct values is pretty much the same for all columns and they have the exact smae definition, apart from the name? Am I missing something here?
EDIT: Here is the CREATE statment of each table involved. There are some more, but we just need here tables ft_booking and dm_date (first two tables of the file).
You are "hiding date in a function call". If db.date is declared a DATE, then
date (db.date) = '2018-05-05'
can be simply
db.date = '2018-05-05'
If db.date is declared a DATETIME, then change to
db.date >= '2018-05-05'
AND db.date < '2018-05-05' + INTERVAL 1 DAY
In either case, be sure there is an index on db.date.
If by "I have a dimension called dm_date", you mean you built a dimension table to hold just dates, and then you are JOINing to the main table with some id, ... To put it bluntly, don't do that! Do not normalize "continuous" things such as DATE, DATETIME, FLOAT, or other numeric values.
If you need to discuss this further, please provide SHOW CREATE TABLE for the relevant table(s). (And please use text, not screen shots.)
Why??
The simple answer is that the Optimizer does not know how to unravel any function. Perhaps it could; perhaps it should. But it does not. Perhaps the answer involves not wanting to see how the function result will be used... comparing against a DATE? against a DATETIME? being used as a string? other?
Still, I suggest the real performance killer is the existence of dm_date rather than indexing and using the date in the main table.
Furthermore, the main table is bigger than it needs to be! fk_date_booking is a 4-byte INT SIGNED instead of a 3-byte DATE.
If I have a table full of records, they could be payments, or bookings or a multitide of other entities, is there a best practice for saving the status of each record beyond a simple 0 for not active and 1 for active?
For example, a payment might have the status 'pending', 'completed' or 'failed'. The way I have previously done it, is to have another table with a series of definitions in value/text pairs ie. 0 = 'failed', 1 = 'pending' and 2 = 'completed'. I would then store 0, 1 or 2 in the payments table and use an inner join to read the text from the definitions table if needed.
This method sometime seems overly complicated and unnecessary, and I have been thinking of changing my method to simply saving the word 'completed' directly in the status field of the payments table for example.
Is this considered bad practice, and if so, what is the best practice?
These seem to be transaction records, so potentially there are many of them and query performance will be an issue. So, it's probably smart to organize your status column or columns in such a way that compound index access to the records you need will be straightforward.
It's hard to give you crisp "do this, don't do that" advice without knowing your query patterns, so here are a couple of scenarios.
Suppose you need to get all the active bookings this month. You'll want a query of the form
SELECT whatever
FROM xactions
WHERE active = 1 and type = 2 /*bookings*/
AND xaction_date >= CURDATE() - INTERVAL DAY(CURDATE()) DAY
This will perform great with a compound BTREE index on (active,type,xaction_date) . The query can be satisfied by random accessing the index to the first eligible record and then scanning it sequentially.
But if you have type=2 meaning active bookings and type=12 meaning inactive bookings, and you want all bookings both active and inactive this month, your query will look like this:
SELECT whatever
FROM xactions
WHERE type IN (2,12)
AND xaction_date >= CURDATE() - INTERVAL DAY(CURDATE()) DAY
This won't be able to scan a compound index quite so easily due to the IN(2,12) clause needing disjoint ranges of values.
tl;dr In MySQL it's easier to index separate columns for various items of status to get better query performance. But it's hard to know without understanding query patterns.
For the specific case you mention, MySQL supports ENUM datatypes.
In your example, an ENUM seems appropriate - it limits the range of valid options, it's translated back to human-readable text in results, and it creates legible code. It has some performance advantages at query time.
However, see this answer for possible drawbacks.
If the status is more than an on/off bool type, then I always have a lookup table as you describe. Apart from being (I believe) a better normalised design, it makes objects based on the data entities easier to code and use.
I am using MySQL and I have a field in a table that needs to store a year+month value. The field doesn't need the day, minute and second info. I am thinking to create the field as CHAR(6) because it seems to be fine using the >, = and < to compare the string.
SELECT '201108' < '201109'
>1
I want to use this format because I can insert the same string to Lucene index engine.
Is it a good idea or I should stick with DATE?
That will work fine, right up to the point where you have to implement your own code for working out the difference between two values, or figuring out what value you need for a time six months into the future.
Use the date type, that's what it's for (a). If is has too much resolution, enforce the constraint that the day will always be set to 1. Or force that with an insert/update trigger.
Then you can use all the fancy date manipulation code that your DBMS vendor has already written, code that's probably going to be much more efficient since it will be dealing with a native binary type rather than a character string.
And you'll save space in this particular case as well since a MySQL date type is actually shorter than a char(6). It's not often that a database decision gives you both space and time advantages (it's usually a trade-off), so you should seize them whenever you can.
(a) This applies to all of those types, such as date, time and datetime.
You'd want to use a date, but not store anything in the Day field. The database is more efficient at searching than your code will ever be because the database is optimized to handle lookups such as this one. You'd want to store a dummy value in the Day field to make this work.
Well, since MySQL only takes 3 bytes to store a date (Warning: The link is for MySQL version 5.0. Check the docs for your version to make sure), it would be better from a storage standpoint--as well as a performance standpoint when it comes to comparisons--to use date.
You can also use the Date Field for that and then while selecting the values you can use DATE_FORMAT function with that Date for selecting the year and month.
having field as Date type then
like the Date you entered is '2011-08-30'
Now you want the result as 201108 the write
select DATE_FORMAT('2011-08-30','%Y%m');
it will give result as 201108
for more detailed information for DATE_FORMAT please visit
http://davidwalsh.name/format-date-mysql-date_format
I need some options.
I have a table layed out as follows with about 78,000,000 rows...
id INT (Primary Key)
loc VARCHAR (Indexed)
date VARCHAR (Indexed)
time VARCHAR
ip VARCHAR
lookup VARCHAR
Here is an example of a query I have.
SELECT lookup, date, time, count(lookup) as count FROM dnstable
WHERE STR_TO_DATE(`date`, '%d-%b-%Y') >= '$date1' AND STR_TO_DATE(`date`, '%d-%b-%Y') <= '$date2' AND
time >= '$hour1%' AND time <= '$hour2%' AND
`loc` LIKE '%$prov%' AND
lookup REGEXP 'ca|com|org|net' AND
lookup NOT LIKE '%.arpa' AND
lookup NOT LIKE '%domain.ca' AND
ip NOT LIKE '192.168.2.1' AND
ip NOT LIKE '192.168.2.2' AND
ip NOT LIKE '192.168.2.3'
GROUP BY lookup
ORDER BY count DESC
LIMIT 100
I have my mysql server configured like a few high useage examples I found. The hardware is good, 4 cores, 8 gig rams.
This query takes about 180 seconds... Does anyone have some tips on making this more efficent?
There are a lot of things wrong here. A LOT of things. I would look to the other answers for query options (you use a lot of LIKES, NOT LIKES, and functions....and you're doing them on unkeyed columns...). If I were in your case, I'd redesign my entire database. It looks as though you're using this to store DNS entries - host names to IP addresses.
You may not have the option to redesign your database - maybe it's a customer database or something that you don't have control over. Maybe they have a lot of applications which depend on the current database design. However, if you can refactor your database, I would strongly suggest it.
Here's a basic rundown of what I'd do:
Store the TLDs (top-level-domains) in a separate column as an ENUM. Make it an index, so it's easily searchable, instead of trying to regex .com, .arpa, etc. TLDs are limited anyway, and they don't change often, so this is a great candidate for an ENUM.
Store the domain without the TLD in a regular column and a reversed column. You could index both columns, but depending on your searches, you may only need to index the reverse column. Basically, having a reverse column allows you to search for all hosts in one domain (ex. google) without having to do a fulltext search each time. MySQL can do a key search on the string "elgoog" in the reverse column. Because DNS is a hierarchy, this fits perfectly.
Change the date and time columns from VARCHAR to DATE and TIME, respectively. This one's an obvious change. No more str_to_time, str_to_date, etc. Absolutely no point in doing that.
Store the IP addresses differently. There's no reason to use a VARCHAR here - it's inefficient and doesn't make sense. Instead, use four separate columns for each octet (this is safe because all IPv4 addresses have four octets, no more, no less) as unsigned TINYINT values. This will give you 0-255, the range you need. (Each IP octet is actually 8 bits, anyway.) This should make searches much faster, especially if you key the columns.
ex: select * from table where octet1 != 10; (this would filter out all 10.0.0.0/8 private IP space)
The basic problem here is that your database design is flawed - and your query is using columns that aren't indexed, and your queries are inefficient.
If you're stuck with the current design....I'm not sure if I can really help you. I'm sorry.
I bet the really big problem here are the STR_TO_DATE functions.
If possible then try date column to really have a DATE datatype. (DATE, DATETIME, TIMESTAMP)
Having this new or altered column (with date datatype) indexed would speed up the selection over that column significant. You have to avoid the date parsing which currently lacked by wrong datatype for column 'date'. This parsing/converting avoids MySQL from using the index on the 'date' column.
Conclusion: Make 'date' column having a Date datatype, have this column indexed and do not use STR_TO_DATE in your statement.
I insinuate that these local ip addresses are not very selective when used with negation, right? (This depends on the typical data in the table.)
Since ip-column is not indexed, selections on that column always result in full table scan. When unequal (<>) selection on ip is very selective then consider putting an index on it and change statement to not use 'like' but <> instead. But I do not think that unequal-selection on ip is very selective.
Conclusion: I do not think you can win anything significant here.
The problem is that a LIKE will mean full table traversal! Which is why you are seeing this.
First thing I would suggest is get rid of LIKE '192.168.2.1' since really that is the same as ='192.168.2.1'
Also the fact that you set the LIMIT 100 at the end means that the query will run against all the records then select only the first 100 -- how about instead you do a SELECT which only involves all the other operations but not LIKE and limit this one and then have a second SELECT which uses LIKE?
Some Tips
Use != instead of NOT LIKE
Avoid REGEXP in mysql query
Avoid STR_TO_DATE(date, '%d-%b-%Y') >= '$date1' try passing the MySQL formatted date to query rather converting them with STR_TO_DATE
lookup should be indexed if you have to use group by on it.
Try caching the query results(if possible).
I am currently part of a team designing a site that will potentially have thousands of users who will be doing a number of date related searches. During the design phase we have been trying to determine which makes more sense for performance optimization.
Should we store the datetime field as a mysql datetime. Or should be break it up into a number of fields (year, month, day, hour, minute, ...)
The question is with a large data set and a potentially large set of users, would we gain performance wise breaking the datetime into multiple fields and saving on relying on mysql date functions? Or is mysql already optimized for this?
Have a look at the MySQL Date & Time Functions documentation, because you can pull specific information from a date using existing functions like YEAR, MONTH, etc. But while these exist, if you have an index on the date column(s), using these functions means those indexes can not be used...
The problem with storing a date as separate components is the work needed to reconstruct them into a date when you want to do range comparisons or date operations.
Ultimately, choose what works best with your application. If there's seldom need for the date to be split out, consider using a VIEW to expose the date components without writing possibly redundant information into your tables.
Use a regular datetime field. You can always switch over to the separated components down the line if performance becomes an issue. Try to avoid premature optimization - in many cases, YAGNI. You may wind up employing both the datetime field and the separated component methodology, since they both have their strengths.
If you know ahead of time some key criteria that all searches will have, MySQL (>= v5.1) table partitioning might help.
For example, if you have a table like this:
create table Books(pubDate dateTime, title varchar(50));
And you know all searches must at least include a year, you could partition it on the date field, along these lines:
create table Books(pubDate dateTime,title varchar(50)
partition by hash(year(pubDate)) partitions 10;
Then, when you run a select against the table, if your where clause includes criteria that limit the partition the results can exist on, the search will only scan that partition, rather than a full table scan. You can see this in action with:
-- scans entire table
explain partitions select * from Books where title='%title%';
versus something like:
-- scans just one partition
explain partitions select * from Books
where year(pubDate)=2010
and title='%title%';
The MySQL documentation on this is quite good, and you can choose from multiple partitioning algorithms.
Even if you opt to break up the date, a table partition on, say, year (int) (assuming searches will always specify a year) could help.