mysql performace issue with code and table design - mysql

I need some options.
I have a table layed out as follows with about 78,000,000 rows...
id INT (Primary Key)
loc VARCHAR (Indexed)
date VARCHAR (Indexed)
time VARCHAR
ip VARCHAR
lookup VARCHAR
Here is an example of a query I have.
SELECT lookup, date, time, count(lookup) as count FROM dnstable
WHERE STR_TO_DATE(`date`, '%d-%b-%Y') >= '$date1' AND STR_TO_DATE(`date`, '%d-%b-%Y') <= '$date2' AND
time >= '$hour1%' AND time <= '$hour2%' AND
`loc` LIKE '%$prov%' AND
lookup REGEXP 'ca|com|org|net' AND
lookup NOT LIKE '%.arpa' AND
lookup NOT LIKE '%domain.ca' AND
ip NOT LIKE '192.168.2.1' AND
ip NOT LIKE '192.168.2.2' AND
ip NOT LIKE '192.168.2.3'
GROUP BY lookup
ORDER BY count DESC
LIMIT 100
I have my mysql server configured like a few high useage examples I found. The hardware is good, 4 cores, 8 gig rams.
This query takes about 180 seconds... Does anyone have some tips on making this more efficent?

There are a lot of things wrong here. A LOT of things. I would look to the other answers for query options (you use a lot of LIKES, NOT LIKES, and functions....and you're doing them on unkeyed columns...). If I were in your case, I'd redesign my entire database. It looks as though you're using this to store DNS entries - host names to IP addresses.
You may not have the option to redesign your database - maybe it's a customer database or something that you don't have control over. Maybe they have a lot of applications which depend on the current database design. However, if you can refactor your database, I would strongly suggest it.
Here's a basic rundown of what I'd do:
Store the TLDs (top-level-domains) in a separate column as an ENUM. Make it an index, so it's easily searchable, instead of trying to regex .com, .arpa, etc. TLDs are limited anyway, and they don't change often, so this is a great candidate for an ENUM.
Store the domain without the TLD in a regular column and a reversed column. You could index both columns, but depending on your searches, you may only need to index the reverse column. Basically, having a reverse column allows you to search for all hosts in one domain (ex. google) without having to do a fulltext search each time. MySQL can do a key search on the string "elgoog" in the reverse column. Because DNS is a hierarchy, this fits perfectly.
Change the date and time columns from VARCHAR to DATE and TIME, respectively. This one's an obvious change. No more str_to_time, str_to_date, etc. Absolutely no point in doing that.
Store the IP addresses differently. There's no reason to use a VARCHAR here - it's inefficient and doesn't make sense. Instead, use four separate columns for each octet (this is safe because all IPv4 addresses have four octets, no more, no less) as unsigned TINYINT values. This will give you 0-255, the range you need. (Each IP octet is actually 8 bits, anyway.) This should make searches much faster, especially if you key the columns.
ex: select * from table where octet1 != 10; (this would filter out all 10.0.0.0/8 private IP space)
The basic problem here is that your database design is flawed - and your query is using columns that aren't indexed, and your queries are inefficient.
If you're stuck with the current design....I'm not sure if I can really help you. I'm sorry.

I bet the really big problem here are the STR_TO_DATE functions.
If possible then try date column to really have a DATE datatype. (DATE, DATETIME, TIMESTAMP)
Having this new or altered column (with date datatype) indexed would speed up the selection over that column significant. You have to avoid the date parsing which currently lacked by wrong datatype for column 'date'. This parsing/converting avoids MySQL from using the index on the 'date' column.
Conclusion: Make 'date' column having a Date datatype, have this column indexed and do not use STR_TO_DATE in your statement.
I insinuate that these local ip addresses are not very selective when used with negation, right? (This depends on the typical data in the table.)
Since ip-column is not indexed, selections on that column always result in full table scan. When unequal (<>) selection on ip is very selective then consider putting an index on it and change statement to not use 'like' but <> instead. But I do not think that unequal-selection on ip is very selective.
Conclusion: I do not think you can win anything significant here.

The problem is that a LIKE will mean full table traversal! Which is why you are seeing this.
First thing I would suggest is get rid of LIKE '192.168.2.1' since really that is the same as ='192.168.2.1'
Also the fact that you set the LIMIT 100 at the end means that the query will run against all the records then select only the first 100 -- how about instead you do a SELECT which only involves all the other operations but not LIKE and limit this one and then have a second SELECT which uses LIKE?

Some Tips
Use != instead of NOT LIKE
Avoid REGEXP in mysql query
Avoid STR_TO_DATE(date, '%d-%b-%Y') >= '$date1' try passing the MySQL formatted date to query rather converting them with STR_TO_DATE
lookup should be indexed if you have to use group by on it.
Try caching the query results(if possible).

Related

MySQL index usage on join

I know there are several questions similar to this one, but those I've found do not relate directly to my problem.
Some initial context: I have a facts table, called ft_booking, with around 10MM records. I have a dimension called dm_date, with around 11k records, which are dates. These tables are related through foreign keys, as usual. There are 3 date foreign keys in the table ft_booking, one for boarding, one for booking, and other for cancellation. All columns have the very same definition, and the amount of distinct records for each is similar (ranging from 2.5k to 3k distinct values in each column).
There I go:
EXPLAIN SELECT
*
FROM dw.ft_booking b
LEFT JOIN dw.dm_date db ON db.sk_date = b.fk_date_booking
WHERE date (db.date) = '2018-05-05'
As you can see, index is being used in the table booking, and the query runs really fast, even though, in my filter, I'm using the date() function. For brevity, I'll just state that the same happens using the column fk_date_boarding. But, check this out:
EXPLAIN SELECT
*
FROM dw.ft_booking b
LEFT JOIN dw.dm_date db ON db.sk_date = b.fk_date_cancellation
WHERE date (db.date) = '2018-05-05';
For some mysterious reason, the planner chooses not to use the index. Now, I understand that using some function over a column kind of forces the database to perform a full table scan, in order to be able to apply that function over the column, thus bypassing the index. But, in this case, the function is not over the actual foreign key column, which is where the lookup in the booking table should be ocurring.
If I remove the date() function, the index will be used in any of those columns, as expected. One might say, then, "well, why don't you just get rid of the date() function?" - I use metabase, an Interface which allow users to use a graphical interface in order to build queries without knowing MySQL, and one of the current limitations of that tool is that it always uses the date() function when building queries not directly written in MySQL - hence, I have no way to remove the function in the queries I'm running.
Actual question: why does MySQL use index in the first two cases, but doesn't in the latter, considering the amount of distinct values is pretty much the same for all columns and they have the exact smae definition, apart from the name? Am I missing something here?
EDIT: Here is the CREATE statment of each table involved. There are some more, but we just need here tables ft_booking and dm_date (first two tables of the file).
You are "hiding date in a function call". If db.date is declared a DATE, then
date (db.date) = '2018-05-05'
can be simply
db.date = '2018-05-05'
If db.date is declared a DATETIME, then change to
db.date >= '2018-05-05'
AND db.date < '2018-05-05' + INTERVAL 1 DAY
In either case, be sure there is an index on db.date.
If by "I have a dimension called dm_date", you mean you built a dimension table to hold just dates, and then you are JOINing to the main table with some id, ... To put it bluntly, don't do that! Do not normalize "continuous" things such as DATE, DATETIME, FLOAT, or other numeric values.
If you need to discuss this further, please provide SHOW CREATE TABLE for the relevant table(s). (And please use text, not screen shots.)
Why??
The simple answer is that the Optimizer does not know how to unravel any function. Perhaps it could; perhaps it should. But it does not. Perhaps the answer involves not wanting to see how the function result will be used... comparing against a DATE? against a DATETIME? being used as a string? other?
Still, I suggest the real performance killer is the existence of dm_date rather than indexing and using the date in the main table.
Furthermore, the main table is bigger than it needs to be! fk_date_booking is a 4-byte INT SIGNED instead of a 3-byte DATE.

MySql date inside varchar (select correct date)?

I have "varchar" field in my database where I have stored records like:
(11.1.2015) Log info 1
(17.4.2015) Log info 2
(22.5.2015) Log info 3
(25.5.2015) Log info 3
...
Now I would like to make SELECT WHERE date inside () is the same or larger than today and select the first one (so in this sample and todays date I should get 22.5.205). I just can't figure out how to do that, so I need some help.
In principle I agree with Pekka웃 on this one.
You should always strive to use proper data types for your data.
This also means never use one column to store 2 different data segments.
However, from the comments to Pekka웃's answer I understand that changing the table is not possible, so here's my attempt to do it.
Assuming your dates are always at the start of the varchar, and always surrounded by parenthesis, you can probably do something like this:
SELECT *
FROM (
SELECT
CAST(SUBSTR(Log_data, 2, LOCATE(')', Log_data)-1) as date) LogDate,
SUBSTR(Log_data, LOCATE(')', Log_data)+1, CHAR_LENGTH(Log_data)) LogData
FROM logs_table
) NormalizedLogTable
WHERE LogDate >= CURDATE()
Limit 1
See sql fiddle here.
Note #1: This is a workaround for your specific situation.
If you ever get the chance, you should normalize your table.
Note #2 I'm not a MySql guy. Most of my sql experience is with Sql server.
You can probably find a better way to convert strings to date then just using cast, to overcome the ambiguity of values like 1.3.2015.
This is likely to be impossible to do with a varchar field, or hellishly complex. While you can theoretically use Regex functions in mySQL to match patterns, you are looking for a date range. Even if it were possible to build a query somehow, it would be a gigantic pile of work and there is no way mySQL could optimize for any aspect of it.
The normal, fast, and straightforward way to go about this is normalizing your table.
In your case, you would probably create a new table named "logs" and connect it to your existing table through an ID field that shows which parent record each log entry belongs to.
Querying for a certain date range for log entries belonging to a specific parent then become as easy as
SELECT log_date FROM logs WHERE parent = 155 AND log_date >= CURDATE()
It's painful to do at first (as you have to rebuild parts of your structure and likely, your app) and makes some everyday queries more complex, but cases like this become much easier and faster.

How can I improve the response time on my query when using ibatis/spring/mysql?

I have a database with 2 tables, I must run a simple query `
select *
from tableA,tableB
where tableA.user = tableB.user
and tablea.email LIKE "%USER_INPUT%"
Where user_input is a part of the string of tablea.email that has to match.
The problem:
The table will carry about 10 million registers and its taking a while, the cache of ibatis (as far as I know) will be used only if the previous query looks the same. example: for USER_INPUT = john_doe if the second query is john_doe again the cache willt work, but if is john_do will not work(that is, as I said, as far as I know).
current, the tableA structure is like this:
id int(11) not_null auto_increment
email varchar(255)not_null
many more fields...
I dont know if email , a varchar of 255 might be too long and could take longer time because of that, if I decrease it to 150 characters for example, would the response time will be shorter?
Right now the query is taking too long... I know I could upgrade to more memory to the servers but I would like to know if there is other way to improve this code.
tableA and tableB have about 30 fields each and they are related by the ID on a relational schema.
Im going to create an index for tableA.email.
Ideas?
I'd recommend running an execution plan for that query in your DB. That'll tell how the DB plans to execute your query, and what you're looking for is something like a "full table scan". I'd guess you'll see just that, due to the like clause, and an index the email field won't help that part.
If you need to search by substrings of email addresses you might want to consider the granularity of how you store your data. For example, instead of storing email addresses in a single field as usual you could split them into two fields (or maybe more), where everything before the '#' is in one field and the domain name is in another. Then you could search by either component without needing a like and then indexes would significantly speed things up significantly. For example, you could do this to search:
WHERE tableA.email_username = 'USER_INPUT' OR tableA.email_domain = 'USER_INPUT'
Of course you then have to concatenate the two fields to recreate the email address, but I think iBatis will let you add a method to your data object to do that in a single place instead of all over your app (been a while since I used iBatis, though, so I could be wrong).
MySQL cannot utilize indexes on LIKE queries where the wildcard precedes the search string (%query).
You can try a Full-Text search instead. You'll have to add a FULLTEXT index to your email column:
ALTER TABLE tablea
ADD FULLTEXT(email);
From there you can revise your query
SELECT *
FROM tableA,tableB
WHERE tableA.user = tableB.user
AND MATCH (tablea.email) AGAINST ('+USER_INPUT' IN BOOLEAN MODE)
You'll have to make sure you can use full text indexes.
Full-text indexes can be used only with MyISAM tables. (In MySQL 5.6 and up, they can also be used with InnoDB tables.)
http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html

Fast search solution for numeric type of large mysql table?

I have large mysql database (5 million rows) and data is phone number.
I tried many solution but it's still slow. Now, I'm using INT type and LIKE sql query for store and searching phone number.
Ex: SELECT phonenumber FROM tbl_phone WHERE phonenumber LIKE '%4567'
for searching phone numbers such as 170**4567**, 249**4567**,...
I need a solution which make it run faster. Help me, please!
You are storing numbers as INT, but querying then as CHAR (the LIKE operator implicitly converts INTs to CHARs) and it surely is not optimal. If you'd like to keep numbers as INT (probably the best idea in IO performance therms), you'd better change your queries to use numerical comparisons:
-- instead of CHAR operators
WHERE phone_number LIKE '%4567'
WHERE phone_number LIKE '1234%'
-- use NUMERIC operators
WHERE phone_number % 10000 = 4567
WHERE phone_number >= 12340000 -- considering 8 digit numbers
Besides choosing a homogeneous way to store and query data, you should keep in mind to create the appropriate index CREATE INDEX IDX0 ON table (phone_number);.
Unfortunately, even then your query might not be optimal, because of effects similar to #ron have commented about. In this case you might have to tune your table to break this column into more manageable columns (like national_code, area_code and phone_number). This would allow an index efficient query by area-codes, for example.
Check the advice here How to speed up SELECT .. LIKE queries in MySQL on multiple columns?
Hope it helps!
I would experiment with using REGEXP, rather than LIKE as in the following example:
SELECT `field` WHERE `field` REGEXP '[0-9]';
Other than that, indeed, create an index if your part of the phone search pattern has a constant length.
Here is also a link to MySQL pattern mathching document.
That LIKE predicate is operating on a string, so you've got an implicit conversion from INT to VARCHAR happening. And that means an index on the INT column isn't going to help, even for a LIKE predicate that has leading characters. (The predicate is not sargable.)
If you are searching for the last digits of the phone number, the only way (that I know of) to get something like that to be fast would be to add another VARCHAR column, and store the reversed phone number in it, where the order of the digits is backwards.
Create an index on that VARCHAR column, and then to find phone number that end with '4567':
WHERE reverse_phone LIKE '7654%'
-or-
WHERE reverse_phone LIKE CONCAT(REVERSE('4567'),'%')

Best way to handle MySQL date for performance with thousands of users

I am currently part of a team designing a site that will potentially have thousands of users who will be doing a number of date related searches. During the design phase we have been trying to determine which makes more sense for performance optimization.
Should we store the datetime field as a mysql datetime. Or should be break it up into a number of fields (year, month, day, hour, minute, ...)
The question is with a large data set and a potentially large set of users, would we gain performance wise breaking the datetime into multiple fields and saving on relying on mysql date functions? Or is mysql already optimized for this?
Have a look at the MySQL Date & Time Functions documentation, because you can pull specific information from a date using existing functions like YEAR, MONTH, etc. But while these exist, if you have an index on the date column(s), using these functions means those indexes can not be used...
The problem with storing a date as separate components is the work needed to reconstruct them into a date when you want to do range comparisons or date operations.
Ultimately, choose what works best with your application. If there's seldom need for the date to be split out, consider using a VIEW to expose the date components without writing possibly redundant information into your tables.
Use a regular datetime field. You can always switch over to the separated components down the line if performance becomes an issue. Try to avoid premature optimization - in many cases, YAGNI. You may wind up employing both the datetime field and the separated component methodology, since they both have their strengths.
If you know ahead of time some key criteria that all searches will have, MySQL (>= v5.1) table partitioning might help.
For example, if you have a table like this:
create table Books(pubDate dateTime, title varchar(50));
And you know all searches must at least include a year, you could partition it on the date field, along these lines:
create table Books(pubDate dateTime,title varchar(50)
partition by hash(year(pubDate)) partitions 10;
Then, when you run a select against the table, if your where clause includes criteria that limit the partition the results can exist on, the search will only scan that partition, rather than a full table scan. You can see this in action with:
-- scans entire table
explain partitions select * from Books where title='%title%';
versus something like:
-- scans just one partition
explain partitions select * from Books
where year(pubDate)=2010
and title='%title%';
The MySQL documentation on this is quite good, and you can choose from multiple partitioning algorithms.
Even if you opt to break up the date, a table partition on, say, year (int) (assuming searches will always specify a year) could help.