I have a table with columns like this:
| Country.Number | CountryName |
| US.01 | USA |
| US.02 | USA |
I'd like to modify this to:
| Country | Number | CountryName |
| US | 01 | USA |
| US | 02 | USA |
Regarding optimization, is there a difference in performance if I use:
select * from mytable where country.number like "US.%"
or
select * from mytable where country = "US"
The performance difference will most likely be miniscule in this particular case, as mysql uses an index on "US.%". The performance degradation is mostly felt when searching for something like "%.US" (the wildcard is in front). As it then does a tablescan without using indices.
EDIT: you can look at it like this:
MySql internally stores varchar indices like trees with first symbol being the root and branching to each next letter.
So when searching for = "US" it looks for U, then goes one step down for S and then another to make sure that is the end of the value. That's three steps.
Searching for LIKE "US.%" it looks again for U, then S, then . and then stops searching and displays the results - that's also three steps only as it cares not whether the value terminated there.
EDIT2: I'm in no way promoting such database denormalization, I just wanted to attract your attention that this matter may not be as straightforward as it seems at first glance.
The later query:
select * from mytable where country = "US"
should be much faster because mySQL does not have to look for wildcards patterns unlike LIKE query. It just looks for the value that has been equalized.
If you need to optimize, a simple = is way better than a like.
Why ?
With an = either the string is exactly the same and it's true or it doesn't match and it's false.
With a like, MySQL must compare the string and test if the other string match the mask, and that takes more time and needs more operations.
So for the sake of your database, use SELECT * FROM 'mytable' WHERE country = "US".
The second is faster if there is an index on column country. MySQL has to scan less index entries to produce the result.
Not technically an answer to the question.. but...I would understand them to be close enough in speed for it not to (usually) matter - thus using "=" would be better as it displays the intent in a more obvious way.
why dont you just make country_id a tinyint unsigned and have an iso_code varchar(3) column which is unique ? (saves you from all the BS)
Related
I have two tables of historical data - one (OldData) is 40,000 records from a datasource with partial/inaccurate data that I am trying to clean, the other (LookupData) is a definitive source of just over one million accurate records.
I am trying to enrich the first, smaller table with records from the larger one, and I can predict matching records by joining on surname and a numeric value known as the service number, but in the first table these numbers are often incomplete.
OldData (partial/inaccurate data) examples:
Surname | ServiceNumber
Smith | 12345
Jones | 9876
Brown | 234
LookupData examples:
Surname | ServiceNumber
SMITH | 12345
SMITH | 23456
JONES | 98765
JONES | 19182
BROWN | T12345
BROWN | 56789
Desired result:
OldData.Surname | OldData.ServiceNumber | LookupData.ServiceNumber
Smith | 12345 | 12345
Jones | 9876 | 98765
Brown | 234 | T12345
The current query that I have is
SELECT OldData.*,LookupData.ServiceNumber
FROM `OldData`
LEFT JOIN `LookupData`
ON lower(OldData.Surname) = lower(LookupData.Surname)
AND LookupData.ServiceNumber like concat('%',OldData.ServiceNumber,'%')
but this never seems to complete
If I narrow it down to a single surname for testing, and add
WHERE OldData.Surname='Devlin'
I get the 47 rows from OldData and the accurate LookupData.ServiceNumber where any matches are found (and null where they aren't) but this query still takes 27 seconds on average.
I have indexes on both Surname fields and ServiceNumber fields.
If I'm seeking the impossible I'd at least like to know :) Thanks
Let's look at the two JOIN conditions of your query.
lower(OldData.Surname) = lower(LookupData.Surname)
Using a function on both ends of the equality slows down the search. MySQL string searches are usually case-insensitive by default, unless you use the BINARY operator. This condition can be rewritten as
OldData.Surname = LookupData.Surname
Second JOIN condition is :
LookupData.ServiceNumber like concat('%',OldData.ServiceNumber,'%')
LIKE is not good for performance, especially when there is a % at the beginning : because MySQL indexes are usually ordered, this causes a full scan to be triggered, because there is no way to find an optimized starting point for the search. In your sample data, it looks like you could remove the starting %.
Using INSTR will likely not improve performance.
You could try with a regexp, like :
LookupData.ServiceNumber REGEXP OldData.ServiceNumber
If you really need to search on both ends on large data, the way to go in MySQL is Full-Text Search Functions. This would require creating a FULLTEXT index on the service number columns (and possibly converting them from numeric to text), and then :
MATCH LookupData.ServiceNumber AGAINST OldData.ServiceNumber
We need to create the index on "source path" column, which is already in MUL - Key. For Example it have /src/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph and we need to search like '%Sal/2016/Jan%' it have almost 10 Million records.
Please suggest any idea for performance improvement.
| Field | Type | Null | Key | Default | Extra |
+------------+----------+------+-----+---------+----------------+
| Id | int(11) | NO | PRI | NULL | auto_increment |
| Name | char(35) | NO | | | |
| Country | char(3) | NO | UNI | | |
| source Path| char(20) | YES | MUL | | |
| Population | int(11) | NO | | 0 |
Unfortunately, a search that starts with % cannot use an index (it has not much to do with being in a composite index).
You have some options though:
The values in your path seem to have actual meaning. The ideal solution would be to use the meta-data, e.g. the month, name, whatever "SAL" stands for, and store it in their own columns or an attribute table, and then query for that meta-data instead. This is obviously only possible in very specific cases where you have the required meta-data for every path, so it is probably not an option here.
You can add a "search table" (e.g. (id, subpath)) that contains all subpaths of your source path, e.g.
'/src/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
'/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
'/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
...
'/Sal/2016/Jan/31-01/Joseph'
...
'/31-01/Joseph'
'/Joseph'
so 11 rows in your example. It's now possible to use an index on that, e.g. in
...
where exists
(select * from subpaths s
where s.subpath like '/Sal/2016/Jan%' and s.id = outerquery.id)
This relies on knowing the start of your search term. If Sal in your example %Sal/2016/Jan should actually include word endings, e.g. /NoSal/2016/Jan, you would have to modify your input term to remove the first word, so %Sal/2016/Jan% would require you to search for /2016/Jan% (with an index) and then recheck the resultset afterwards if it also fits %Sal/2016/Jan% (see the fulltext option for an example, it has the same "problem" to only look for the beginning of words).
You will have to maintain the search table, which is usually done in a trigger (update the subpath table when you insert, update or delete values in your original table).
Since this is a new table, you cannot combine it (directly) with another index, to e.g. optimize where country = 'A' and subpath like 'Sal/2016/Jan%' if country = 'A' would already get rid of 99.99% of the rows. You may have to check explain for your query if MySQL actually uses the index (because the optimizer can try something different) and then maybe reorganize your query (e.g. use a join or force index).
You can use a fulltext search. From the userinput, you would have to generate a query like
select * from
(select * from table
where match(`source Path`) against ('+SAL +2016 +Jan' in boolean mode)) subquery
where `source path` like '%Sal/2016/Jan%'
The fulltext search will not care about the order of the words, so you have to recheck the resultset if it actually is the correct path, but the fulltext search will use the (fulltext) index to speed it up. It will only look for the beginning of words, so similar to the "search table" option, if Sal can be the end of the word, you have to remove it from the fulltext search. By default, only words with at least 3 or 4 letters (depending on your engine) will be added to the index, so you have to set the value of either ft_min_word_len or innodb_ft_min_token_size to whatever fits your requirements.
The search table approach is probably the most convenient solution, as it can be used very similar to your current search: you can add the userinput directly in one place (without having to interpret it to create the against (...) expression) and you can also use it easily in other situations (e.g. in something like join table2 on concat(table2.Year,'/',table2.Month,'%') like ...); but you will have to set up the triggers (or however else you maintain the table), which is a little more complicated than just adding a fulltext index.
I am working on a data analytics Dashboard for a media content broadcasting company. Even if a user clicks a certain channel, logs/records are stored into MySQL DB. Following is the table that stores data regarding channel play times.
Here is the table structure:
_____________________________________
| ID INT(11) |
_____________________________________
| Channel_ID INT(11) |
_____________________________________
| playing_date (DATE) |
_____________________________________
| country_code VARCHAR(50) |
_____________________________________
| playtime_in_sec INT(11) |
_____________________________________
| count_more_then_30_min_play INT(11) |
_____________________________________
| count_15_30_min_play INT(11) |
_____________________________________
| count_0_15_min_play |
_____________________________________
| channel_report_tag VARCHAR(50) |
_____________________________________
| device_report_tag VARCHAR(50) |
_____________________________________
| genre_report_tag VARCHAR(50) |
_____________________________________
The Query that I run behind one of the dashboard graphs construction is :
SELECT
channel_report_tag,
SUM(count_more_then_30_min_play) AS '>30 minutes',
SUM(count_15_30_min_play) AS '15-30 Minutes',
SUM(count_0_15_min_play) AS '0-15 Minutes'
FROM
channel_play_times_cleaned
WHERE
playing_date BETWEEN '' AND ''
AND country_code LIKE ''
AND device_report_tag LIKE ''
AND channel_report_tag LIKE ''
GROUP BY
channel_report_tag
LIMIT 10
This query basically is taking a lot of time to return the result set (given the table data exceeds a million records per day and increasing every second ). I came across this stack-overflow Question : What generic techniques can be applied to optimize SQL queries? which basically mentions employing indices as one the techniques to optimize SQL queries. At the moment I am confused how to apply indices (i.e on what columns) in order to optimize the above mentioned query. I would be very grateful if some one could offer help in creating indices according to my specific scenario. Any other expert opinion for a beginner like me are surely welcomed.
EDIT :
As suggested by #Thomas G ,
I have tried to improve my query and make it more specific :
SELECT
channel_report_tag,
SUM(count_more_then_30_min_play) AS '>30 minutes',
SUM(count_15_30_min_play) AS '15-30 Minutes',
SUM(count_0_15_min_play) AS '0-15 Minutes'
FROM
channel_play_times_cleaned
WHERE
playing_date BETWEEN '' AND ''
AND country_code = 'US'
AND device_report_tag = 'j8'
AND channel_report_tag = 'NAT GEO'
GROUP BY
channel_report_tag
LIMIT 10
I started to write this in a comment because these are hints and not a clear answer. But that's way too long
First of all, it is common sense (but not always a rule of thumb) to index the columns appearing in a WHERE clause :
playing_date BETWEEN '' AND ''
AND country_code LIKE ''
AND device_report_tag LIKE ''
AND channel_report_tag LIKE ''
If your columns have a very high cardinality (your tag columns???), it's probably not a good idea to index them. Country_code and playing_date should be indexed.
The issue here is that there are so many LIKE in your query. This operator is perf a killer and you are using it on 3 columns. That's awfull for the database. So the question is: Is that really needed?
For instance I see no obvious reason to make a LIKE on a country code. Will you really query like this :
AND country_code LIKE 'U%'
To retrieve UK and US ??
You probably won't. Chances are high that you will know the countries for which you are searching for, so you should do this instead :
AND country_code IN ('UK','US')
Which will be a lot faster if the country column is indexed
Next, If you really want to make LIKE on your 2 tag columns, instead of doing a LIKE you can try this
AND MATCH(device_report_tag) AGAINST ('anything*' IN BOOLEAN MODE)
It is also possible to index your tag columns as FULLTEXT, especially if you search with LIKE ='anything%'. I you search with LIKE='%anything%', the index won't probably help much.
I could also state that with millions rows a day, you might have to PARTITION your tables (on the date for instance). And following your data, a composite index on the date and something else might help.
Really, there's no simple and straight answer to your complex question, especially with what you shown (not a lot).
Separate indexes are not as useful as composite indexes. Unfortunately, you have many possible combinations, and you are (apparently) allowing wildcards, which may destroy the utility of indexes.
Suggest you use client code to build the WHERE clause rather than populating it with ''
In composite indexes, put one range last. date BETWEEN ... AND ... is a "range".
LIKE 'abc' -- same as = 'abc', so why not change to that.
LIKE 'abc%' -- is a "range"
LIKE '%abc' -- can't use an index.
IN ('CA', 'TX') -- sometimes optimizes like '=', sometimes like 'range'.
So... Watch what queries the users ask for, then build composite indexes to satisfy them. Some rules:
At most one range, and put it last.
Put '=' column(s) first.
INDEX(a,b) is handled by INDEX(a,b,c), so include only the latter.
Don't have more than, say, a dozen indexes.
Index Cookbook
'customer_data' table:
id - int auto increment
user_id - int
json - TEXT field containing json object
tags - varchar 200
* id + user_id are set as index.
Each customer (user_id) may have multiple lines.
"json" is text because it may be very large with many keys or or not so big with few keys containing short values.
I usually search for the json for user_id.
Problem: with over 100,000 lines and it takes forever to complete a query. I understand that TEXT field are very wasteful and mysql does not index them well.
Fix 1:
Convert the "json" field to multiple columns in the same table where some columns may be blank.
Fix 2:
Create another table with user_id|key|value, but I may go into huge "joins" and will that not be much slower? Also the key is string but value may be int or text and various lengths. How to I reconcile that?
I know this is a pretty regular usecase, what are the "industry standards" for this usecase?
UPDATE
So I guess Fix 2 is the best option, how would I query this table and get one row result, efficiently?
id | key | value
-------------------
1 | key_1 | A
2 | key_1 | D
1 | key_2 | B
1 | key_3 | C
2 | key_3 | E
result:
id | key_1 | key_2 | key_3
---------------------------
1 | A | B | C
2 | D | | E
This answer is a bit outside the box defined in your question, but I'd suggest:
Fix 3: Use MongoDB instead of MySQL.
This is not to criticize MySQL at all -- MySQL is a great structured relational database implementation. However, you don't seem interested in using either the structured aspects or the relational aspects (either because of the specific use case and requirements or because of your own programming preferences, I'm not sure which). Using MySQL because relational architecture suits your use case (if it does) would make sense; using relational architecture as a workaround to make MySQL efficient for your use case (as seems to be the path you're considering) seems unwise.
MongoDB is another great database implementation, which is less structured and not relational, and is designed for exactly the sort of use case you describe: flexibly storing big blobs of json data with various identifiers, and storing/retrieving them efficiently, without having to worry about structural consistency between different records. JSON is Mongo's native document representation.
while trying to figure out how to tag a blog post with a single sql statement here, the following thought crossed my mind: using a relation table tag2post that references tags by id as follows just isn't necessary:
tags
+-------+-----------+
| tagid | tag |
+-------+-----------+
| 1 | news |
| 2 | top-story |
+-------+-----------+
tag2post
+----+--------+-------+
| id | postid | tagid |
+----+--------+-------+
| 0 | 322 | 1 |
+----+--------+-------+
why not just using the following model, where you index the tag itself as follows? taken that tags are never renamed, but added and removed, this could make sense, right? what do you think?
tag2post
+----+--------+-------+
| id | postid | tag |
+----+--------+-------+
| 1 | 322 | sun |
+----+--------+-------+
| 2 | 322 | moon |
+----+--------+-------+
| 3 | 4443 | sun |
+----+--------+-------+
| 4 | 2567 | love |
+----+--------+-------+
PS: i keep an id, i order to easily display the last n tags added...
It works, but it is not normalized, because you have redundancy in the tags. You also lose the ability to use the "same" tags to tag things besides posts. For small N, optimization doesn't matter, so I have no problems if you run with it.
As a practical matter, your indexes will be larger (assuming you are going to index on tag for searching, you are now indexing duplicates and indexing strings). In the normalized version, the index on the tags table will be smaller, will not have duplicates, and the index on the tag2post table on tagid will be smaller. In addition, the fixed size int columns are very efficient for indexing and you might also avoid some fragmentation depending on your clustering choices.
I know you said no renaming, but in general, in both cases, you might still need to think about the semantics of what it means to rename (or even delete) a tag - do all entries need to be changed, or does the tag get split in some way. Because this is a batch operation in a transaction in the worst case (all the tag2post have to be renamed), I don't really classify it as significant from a design point of view.
This sounds fine to me, using an ID to reference something that you delegated into another table makes sense when you have things that vary, say a user's name or whatever, because you don't want to change it's name in every place in your database when he changes it. However in this case the tag names themselves will not vary, so the only potential downside I see is that a text index might be slightly slower than a numeric index to search through.
Where is the real advantage of your proposal over a relation table containing IDs?
Technically they solve the same problem, but your proposed solution does it in a redundant, de-normalized way that only seems to satisfy the instinctive urge to be able to read the data directly from the relation table.
The DB server is pretty good at joining tables, and even more so if the join is over an INT field with an index on it. I don't think you will be facing devastating performance issues when you join another table (like: INT id, VARCHAR(50) TagName) to your query.
But you lose the ability to easily rename a tag (even if you don't plan on doing so), and you needlessly inflate your relation table with redundant data. Over time, this may cost you more performance than the normalized solution.
The de-normalised method may be fine depending on your application.
You may find that it causes a performance hit due to searching a large set of VARCHAR data.
When doing a search for things tagged like "sun*" (e.g. sun, sunny, sunrise)
you will not need to do a join. However, you will need to do a like comparison on a MUCH larger set of VARCHAR data. Proper indexing may alleviate this issue but only testing will tell you which method is faster with your dataset.
You also have the option of adding a VIEW that pre-joins the normalised tables. This gives you simpler queries while still allowing you to have highly normalised data.
My recommendation is to go with a normalised structure (and add de-normalised views a necessary for ease of use) until you encounter an issue that de-normalising the data schema fixes.
I was considering that too. Want a list of tags in the database, just select distinct tag from tag2post. I was told that since I wanted to optimize for select statements, it would be better to use an integer key because it was much faster than using a string.