MYSQL Long super-keys - mysql

I am currently working on a project, which involves altering data stored in a MYSQL database. Since the table that I am working on does not have a key, I add a key with the following command:
ALTER TABLE deCoupledData ADD COLUMN MY_KEY INT NOT NULL AUTO_INCREMENT KEY
Due to the fact that I want to group my records according to selected fields, I try to create an index for the table deCoupledData that consists of MY_KEY, along with the selected fields. For example, If I want to work with the fields STATED_F and NOT_STATED_F, I type:
ALTER TABLE deCoupledData ADD INDEX (MY_KEY, STATED_F, NOT_STATED_F)
The real issue is that the fields that I usually work with are more than 16, so MYSQL does not allow super-keys longer than 16 fields.
In conclusion, Is there another way to do this? Can I make (somehow) MYSQL to order the records according to the desired super-key (something like clustering)? I really need to make my script faster and the main overhead is that each group may contain records which are not stored on the same page of the disk, and I assume that my pc starts random I/Os in order to retrieve records.
Thank you for your time.
Nick Katsipoulakis
CREATE TABLE deCoupledData (
AA double NOT NULL DEFAULT '0',
STATED_F double DEFAULT NULL,
NOT_STATED_F double DEFAULT NULL,
MIN_VALUES varchar(128) NOT NULL DEFAULT '-1,-1',
MY_KEY int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (MY_KEY),
KEY AA (AA) )
ENGINE=InnoDB AUTO_INCREMENT=74358 DEFAULT CHARSET=latin1

Okay, first of all, when you add an index over multiple columns and you don't really use the first column, the index is useless.
Example: You have a query like
SELECT *
FROM deCoupledData
WHERE
stated_f = 5
AND not_stated_f = 10
and an index over (MY_KEY, STATED_F, NOT_STATED_F).
The index can only be used, if you have another AND my_key = 1 or something in the WHERE clause.
Imagine you want to look up every person in a telephone book with first name 'John'. Then the knowledge that the book is sorted by last name is useless, you still have to look up every single name.
Also, the primary key does not have to be a surrogate / artificial one. It's nearly always better to have a primary key which is made up of columns which identify each row uniquely anyway.
Also it's not always good to have many indexes. Not only do indexes slow down INSERTs and UPDATEs, sometimes they just cause an extra lookup, since first a look at the index is taken and a second look to find the actual data.
That's just a few tips. Maybe Jordan's hint is not a bad idea, "You should maybe post a new question that has your actual SQL query, table layout, and performance questions".
UPDATE:
Yes, that is possible. According to manual
If you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index.
which means that the data is practically sorted on disk, yes.
Be aware that it's also possible to define a primary key over multiple columns!
Like
CREATE TABLE deCoupledData (
AA double NOT NULL DEFAULT '0',
STATED_F double DEFAULT NULL,
NOT_STATED_F double DEFAULT NULL,
MIN_VALUES varchar(128) NOT NULL DEFAULT '-1,-1',
MY_KEY int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (NOT_STATED_F, STATED_F, AA),
KEY AA (AA) )
ENGINE=InnoDB AUTO_INCREMENT=74358 DEFAULT CHARSET=latin1
as long as the combination of the columns is unique.

Related

Why does MySQL not use index if varchar size is too high?

I am joining with a table and noticed that if the field I join on has a varchar size that's too high then MySQL doesn't use the index for that field in the join, thus resulting in a significantly longer query time. I've put explains and table definition below. It is version MySQL 5.7. Any ideas why this is happening?
Table definition:
CREATE TABLE `LotRecordsRaw` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`lotNumber` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL,
`scrapingJobId` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `lotNumber_UNIQUE` (`lotNumber`),
KEY `idx_Lot_lotNumber` (`lotNumber`)
) ENGINE=InnoDB AUTO_INCREMENT=14551 DEFAULT CHARSET=latin1;
Explains:
explain
(
select lotRecord.*
from LotRecordsRaw lotRecord
left join (
select lotNumber, max(scrapingJobId) as id
from LotRecordsRaw
group by lotNumber
) latestJob on latestJob.lotNumber = lotRecord.lotNumber
)
produces:
The screenshot above shows that the derived table is not using the index on "lotNumber". In that example, the "lotNumber" field was a varchar(255). If I change it to be a smaller size, e.g. varchar(45), then the explain query produces this:
The query then runs orders of magnitude faster (2 seconds instead of 100 sec). What's going on here?
Hooray! You found an optimization reason for not blindly using 255 in VARCHAR.
Please try 191 and 192 -- I want to know if that is the cutoff.
Meanwhile, I have some other comments:
A UNIQUE is a KEY. That is, idx_Lot_lotNumber is redundant and may as well be removed.
The Optimizer can (and probably would) use INDEX(lotNumber, scrapingJobId) as a much faster way to find those MAXes.
Unfortunately, there is no way to specify "make a unique index on lotNumber, but also have that other column in the index.
Wait! With lotNumber being unique, there is only one row per lotNumber. That means MAX and GROUP BY are totally unnecessary!
It seems like lotNumber could be promoted to PRIMARY KEY (and completely get rid of id).

How to efficiently update values without a primary key in MySQL?

I am currently facing an issue with designing a database table and updating/inserting values into it.
The table is used to collect and aggregate statistics that are identified by:
the source
the user
the statistic
an optional material (e.g. item type)
an optional entity (e.g. animal)
My main issue is, that my proposed primary key is too large because of VARCHARs that are used to identify a statistic.
My current table is created like this:
CREATE TABLE `Statistics` (
`server_id` varchar(255) NOT NULL,
`player_id` binary(16) NOT NULL,
`statistic` varchar(255) NOT NULL,
`material` varchar(255) DEFAULT NULL,
`entity` varchar(255) DEFAULT NULL,
`value` bigint(20) NOT NULL)
In particular, the server_id is configurable, the player_id is a UUID, statistic is the representation of an enumeration that may change, material and entity likewise. The value is then aggregated using SUM() to calculate the overall statistic.
So far it works but I have to use DELETE AND INSERT statements whenever I want to update a value, because I have no primary key and I can't figure out how to create such a primary key in the constraints of MySQL.
My main question is: How can I efficiently update values in this table and insert them when they are not currently present without resorting to deleting all the rows and inserting new ones?
The main issue seems to be the restriction MySQL puts on the primary key. I don't think adding an id column would solve this.
Simply add an auto-incremented id:
CREATE TABLE `Statistics` (
statistis_id int auto_increment primary key,
`server_id` varchar(255) NOT NULL,
`player_id` binary(16) NOT NULL,
`statistic` varchar(255) NOT NULL,
`material` varchar(255) DEFAULT NULL,
`entity` varchar(255) DEFAULT NULL,
`value` bigint(20) NOT NULL
);
Voila! A primary key. But you probably want an index. One that comes to mind:
create index idx_statistics_server_player_statistic on statistics(server_id, player_id, statistic)`
Depending on what your code looks like, you might want additional or different keys in the index, or more than one index.
Follow the below hope it will solve your problem :-
- First use a variable let suppose "detailed" as money with your table.
- in your project when you use insert statement then before using statement get the maximum of detailed (SELECT MAX(detailed)+1 as maxid FROM TABLE_NAME( and use this as use number which will help you to FETCH,DELETE the record.
-you can also update with this also BUT during update MAXIMUM of detailed is not required.
Hope you understand this and it will help you .
I have dug a bit more through the internet and optimized my code a lot.
I asked this question because of bad performance, which I assumed was because of the DELETE and INSERT statements following each other.
I was thinking that I could try to reduce the load by doing INSERT IGNORE statements followed by UPDATE statements or INSERT .. ON DUPLICATE KEY UPDATE statements. But they require keys to be useful which I haven't had access to, because of constraints in MySQL.
I have fixed the performance issues though:
By reducing the amount of statements generated asynchronously (I know JDBC is blocking but it worked, it just blocked thousand of threads) and disabling auto-commit, I was able to improve the performance by 600 times (from 60 seconds down to 0.1 seconds).
Next steps are to improve the connection string and gaining even more performance.

Performance of simple SELECT operation on big (2 GB) table

I've really simple query to get MIN and MAX values, it looks like:
SELECT MAX(value_avg)
, MIN(value_avg)
FROM value_data
WHERE value_id = 769
AND time_id BETWEEN 214000 AND 219760;
And here you are the schema of the value_data table:
CREATE TABLE `value_data` (
`value_id` int(11) NOT NULL,
`time_id` bigint(20) NOT NULL,
`value_min` float DEFAULT NULL,
`value_avg` float DEFAULT NULL,
`value_max` float DEFAULT NULL,
KEY `idx_vdata_vid` (`value_id`),
KEY `idx_vdata_tid` (`time_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
As you see, the query and the table are simple and I don't see anything wrong here, but when I execute this query, it takes about ~9 seconds to get data. I also made profile of this query, and 99% of time is "Sending data".
The table is really big and it weighs about 2 GB, but is it a problem? I don't think this table is too big, it must be something else...
MySQL can easily handle a database of that size. However, you should be able to improve the performance of this query and probably the table in general. By changing the time_id column to an UNSIGNED INT NOT NULL, you can significantly decrease the size of the data and indexes on that column. Also, the query you mention could benefit from a composite index on (value_id, time_id). With that index, it would be able to use the index for both parts of the query instead of just one as it is now.
Also, please edit your question with an EXPLAIN of the query. It should confirm what I expect about the indexes, but it's always helpful information to have.
Edit:
You don't have a PRIMARY index defined for the table, which definitely isn't helping your situation. If the values of (value_id, time_id) are unique, you should probably make the new composite index I mention above the PRIMARY index for the table.

Select rows where column LIKE dictionary word

I have 2 tables:
Dictionary - Contains roughly 36,000 words
CREATE TABLE IF NOT EXISTS `dictionary` (
`word` varchar(255) NOT NULL,
PRIMARY KEY (`word`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Datas - Contains roughly 100,000 rows
CREATE TABLE IF NOT EXISTS `datas` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`hash` varchar(32) NOT NULL,
`data` varchar(255) NOT NULL,
`length` int(11) NOT NULL,
`time` int(11) NOT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `hash` (`hash`),
KEY `data` (`data`),
KEY `length` (`length`),
KEY `time` (`time`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=105316 ;
I would like to somehow select all the rows from datas where the column data contains 1 or more words.
I understand this is a big ask, it would need to match all of these rows together in every combination possible, so it needs the best optimization.
I have tried the below query, but it just hangs for ages:
SELECT `datas`.*, `dictionary`.`word`
FROM `datas`, `dictionary`
WHERE `datas`.`data` LIKE CONCAT('%', `dictionary`.`word`, '%')
AND LENGTH(`dictionary`.`word`) > 3
ORDER BY `length` ASC
LIMIT 15
I have also tried something similar to the above with a left join, and on clause that specified the like statement.
This is actually not an easy problem, what you are trying to perform is called Full Text Search, and relational databases are not the best tools for such a task. If this is some kind of a core functionality consider using solutions dedicated for this kind of operations, like Sphinx Search Server.
If this is not a "Mission Critical" system, you can try with something else. I can see that datas.data column isn't really long, so you can create a structure dedicated for your task and keep maintaining it during operational use. Fore example, create table:
dictionary_datas (
datas_id FK (datas.id),
word FK (dictionary.word)
)
Now anytime you insert, delete or simply modify datas or dictionary tables you update dictionary_datas placing there info which datas_id contains which words (basically many to many relations). Of course it will degradate your performance, so if you have high high transactional load on your system, you have to do this periodicaly. For example place a Cron Job which runs every night at 03:00 am and actualize the table. To simplify the task you can add a flag TO_CHECK into DATAS table, and actualize data only for those records having there 1 (after you actualise dictionary_datas you switch the value to 0). Remember by the way to refresh whole DATAS table after an update to DICTIONARY table. 36 000 and 100 000 aren't big numbers in terms of data processing.
Once you have this table you can just query it like:
SELECT datas_id, count(*) AS words_num FROM dictionary_datas GROUP BY datas_id HAVING count(*) > 3;
To speed up the query (and yet slow down it's update) you can create a composite index on its columns datas_id, word (in EXACTLY that order). If you decide to refresh the data periodicaly you should remove the index before refresh, than refresh the data, and finaly create the index after refreshing - this way will be faster.
I'm not sure if I understood your problem, but I think this could be a solution. Also, I think people don't like Regular Expression but this works for me to select columns where their value has more than 1 word.
SELECT * FROM datas WHERE data REGEXP "([a-z] )+"
Have you tried this?
select *
from dictionary, datas
where position(word,data) > 0
;
This is very inefficient, but might be good enough for you. Here is a fiddle.
For better performance, you could try placing a text search index on your text column DATA and then using the CONTAINS function instead of POSITION.

First database with referential integrity constraints -- suggestions, feedback, errors?

TARGET_RDBMS: MySQL-5.X-InnoDB ("X" equals current stable release)
BACKGROUND: Building my first database with true referential integrity constraints, in an effort to get feedback, after creating the "real" DDL, I've made an abstraction that I believe covers the "feel" of the database; this is only 3 tables of about 20, all with referential integrity constraints; only pattern I see that is missing is a composite key table, which does not have data to be dumped in right now anyway, so I'm just focus on the first iteration.
Sample Data / Unit Test: One thing I do not know is how to build out a sample data set that will offer 100% coverage of the referential integrity modeled -- AND build "Unit Test" around that sample data and this DDL:
Sample DLL:
(Note: Just to be clear, the LEGEND and naming standards are JUST for this example, which I've abstracted from the "real" database. The column names are robotic in nature, and meant to make the meaning and relationship of a given instance as clear as possible. If you have suggestions on the notation system used, please feel free to comment. I'm open to any suggestions. Thanks!)
CREATE DATABASE sampleDB;
use sampleDB;
# ###############
# LEGEND
# - sID = surrogate key
# - nID = natural key
# - cID = common/shared across tables, but NOT unique/natural-key
# - PK = Primary Key
# - FK = Foreign Key
# - data01 = Sample data (non-key,not-shared-across-tables)
# - data02 = Sample data NOT NULL (non-key,not-shared-across-tables)
#
# - uID = user defined unique/natural key (NOTE: not used)
# ###############
# Behavior
# - create_timestamp (NOT NULL, updated on record creation, NOT update)
# - update_timestamp (NOT NULL, updated on record creation AND updates)
CREATE TABLE `TABLE_01` (
`TABLE_01_sID_PK` MEDIUMINT NOT NULL AUTO_INCREMENT,
`TABLE_01_cID` int(8) NOT NULL,
`TABLE_01_data01` varchar(128) default NULL,
`TABLE_01_data02` varchar(128) default NULL,
`create_timestamp` DATETIME DEFAULT NULL,
`update_timestamp` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`TABLE_01_sID_PK`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `TABLE_02` (
`TABLE_02_sID_PK` MEDIUMINT NOT NULL AUTO_INCREMENT,
`TABLE_02_nID_FK__TABLE_01_sID_PK` int(8) NOT NULL,
`TABLE_02_cID` int(8) NOT NULL,
`TABLE_02_data01` varchar(128) default NULL,
`TABLE_02_data02` varchar(128) NOT NULL,
`create_timestamp` DATETIME DEFAULT NULL,
`update_timestamp` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`TABLE_02_sID_PK`),
FOREIGN KEY (TABLE_02_nID_FK__TABLE_01_sID_PK) REFERENCES TABLE_01(TABLE_01_sID_PK),
INDEX `TABLE_02_nID_FK__TABLE_01_sID_PK` (`TABLE_02_nID_FK__TABLE_01_sID_PK`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
CREATE TABLE `TABLE_03` (
`TABLE_03_sID_PK` MEDIUMINT NOT NULL AUTO_INCREMENT,
`TABLE_03_nID_FK__TABLE_01_sID_PK` int(8) NOT NULL,
`TABLE_03_nID_FK__TABLE_02_sID_PK` int(8) NOT NULL,
`TABLE_03_cID` int(8) NOT NULL,
`TABLE_03_data01` varchar(128) default NULL,
`TABLE_03_data02` varchar(128) NOT NULL,
`create_timestamp` DATETIME DEFAULT NULL,
`update_timestamp` TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`TABLE_03_sID_PK`),
FOREIGN KEY (TABLE_03_nID_FK__TABLE_01_sID_PK) REFERENCES TABLE_01(TABLE_01_sID_PK),
FOREIGN KEY (TABLE_03_nID_FK__TABLE_02_sID_PK) REFERENCES TABLE_02(TABLE_02_sID_PK),
INDEX `TABLE_03_nID_FK__TABLE_01_sID_PK` (`TABLE_03_nID_FK__TABLE_01_sID_PK`),
INDEX `TABLE_03_nID_FK__TABLE_02_sID_PK` (`TABLE_03_nID_FK__TABLE_02_sID_PK`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
SHOW TABLES;
# DROP DATABASE `sampleDB`;
# #######################
# View table definition
# DESC inserttablename;
# #######################
# View table create statement
# SHOW CREATE TABLE example;
Questions:
Any and all feedback on missing, wrong, or "better" ways to do this database build are welcome. If you have questions, just comment -- and I'll respond ASAP. Again, thanks~!
UPDATE (1):
Just added "MEDIUMINT NOT NULL AUTO_INCREMENT" to the PKs -- not sure how I left that off.
First of all, I want to applaud you for defining a standard. There is no end to how much it will come to help you in the future.
Having said that, a couple of very subjective opinions from my part:
I don't like to embed type information in names, such as "TABLE_PERSON" or "PERSON_T" because it becomes confusing the second you replace a table with a view instead. At this point you could of course search and replace "PERSON_T" with "PERSON_VW" instead, but it kind of misses the point :)
The same goes for columns (although i can't see this in your example). Think of the "n_is_dead" column that gets changed from numeric to varchar.
Can a row exist in a table without being created (create_timestamp)? Declare columns as NOT NULL if they really can't be null. In fact, I start of having NOT NULL on most of my columns because it makes me think harder about the nature of the data.
I'm a fan of naming the primary key column something other than ID. For example
company(company_id, etc)
person(person_id, company_id, firstname etc)
I've heard some people have problems with O/R mappers that want you to have the primary key named "ID" at all times, but I don't know if this is still true of if this has changed recently.
It's not clear to me if you intented to embed (s,n,c) in the column names to indicate whether they are surrogate, natural or common key. But I also don't think this is a good idea. I feel that would "reveal" some implementation detail that doesn't fit naturally in the logical model.
It looks like you are exposing/embedding the foreign key relationship in the column names. I have never thought of this, but I think you will deeply regret this one. If not only because it makes the column names unbearably uggly :)
When choosing a name for an index. The only time I regret naming an index something is when I look at an execution plan and see "index_01" being used. I always wish I had put the column name in the index to make it visible in the xplan. I don't know the limit for an index name, but I always run into the limit on Oracle. So, try to come up with some rule for how to abbreviate the table name. The column name is the important thing here.
Regarding mixed case. I always (no exceptions) go with either ALL_UPPER_CASE or all_lower_case. The reason is that in the past I've been burned when migrating queries between databases when they treat case differently. Lately, I use all_lower_case because the typical font of our editors makes it easier to spot spelling errors in lower case than in upper case. And when I fail at things, it doesn't seem like the editor is SHOUTING AT ME ;)