MySQL indexes creation strategy and inner logic - mysql

This question expects a generic answer to the wide problematic of indexes creation on MySQL database.
Let's take this table example :
CREATE TABLE IF NOT EXISTS `article` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`published` tinyint(1) NOT NULL DEFAULT '0',
`author_id` int(11) unsigned NOT NULL,
`modificator_id` int(11) unsigned DEFAULT NULL,
`category_id` int(11) unsigned DEFAULT NULL,
`title` varchar(200) COLLATE utf8_unicode_ci NOT NULL,
`headline` text COLLATE utf8_unicode_ci NOT NULL,
`content` text COLLATE utf8_unicode_ci NOT NULL,
`url_alias` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`priority` mediumint(11) unsigned NOT NULL DEFAULT '50',
`publication_date` datetime NOT NULL,
`creation_date` datetime NOT NULL,
`modification_date` datetime NOT NULL,
PRIMARY KEY (`id`)
);
Over such a sample there is a wide range of queries that could be performed on different criterions :
category_id
published
publication_date
e.g.:
SELECT id FROM article WHERE NOT published AND category_id = '2' ORDER BY publication_date;
On many tables you can see a wide range of state fields (like published here), date fields or reference fields (like author_id or category_id). What strategy should be picked to make indexes ?
Which can be developed under the following points:
Make an index on every fields that can be used in query (either as where argument or order by) even if this can lead to have a lot of indexes per table ?
Also make an index on fields that have only a small set of values like boolean or enum, this just does reduce the scope size of the scan by a n factor (assuming n being the number of inputs and every value homogeneously used) ?
I've read that MySQL prior to 5.0 used only one index per request how do the system picks it ? (by choosing the more restrictive one ?)
How does a OR statement is processed ?
How much does this is going to slow insert ?
Does InnoDB/MyISAM change anything to this problem ?
I know the EXPLAIN statement could be used to know whether a request is optimized or not, but a bit of concrete theoretical stuff would really be more constructive than a purely empirical approach !

Related

How can i define a column as tinyint(4) using sequelize ORM

I need to define my table as
CREATE TABLE `test` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`o_id` int(11) unsigned NOT NULL,
`m_name` varchar(45) NOT NULL,
`o_name` varchar(45) NOT NULL,
`customer_id` int(11) unsigned NOT NULL,
`client_id` tinyint(4) unsigned DEFAULT '1',
`set_id` tinyint(4) unsigned DEFAULT NULL,
`s1` tinyint(4) unsigned DEFAULT NULL,
`s2` tinyint(4) unsigned DEFAULT NULL,
`s3` tinyint(4) unsigned DEFAULT NULL,
`review` varchar(2045) DEFAULT NULL,
`created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `br_o_id_idx` (`order_id`),
KEY `br_date_idx` (`created_at`),
KEY `br_on_idx` (`operator_name`),
KEY `br_mn_idx` (`merchant_name`)
)
but as i am looking on sequelize documentation , it does not have support for tiny int with its size.
From the lack of ZERO_FILL in your table definition, I suspect tinyint(4) probably does not do what you think it does. From 11.2.5 Numeric Type Attributes:
MySQL supports an extension for optionally specifying the display width of integer data types in parentheses following the base keyword for the type.
...
The display width does not constrain the range of values that can be stored in the column.
...
For example, a column specified as SMALLINT(3) has the usual SMALLINT range of -32768 to 32767, and values outside the range permitted by three digits are displayed in full using more than three digits.
I'm not sure if other RDBMSs treat the number in parens differently, but from perusing the sequelize source it looks like they're under the same, incorrect, impression you are.
That being said, the important part of your schema, being that you want to store those fields as TINYINTs (using only a byte of storage to contain values between 0-255), is sadly not available in the Sequelize DataTypes. I might suggest opening a PR to add it...
On the other hand, if you really are looking for the ZEROFILL functionality, and need to specify that display width of 4, you could do something like Sequelize.INTEGER(4).ZEROFILL, but obviously, that would be pretty wasteful of space in your DB.
For MySQL, the Sequelize.BOOLEAN data type maps to TINYINT(1). See
https://github.com/sequelize/sequelize/blob/3e5b8772ef75169685fc96024366bca9958fee63/lib/data-types.js#L397
and
http://docs.sequelizejs.com/en/v3/api/datatypes/
As noted by #user866762, the number in parentheses only affects how the data is displayed, not how it is stored. So, TINYINT(1) vs. TINYINT(4) should have no effect on your data.

Indexing columns for faster querying in MySQL 5.6 or higher

I'm building a real estate app. I have a table called properties which is like the main table that has all common columns (10 columns) for all types of properties (lands, apartments, ... etc) and then I have a specific table for each property type since each type has some specific column. here is the property table:
CREATE TABLE `properties` (
`property_id` int(11) NOT NULL AUTO_INCREMENT,
`property_type` int(11) DEFAULT NULL,
`property_title` varchar(255) NOT NULL,
`property_description` varchar(1000) NOT NULL,
`country_id` int(11) NOT NULL,
`city_id` int(11) NOT NULL,
`city_location_id` int(11) NOT NULL,
`price` int(11) DEFAULT NULL,
`area` decimal(7,2) DEFAULT NULL,
`latitude` decimal(10,8) DEFAULT NULL,
`longitude` decimal(11,8) DEFAULT NULL,
`entry_date` datetime NOT NULL,
`last_modification_date` datetime NOT NULL,
PRIMARY KEY (`property_id`)
)
and here is the apartments for example:
CREATE TABLE `apartments` (
`apartment_id` INT NOT NULL COMMENT '',
`num_of_bedrooms` INT NULL COMMENT '',
`num_of_bathrooms` INT NULL COMMENT '',
`num_of_garages` INT NULL COMMENT '',
PRIMARY KEY (`apartment_id`) COMMENT '',
CONSTRAINT `properties_apartments_fk`
FOREIGN KEY (`apartment_id`)
REFERENCES `aqar_world`.`properties` (`property_id`)
ON DELETE CASCADE
ON UPDATE NO ACTION);
now the user can filter his search based on almost any of these columns or a combination of them, so how should I put my indexing strategy on the columns (the user could filter based on price, area, area and price, number of bedrooms and location and so on with these so many combinations) .. another point is that the property_description and property_title are texts so I'll have to add a fulltext index on each of them, right? also there is a join between these two tables and also between them and some other table (like agents tables for example).
I've read some say since mysql 5.6 there something in the optimizer that makes use of multiple indexes so you can put an index on each column but I don't know if that is right .. please advice since I'm not that good in taking care of DB performance
5.7 has JSON tricks. MariaDB 10 has Dynamic Columns with similar tricks.
The main principle: Expose the more useful fields; throw the more obscure fields into JSON or Dynamic columns. Then let MySQL filter on the former, and your app takes care of further filtering on the latter.
More discussion.

Add Indexes correctly to a large database table

The problem is that after I insert 200.000-300.000 rows of data into those columns the search moves very slow and my first thing that came in mind is the indexes that I may have not added correctly. I have tried adding as many as possible BTREE indexes phpmyadmin did not let me to add for all. What would be the correct indexes for my table? I have the following table with the following indexes:
CREATE TABLE IF NOT EXISTS `carads` (
`ADID` int(7) NOT NULL AUTO_INCREMENT,
`LINK` varchar(255) CHARACTER SET latin1 NOT NULL,
`TITLE` varchar(255) NOT NULL,
`MAKE` varchar(50) CHARACTER SET latin1 NOT NULL,
`MODEL` varchar(100) CHARACTER SET latin1 NOT NULL,
`FUEL` varchar(50) CHARACTER SET latin1 NOT NULL,
`LOC` varchar(100) NOT NULL,
`TRANS` varchar(50) NOT NULL,
`YEAR` varchar(4) CHARACTER SET latin1 NOT NULL,
`BODY` varchar(255) CHARACTER SET latin1 NOT NULL,
`DESCRIPT` text CHARACTER SET latin1 NOT NULL,
`PHONENR` varchar(20) NOT NULL,
`MILEAGE` int(11) NOT NULL,
`PRICE` int(20) NOT NULL,
`DISTANCE` int(250) NOT NULL,
`POSTCODE` varchar(250) NOT NULL,
`IMAGE1` varchar(255) NOT NULL,
`IMAGE2` varchar(255) NOT NULL,
`IMAGE3` varchar(255) NOT NULL,
`IMAGE4` varchar(255) NOT NULL,
`IMAGE5` varchar(255) NOT NULL,
`CPHONE` varchar(250) NOT NULL,
`CEMAIL` varchar(500) NOT NULL,
`COLOUR` varchar(250) NOT NULL,
`EQUIPMENT` text NOT NULL,
`STATUS` tinyint(1) NOT NULL DEFAULT '1',
`DATE` date NOT NULL,
`DEL` int(1) NOT NULL DEFAULT '0',
PRIMARY KEY (`ADID`),
KEY `ix_MakeModelPrice` (`STATUS`,`MAKE`(25),`MODEL`(25),`PRICE`),
KEY `ix_Price` (`PRICE`,`STATUS`,`DEL`,`TITLE`(30),`ADID`),
KEY `ix_Date` (`DATE`,`STATUS`,`DEL`,`TITLE`(30),`ADID`),
KEY `LINK` (`LINK`),
FULLTEXT KEY `MODEL` (`MODEL`),
FULLTEXT KEY `SearchIndex` (`TITLE`,`LOC`,`TRANS`,`CPHONE`,`CEMAIL`,`COLOUR`,`EQUIPMENT`),
FULLTEXT KEY `MAKE` (`MAKE`)
)
ENGINE=MyISAM DEFAULT CHARSET=utf8 AUTO_INCREMENT=2478687;
This is very complicated and we cannot give you the correct answer, you have to understand and find the best answer by yourself.
You have to keep following in mind:
The query optimizer will choose only one index.
Indexes which start with something like "status" and or "del" (boolean values or values where 95% of the rows have the selected values) don't add any value, besides these dummy columns are followed with often queried, highly selective values.
You should first find the attributes which are
filled in most of the queries (I could imagine that "make", "price" and "year" are good candidates)
are most selective (meaning that the resulting rows are < 10%)
You have to find out which distribution of values for each of the columns exist in your table. Examples:
Make:
BMW: 5%
Alfa Romeo: 1%
VW: 7%
...
Price-Range:
0..999: 3%
1000..1999: 4%
2000..3000: 5%
...
If 80% of all searches contain "make", "price" and "year", then create an index with all 3 columns. Put the columns which are most selective and/or are mentioned in most searches to the front, followed by the other columns.
With some luck you can improve response time of many searches dramatically. You can then dig deeper into statistics and add some other indexes. Maybe 80% of all searches have a selection for "make", but in the rest there are still many searches without "make", but with focus on "price" and "fuel", then create an index for those searches.
You could as well improve performance when you use "codes" (e.g. Alfa Romeo=1, BMW=2, VW=3, ...) or cluster ranges of values (e.g. price_range: 0..999, 1000..2000, ...). This could help MySQL to build a bit more efficient indexes (smaller leads to less memory footprint and less I/0).
And to understand indexes better, try to submit a query like this (I want that index ix_MakeModelPrice is used):
-- ix_MakeModelPrice: STATUS`,`MAKE`(25),`MODEL`(25),`PRICE`
SELECT * FROM carads
where STATUS=1 AND MAKE='Alfa Romeo'
AND MODEL='159' and PRICE BETWEEN 100 and 1000
order by ADID Desc
LIMIT 0
This query should be fast (hopefully with some matching rows). Do you see why it is fast? "STATUS" is not selective, but the rest should reduce the number of rows found with an index-scan to probably way below 1%. The number of physical reads (rows) is reduced to a minimum => faster response.

Mysql Limit Performance

I have a large table in mysql, about 1 million records.
I'm using a dynamic query with different parameters in where clause and order, so i cant use some code like AND id > 34000 LIMIT 10
I have index on my fields in WHERE and LIMIT and ORDER but index doesn't help alone.
I need a better way than LIMIT 34000, 10, Is there any way to slove offset delay?
I put my table schema but i just copy more usable field without any index, because i'm using dynamic queries.
CREATE TABLE IF NOT EXISTS `p_apartmentbuy` (
`property_id` mediumint(8) unsigned NOT NULL,
`dateadd` int(10) unsigned NOT NULL,
`sqm` smallint(5) unsigned NOT NULL,
`sqmland` smallint(5) unsigned NOT NULL,
`age` tinyint(2) unsigned NOT NULL,
`price` bigint(12) unsigned NOT NULL,
`pricemeter` int(11) unsigned NOT NULL,
`floortotal` tinyint(3) unsigned NOT NULL,
`floorno` tinyint(3) unsigned NOT NULL,
`unittotal` smallint(4) unsigned NOT NULL,
`unitthisfloor` tinyint(3) unsigned NOT NULL,
`room` tinyint(1) unsigned NOT NULL,
`parking` tinyint(1) unsigned NOT NULL,
`renovate` tinyint(1) unsigned NOT NULL,
`address` varchar(255) COLLATE utf8_general_ci NOT NULL,
`describe` varchar(500) COLLATE utf8_general_ci NOT NULL,
`featured` tinyint(1) unsigned NOT NULL,
`l_location_id` smallint(5) unsigned NOT NULL,
`l_city_id` smallint(4) unsigned NOT NULL,
`pf_furnished_id` tinyint(2) unsigned NOT NULL,
PRIMARY KEY (`property_id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_general_ci;
the problem with a table with 1 mill records wont be the AND id > 34000 LIMIT 10 or LIMIT 34000, 10 that will up to the Structure and the rest of the query. I.E, you need index, PK, FK to speed up the query, beside that an Order by probably will slow it down, make search like '%text%' it will make your query SLOW. Also it's up to the table's Engine
So don't expect that changing limit 10 will make a huge difference. There are a couple of tool that will help you to determinate a 'better' query, but not all queries works as the same so don't expect the "best solution" because it doesn't exists.
You can use Show create table or Describe select ...... or explain to see what's going on, or use the command benchmark to see the approximate time of a function that you are applying to improve it
EDIT:
Some tools for MySQL
I'll recommend you to take a look to this program that will help you with this part of performance.
Mysqlslap (it's like benchmark but you can customize more the result).
SysBench (test CPUperformance, I/O performance, mutex contention, memory speed, database performance).
Mysqltuner (with this you can analize general statistics, Storage engine Statistics, performance metrics).
mk-query-profiler (perform analysis of a SQL Statement).
mysqldumpslow (good to know witch queries are causing problems).
MySQL is able to optimize LIMIT clauses (i.e. only scan / evaluate the rows in the range specified by LIMIT) if it is able to use only indexes to find rows matching the query.
For queries like SELECT * FROM users WHERE active = 1 ORDER BY created_at, adding and index on (active, created_at) is enough.
See http://www.mysqlperformanceblog.com/2006/09/01/order-by-limit-performance-optimization/

MYSQL: Find and delete similar records - Updated with example

I'm trying to dedup a table, where I know there are 'close' (but not exact) rows that need to be removed.
I have a single table, with 22 fields, and uniqueness can be established through comparing 5 of those fields. Of the remaining 17 fields, (including the unique key), there are 3 fields that cause each row to be unique, meaning the dedup proper method will not work.
I was looking at the multi table delete method outlined here: http://blog.krisgielen.be/archives/111 but I can't make sense of the final line of code (AND M1.cd*100+M1.track > M2.cd*100+M2.track) as I am unsure what the cd*100 part achieves...
Can anyone assist me with this? I suspect I could do better exporting the whole thing to python, doing something with it, then re-importing it, but then (1)I'm stuck with knowing how to dedup the string anyway! and (2) I had to break the record into chunks to be able to import it into mysql as it was timing out after 300 seconds so it turned into a whole debarkle to get into mysql in the first place.... (I am very novice at both mysql and python)
The table is a dump of some 40 log files from some testing. The test set for each log is some 20,000 files. The repeating values are either the test conditions, the file name/parameters or the results of the tests.
CREATE SHOW TABLE:
CREATE TABLE `t1` (
`DROID_V` int(1) DEFAULT NULL,
`Sig_V` varchar(7) DEFAULT NULL,
`SPEED` varchar(4) DEFAULT NULL,
`ID` varchar(7) DEFAULT NULL,
`PARENT_ID` varchar(10) DEFAULT NULL,
`URI` varchar(10) DEFAULT NULL,
`FILE_PATH` varchar(68) DEFAULT NULL,
`NAME` varchar(17) DEFAULT NULL,
`METHOD` varchar(10) DEFAULT NULL,
`STATUS` varchar(14) DEFAULT NULL,
`SIZE` int(10) DEFAULT NULL,
`TYPE` varchar(10) DEFAULT NULL,
`EXT` varchar(4) DEFAULT NULL,
`LAST_MODIFIED` varchar(10) DEFAULT NULL,
`EXTENSION_MISMATCH` varchar(32) DEFAULT NULL,
`MD5_HASH` varchar(10) DEFAULT NULL,
`FORMAT_COUNT` varchar(10) DEFAULT NULL,
`PUID` varchar(15) DEFAULT NULL,
`MIME_TYPE` varchar(24) DEFAULT NULL,
`FORMAT_NAME` varchar(10) DEFAULT NULL,
`FORMAT_VERSION` varchar(10) DEFAULT NULL,
`INDEX` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`INDEX`)
) ENGINE=MyISAM AUTO_INCREMENT=960831 DEFAULT CHARSET=utf8
The only unique field is the PriKey, 'index'.
Unique records can be established by looking at DROID_V,Sig_V,SPEED.NAME and PUID
Of the ¬900,000 rows, I have about 10,000 dups that are either a single duplicate of a record, or have upto 6 repetitions of the record.
Row examples: As Is
5;"v37";"slow";"10266";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"191977"
5;"v37";"slow";"10268";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"191978"
5;"v37";"slow";"10269";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"191979"
5;"v37";"slow";"10270";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"191980"
5;"v37";"slow";"12766";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"193977"
5;"v37";"slow";"12768";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"193978"
5;"v37";"slow";"12769";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"193979"
5;"v37";"slow";"12770";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"193980"
Row Example: As It should be
5;"v37";"slow";"10266";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/7";"image/tiff";"Tagged Ima";"3";"191977"
5;"v37";"slow";"10268";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/8";"image/tiff";"Tagged Ima";"4";"191978"
5;"v37";"slow";"10269";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/9";"image/tiff";"Tagged Ima";"5";"191979"
5;"v37";"slow";"10270";;"file:";"V1-FL425817.tif";"V1-FL425817.tif";"BINARY_SIG";"MultipleIdenti";"20603284";"FILE";"tif";"2008-11-03";;;;"fmt/10";"image/tiff";"Tagged Ima";"6";"191980"
Please note, you can see from the index column at the end that I have cut out some other rows - I have only idenitified a very small set of repeating rows. Please let me know if you need any more 'noise' from the rest of the DB
Thanks.
I figured out a fix - using the count function, I was using a COUNT(*) that just returned everything in the table, by using a COUNT (distinct NAME) function I am able to weed out the dup rows that fit the dup critera (as set out by the field selection in a WHERE clause)
Example:
SELECT `PUID`,`DROID_V`,`SIG_V`,`SPEED`, COUNT(distinct NAME) as Hit FROM sourcelist, main_small WHERE sourcelist.SourcePUID = 'MyVariableHere' AND main_small.NAME = sourcelist.SourceFileName
GROUP BY `PUID`,`DROID_V`,`SIG_V`,`SPEED` ORDER BY `DROID_V` ASC, `SIG_V` ASC, `SPEED`;