Optimizing a giant mysql table - mysql

I have a giant mysql table which is growing at all the time. It's recording chat data.
this what my table looks like
CREATE TABLE `log` (
`id` BIGINT(20) NOT NULL AUTO_INCREMENT,
`channel` VARCHAR(26) NOT NULL,
`timestamp` DATETIME NOT NULL,
`username` VARCHAR(25) NOT NULL,
`message` TEXT NOT NULL,
PRIMARY KEY (`id`),
INDEX `username` (`username`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
AUTO_INCREMENT=2582573
;
Indexing the username is kinda important because queries for a username can take like 5 seconds otherwise.
Is there anyway of optimizing this table even more to prepare it for huge amounts of data.
So that even 100m rows won't be a problem.

`id` BIGINT(20) NOT NULL AUTO_INCREMENT,
Will you have more than 4 billion rows? If not, use INT UNSIGNED, saving 4 bytes per row. Plus another 4 bytes for each row in the secondary index.
`channel` VARCHAR(26) NOT NULL,
`username` VARCHAR(25) NOT NULL,
Normalize each -- that is, replace this by, say, a SMALLINT UNSIGNED and have a mapping between them. Savings: lots.
INDEX `username` (`username`)
That becomes user_id, saving even more.
Smaller --> more cacheable --> faster.
What other queries will you have?
"Memory usage" -- For InnoDB, set innodb_buffer_pool_size to about 70% of available RAM. Then, let it worry about what is in memory, what it not. Once the table is too big to be cached, you should shrink the data (as I mentioned above) and provide 'good' indexes (as mentioned in other comments) and perhaps structure the table for "locality of reference" (without knowing all the queries, I can't address this).
You grumbled about using IDs instead of strings... Let's take a closer look at that. How many distinct usernames are there? channels? How does the data come in -- do you get one row at a time, or batches? Is something doing direct INSERTs or feeding to some code that does the INSERTs? Could there be a STORED PROCEDURE to do the normalization and insertion? If you need hundreds of rows inserted per second, then I can discuss how to do both, and do them efficiently.
You did not ask about PARTITIONs. I do not recommend it for a simple username query.
2.5M rows is about the 85th percentile. 100M rows is more exciting -- 98th percentile.

Related

Multi-Table MariaDB Database Partitioning Solution

Looking for some guidance on how to best tackle partitioning on some database tables for the purpose of archiving/deleting data over a certain age. The main reason for this is to resolve some issues in database size.
You can think of the data akin to telemetry data where is is growing over time, but once it enters the database it doesn't change outside of the first 10-15 minutes in the event there is any form of conflicting data that requires the application to update a recent record (max 15 mins).
Current database size is approaching 500GB and is sitting on NVMe storage across a 3x Node Galera cluster in three cities. Backups are becoming increasingly larger and if an SST is needed between nodes this can take a couple of hours to complete which is less than ideal.
The plan to deal with this is by way of archiving, where we plan to off-board historical data to another server (say once a year) with slower storage that can then be backed up once and won't change for 12 months. The historical data will be rarely accessed, and in the event it is our application will handle querying the archive server if older than a certain date instead of the production servers that are relied on heavily for "recent" data.
We have 3x tables per customer, and they reference each other in a sort of heirarchy. There are no foreign keys in the tables, but they do hold references to one another and are used in JOIN queries. Eg. summary table is the top of the hierarchy and holds one record per "event". Under this is the details table and there could be 1-10 detail records sitting under the summary event. Under details is the digits table that could include 0-10 records per detailed record.
CREATE TABLE data below;
CREATE TABLE `summary_X` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_utc` datetime DEFAULT NULL,
`end_utc` datetime DEFAULT NULL,
`total_duration` smallint(6) DEFAULT NULL,
`legs` tinyint(4) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `start_utc` (`start_utc`)
) ENGINE=InnoDB
CREATE TABLE `details_X` (
`xid` bigint(20) NOT NULL AUTO_INCREMENT,
`id` int(11) NOT NULL,
`duration` smallint(6) DEFAULT NULL,
`start_utc` timestamp NULL DEFAULT NULL,
`end_utc` timestamp NULL DEFAULT NULL,
`event` varchar(2) DEFAULT NULL,
`event_time` smallint(6) DEFAULT NULL,
`event_a` varchar(7) DEFAULT NULL,
`event_b` varchar(7) DEFAULT NULL,
`ani` varchar(20) DEFAULT NULL,
`dnis` varchar(10) DEFAULT NULL,
`first_time` varchar(30) DEFAULT NULL,
`final_time` varchar(30) DEFAULT NULL,
`digits_count` int(2) DEFAULT 0,
`sys_a` varchar(3) DEFAULT NULL,
`sys_b` varchar(3) DEFAULT NULL,
`log_id_a` varchar(12) DEFAULT NULL,
`seq_a` varchar(1) DEFAULT NULL,
`log_id_b` varchar(12) DEFAULT NULL,
`seq_b` varchar(1) DEFAULT NULL,
`assoc_log_id_a` varchar(12) DEFAULT NULL,
`assoc_log_id_b` varchar(12) DEFAULT NULL,
PRIMARY KEY (`xid`),
KEY `start_utc` (`start_utc`),
KEY `end_utc` (`end_utc`),
KEY `event_a` (`event_a`),
KEY `event_b` (`event_b`),
KEY `id` (`id`),
KEY `final_digits` (`final_digits`),
KEY `log_id_a` (`log_id_a`),
KEY `log_id_b` (`log_id_b`)
) ENGINE=InnoDB
CREATE TABLE `digits_X` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`leg_id` bigint(20) DEFAULT NULL,
`sequence` int(2) NOT NULL,
`digits` varchar(30) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `digits` (`digits`),
KEY `leg_id` (`leg_id`)
) ENGINE=InnoDB
My first thought was to partition on Year, sounds easy enough but we don't have a date column on the digits table, so records here could be orphaned away from their mapped details record and no longer match in a JOIN on the archive server.
We then can also have a similar issue with summary and the timestamps on the "details" records could span multiple years. Eg. Summary event starts at 2021-12-31 23:55:00. First detail record is same timestamp, and then the next detail under the same event could be 2022-01-01 00:11:00. If 2021 partition was archived off to the other server, the 2022 detail would be orphaned and no longer JOIN to the 2021 summary event.
One alternative could be not to partition at all and do SELECT/INSERT/DELETE which isn't practical with the volume of data. Some tables have 30M-40M rows per year so this would be very resource taxing. There are also 400+ customers each with their own sets of tables.
Another I thought of was to add a column to the three tables as a "Year" column we can partition on but would include the Year of first event across all, so all related records can be on the same partitions/server, but this seems like a waste of space and there should be a better way.
Any thoughts or guidance would be appreciated.
To add PARTITIONing will require copying the entire table over. That will involve downtime and disk space. If you can live with that, then...
PARTITION BY RANGE(...) where the expression involves, say, TO_DAYS(...) or possibly TO_SECONDS(...). Then set up cron jobs to add a new partition periodically. (There is nothing automated for such.) And to detach the oldest partition. See Partition for a discussion of the details. (TO_DAYS avoids the need for a 'year' column.)
Note that Partitioning is implemented as several sub-tables under a table. With "transportable tablespaces", you can detach a partition from the big table, turning it into a table unto itself. At that point, you are free to move it to another server of something.
In a situation like yours, I might consider the following.
Write the raw data to a file (perhaps one per day) for archiving;
Insert into a table that will live only briefly; this will be purged by some means frequently;
Update "normalization" tables
"Summarize" the data into Summary Tables, where each set of rows covers one hour (or whatever makes sense).
Write "reports" from the summary table(s).
Be aware that each Partition takes an extra 5.5MB (average), so do not make many partitions. Or do you need only 2, each containing 15 minutes' data?
Meanwhile, I would look carefully at the schema. Can an INT (4 bytes) be turned into a SMALLINT (2 bytes). Can more things be Normalized.
digits_count int(2) -- that is a 4-byte INT; the (2) has no meaning and has been removed in MySQL 8. (MariaDB may follow suit someday.) It sounds like you need only a 1-byte TINYINT UNSIGNED (range: 0..255).
Since this is log info, be aware of Daylight Savings wrt DATETIME. (One hour per year is missing; another hour repeats.) This problem does not occur with TIMESTAMP. Each one takes 5 bytes (unless you include fractional seconds.)
(I can't advise on unnecessary indexes without seeing the queries.) SHOW TABLE STATUS will tell you how much space is being consumed by all the indexes.
Are the 3 tables of similar size?
Re "orphaning" -- You need at least 2 partitions -- one being filled (0-100% full) and an older partition (100% full)
"30M-40M rows per year" times 400 customers. Does that add up to 500 rows inserted per second? Are they INSERTed one row at a time? High speed ingestion
Are there more deletes and selects than inserts? And/or do they involve more than single rows? (I'm fishing for more info go help with some other issues you either have or are threatening to have.) Even with Deletes and no Partitioning, the disk growth will slow down as free space is generated, then reused. ("Rince and repeat.")
Without partitioning, see Huge Deletes . But... DELETEing data from a table does not shrink it disk footprint. However if each 'customer' has 1/400th of the data; and (of course) you do each customer separately, then there may not be any disk problem
I've given you a lot to think about. Answer some of my questions; I may have more advice.

Mysql partitioning effect on DDL and DML

I am using Mysql 5.6 with ~150 million records in Transaction table (InnodB). As the size is increasing this table is becoming unmanageable (adding column or index) and slow even with required indexing. After searching through internet I found it is appropriate time to partition the table. I am confidant that partitioning will solve following purpose for me
Improve DML statements response time (using partitioning pruning)
Improve archival process
But I am not sure wether (and how) it will improve DDL performance for this table or not. More specifically following DDL's performance.
ALTER TABLE ADD/DROP COLUMN
ALTER TABLE ADD/DROP INDEX
I went through Mysql documentation and internet but unable to find my answer. Can anyone please help me in this or provide any relevant documentation for this.
My table structure is as following
CREATE TABLE `TRANSACTION` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`parent_id` int(11) DEFAULT NULL,
`parent_uuid` char(36) DEFAULT NULL,
`order_number` varchar(64) DEFAULT NULL,
`order_id` int(11) DEFAULT NULL,
`order_uuid` char(36) DEFAULT NULL,
`order_type` char(1) DEFAULT NULL,
`business_id` int(11) DEFAULT NULL,
`store_id` int(11) DEFAULT NULL,
`store_device_id` int(11) DEFAULT NULL,
`source` char(1) DEFAULT NULL COMMENT 'instore, online, order_ahead, etc',
`created_at` timestamp NULL DEFAULT NULL,
`updated_at` timestamp NULL DEFAULT NULL,
`flags` int(11) DEFAULT NULL,
`customer_lang` char(2) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `parent_id` (`parent_id`),
KEY `business_id` (`business_id`,`store_id`,`store_device_id`),
KEY `parent_uuid` (`parent_uuid`),
KEY `order_uuid` (`order_uuid`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
And I am partitioning using following statement.
ALTER TABLE TRANSACTION PARTITION BY RANGE (id)
(PARTITION p0 VALUES LESS THAN (5000000) ENGINE = InnoDB,
PARTITION p1 VALUES LESS THAN (10000000) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN MAXVALUE ENGINE = InnoDB)
Thanks!
Partitioning is not a performance panacea. Even the items you mentioned will not speed up; they may even slow down.
Instead, I will critique the table to look for ways to speed up some things.
UUIDs are terrible for performance once the index on it becomes too big to be cached. This is because of its randomness. Possible solutions: compact it into BINARY(16); shrink the table other ways; avoid UUIDs.
Why have both parent_id and parent_uuid??
Shrink the 4-byte INTs to smaller datatypes where practical.
Usually CHAR should be CHARACTER SET ascii (1-byte/character), not utf8mb4 (4 bytes/char).
Caution: 150M is getting remotely close to the 2-billion limit of INT SIGNED. Consider 4B limit of INT UNSIGNED. (Each is 4 bytes.)
Do you ever use created_at or updated_at?
MySQL 8.0.13 has a very fast ADD COLUMN and DROP COLUMN (for limited situations).
5.7.?? has a less-invasive ADD INDEX than previous versions, but I am not sure it applies to partitioned tables.
5.7.4: Online DDL support reduces table rebuild time and permits concurrent DML, which helps reduce user application downtime. For additional information, see Overview of Online DDL.
More importantly, let's see the main queries that are "too slow". There may be composite indexes and/or reformulations of the queries that will speed them up.
There is even a slim chance that partitioning will help but not on the PRIMARY KEY.
I think there are only 4 use cases where partitioning helps performance.

Optimize Query on mysql

I have a query that runs really slow (15 20 seconds) when is not on memory and quite fast when is on memory (2s - 0.6s)
select count(distinct(concat(conexiones.tMacAdres,date_format(conexiones.fFecha,'%Y%m%d')))) as Conexiones,
sum(if(conexiones.tEvento='megusta',1,0)) as MeGusta,sum(if(conexiones.tEvento='megusta',conexiones.nAmigos,0)) as ImpactosMeGusta,
sum(if(conexiones.tEvento='checkin',1,0)) as CheckIn,sum(if(conexiones.tEvento='checkin',conexiones.nAmigos,0)) as ImpactosCheckIn,
min(conexiones.fFecha) Fecha_Inicio, now() Fecha_fin,datediff(now(),min(conexiones.fFecha)) as dias
from conexiones, instalaciones
where conexiones.idInstalacion=instalaciones.idInstalacion and conexiones.idInstalacion=190
and (fFecha between '2014-01-01 00:00:00' and '2016-06-18 23:59:59')
group by instalaciones.tNombre
order by instalaciones.idCliente
This is Table SCHEMAS:
Instalaciones with 1332 rows:
CREATE TABLE `instalaciones` (
`idInstalacion` int(10) unsigned NOT NULL AUTO_INCREMENT,
`idCliente` int(10) unsigned DEFAULT NULL,
`tRouterSerial` varchar(50) DEFAULT NULL,
`tFacebookPage` varchar(256) DEFAULT NULL,
`tidFacebook` varchar(64) DEFAULT NULL,
`tNombre` varchar(128) DEFAULT NULL,
`tMensaje` varchar(128) DEFAULT NULL,
`tWebPage` varchar(128) DEFAULT NULL,
`tDireccion` varchar(128) DEFAULT NULL,
`tPoblacion` varchar(128) DEFAULT NULL,
`tProvincia` varchar(64) DEFAULT NULL,
`tCodigoPosta` varchar(8) DEFAULT NULL,
`tLatitud` decimal(15,12) DEFAULT NULL,
`tLongitud` decimal(15,12) DEFAULT NULL,
`tSSID1` varchar(40) DEFAULT NULL,
`tSSID2` varchar(40) DEFAULT NULL,
`tSSID2_Pass` varchar(40) DEFAULT NULL,
`fSincro` datetime DEFAULT NULL,
`tEstado` varchar(10) DEFAULT NULL,
`tHotspot` varchar(10) DEFAULT NULL,
`fAlta` datetime DEFAULT NULL,
PRIMARY KEY (`idInstalacion`),
UNIQUE KEY `tRouterSerial` (`tRouterSerial`),
KEY `idInstalacion` (`idInstalacion`)
) ENGINE=InnoDB AUTO_INCREMENT=1332 DEFAULT CHARSET=utf8;
Conexiones with 2370365 rows
CREATE TABLE `conexiones` (
`idConexion` int(10) unsigned NOT NULL AUTO_INCREMENT,
`idInstalacion` int(10) unsigned DEFAULT NULL,
`idUsuario` int(11) DEFAULT NULL,
`tMacAdres` varchar(64) DEFAULT NULL,
`tUsuario` varchar(128) DEFAULT NULL,
`tNombre` varchar(64) DEFAULT NULL,
`tApellido` varchar(64) DEFAULT NULL,
`tEmail` varchar(64) DEFAULT NULL,
`tSexo` varchar(20) DEFAULT NULL,
`fNacimiento` date DEFAULT NULL,
`nAmigos` int(11) DEFAULT NULL,
`tPoblacion` varchar(64) DEFAULT NULL,
`fFecha` datetime DEFAULT NULL,
`tEvento` varchar(20) DEFAULT NULL,
PRIMARY KEY (`idConexion`),
KEY `idInstalacion` (`idInstalacion`),
KEY `tMacAdress` (`tMacAdres`) USING BTREE,
KEY `fFecha` (`fFecha`),
KEY `idUsuario` (`idUsuario`),
KEY `insta_fecha` (`idInstalacion`,`fFecha`)
) ENGINE=InnoDB AUTO_INCREMENT=2370365 DEFAULT CHARSET=utf8;
This is EXPLAIN
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE instalaciones const PRIMARY,idInstalacion PRIMARY 4 const 1
1 SIMPLE conexiones ref idInstalacion,fFecha,insta_fecha idInstalacion 5 const 110234 "Using where"
Thanks !
(Edited)
SHOW TABLE STATUS LIKE 'conexiones'
Name Engine Version Row_format Rows Avg_row_length Data_length Max_data_length Index_length Data_free Auto_increment Create_time Update_time Check_time Collation Checksum Create_options Comment
conexiones InnoDB 10 Compact 2305296 151 350060544 0 331661312 75497472 2433305 28/06/2016 22:26 NULL NULL utf8_general_ci NULL
Here's why it is so slow. And I will end with a possible speedup.
First, please do
SELECT COUNT(*) FROM conexiones
WHERE idInstalacion=190
and fFecha >= '2014-01-01'
and fFecha < '2016-06-19
in order to see how many rows we are dealing with. The EXPLAIN suggests 110234, but that is only a crude estimate.
Assuming there are 110K rows of conexiones involved in the query, and assuming the rows were (approximately) inserted in chronological order by fFecha, then...
There are a lot of rows to work with, and
They are scattered around the table on disk, hence
The query takes a lot of I/O, unless it is cached.
Let's further check on my last claim... How much RAM do you have? What is the value of innodb_buffer_pool_size? It should be about 70% of available RAM. Use a lower percentage if you have less than 4GB of RAM.
Assuming that conexiones is too big to be 'cached' in the 'buffer_pool', we need to find a way to decrease the I/O.
There are 1332 different values for idInstalacion. Perhaps you insert 1332 rows every few minutes/hours into conexiones? Since the PRIMARY KEY merely an AUTO_INCREMENT, those rows will be 'appended' to the end of the table.
Now let's look at where the idInstalacion=190 rows are. A new one of them occurs every 1332 (or so) rows. That means they are spread out. It means that (probably) no two rows are in the same block (16KB in InnoDB). That means that the 110234 will be in 110234 different blocks. That's about 2GB. If the buffer_pool is smaller than that, then there will be I/O. Even if it is bigger than that, that's a lot of data to touch.
But what to do about it? If we could arrange the =190 rows to be consecutive in the table, then the 2GB might drop to, say, 20MB -- a much more manageable and cacheable size. But how can that be done? By changing the PRIMARY KEY.
PRIMARY KEY(idInstalacion, fFecha, idConexion),
INDEX(idConexion)
and DROP any other indexes starting with idInstalacion or idConexion. To explain:
Since the PK is "clustered" with the data, all idInstalacion=190 rows over any consecutive fFetcha range will be consecutive in the data. So, fetching one block will get about 100 rows -- much less I/O.
A PK must be unique. Assuming (idInstalacion, fFecha) is not unique, I tacked on idConexion to make it unique.
I added INDEX(idConexion) to make AUTO_INCREMENT happy.
Potential drawback... Since this change rearranges the order of the data, other queries, including the INSERTs may be slowed down. The INSERTs will be scattered, but not really slowed down. 1332 "hots spots" would be accepting the new rows; that many blocks can easily be cached.
Arithmetic... If you have spinning drives, I would expect the existing structure to take about 1102 seconds (perhaps under 110 seconds for SSD) for 110234 rows. Since it is taking under 20 seconds, I suspect there is some caching (or you have SSDs) or the 110234 is grossly overestimated. My suggested change should decrease the "worst" time significantly, and slightly improve the "in memory" time. This "slight improvement" comes from being able to use the PK instead of a secondary key.
Caveat: Since 110234 * 1332 is nowhere near 2370365, much of my numerical analysis is probably nowhere near correct. For example, 2370365 rows with that schema is possible less than 1GB. Please provide SHOW TABLE STATUS LIKE 'conexiones'.
Addenda
"server has 2GB Ram and innodb_buffer_pool_size is 5368709120" -- Either that is a typo or it is terrible. Since the buffer_pool needs to reside in RAM, do not set the buffer_pool to 5GB. 500MB might be OK for your tiny 2GB of RAM.
The SHOW TABLE STATUS confirms that it (data + indexes) won't quite fit in 500M, so you may periodically experience I/O bound queries with 500M.
Increasing your RAM and buffer_pool would temporarily (until the data gets bigger) help performance.
Before putting this into production, test the ALTER and time the various queries you use:
ALTER TABLE conexiones
DROP PRIMARY KEY,
DROP INDEX insta_fecha,
DROP INDEX idInstalacion,
PRIMARY KEY(idInstalacion, fFecha, idConexion),
INDEX(idConexion)
Caution: The ALTER will need about 1GB of free disk space.
When timing, run with the Query Cache off, and run twice -- the first may involve I/O; the second is the 'in memory' as you mentioned.
Revised analysis: Since the bigger table has 300MB of data and some amount of indexes in use, and assuming 500MB buffer pool, I suspect that blocks are bumped out of the buffer pool some of the time. This fits well with your initial comment on the query's speed. My suggested index changes should help avoid the speed variance, but may hurt the performance of other queries.
Try to use a multi column index:
CREATE idx_nn_1 ON conexiones(idInstalacion,fFecha);
You might need to have it the other way around depending on the data, so test both. This avoids reading all the records for between condition on fFecha matching the idInstalacion condition, and should improve performance.
Try the following:
Either delete the idInstalacion INDEX or tell the engine to use the correct key in the from clause:
from conexiones use index (insta_fecha), instalaciones
And you don't need to JOIN, GROUP or ORDER. You are joining on a constant value (190) with one row. And you don't use any column from instalaciones.
So all you need is this:
select count(distinct(concat(conexiones.tMacAdres,date_format(conexiones.fFecha,'%Y%m%d')))) as Conexiones,
sum(if(conexiones.tEvento='megusta',1,0)) as MeGusta,sum(if(conexiones.tEvento='megusta',conexiones.nAmigos,0)) as ImpactosMeGusta,
sum(if(conexiones.tEvento='checkin',1,0)) as CheckIn,sum(if(conexiones.tEvento='checkin',conexiones.nAmigos,0)) as ImpactosCheckIn,
min(conexiones.fFecha) Fecha_Inicio, now() Fecha_fin,datediff(now(),min(conexiones.fFecha)) as dias
from conexiones -- use index (insta_fecha)
where conexiones.idInstalacion=190
and (fFecha between '2014-01-01 00:00:00' and '2016-06-18 23:59:59')
However - it doesn't mean it will be faster. MySQL will probably optimize all that stuff away.

MySql - Handle table size and performance

We are having a Analytics product. For each of our customer we give one JavaScript code, they put that in their web sites. If a user visit our customer site the java script code hit our server so that we store this page visit on behalf of this customer. Each customer contains unique domain name.
we are storing this page visits in MySql table.
Following is the table schema.
CREATE TABLE `page_visits` (
`domain` varchar(50) DEFAULT NULL,
`guid` varchar(100) DEFAULT NULL,
`sid` varchar(100) DEFAULT NULL,
`url` varchar(2500) DEFAULT NULL,
`ip` varchar(20) DEFAULT NULL,
`is_new` varchar(20) DEFAULT NULL,
`ref` varchar(2500) DEFAULT NULL,
`user_agent` varchar(255) DEFAULT NULL,
`stats_time` datetime DEFAULT NULL,
`country` varchar(50) DEFAULT NULL,
`region` varchar(50) DEFAULT NULL,
`city` varchar(50) DEFAULT NULL,
`city_lat_long` varchar(50) DEFAULT NULL,
`email` varchar(100) DEFAULT NULL,
KEY `sid_index` (`sid`) USING BTREE,
KEY `domain_index` (`domain`),
KEY `email_index` (`email`),
KEY `stats_time_index` (`stats_time`),
KEY `domain_statstime` (`domain`,`stats_time`),
KEY `domain_email` (`domain`,`email`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
We don't have primary key for this table.
MySql server details
It is Google cloud MySql (version is 5.6) and storage capacity is 10TB.
As of now we are having 350 million rows in our table and table size is 300 GB. We are storing all of our customer details in the same table even though there is no relation between one customer to another.
Problem 1: For few of our customers having huge number of rows in table, so performance of queries against these customers are very slow.
Example Query 1:
SELECT count(DISTINCT sid) AS count,count(sid) AS total FROM page_views WHERE domain = 'aaa' AND stats_time BETWEEN CONVERT_TZ('2015-02-05 00:00:00','+05:30','+00:00') AND CONVERT_TZ('2016-01-01 23:59:59','+05:30','+00:00');
+---------+---------+
| count | total |
+---------+---------+
| 1056546 | 2713729 |
+---------+---------+
1 row in set (13 min 19.71 sec)
I will update more queries here. We need results in below 5-10 seconds, will it be possible?
Problem 2: The table size is rapidly increasing, we might hit table size 5 TB by this year end so we want to shard our table. We want to keep all records related to one customer in one machine. What are the best practises for this sharding.
We are thinking following approaches for above issues, please suggest us best practices to overcome these issues.
Create separate table for each customer
1) What are the advantages and disadvantages if we create separate table for each customer. As of now we are having 30k customers we might hit 100k by this year end that means 100k tables in DB. We access all tables simultaneously for Read and Write.
2) We will go with same table and will create partitions based on date range
UPDATE : Is a "customer" determined by the domain? Answer is Yes
Thanks
First, a critique if the excessively large datatypes:
`domain` varchar(50) DEFAULT NULL, -- normalize to MEDIUMINT UNSIGNED (3 bytes)
`guid` varchar(100) DEFAULT NULL, -- what is this for?
`sid` varchar(100) DEFAULT NULL, -- varchar?
`url` varchar(2500) DEFAULT NULL,
`ip` varchar(20) DEFAULT NULL, -- too big for IPv4, too small for IPv6; see below
`is_new` varchar(20) DEFAULT NULL, -- flag? Consider `TINYINT` or `ENUM`
`ref` varchar(2500) DEFAULT NULL,
`user_agent` varchar(255) DEFAULT NULL, -- normalize! (add new rows as new agents are created)
`stats_time` datetime DEFAULT NULL,
`country` varchar(50) DEFAULT NULL, -- use standard 2-letter code (see below)
`region` varchar(50) DEFAULT NULL, -- see below
`city` varchar(50) DEFAULT NULL, -- see below
`city_lat_long` varchar(50) DEFAULT NULL, -- unusable in current format; toss?
`email` varchar(100) DEFAULT NULL,
For IP addresses, use inet6_aton(), then store in BINARY(16).
For country, use CHAR(2) CHARACTER SET ascii -- only 2 bytes.
country + region + city + (maybe) latlng -- normalize this to a "location".
All these changes may cut the disk footprint in half. Smaller --> more cacheable --> less I/O --> faster.
Other issues...
To greatly speed up your sid counter, change
KEY `domain_statstime` (`domain`,`stats_time`),
to
KEY dss (domain_id,`stats_time`, sid),
That will be a "covering index", hence won't have to bounce between the index and the data 2713729 times -- the bouncing is what cost 13 minutes. (domain_id is discussed below.)
This is redundant with the above index, DROP it:
KEY domain_index (domain)
Is a "customer" determined by the domain?
Every InnoDB table must have a PRIMARY KEY. There are 3 ways to get a PK; you picked the 'worst' one -- a hidden 6-byte integer fabricated by the engine. I assume there is no 'natural' PK available from some combination of columns? Then, an explicit BIGINT UNSIGNED is called for. (Yes that would be 8 bytes, but various forms of maintenance need an explicit PK.)
If most queries include WHERE domain = '...', then I recommend the following. (And this will greatly improve all such queries.)
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
domain_id MEDIUMINT UNSIGNED NOT NULL, -- normalized to `Domains`
PRIMARY KEY(domain_id, id), -- clustering on customer gives you the speedup
INDEX(id) -- this keeps AUTO_INCREMENT happy
Recommend you look into pt-online-schema-change for making all these changes. However, I don't know if it can work without an explicit PRIMARY KEY.
"Separate table for each customer"? No. This is a common question; the resounding answer is No. I won't repeat all the reasons for not having 100K tables.
Sharding
"Sharding" is splitting the data across multiple machines.
To do sharding, you need to have code somewhere that looks at domain and decides which server will handle the query, then hands it off. Sharding is advisable when you have write scaling problems. You did not mention such, so it is unclear whether sharding is advisable.
When sharding on something like domain (or domain_id), you could use (1) a hash to pick the server, (2) a dictionary lookup (of 100K rows), or (3) a hybrid.
I like the hybrid -- hash to, say, 1024 values, then look up into a 1024-row table to see which machine has the data. Since adding a new shard and migrating a user to a different shard are major undertakings, I feel that the hybrid is a reasonable compromise. The lookup table needs to be distributed to all clients that redirect actions to shards.
If your 'writing' is running out of steam, see high speed ingestion for possible ways to speed that up.
PARTITIONing
PARTITIONing is splitting the data across multiple "sub-tables".
There are only a limited number of use cases where partitioning buys you any performance. You not indicated that any apply to your use case. Read that blog and see if you think that partitioning might be useful.
You mentioned "partition by date range". Will most of the queries include a date range? If so, such partitioning may be advisable. (See the link above for best practices.) Some other options come to mind:
Plan A: PRIMARY KEY(domain_id, stats_time, id) But that is bulky and requires even more overhead on each secondary index. (Each secondary index silently includes all the columns of the PK.)
Plan B: Have stats_time include microseconds, then tweak the values to avoid having dups. Then use stats_time instead of id. But this requires some added complexity, especially if there are multiple clients inserting data. (I can elaborate if needed.)
Plan C: Have a table that maps stats_time values to ids. Look up the id range before doing the real query, then use both WHERE id BETWEEN ... AND stats_time .... (Again, messy code.)
Summary tables
Are many of the queries of the form of counting things over date ranges? Suggest having Summary Tables based perhaps on per-hour. More discussion.
COUNT(DISTINCT sid) is especially difficult to fold into summary tables. For example, the unique counts for each hour cannot be added together to get the unique count for the day. But I have a technique for that, too.
I wouldn't do this if i were you. First thing that come to mind would be, on receive a pageview message, i send the message to a queue so that a worker can pickup and insert to database later (in bulk maybe); also i increase the counter of siteid:date in redis (for example). Doing count in sql is just a bad idea for this scenario.

Best way to speed up a query in a innodb table with 100.000.000 rows in Mysql 5.6

I have a Mysql 5.6 table with 70 million rows in it, but it will grow to 100+ million rows or more in a few weeks.
I have a dedicated machine with a humble 500GB disk and 4GB RAM and the innodb_buffer_pool_size is set to 2GB.
The database uses 99% to selects and 1% to inserts (once a month).
The most important column is descripcion_detallada_producto varchar(300) and it is where the selects are aimed at in 90% of the times.
My table is:
CREATE TABLE `t1` (
`N_orden` bigint(20) NOT NULL DEFAULT '0',
`Fecha` varchar(15) COLLATE latin1_spanish_ci DEFAULT NULL,
`Ncm` int(11) NOT NULL,
`Origen` int(11) NOT NULL,
`Adquisicion` int(11) NOT NULL,
`Medida_Estadistica` int(11) NOT NULL,
`Unidad_Comercializacion` varchar(30) COLLATE latin1_spanish_ci DEFAULT NULL,
`Descripcion_Detallada_Producto` varchar(300) COLLATE latin1_spanish_ci DEFAULT NULL,
`Cantidad_Estadistica` double DEFAULT NULL,
`Peso_Liquido_Kg` double DEFAULT NULL,
`Valor_Fob` double DEFAULT NULL,
`Valor_Frete` double DEFAULT NULL,
`Valor_Seguro` double DEFAULT NULL,
`Valor_Unidad` double DEFAULT NULL,
`Cantidad` double DEFAULT NULL,
`Valor_Total` double DEFAULT NULL,
PRIMARY KEY (`N_orden`),
KEY `Ncm` (`Ncm`),
KEY `Origen` (`Origen`),
KEY `Adquisicion` (`Adquisicion`),
KEY `Medida_Estadistica` (`Medida_Estadistica`),
KEY `Descripcion_Detallada_Producto` (`Descripcion_Detallada_Producto`),
CONSTRAINT `t1_ibfk_1` FOREIGN KEY (`Ncm`) REFERENCES `ncm` (`Ncm`),
CONSTRAINT `t1_ibfk_2` FOREIGN KEY (`Origen`) REFERENCES `paises` (`Codigo_Pais`),
CONSTRAINT `t1_ibfk_3` FOREIGN KEY (`Adquisicion`) REFERENCES `paises` (`Codigo_Pais`),
CONSTRAINT `t1_ibfk_4` FOREIGN KEY (`Medida_Estadistica`) REFERENCES `medida_estadistica` (`Codigo_Medida_Estadistica`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_spanish_ci;
My question: Today a SELECT query using LIKE '%whatever%' takes normally 5 to 7 minutes, sometimes more. From where I understand the varchar index just are used when 'whatever%' is used, but I NEED to have the possibility to search for strings using left and right wildcards without needing to wait ~7 minutes each search. How can I do it?
The right way to fix the problem is to look at all the queries being run against the table, and their relative frequency. You've only given us part of one. You didn't even say which field it relates to. Since you do say "The most important column is descripcion_detallada_producto varchar(300) and it is where the selects are aimed at in 90% of the times" I'll assume that you only need to optimize
WHERE descripcion_detallada_producto LIKE '%wathever%'
As Vatev has already said, you probably should be using fulltext searches - which are sematically (and syntactically) different from LIKE predicates. Further you should be splitting the descripcion_detallada_producto attribute into it's own relation to reduce the buffer flushing effects of reading huge rows into memory from disk.
If you are searching for entire words that may be anywhere in a text column, you should consider using fulltext indexes, which are obviously used differently than wildcard searches. If you're unsure how to search your fulltext indexes, you can always get help with that.
Doing a search like the following will not use any of your indexes. Instead, it will scan through all rows of your table data, and you're subjected to disk reads (and any correlated disk fragmentation, which isn't usually a problem because we don't usually scan through tables):
SELECT * FROM t1
WHERE Descripcion_Detallada_Producto LIKE `%whatever%'
The following query would just scan through your index on Descripcion_Detallada_Producto which would act as a "covering" index (notice that the columns in the select make the difference):
SELECT N_orden FROM t1
WHERE Descripcion_Detallada_Producto LIKE `%whatever%'
The advantage in scanning an index instead of the actual table data is that the amount of data that is read as it scans is minimized, and ideally with a large innodb_buffer_pool_size, that index would be in memory, which would avoid disk seeks.
Once you get the N_orden values, then you could retrieve the individual records from the table data.
Additional Info
Consider reducing the size of the columns (bigint to unsigned int for N_orden) and reduce size of Descripcion_Detallada_Producto. Even though VARCHAR only uses up actual bytes (plus length) in the table data, each index entry actually uses the max, so reducing even a VARCHAR column size in an index will improve index scan speed.
In addition, if you have categories, restrict searches to selected categories and create a multi-column index on category+description. The following will only have to scan through a portion of a multi-column index on both category and description by restricting the search to a particular category:
SELECT N_orden FROM t1
WHERE Category = 1
AND Descripcion_Detallada_Producto LIKE `%whatever%'
Finally, consider removing wildcard prefixes. Make the user at least type the beginning of the model number.