LOAD DATA FROM S3 - Fastest way for big data and indexes

LOAD DATA FROM S3 - Fastest way for big data and indexes - mysql

I have a table that consists of holiday deals, so just to give you an idea, each row will contain the following bits of data:
Departure airport
Arrival airport
Start date
Duration
Hotel destination
Resort
Hotel name
Hotel rating
A few tiny integer columns for 1s and 0s.
Price
Date time the row was updated
Now, all of these deals get packaged up from 3 tables, they are flights, accommodation and transfers, the packaging up is to find the cheapest deal per variation such as, per departure airport, duration, board basis etc.
The table I am importing into will consist of around about 50 million rows, the import is extremely slow.
I have removed the indexes, that made a massive difference but now when I re-add the indexes back to the table after all data is in there it takes forever to complete.
I would like to know is there a way of bulk loading data quickly or is there a quicker way of adding indexes back to the table after data has been added?
Create Table
```
CREATE TABLE `iv_deals` (
`aid` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT COMMENT 'Deal Autonumber PK',
`startdate` DATE NULL DEFAULT NULL COMMENT 'Holiday Start Date',
`startdatet` TINYINT(2) NOT NULL DEFAULT '0',
`depairport` CHAR(3) NULL DEFAULT NULL COMMENT 'Departure Airport IATA Code',
`arrairport` CHAR(3) NULL DEFAULT NULL COMMENT 'Arrival Airport IATA Code',
`destination` VARCHAR(30) NULL DEFAULT NULL COMMENT 'Holiday Destination',
`resort` VARCHAR(30) NULL DEFAULT NULL COMMENT 'Holiday Resort',
`hotel` VARCHAR(50) NULL DEFAULT NULL COMMENT 'Holiday Property Name',
`iv_PropertyID` INT(11) UNSIGNED NOT NULL DEFAULT '0' COMMENT 'Holiday Property ID',
`rating` VARCHAR(2) NULL DEFAULT NULL COMMENT 'Holiday Property Star Rating',
`board` VARCHAR(10) NULL DEFAULT NULL COMMENT 'Holiday Meal Option',
`duration` TINYINT(2) UNSIGNED NULL DEFAULT '0' COMMENT 'Holiday Duration',
`2for1` TINYINT(1) UNSIGNED NULL DEFAULT '0' COMMENT 'Is 2nd Week FREE Offer, 0 = False, 1 = True',
`3for2` TINYINT(1) UNSIGNED NULL DEFAULT '0' COMMENT 'Is 3rd Week FREE Offer, 0 = False, 1 = True',
`3and4` TINYINT(1) UNSIGNED NULL DEFAULT '0' COMMENT 'Is 3rd and 4th Week FREE Offer, 0 = False, 1 = True',
`4for3` TINYINT(1) UNSIGNED NULL DEFAULT '0' COMMENT 'Is 4th Week FREE Offer, 0 = False, 1 = True',
`freebb` VARCHAR(2) NULL DEFAULT NULL COMMENT 'Free Week Meal Option',
`adults` TINYINT(1) UNSIGNED NULL DEFAULT '0' COMMENT 'Number of Adults',
`children` TINYINT(1) UNSIGNED NULL DEFAULT '0' COMMENT 'Number of Children',
`infants` TINYINT(1) UNSIGNED NULL DEFAULT '0' COMMENT 'Number of Infants',
`price` SMALLINT(4) UNSIGNED NULL DEFAULT '9999' COMMENT 'Price',
`carrier` VARCHAR(40) NULL DEFAULT NULL COMMENT 'Flight Carrier IATA Code',
`DateUpdated` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`aid`, `startdatet`),
UNIQUE INDEX `Unique` (`startdate`, `depairport`, `arrairport`, `iv_PropertyID`, `board`, `duration`, `adults`, `children`, `startdatet`),
INDEX `ik_Price` (`price`),
INDEX `ik_Destination` (`destination`),
INDEX `ik_Resort` (`resort`),
INDEX `ik_DepAirport` (`depairport`),
INDEX `ik_Startdate` (`startdate`),
INDEX `ik_Board` (`board`),
INDEX `ik_FILTER_ALL` (`price`, `depairport`, `destination`, `resort`, `board`, `startdate`),
INDEX `iv_PropertyID` (`iv_PropertyID`),
INDEX `ik_Duration` (`duration`),
INDEX `rating` (`rating`),
INDEX `adults` (`adults`),
INDEX `DirectFromPrice` (`iv_PropertyID`, `depairport`, `arrairport`, `board`, `duration`, `adults`, `children`, `startdate`),
INDEX `DirectFromPrice_wo_depairport` (`iv_PropertyID`, `arrairport`, `board`, `duration`, `adults`, `children`),
INDEX `DirectFromPrice_w_pid_dep` (`iv_PropertyID`, `depairport`, `adults`, `children`, `price`),
INDEX `DirectFromPrice_w_pid_night` (`iv_PropertyID`, `duration`, `adults`, `children`),
INDEX `DirectFromPrice_Dur_Board` (`iv_PropertyID`, `duration`, `board`, `adults`, `children`),
INDEX `join_index` (`destination`, `startdate`, `duration`)
)
COLLATE='utf8_general_ci'
AUTO_INCREMENT=1258378560
/*!50100 PARTITION BY LIST (startdatet)
(PARTITION part0 VALUES IN (1) ENGINE = InnoDB,
PARTITION part1 VALUES IN (2) ENGINE = InnoDB,
PARTITION part2 VALUES IN (3) ENGINE = InnoDB,
PARTITION part3 VALUES IN (4) ENGINE = InnoDB,
PARTITION part4 VALUES IN (5) ENGINE = InnoDB,
PARTITION part5 VALUES IN (6) ENGINE = InnoDB,
PARTITION part6 VALUES IN (7) ENGINE = InnoDB,
PARTITION part7 VALUES IN (8) ENGINE = InnoDB,
PARTITION part8 VALUES IN (9) ENGINE = InnoDB,
PARTITION part9 VALUES IN (10) ENGINE = InnoDB,
PARTITION part10 VALUES IN (11) ENGINE = InnoDB,
PARTITION part11 VALUES IN (12) ENGINE = InnoDB,
PARTITION part12 VALUES IN (0) ENGINE = InnoDB) */;
```

If there are 50M rows, but AUTO_INCREMENT=1258378560, let point out another problem that is looming. (It may be related to the slow load.)
`aid` INT(11) UNSIGNED NOT NULL AUTO_INCREMENT
allows only 4 billion; you are already at 1.2 billion. Do a little math to estimate when you will run out of ids. The brute force solution is to change to BIGINT, but let's analyze why the ids are being 'burned'. There are several ways that INSERT/REPLACE/etc can throw away ids. Please describe how the import is working. REPLACE is perhaps the worst -- it burns ids and is effectively DELETE + INSERT. Other techniques are faster.
(I will now ramble in many directions...)
The partitioning by month (which I assume you are doing with (startdatet) probably does not add any performance. What has your experience been? (I usually argue against using PARTITION except for the few use cases where there are benefits. I see no benefit in your case.)
19 indexes means 19 BTrees that must be updated. The 2 uniques ones must be checked before the INSERT is finished; the 17 others can be delayed, but not forever. (The details are discussed under "change buffer".)
How much RAM? What is the setting of innodb_buffer_pool_size? It should be about 70% of RAM. The Change buffer is a portion of that.
I see at least 4 indexes that can be dropped, since other indexes handle their need. In general, if you have INDEX(a, b), you don't also need INDEX(a). (Shrinking from 19 indexes to 15 will help some.)
Flags and other things of low cardinality are virtually useless by them selves as indexes. The Optimizer will decide that it is cheaper to scan the table than to bounce between the index's BTree and the data BTree. I'm thinking of INDEX(rating).
Any SELECT that does not have startdatet in the WHERE is likely to be slower than without partitioning. This is because the query must check all 13 partitions. Even with AND startdatet = 4, performance won't be any better than if there had been an index that included startdatet.
Let me discuss any index starting with a column (perhaps price, rating, startdate) that is queried as a "range" (eg, WHERE price BETWEEN ...). The processing cannot use any columns after that column. I suspect ik_FILTER_ALL will scan a big chunk of the index, since it only filtered on price. Rearrange the columns. Based on the name, I am guessing this is a "covering" index. That is, a common query references only those 6 columns? Note: SELECT * ... references more than just those 6, so the index is not "covering". (Show us the query; I can discuss it more.)
The 5 "DirectFromPrice" indexes are probably each 'perfect' for some query. But they are awfully long (lots of columns). I would guess that 2 shorter lists would come close to handling the 5 cases "well enough". (Remember, decreasing the number of indexes will help in the goal of insert time.)
What version of MySQL/MariaDB are you using?
The main action item at this point: Show us the import. (I will discuss sorting the input after seeing the method being used.)

Related

MySQL partition contains more records than expected

I have partitioned a MySQL table containing 53 rows. Now when I query number of records in all partitions, the records are almost 3 times the expected. Even phpMyAdmin thinks there are 156 records.
Have I done somthing wrong in my table design and partitioning?
Below picture shows count of records in partitions:
phpMyAdmin:
Finally, this is my table:
CREATE TABLE cl_inbox (
id int(11) NOT NULL AUTO_INCREMENT,
user int(11) NOT NULL,
contact int(11) DEFAULT NULL,
sdate timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
body text NOT NULL,
userstatus tinyint(4) NOT NULL DEFAULT 1 COMMENT '0: new, 1:read, 2: deleted',
contactstatus tinyint(4) NOT NULL DEFAULT 0,
class tinyint(4) NOT NULL DEFAULT 0,
attachtype tinyint(4) NOT NULL DEFAULT 0,
attachsrc varchar(255) DEFAULT NULL,
PRIMARY KEY (id, user),
INDEX i_class (class),
INDEX i_contact_user (contact, user),
INDEX i_contactstatus (contactstatus),
INDEX i_user_contact (user, contact),
INDEX i_userstatus (userstatus)
)
ENGINE = INNODB
AUTO_INCREMENT = 69
AVG_ROW_LENGTH = 19972
CHARACTER SET utf8
COLLATE utf8_general_ci
ROW_FORMAT = DYNAMIC
PARTITION BY KEY (`user`)
(
PARTITION partition1 ENGINE = INNODB,
PARTITION partition2 ENGINE = INNODB,
PARTITION partition3 ENGINE = INNODB,
.....
PARTITION partition128 ENGINE = INNODB
);

Those numbers are approximations, just as with SHOW TABLE STATUS and EXPLAIN.
Meanwhile, you will probably find that PARTITION BY KEY provides no performance improvement. If you find otherwise, I would be very interested to hear about it.

Simple join makes MySQL/MariaDB COUNT(*) rows very slow

I'm joining two tables and counting returned rows with simple MySQL query:
SELECT SQL_NO_CACHE count(parc2.id)
FROM SHIP__shipments AS ship
JOIN SHIP__shipments_parcels AS parc2 ON ship.shipmentId = parc2.shipmentId
It takes approx. 2 seconds to provide result, which is around 800k rows. Primary table has cca. 700k rows, joined table has cca. 800k rows.
Both tables have indexes and all that stuff. Join without counting is very fast, cca. 0.005s.
Counting just one table is also very fast, something like 0.01s.
Once counting and join is in the same query, we are dropping to 2s with 99% of time in "sending data" by profiler.
Output from explain:
1 SIMPLE ship index PRIMARY senderId 4 NULL 738700 Using index
1 SIMPLE parc2 ref shippmentId,shipmentId shippmentId 4 ship.shipmentId 1 Using index
I did tons of tries during testing. Using for example combined keys, using count(*), forcing index to use.. also more exotic ways like using subqueries, etc. Nothing really helps, it's always that slow.
Tables:
CREATE TABLE `SHIP__shipments` (
`shipmentId` int(11) NOT NULL COMMENT 'generated ID',
`externalId` varchar(255) DEFAULT NULL COMMENT 'spedition number',
`senderId` int(11) NOT NULL COMMENT 'FK - sender address',
`recipientId` int(11) DEFAULT NULL COMMENT 'Fk - recipient address',
`customerId` int(11) NOT NULL COMMENT 'FK - custromer',
`packageCount` int(11) NOT NULL COMMENT 'number of parcels',
`shipmentPickupDate` datetime NOT NULL COMMENT 'when to pickup shipent',
`shipmenmtDescription` varchar(255) DEFAULT NULL COMMENT 'free description',
`codAmount` double DEFAULT NULL COMMENT 'COD to take',
`codReference` varchar(255) DEFAULT NULL COMMENT 'customer''s COD refference',
`codCurrencyCode` varchar(50) DEFAULT NULL COMMENT 'FK - currency',
`codConfirmed` tinyint(1) NOT NULL COMMENT 'COD confirmed by spedition',
`codSent` tinyint(1) NOT NULL COMMENT 'COD paid to customer? 1/0',
`trackingCountryCode` varchar(50) NOT NULL COMMENT 'FK - country of shippment tracking',
`subscriptionDate` datetime NOT NULL COMMENT 'when to enter to the sped. system',
`speditionCode` varchar(50) NOT NULL COMMENT 'FK - spedition',
`shipmentType` enum('DIRECT','WAREHOUSE') NOT NULL DEFAULT 'WAREHOUSE' COMMENT 'internal OLZA flag',
`weight` decimal(10,3) NOT NULL COMMENT 'sum weight of parcells',
`billingPrice` decimal(10,2) NOT NULL COMMENT 'stored price of delivery',
`billingCurrencyCode` varchar(50) NOT NULL COMMENT 'storred currency of delivery price',
`invoiceCreated` tinyint(1) NOT NULL COMMENT 'invoicing has been done? 1/0',
`invoicingDate` datetime NOT NULL COMMENT 'date of creating invoice',
`pickupPlaceId` varchar(100) DEFAULT NULL COMMENT 'pickup place ID, if applicable for shipment',
`created` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`modified` datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
`lastCheckDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT 'last date of status check'
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='shippment details';
ALTER TABLE `SHIP__shipments`
ADD PRIMARY KEY (`shipmentId`),
ADD UNIQUE KEY `senderId` (`senderId`) USING BTREE,
ADD UNIQUE KEY `externalId` (`externalId`,`trackingCountryCode`,`speditionCode`),
ADD UNIQUE KEY `recipientId_2` (`recipientId`),
ADD KEY `recipientId` (`recipientId`),
ADD KEY `customerId` (`customerId`),
ADD KEY `codCurrencyCode` (`codCurrencyCode`),
ADD KEY `trackingCountryCode` (`trackingCountryCode`),
ADD KEY `speditionCode` (`speditionCode`);
ALTER TABLE `SHIP__shipments`
MODIFY `shipmentId` int(11) NOT NULL AUTO_INCREMENT COMMENT 'generated ID';
ALTER TABLE `SHIP__shipments`
ADD CONSTRAINT `SHIP__shipments_ibfk_3` FOREIGN KEY (`recipientId`) REFERENCES `SHIP__recipient_list` (`recipientId`),
ADD CONSTRAINT `SHIP__shipments_ibfk_4` FOREIGN KEY (`customerId`) REFERENCES `CUST__customer_list` (`customerId`),
ADD CONSTRAINT `SHIP__shipments_ibfk_5` FOREIGN KEY (`codCurrencyCode`) REFERENCES `SYS__currencies` (`code`),
ADD CONSTRAINT `SHIP__shipments_ibfk_6` FOREIGN KEY (`trackingCountryCode`) REFERENCES `SYS__countries` (`code`),
ADD CONSTRAINT `SHIP__shipments_ibfk_7` FOREIGN KEY (`speditionCode`) REFERENCES `SYS__speditions` (`code`),
ADD CONSTRAINT `SHIP__shipments_ibfk_8` FOREIGN KEY (`senderId`) REFERENCES `SHIP__sender_list` (`senderId`);
CREATE TABLE `SHIP__shipments_parcels` (
`id` int(11) NOT NULL COMMENT 'generated ID',
`shipmentId` int(11) NOT NULL COMMENT 'FK - shippment',
`externalNumber` varchar(255) DEFAULT NULL COMMENT 'number from spedition',
`externalBarcode` varchar(255) DEFAULT NULL COMMENT 'Barcode ID - external reference',
`status` varchar(100) DEFAULT NULL COMMENT 'FK - current status',
`weigth` decimal(10,3) NOT NULL COMMENT 'weight of parcel',
`weightConfirmed` tinyint(1) NOT NULL COMMENT 'provided weight has been confirmed/updated by measuring',
`parcelType` varchar(255) NOT NULL COMMENT 'foreign key',
`created` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`modified` datetime DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='data and relations between shippment and it''s parcels';
ALTER TABLE `SHIP__shipments_parcels`
ADD PRIMARY KEY (`id`),
ADD KEY `shippmentId` (`shipmentId`,`status`),
ADD KEY `status` (`status`),
ADD KEY `parcelType` (`parcelType`),
ADD KEY `externalBarcode` (`externalBarcode`),
ADD KEY `weightConfirmed` (`weightConfirmed`),
ADD KEY `externalNumber` (`externalNumber`),
ADD KEY `shipmentId` (`shipmentId`);
ALTER TABLE `SHIP__shipments_parcels`
MODIFY `id` int(11) NOT NULL AUTO_INCREMENT COMMENT 'generated ID';
ALTER TABLE `SHIP__shipments_parcels`
ADD CONSTRAINT `SHIP__shipments_parcels_ibfk_2` FOREIGN KEY (`status`) REFERENCES `SHIP__statuses` (`statusCode`),
ADD CONSTRAINT `SHIP__shipments_parcels_ibfk_3` FOREIGN KEY (`shipmentId`) REFERENCES `SHIP__shipments` (`shipmentId`),
ADD CONSTRAINT `SHIP__shipments_parcels_ibfk_4` FOREIGN KEY (`parcelType`) REFERENCES `SHIP__parcel_types` (`parcelType`);
Server is running on SSD disks and we are not talking about a lot of data here.
Am I missing something here? Or 2 seconds is real time of row counting?
Can I have count result in "normal" time like 0.01s?
We are running MariaDB 10.

Analysis
Let's dissect some columns and the EXPLAIN:
`shipmentId` int(11) (*3) NOT NULL COMMENT 'generated ID',
`senderId` int(11) (*3) NOT NULL COMMENT 'FK - sender address',
1 SIMPLE ship index PRIMARY
senderId (*2) 4 NULL 738700 Using index (*1)
1 SIMPLE parc2 ref shippmentId,shipmentId
shippmentId (*4) 4 ship.shipmentId 1 Using index (*1)
SELECT ... count(parc2.id) (*5) ... STRAIGHT_JOIN (*6) ...
Notes:
*1 -- Both are Using index; this is likely to help a lot.
*2 -- INDEX(senderId) is probably the "smallest" index. Note that you are using InnoDB. The PK is "clustered" with the data, so it is not "small". Every secondary index has the PK implicitly tacked on, so that is effectively (senderId, shipmentId). This explains why the Optimizer mysteriously picked INDEX(senderId).
*3 -- INT takes 4 bytes, allowing numbers up to +/- 2 billion. Do you expect to have that many senders and shipments? Shrinking the datatype (and making it UNSIGNED will save some space and I/O, and therefore may speed things up a little.
*4 -- INDEX(shipmentId) is actually like INDEX(shipmentId, id), again 2 INTs.
*5 -- COUNT(x) checks x for being NOT NULL. This is probably unnecessary in your application. Change to COUNT(*) unless you do need the null check. (The performance difference will be minor.)
*6 -- It probably does not matter which table it picks first, except perhaps for what indexes are available. Hence, STRAIGHT_JOIN did not help.
Now let's discuss how the JOIN works. Virtually all JOINs in MySQL are "NLJ" (Nested Loop Join). This is where the code walks through one of the tables (actually just an index for one table), then reaches into the other table (also, just into an index) for each row found.
To do a COUNT(*) it only needs to check for the existence of the row.
So, it walked through the 2-column INDEX(senderId, shipmentId) to find a list of all shipmentIds in the first table. It did not waste time sorting or dedupping that list. And, since shipmentId is the PK, (hence UNIQUE), there won't be any dups.
For each shipmentId, it then looked up all the rows in the second table. That was efficient to do because of INDEX(shipmentId, id).
I/O (or not)
Let's digress into another issue. Was there any I/O? Were all those rows of those two indexes fully cached in RAM? What is the value of innodb_buffer_pool_size?
The way InnoDB fetches a row (from a table or from an index) is to first check to see if it is in the "buffer pool". If it is not there, then it must bump something out of the buffer pool and read the desired 16KB block into the buffer pool.
At one extreme, nothing is in the buffer pool and all the blocks must be read from disk. At the other extreme, all are cached, and no I/O is needed. Since you tried all sorts of things, I assume that all the relevant blocks (those two indexes) were in RAM.
2 INTs * (800K + 700K rows) + some overhead = maybe 50MB. Assuming innodb_buffer_pool_size is more than that, and no swapping occurred, then it is reasonable for there to be no I/O.
So, how long should it take to touch 1.5M rows that are fully cached, in a JOIN? Alas, 2 seconds seems reasonable.
User expectations
It is rare to need an exact, up-to-the-second count that is in the millions. Rethink the User requirement. Or we can discuss ways to pre-compute the value. Or dead-reckon it.
Side notes
(These do not impact the question at hand.)
Don't blindly use 255 for all strings.
UNIQUE(x) is an INDEX, so don't also have INDEX(x).
Having more than 2 PRIMARY or UNIQUE indexes is usually a design error in the schema.
Some columns could (should?) be normalized. Example: parcelType?
Don't use FLOAT or DOUBLE for monetary values; use DECIMAL. (weight could be floating.)

Design Database to store lists

I apologize for the ambiguity of the column and table names.
My database has two tables A and B. Its a many to many relationship between these tables.
Table A has around 200 records
Table A structure
Id. Definition
12 Def1
42 Def2 .... etc.
Table B has around 5 Billion records
Column 1 . Associated Id(from table A)
eg . abc 12
abc 21
pqr 42
I am trying to optimize the way data is stored in table B, as it has a lot of redundant data. The structure am thinking of, is as follows
Column 1 Associated Ids
abc 12, 21
pqr 42
The "Associated Id" column can have updates when new rows are added to table A.
Is this a good structure to create in this scenario? If yes what should the column type be for the "Associated Id"? I am using mysql database.
Create table statements.
CREATE TABLE `A` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(100) DEFAULT NULL,
`name` varchar(100) DEFAULT NULL,
`creat_usr_id` varchar(20) NOT NULL,
`creat_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`modfd_usr_id` varchar(20) DEFAULT NULL,
`modfd_ts` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `A_ak1` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=277 DEFAULT CHARSET=utf8;
CREATE TABLE `B`(
`col1` varchar(128) NOT NULL,
`id` int(11) NOT NULL,
`added_dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`creat_usr_id` varchar(20) NOT NULL,
`creat_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`col1`,`id`,`added_dt`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP(added_dt))
(PARTITION Lessthan_2016 VALUES LESS THAN (1451606400) ENGINE = InnoDB,
PARTITION L`Ω`essthan_201603 VALUES LESS THAN (1456790400) ENGINE = InnoDB,
PARTITION Lessthan_201605 VALUES LESS THAN (1462060800) ENGINE = InnoDB,
PARTITION Lessthan_201607 VALUES LESS THAN (1467331200) ENGINE = InnoDB,
PARTITION Lessthan_201609 VALUES LESS THAN (1472688000) ENGINE = InnoDB,
PARTITION Lessthan_201611 VALUES LESS THAN (1477958400) ENGINE = InnoDB,
PARTITION Lessthan_201701 VALUES LESS THAN (1483228800) ENGINE = InnoDB,
PARTITION pfuture VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */;
Indexes.
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Index_type Comment Index_comment
B 0 PRIMARY 1 col1 A
2 NULL NULL BTREE
B 0 PRIMARY 2 id A
6 NULL NULL BTREE
B 0 PRIMARY 3 added_dt A
6 NULL NULL BTREE

5 billion rows here. Let me walk through things:
col1 varchar(128) NOT NULL,
How often is this column repeated? That is, is is worth it to 'normalize it?
id int(11) NOT NULL,
Cut the size of this column in half (4 bytes -> 2), since you have only 200 distinct ids:
a_id SMALLINT UNSIGNED NOT NULL
Range of values: 0..65535
added_dt timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
Please explain why this is part of the PK. That is a rather odd thing to do.
creat_usr_id varchar(20) NOT NULL,
creat_ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
Toss these as clutter, unless you can justify keeping track of 5 billion actions this way.
PRIMARY KEY (col1,id,added_dt)
I'll bet you will eventually get two rows in the same second. A PK is 'unique'. Perhaps you need only (col, a_id)`? Else, you are allowing a col-a_id pair to be added multiple times. Or maybe you want IODKU to add a new row versus update the timestamp?
PARTITION...
This is useful if (and probably only if) you intend to remove 'old' rows. Else please explain why you picked partitioning.
It is hard to review a schema without seeing the main SELECTs. In the case of large tables, we should also review the INSERTs, UPDATEs, and DELETEs, since each of them could pose serious performance problems.
At 100 rows inserted per second, it will take more than a year to add 5B rows. How fast will the rows be coming in? This may be a significant performance issue, too.

Simple count id in MySql table is taking to long

I have to tables with 65.5 Million rows:
1)
CREATE TABLE RawData1 (
cdasite varchar(45) COLLATE utf8_unicode_ci NOT NULL,
id int(20) NOT NULL DEFAULT '0',
timedate datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
type int(11) NOT NULL DEFAULT '0',
status int(11) NOT NULL DEFAULT '0',
branch_id int(20) DEFAULT NULL,
branch_idString varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (id,cdasite,timedate),
KEY idx_timedate (timedate,cdasite)
) ENGINE=InnoDB;
2)
Same table with partition (call it RawData2)
PARTITION BY RANGE ( TO_DAYS(timedate))
(PARTITION p20140101 VALUES LESS THAN (735599) ENGINE = InnoDB,
PARTITION p20140401 VALUES LESS THAN (735689) ENGINE = InnoDB,
.
.
PARTITION p20201001 VALUES LESS THAN (738064) ENGINE = InnoDB,
PARTITION future VALUES LESS THAN MAXVALUE ENGINE = InnoDB);
I'm using the same query:
SELECT count(id) FROM RawData1
where timedate BETWEEN DATE_FORMAT(date_sub(now(),INTERVAL 2 YEAR),'%Y-%m-01') AND now();
2 problems:
1. why the partitioned table runs longer then the regular table?
2. the regular table returns 36380217 in 17.094 Sec. is it normal, all R&D leaders think it is not fast enough, it need to return in ~2 Sec.
What do I need to check / do / change ?
Is it realistic to scan 35732495 rows and retrieve 36380217 in less then 3-4 sec?

You have found one example of why PARTITIONing is not a performance panacea.
Where does id come from?
How many different values are there for cdasite? If thousands, not millions, build a table mapping cdasite <=> id and switch from a bulky VARCHAR(45) to a MEDIUMINT UNSIGNED (or whatever is appropriate). This item may help the most, but perhaps not enough.
Ditto for status, but probably using TINYINT UNSIGNED. Or think about ENUM. Either is 1 byte, not 4.
The (20) on INT(20) means nothing. You get a 4-byte integer with a limit of about 2 billion.
Are you sure there are no duplicate timedates?
branch_id and branch_idString -- this smells like a pair that needs to be in another table, leaving only the id here?
Smaller -> faster.
COUNT(*) is the same as COUNT(id) since id is NOT NULL.
Do not include future partitions before they are needed; it slows things down. (And don't use partitioning at all.)
To get that query even faster, build and maintain a Summary Table. It would have at least a DATE in the PRIMARY KEY and at least COUNT(*) as a column. Then the query would fetch from that table. More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables

Partition MySQL table with primary key and concatonated unique index

I have a table storing weekly viewing statistic for around 40K businesses, the tables passed 2.2M records and is starting to slow things down, I'm looking at partitioning it to speed things up but I'm not sure how best to do it.
My ORM requires an id field as a primary key, but that field has no relevance to the data, I've been using a unique index on fields for year, week number and business ID.
As I need the primary key to be involved in the partition map, I'm not sure how best to organise this (I've never used partitioning before).
Currently I have...
CREATE TABLE `weekly_views` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`business_id` int(11) NOT NULL,
`year` smallint(4) UNSIGNED NOT NULL,
`week` tinyint(2) UNSIGNED NOT NULL,
`hits` int(5) NOT NULL,
`created` timestamp NOT NULL ON UPDATE CURRENT_TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
`updated` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
UNIQUE `search` USING BTREE (business_id, `year`, `week`),
UNIQUE `id` USING BTREE (id, `week`)
) ENGINE=`InnoDB` AUTO_INCREMENT=2287009 DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci ROW_FORMAT=COMPACT CHECKSUM=0 DELAY_KEY_WRITE=0 PARTITION BY LIST(week) PARTITIONS 52 (PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
(5 ... 51)
PARTITION p52 VALUES IN (52) ENGINE = InnoDB);
One partition per week seemed the only logical way to break them up. Am I right that when I search for a record for the current week/business using 'business_id = xx and week = xx and year = xx' it's going to know which partition to use without searching them all? But, when I get the result and save it via the ORM, it's going to use the id field and not know which partition to use?
I guess I could use a custom query to insert or update (I haven't originally done this as the ORM doesn't support it).
Am I going the right way about this, or is there a better way to partition a table like this?
Thanks for your help!

As long as the query has week column in WHERE clause, MySQL will look in correct partition. However, weeks repeat each year and you'll end up with data from different years in the same partition.
Also you need 53 not 52 partitions, as you'll need to deal with leap years.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

LOAD DATA FROM S3 - Fastest way for big data and indexes - mysql

Related

MySQL partition contains more records than expected

Simple join makes MySQL/MariaDB COUNT(*) rows very slow

Design Database to store lists

Simple count id in MySql table is taking to long

Partition MySQL table with primary key and concatonated unique index

Categories

Resources