I apologize for the ambiguity of the column and table names.
My database has two tables A and B. Its a many to many relationship between these tables.
Table A has around 200 records
Table A structure
Id. Definition
12 Def1
42 Def2 .... etc.
Table B has around 5 Billion records
Column 1 . Associated Id(from table A)
eg . abc 12
abc 21
pqr 42
I am trying to optimize the way data is stored in table B, as it has a lot of redundant data. The structure am thinking of, is as follows
Column 1 Associated Ids
abc 12, 21
pqr 42
The "Associated Id" column can have updates when new rows are added to table A.
Is this a good structure to create in this scenario? If yes what should the column type be for the "Associated Id"? I am using mysql database.
Create table statements.
CREATE TABLE `A` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`title` varchar(100) DEFAULT NULL,
`name` varchar(100) DEFAULT NULL,
`creat_usr_id` varchar(20) NOT NULL,
`creat_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`modfd_usr_id` varchar(20) DEFAULT NULL,
`modfd_ts` timestamp NULL DEFAULT NULL ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
UNIQUE KEY `A_ak1` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=277 DEFAULT CHARSET=utf8;
CREATE TABLE `B`(
`col1` varchar(128) NOT NULL,
`id` int(11) NOT NULL,
`added_dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`creat_usr_id` varchar(20) NOT NULL,
`creat_ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`col1`,`id`,`added_dt`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
/*!50100 PARTITION BY RANGE (UNIX_TIMESTAMP(added_dt))
(PARTITION Lessthan_2016 VALUES LESS THAN (1451606400) ENGINE = InnoDB,
PARTITION L`Ω`essthan_201603 VALUES LESS THAN (1456790400) ENGINE = InnoDB,
PARTITION Lessthan_201605 VALUES LESS THAN (1462060800) ENGINE = InnoDB,
PARTITION Lessthan_201607 VALUES LESS THAN (1467331200) ENGINE = InnoDB,
PARTITION Lessthan_201609 VALUES LESS THAN (1472688000) ENGINE = InnoDB,
PARTITION Lessthan_201611 VALUES LESS THAN (1477958400) ENGINE = InnoDB,
PARTITION Lessthan_201701 VALUES LESS THAN (1483228800) ENGINE = InnoDB,
PARTITION pfuture VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */;
Indexes.
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Sub_part Packed Index_type Comment Index_comment
B 0 PRIMARY 1 col1 A
2 NULL NULL BTREE
B 0 PRIMARY 2 id A
6 NULL NULL BTREE
B 0 PRIMARY 3 added_dt A
6 NULL NULL BTREE
5 billion rows here. Let me walk through things:
col1 varchar(128) NOT NULL,
How often is this column repeated? That is, is is worth it to 'normalize it?
id int(11) NOT NULL,
Cut the size of this column in half (4 bytes -> 2), since you have only 200 distinct ids:
a_id SMALLINT UNSIGNED NOT NULL
Range of values: 0..65535
added_dt timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
Please explain why this is part of the PK. That is a rather odd thing to do.
creat_usr_id varchar(20) NOT NULL,
creat_ts timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
Toss these as clutter, unless you can justify keeping track of 5 billion actions this way.
PRIMARY KEY (col1,id,added_dt)
I'll bet you will eventually get two rows in the same second. A PK is 'unique'. Perhaps you need only (col, a_id)`? Else, you are allowing a col-a_id pair to be added multiple times. Or maybe you want IODKU to add a new row versus update the timestamp?
PARTITION...
This is useful if (and probably only if) you intend to remove 'old' rows. Else please explain why you picked partitioning.
It is hard to review a schema without seeing the main SELECTs. In the case of large tables, we should also review the INSERTs, UPDATEs, and DELETEs, since each of them could pose serious performance problems.
At 100 rows inserted per second, it will take more than a year to add 5B rows. How fast will the rows be coming in? This may be a significant performance issue, too.
Related
I have partitioned a MySQL table containing 53 rows. Now when I query number of records in all partitions, the records are almost 3 times the expected. Even phpMyAdmin thinks there are 156 records.
Have I done somthing wrong in my table design and partitioning?
Below picture shows count of records in partitions:
phpMyAdmin:
Finally, this is my table:
CREATE TABLE cl_inbox (
id int(11) NOT NULL AUTO_INCREMENT,
user int(11) NOT NULL,
contact int(11) DEFAULT NULL,
sdate timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
body text NOT NULL,
userstatus tinyint(4) NOT NULL DEFAULT 1 COMMENT '0: new, 1:read, 2: deleted',
contactstatus tinyint(4) NOT NULL DEFAULT 0,
class tinyint(4) NOT NULL DEFAULT 0,
attachtype tinyint(4) NOT NULL DEFAULT 0,
attachsrc varchar(255) DEFAULT NULL,
PRIMARY KEY (id, user),
INDEX i_class (class),
INDEX i_contact_user (contact, user),
INDEX i_contactstatus (contactstatus),
INDEX i_user_contact (user, contact),
INDEX i_userstatus (userstatus)
)
ENGINE = INNODB
AUTO_INCREMENT = 69
AVG_ROW_LENGTH = 19972
CHARACTER SET utf8
COLLATE utf8_general_ci
ROW_FORMAT = DYNAMIC
PARTITION BY KEY (`user`)
(
PARTITION partition1 ENGINE = INNODB,
PARTITION partition2 ENGINE = INNODB,
PARTITION partition3 ENGINE = INNODB,
.....
PARTITION partition128 ENGINE = INNODB
);
Those numbers are approximations, just as with SHOW TABLE STATUS and EXPLAIN.
Meanwhile, you will probably find that PARTITION BY KEY provides no performance improvement. If you find otherwise, I would be very interested to hear about it.
I have to tables with 65.5 Million rows:
1)
CREATE TABLE RawData1 (
cdasite varchar(45) COLLATE utf8_unicode_ci NOT NULL,
id int(20) NOT NULL DEFAULT '0',
timedate datetime NOT NULL DEFAULT '0000-00-00 00:00:00',
type int(11) NOT NULL DEFAULT '0',
status int(11) NOT NULL DEFAULT '0',
branch_id int(20) DEFAULT NULL,
branch_idString varchar(64) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (id,cdasite,timedate),
KEY idx_timedate (timedate,cdasite)
) ENGINE=InnoDB;
2)
Same table with partition (call it RawData2)
PARTITION BY RANGE ( TO_DAYS(timedate))
(PARTITION p20140101 VALUES LESS THAN (735599) ENGINE = InnoDB,
PARTITION p20140401 VALUES LESS THAN (735689) ENGINE = InnoDB,
.
.
PARTITION p20201001 VALUES LESS THAN (738064) ENGINE = InnoDB,
PARTITION future VALUES LESS THAN MAXVALUE ENGINE = InnoDB);
I'm using the same query:
SELECT count(id) FROM RawData1
where timedate BETWEEN DATE_FORMAT(date_sub(now(),INTERVAL 2 YEAR),'%Y-%m-01') AND now();
2 problems:
1. why the partitioned table runs longer then the regular table?
2. the regular table returns 36380217 in 17.094 Sec. is it normal, all R&D leaders think it is not fast enough, it need to return in ~2 Sec.
What do I need to check / do / change ?
Is it realistic to scan 35732495 rows and retrieve 36380217 in less then 3-4 sec?
You have found one example of why PARTITIONing is not a performance panacea.
Where does id come from?
How many different values are there for cdasite? If thousands, not millions, build a table mapping cdasite <=> id and switch from a bulky VARCHAR(45) to a MEDIUMINT UNSIGNED (or whatever is appropriate). This item may help the most, but perhaps not enough.
Ditto for status, but probably using TINYINT UNSIGNED. Or think about ENUM. Either is 1 byte, not 4.
The (20) on INT(20) means nothing. You get a 4-byte integer with a limit of about 2 billion.
Are you sure there are no duplicate timedates?
branch_id and branch_idString -- this smells like a pair that needs to be in another table, leaving only the id here?
Smaller -> faster.
COUNT(*) is the same as COUNT(id) since id is NOT NULL.
Do not include future partitions before they are needed; it slows things down. (And don't use partitioning at all.)
To get that query even faster, build and maintain a Summary Table. It would have at least a DATE in the PRIMARY KEY and at least COUNT(*) as a column. Then the query would fetch from that table. More on Summary tables: http://mysql.rjweb.org/doc.php/summarytables
I am in a problematic situation and found dozens of questions on same topic, but may b i am not able to understand those solutions as per my issue.
I have a system built in Codeigniter, and it does the following
codeigniter->start_transaction()
UPDATE T SET A = 1, MODIFIED = NOW()
WHERE PK IN
( SELECT PK FROM
(SELECT PK, LAST_INSERT_ID(PK) FROM T
where FK = 31 AND A=0 AND R=1 AND R_FK = 21
AND DEAD = 0 LIMIT 0,1) AS TBL1
) and A=0 AND R = 1 AND R_FK = 21 AND DEAD = 0
-- what this query does is , it takes a row dynamically which is not dead yet,
--and not assigned and it's linked to 21 id (R_FK) from R table,
-- when finds the row, update it to be marked as assigned (A=1).
-- PK = LAST_INSERT_ID(PK) ensures that last_insert_id is updated with this row id, so i can retrieve it from PHP
GOTO MODULE B
MODULE B {
INSERT INTO T(A,B,C,D,E,F,R,FK,R_FK,DEAD,MODIFIED) VALUES(ALL VALUES)
-- this line gives me lock wait timeout exceeded.
}
MySQL version is 5.1.63-community-log
Table T is an INNODB table and has only one normal type index on FK field, and no foreign key constraints are there. PrimaryKey (PK) field is an auto_increment field.
I get lock wait timeout in the above case , and that is due to first transactional update holding lock on table, how can i avoid lock on table with that update query ,while using transactions, I cannot commit the transaction until i receive response from MODULE B .
I don't have much detailed knowledge about DB and structural things, so please bear with me if i said something not making sense.
--UPDATE--
-- TABLE T Structure
CREATE TABLE `T` (
`PK` int(11) NOT NULL AUTO_INCREMENT,
`FK` int(11) DEFAULT NULL,
`P` varchar(1024) DEFAULT NULL,
`DEAD` tinyint(1) NOT NULL DEFAULT '0',
`A` tinyint(1) NOT NULL DEFAULT '0',
`MODIFIED` datetime DEFAULT NULL,
`R` tinyint(4) NOT NULL DEFAULT '0',
`R_FK` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`PK`),
KEY `FK_REFERENCE_54` (`FK`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
-- Indexes Information
SHOW INDEX FROM T;
1- Field FK, Cardinality 65 , NULL => Yes , Index_Type => BTRee
2- Field PK, Cardinality 11153, Index_Type => BTRee
Problem: slow query.
table1 has about 5 000 rows
table2 has about 50 000 rows
timestamp format is int(11)
MySQL - 20 seconds (with indexes)
PostgreSQL - 0,04 seconds (with indexes)
SELECT *
FROM table1
LEFT JOIN table2
ON table2_timestamp BETWEEN table1_timestamp - 500
AND table1_timestamp + 500 ;
Can anybody help me with optimize this query for MySQL?
Explain:
1 SIMPLE a index a 9 2 Using index
1 SIMPLE b index b b 9 5 Using index
Tables:
CREATE TABLE `a` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`table1_timestamp` bigint(20) NULL DEFAULT NULL ,
PRIMARY KEY (`id`),
INDEX `a` (`table1_timestamp`) USING BTREE
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
AUTO_INCREMENT=3
ROW_FORMAT=COMPACT
;
CREATE TABLE `b` (
`id` int(11) NOT NULL AUTO_INCREMENT ,
`table2_timestamp` bigint(20) NULL DEFAULT NULL ,
PRIMARY KEY (`id`),
INDEX `a` (`table2_timestamp`) USING BTREE
)
ENGINE=InnoDB
DEFAULT CHARACTER SET=utf8 COLLATE=utf8_general_ci
AUTO_INCREMENT=3
ROW_FORMAT=COMPACT
;
A couple of points spring to mind but both feel like long-shots. Realistically it looks as though there shouldn't be much you can do to your query assuming your example is an accurate representation.
1 : You are using BIGINT which has a maximum value of 9x10^18 (SIGNED). INT has a max value of 4x10^9 (UNSIGNED), compared to days timestamp which is around 1.4x10^9 (all values approximate) and so consider changing the data type of that column in both tables from BIGINT to INT UNSIGNED or DATETIME
2 : The ROW_FORMAT is COMPACT which may cause issues with BTREE indexes (source). You are dealing with INT data types and so a ROW_FORMAT of FIXED would suffice so try changing to ROW_FORMAT=FIXED on both tables
3 : If always expecting rows to be returned from table2 for table1 rows then INNER JOIN would be more efficient than LEFT JOIN
I have a table storing weekly viewing statistic for around 40K businesses, the tables passed 2.2M records and is starting to slow things down, I'm looking at partitioning it to speed things up but I'm not sure how best to do it.
My ORM requires an id field as a primary key, but that field has no relevance to the data, I've been using a unique index on fields for year, week number and business ID.
As I need the primary key to be involved in the partition map, I'm not sure how best to organise this (I've never used partitioning before).
Currently I have...
CREATE TABLE `weekly_views` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`business_id` int(11) NOT NULL,
`year` smallint(4) UNSIGNED NOT NULL,
`week` tinyint(2) UNSIGNED NOT NULL,
`hits` int(5) NOT NULL,
`created` timestamp NOT NULL ON UPDATE CURRENT_TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
`updated` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
UNIQUE `search` USING BTREE (business_id, `year`, `week`),
UNIQUE `id` USING BTREE (id, `week`)
) ENGINE=`InnoDB` AUTO_INCREMENT=2287009 DEFAULT CHARACTER SET latin1 COLLATE latin1_swedish_ci ROW_FORMAT=COMPACT CHECKSUM=0 DELAY_KEY_WRITE=0 PARTITION BY LIST(week) PARTITIONS 52 (PARTITION p1 VALUES IN (1) ENGINE = InnoDB,
PARTITION p2 VALUES IN (2) ENGINE = InnoDB,
PARTITION p3 VALUES IN (3) ENGINE = InnoDB,
PARTITION p4 VALUES IN (4) ENGINE = InnoDB,
(5 ... 51)
PARTITION p52 VALUES IN (52) ENGINE = InnoDB);
One partition per week seemed the only logical way to break them up. Am I right that when I search for a record for the current week/business using 'business_id = xx and week = xx and year = xx' it's going to know which partition to use without searching them all? But, when I get the result and save it via the ORM, it's going to use the id field and not know which partition to use?
I guess I could use a custom query to insert or update (I haven't originally done this as the ORM doesn't support it).
Am I going the right way about this, or is there a better way to partition a table like this?
Thanks for your help!
As long as the query has week column in WHERE clause, MySQL will look in correct partition. However, weeks repeat each year and you'll end up with data from different years in the same partition.
Also you need 53 not 52 partitions, as you'll need to deal with leap years.