I'm working on designing a new database that will need to handle an enormous amount of data. It will be a data warehouse system, and will thus be organized around a central hub table:
create table hub(id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
date_time DATETIME NOT NULL, bit_of_data INT NOT NULL);
When this table grows very large, it seems that it will be necessary to partition it based on the 'date_time' column, with each partition being, say, one month of data. However, there will also be another table:
create table other_data(id BIGINT NOT NULL PRIMARY KEY,
more_data INT NOT NULL, FOREIGN KEY(id) REFERENCES hub(id));
This second table will contain records for about 90% of the ids that appear in the main 'hub' table. I'd like to partition the 'other_data' table as well as the 'hub' table, and have the partitions basically match up with each other. Is there any way to partition the 'hub' table on a date range, and then also partition the 'other_data' table on the same date range?
Thanks!
This can be done only by adding a (redudant) date column in the other_data table.
Related
I have a main table called results. E.g.
CREATE TABLE results (
r_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
r_date DATE NOT NULL,
system_id INT NOT NULL,
FOREIGN KEY (system_id) REFERENCES systems(s_id) ON UPDATE CASCADE ON DELETE CASCADE
);
The systems table as:
CREATE TABLE systems (
s_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
system_name VARCHAR(50) NOT NULL UNIQUE
);
I'm writing a program in Python with MySQL connector. Is there a way to add data to the systems table and then auto assign the generated s_id to the results table?
I know I could INSERT into systems, then do another call to that table to see what the ID is for the s_name, to add to the results table but I thought there might be quirk in SQL that I'm not aware of to make life easier with less calls to the DB?
You could do what you describe in a trigger like this:
CREATE TRIGGER t AFTER INSERT ON systems
FOR EACH ROW
INSERT INTO results SET r_date = NOW(), system_id = NEW.s_id;
This is possible only because the columns of your results table are easy to fill in from the data the trigger has access to. The auto-increment fills itself in, and no additional columns need to be filled in. If you had more columns in the results table, this would be harder.
You should read more about triggers:
https://dev.mysql.com/doc/refman/8.0/en/create-trigger.html
https://dev.mysql.com/doc/refman/8.0/en/triggers.html
I have a table that contains ids and emails. For simplicity's sake lets say that an id is the row number. Both of these columns are unique - no two rows will have the same id and no two rows will have the same email. I need to be able to query fast id by email and email by id.
If I were to program this schema myself, in addition to the main table (which is indexed by the id), I would store a hash table which would have the emails as the keys. That would ensure O(1) for searches in both directions.
Here is how I plan on making my tables:
CREATE TABLE main_table (
id INT AUTO_INCREMENT,
email VARCHAR(256) NOT NULL,
...
PRIMARY KEY(id)
UNIQUE(email)
);
CREATE TABLE id_by_email (
email VARCHAR(256),
id INT,
PRIMARY KEY(email),
FOREIGN KEY(email) REFERENCE main_table(email),
FOREIGN KEY(id) REFERENCE main_table(email),
);
Will this setup even work? And if it will, will it produce the O(1) lookup I'm striving for?
The search in a B-tree index is O(log n). For all practical purposes, this is fast enough. After all, the log of 1,000,000 is only about 30.
In addition, with an index, you don't have to worry about whether or not a hash table fits into memory. And SQL maintains the index even when the data changes.
This is a question about database design. Say I have several tables, some of which each have a common expiry field.
CREATE TABLE item (
id INT PRIMARY KEY
)
CREATE TABLE coupon (
id INT PRIMARY KEY FOREIGN KEY (`item.id`),
expiry DATE NOT NULL
)
CREATE TABLE subscription (
id INT PRIMARY KEY FOREIGN KEY (`item.id`),
expiry DATE NOT NULL
)
CREATE TABLE product(
id INT PRIMARY KEY FOREIGN KEY (`item.id`),
name VARCHAR(32)
)
The expiry column does need to be indexed so I can easily query by expiry.
My question is, should I pull the expiry column into another table like so?
CREATE TABLE item (
id INT PRIMARY KEY
)
CREATE TABLE expiry(
id INT PRIMARY KEY,
expiry DATE NOT NULL
)
CREATE TABLE coupon (
id INT PRIMARY KEY FOREIGN KEY (`item.id`),
expiry_id INT NOT NULL FOREIGN KEY(`expiry.id`)
)
CREATE TABLE subscription (
id INT PRIMARY KEY FOREIGN KEY (`item.id`),
expiry_id INT NOT NULL FOREIGN KEY(`expiry.id`)
)
CREATE TABLE product(
id INT PRIMARY KEY FOREIGN KEY (`item.id`),
name VARCHAR(32)
)
Another possible solution is to pull the expiry into another base "class" table.
CREATE TABLE item (
id INT PRIMARY KEY
)
CREATE TABLE expiring_item (
id INT PRIMARY KEY FOREIGN KEY(`item.id`),
expiry DATE NOT NULL
)
CREATE TABLE coupon (
id INT PRIMARY KEY FOREIGN KEY (`expiring_item .id`),
)
CREATE TABLE subscription (
id INT PRIMARY KEY FOREIGN KEY (`expiring_item .id`),
)
CREATE TABLE product(
id INT PRIMARY KEY FOREIGN KEY (`item.id`),
name VARCHAR(32)
)
Given the nature of databases in that refactoring the table structure is difficult once they are being used, I am having trouble weighing the pros and cons of each approach.
From what I see, the first approach uses the least number of table joins, however, I will have redundant data for each expiring item. The second approach seems good, in that any time I need to add an expiry to an item I simply add a foreign key to that table. But, if I discover expiring items (or a subset of expiring items) actually share another attribute then I need to add another table for that. I like the third approach best, because it brings me closest to an OOP like hierarchy. However, I worry that is my personal bias towards OOP programming, and database tables do not use composition in the same way OOP class inheritance does.
Sorry for the poor SQL syntax ahead of time.
I would stick with the first design as 'redundant' data is still valid data if only as a record of what was valid at a point in time and it also allows for renewal with minimum impact. Also the second option makes no great sense as the expiry is an arbritrary item that has no real context outside of the table referencing, in other words unless it is associated with a coupon or a subscription it is an orphan value. Finally the third option makes no more sense in that at what point does a item become expiring? as soon as it is defined? at a set period before expiry...at the end of the day the expiry is an distinct attribute which happens to have the same name and purpose for both the coupon and the subscription but which isn't related to each other or as such the item.
Do not normalize "continuous" values such as datetime, float, int, etc. It makes it very inefficient to do any kind of range test on expiry.
Anyway, a DATE takes 3 bytes; an INT takes 4, so the change would increase the disk footprint for no good reason.
So, use the first, not the second. But...
As for the third, you say "expirations are independent", yet you propose having a single expiry?? Which is it??
If they are not independent, then another principle comes into play. "Don't have redundant data in a database." So, if the same expiry really applies to multiple connected tables, it should be in only one of the tables. Then the third schema is the best. (Exception: There may be a performance issue, but I doubt it.)
If there are different dates for coupon/subscription/etc, then you must not use the third.
I have a hypothetical table with a primary key that is a BIGINT. Let's say my table grows very large and I have to partition and create different partitions by date range. What happens with primary key? Does that mean I can exceed the capacity of the BIGINT since there are more tables now? How does MySQL keep from assigning duplicate primary keys assuming a BIGINT set to auto increment a unique value?
Thanks in advance...
Partitioning doesn't create a new table. There is still one table with many partitions. Auto increment unique value keeps it functionality. Data grouped in partitions by date field that you choose to create partitions with.
Take a look at this : http://dev.mysql.com/tech-resources/articles/mysql_55_partitioning.html
I have a massive (3,000,000,000 rows) fact table in a datawarehouse star schema. The table is partitioned on the date key.
I would like to add an index on one of the foreign keys. This is to allow me to identify and remove childless rows in a large dimension table.
If I just issue a CREATE INDEX statement then it would take forever.
Do any SQL gurus have any fancy techniques for this problem?
(SQL 2008)
--Simplified example...
CREATE TABLE FactRisk
(
DateId int not null,
TradeId int not null,
Amount decimal not null
)
--I want to create this index, but the straightforward way will take forever...
CREATE NONCLUSTERED INDEX IX_FactRisk_TradeId on FactRisk (TradeId)
I have a plan...
Switch out all the daily partitions to tables
Index the now empty fact table
Index the individual partition
Switch all the partitions back in
Initial investigation implies that this will work. I will report back...