Does Citus support creating shards using mysql_fdw? - mysql

The Citus documentation for the master_get_table_metadata function states:
part_storage_type: Type of storage used for the table. May be ‘t’ (standard table), ‘f’ (foreign table) or ‘c’ (columnar table).
But I searched the entire documentation and found no examples of how to work with tables distributed using the ‘f’ (foreign table) partition storage type.
I suppose the initial foreign table could be created using:
CREATE FOREIGN TABLE audit (
id integer NOT NULL,
ctime timestamp without time zone DEFAULT now() NOT NULL,
site_id integer NOT NULL,
client_id integer,
done_time timestamp without time zone,
status text DEFAULT 'NEW' NOT NULL,
file_id character varying(16) DEFAULT ''::character varying NOT NULL
) SERVER mysql_svr OPTIONS (dbname 'constructor', table_name 'audit');
But how do I distribute such a table after creating it? How will the shards be created?
Update
I have found this
FOREIGN (‘f’) — Indicates that shard stores foreign data. (Used by distributed file_fdw tables)
So my question remains: is it possible to use other foreign data wrappers, such as mysql_fdw?

Creating distributed foreign tables has only partial support right now within Citus.
Let's take your example:
CREATE FOREIGN TABLE audit (
id integer NOT NULL,
ctime timestamp without time zone DEFAULT now() NOT NULL,
site_id integer NOT NULL,
client_id integer,
done_time timestamp without time zone,
status text DEFAULT 'NEW' NOT NULL,
file_id character varying(16) DEFAULT ''::character varying NOT NULL
) SERVER mysql_svr
OPTIONS (dbname 'constructor', table_name 'audit');
You can now distribute this using:
SELECT * FROM master_create_distributed_table('audit', 'id', 'append');
And create shards using:
SELECT master_create_worker_shards('audit', <shard_count>);
However, each shard created on the worker node will inherit the same options as the master node. Thus, each shard will point, in this example, to dbname 'constructor', and foreign table 'audit'. There would be limited value in creating such a distribution, since even though Citus will issue parallel queries, they will all again be sent to a single node and table.
To construct a more useful example, let's say you already have some (let's say 8) sharded MySQL tables, e.g. audit_1, audit_2, ..., audit_8.
You can construct the same table as above, and create a distributed setup like so:
SELECT * FROM master_create_distributed_table('audit', 'id', 'append');
And create shards using:
SELECT master_create_worker_shards('audit', 8);
You would now need to log into each Citus worker node, and update each shard to point to it's relevant MySQL shard.
e.g:
ALTER TABLE audit_100208 OPTIONS (SET table_name 'audit_1');
If you have tables spread across multiple nodes or databases, you'd need to manually create specific servers for each foreign node on every Citus worker node.
There are caveats here to be careful of. For one, we marked the distribution as 'append', because we don't know the underlying distribution of the foreign table. If you use hash, you may get wrong partition pruning via Citus. There may be other caveats too, as this isn't a use-case we actively support or have tested. From a historical perspective, we primarily used this as a proof-of-concept to try reading flat-files spread across multiple nodes.
** Edit **
Adding responses to the other questions by Eugen.
Also, please note, such Q/A is best suited for the mailing list here:
https://groups.google.com/forum/#!forum/citus-users
By 'partial support', I meant we will push down the foreign table creation, but will not automatically map different foreign table settings to different shards.
SQL and PostgreSQL has a wide range of features, and we don't currently support all of them. We are compiling a list of available features, but in the meantime let us know if there is any features you are interested in.
We do automatically create shards with storage-type 'f', when you issue master_create_distributed_table.

Related

Auto-increment a primary key in MySql

During the creation of tables using mysql on phpmyadmin, I always find an issue when it comes to primary keys and their auto-increments. When I insert lines into my table. The auto_increment works perfectly adding a value of 1 to each primary key on each new line. But when I delete a line for example a line where the primary key is 'id = 4' and I add a new line to the table. The primary key in the new line gets a value of 'id = 5' instead of 'id = 4'. It acts like the old line was never deleted.
Here is an example of the SQL statement:
CREATE TABLE employe(
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(30) NOT NULL
)
ENGINE = INNODB;
How do I find a solution to this problem ?
Thank you.
I'm pretty sure this is by design. If you had IDs up to 6 in your table and you deleted ID 2, would you want the next input to be an ID of 2? That doesn't seem to follow the ACID properties. Also, if there was a dependence on that data, for example, if it was user data, and the ID determined user IDs, it would invalidate pre-existing information, since if user X was deleted and the same ID was assigned to user Y, that could cause integrity issues in dependent systems.
Also, imagine a table with 50 billion rows. Should the table run an O(n) search for the smallest missing ID every time you're trying to insert a new record? I can see that getting out of hand really quickly.
Some links you might like to read:
Principles of Transaction-Oriented Database Recovery (1983)
How can we re-use the deleted id from any MySQL-DB table?
Why do you care?
Primary keys are internal row identifiers that are not supposed to be sexy or good looking. As long as they are able identify each row uniquely, they serve their purpose.
Now, if you care about its value, then you probably want to expose the primary key value somewhere, and that's a big red flag. If you need an external, visible identifier, you can create a secondary column with any formatting sequence and values you want.
As a side note, the term AUTO_INCREMENT is a bit misleading. It doesn't really mean they increase one by one all the time. It just mean it will try to produce sequential numbers, as long as it is possible. In multi-threaded apps that's usually not possible since batches or numbers are reserved per thread so the row insertion sequence may end actually not following the natural numbering. Row deletions have a similar effect, as well as INSERT with roll backs.
Primary keys are meant to be used for joining tables together and
indexing, they are not meant to be used for human usage. Reordering
primary key columns could orphan data and wreck havoc to your queries.
Tips: Add another column to your table and reorder that column to your will if needed (show that column to your user instead of the primary key).

Issues with a circular table reference MySQL

Not really a DBA, but was tasked with designing a couple new tables for a new feature in a web app. DB is MySQL, using NHibernate as ORM (though that's probably irrelevant to the question).
I'm going to be modelling various, "scenarios" which represent different variations of several designs in the app. Aside from the first scenario & "unstarted" scenarios, each scenario will have a parent scenario they're building from. As a result, will end up with a sort of "no-loop / no-merge" tree structure as scenarios are branched from one another.
CREATE TABLE `scenarios` (
`ScenarioID` INT NOT NULL AUTO_INCREMENT,
`DesignID` INT DEFAULT NULL,
`ScenarioTreeID` INT NOT NULL,
`ParentScenarioID` INT DEFAULT NULL,
`Title` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
...
In addition to the scenarios themselves, there's information that's best related to the entire "Tree" of scenarios (e.g. what structure are the scenarios related to, etc). I've tried to factor this data out into another table called scenariotree and reference it from scenarios via ScenarioTreeID. The issue I ran into was, from a querying perspective, that it'd be important to know what the "root scenario" is when I query the tree (I can't just go WHERE ParentScenarioID is NULL as that includes "unstarted" scenarios). So I tried to set up the table as such:
CREATE TABLE `scenariotree` (
`ScenarioTreeID` INT NOT NULL AUTO_INCREMENT,
`StructureID` INT NOT NULL,
`RootScenario` INT DEFAULT NULL,
...
But then I couldn't create either table due to the circular foreign key references. I realise I can create the tables first & then add the foreign keys in (or just turn FK checks off & then on again when I'm finished), but should I be doing this? Poking around online I'm finding conflicting opinions. Basically what I want to ask is:
"Is this acceptable schema design, or am I going to run into issues down the road? If so, what issues am I likely to have & how might I restructure these tables to avoid them?"
It's fine to have circular references. They are less common than not have cycles, but they are legitimate to model some data structures.
They do require some special handling, as you discovered. That's okay and it's necessary.
You already identified two ways of handling them:
SET FOREIGN_KEY_CHECKS=0; temporarily while you insert the mutually-depended data. One problem with this is that some people forget to re-enable the checks, and then some weeks later discover that their data is full of references that point to non-existing data.
Create the table first, then use ALTER TABLE to add the foreign keys after you populate the data. The problem here is that if you need to add new rows to existing tables, you'd have to drop the foreign keys and re-add them every time, and this affects all clients, not just your session.
A couple of other options:
Make one or the other foreign key nullable. When you need to insert mutually-dependent rows in the two tables, insert the one with nullable FK first, and use a NULL. Then insert to the other table. Then UPDATE the first table to assign the non-NULL value it should reference.
Finally, don't use FOREIGN KEY constraints. You will have columns that reference other columns, but it's sort of on the "honor system" instead of having a RDBMS-enforced constraint. This comes with its own risks of course, because any data that is supposed to be a foreign key has no assurance that it correct. But it gives you total freedom to insert in whatever order you need to. You can use a transaction to make sure inserts to both tables happen together.

How can I best maintain integrity between two columns in a table?

Hypothetically, I have an ENUM column named Category, and an ENUM column named Subcategory. I will sometimes want to SELECT on Category alone, which is why they are split out.
CREATE TABLE `Bonza` (
`EventId` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`Category` ENUM("a", "b", "c") NOT NULL,
`Subcategory` ENUM("x", "y", "z") NOT NULL,
PRIMARY KEY(`EventId`)
) ENGINE=InnoDB;
But not all subcategories are valid for all categories (say, "z" is only valid with "a" and "b"), and it irks me that this constraint isn't baked into the design of the table. If MySQL had some sort of "pair" type (where a column of that type were indexable on a leading subsequence of the value) then this wouldn't be such an issue.
I'm stuck with writing long conditionals in a trigger if I want to maintain integrity between category and subcategory. Or am I better off just leaving it? What would you do?
I suppose the most relationally-oriented approach would be storing an EventCategoryId instead, and mapping it to a table containing all valid event type pairs, and joining on that table every time I want to look up the meaning of an event category.
CREATE TABLE `Bonza` (
`EventId` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`EventCategoryId` INT UNSIGNED NOT NULL,
PRIMARY KEY(`EventId`),
FOREIGN KEY `EventCategoryId` REFEFRENCES(`EventCategories`.`EventCategoryId`)
ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB;
CREATE TABLE `EventCategories` (
`EventCategoryId` INT UNSIGNED NOT NULL,
`Category` ENUM("a", "b", "c") NOT NULL,
`Subcategory` ENUM("x", "y", "z") NOT NULL,
PRIMARY KEY(`EventCategoryId`)
) ENGINE=InnoDB;
-- Now populate this table with valid category/subcategory pairs at installation
Can I do anything simpler? This lookup will potentially cost me complexity and performance in calling code, for INSERTs into Bonza, no?
Assuming that your categories and subcategories don't change that often, and assuming that you're willing to live with a big update when they do, you can do the following:
Use an EventCategories table to control the hierarchical constraint between categories and subcategories. The primary key for that table should be a compound key containing both Category and Subcategory. Reference this table in your Bonza table. The foreign key in Bonza happens to contain both of the columns that you want to filter by, so you don't need to join to get what you're after. It will also be impossible to assign an invalid combination.
CREATE TABLE `Bonza` (
`EventId` UNSIGNED INT NOT NULL AUTO_INCREMENT,
`Category` CHAR(1) NOT NULL,
`Subcategory` CHAR(1) NOT NULL,
PRIMARY KEY(`EventId`),
FOREIGN KEY `Category`, `Subcategory`
REFEFRENCES(`EventCategories`.`Category`, `EventCategories`.`Subcategory`)
ON DELETE RESTRICT ON UPDATE CASCADE
) ENGINE=InnoDB;
CREATE TABLE `EventCategories` (
`EventCategoryId` UNSIGNED INT NOT NULL,
`Category` CHAR(1) NOT NULL,
`Subcategory` CHAR(1) NOT NULL,
PRIMARY KEY(`Category`, `Subcategory`)
) ENGINE=InnoDB;
My thought is: "best" is almost always opinion-based, but still there are some common things that may be said
Using relational structure
Once you have an issue that not all pairs are valid - you have an issue - that you must store this information. Therefore, you need either to store which pairs are invalid or to store which pairs are valid. Your sample with additional table is completely valid in terms of relational DBMS. In fact, if we'll face such issue, it is near the only way to resolve it on database-design level. With it:
You're storing valid pairs. That's as I've said: you have to store this information somewhere and here we are - creating new table
You're maintaining referential integrity via FOREIGN KEY. So your data will always be correct and point to valid pair
What bad things may happen and how could this impact the performance?
To reconstruct full row, you'll need to use simple JOIN:
SELECT
Bonza.id,
EventCategories.Subcategory,
EventCategories.Category
FROM
Bonza
LEFT JOIN EventCategories
ON Bonza.EventCategoryId=EventCategory.id
Performance of this JOIN will be good: you'll do it be FK - thus, by definition, you'll get only INDEX SCAN. It is about index quality (i.e. it's cardinality) - but in general it will be fast.
How complex is one JOIN? It's simple operation - but it may add some overhead to complex queries. However, in my opinion: it's ok. There's nothing complex in it.
You are able to change pairs with a simple changing of EventCategories data. That is: you can easily remove restrictions on prohibited pairs and this will affect nothing. I see this as a great benefit of this structure. However, adding new restriction isn't so simple - because, yes, it requires DELETE operation. You've chosen ON DELETE RESTRICT action for your FK - and that means you'll have to handle all conflicting records before adding new restriction. This depends, of course, from your application's logic - but think of it another way: if you'll add new restriction, shouldn't then all conflicting records be removed (because logic is saying: yes, they should)? If so, then change your FK to ON DELETE CASCADE.
So: having additional table is simple, flexible and actually easy way to resolve your issue.
Storing in one table
You've mentioned, that you can use trigger for your issue. And that is actually applicable, so I'll show - that this has it's weakness (well, together with some benefits). Let's say, we'll create the trigger:
DELIMITER //
CREATE TRIGGER catCheck BEFORE INSERT ON Bonza
FOR EACH ROW
BEGIN
IF NEW.Subcategory = "z" && NEW.Category = "c" THEN
SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'Invalid category pair';
END IF;
END;//
DELIMITER ;
Obviously, we still have to store information about how to validate our pairs, but in this case we store invalid combinations. Once we'll get invalid data, we'll catch this inside trigger and abort our insert, returning proper user-defined errno (45000) together with some explanation text. Now, what about complexity and performance?
This way allows you to store your data as it is, in one table. This is a benefit: you'll get rid of JOIN - integrity is maintained by another tool. You may forget about storing pairs and handling them, hiding this logic in the trigger
So, you'll win on SELECT statements: your data always contain valid pairs. And no JOIN would be needed
But, yes, you'll loose on INSERT/UPDATE statements: they will invoke trigger and within it - some checking condition. It may be complex (many IF parts) and MySQL will check them one by one. Making one single condition wouldn't help lot - because still, in worst case, MySQL will check it till it's end.
Scalability of this method is poor. Every time you'll need to add/remove pair restriction - you'll have to redefine trigger. Even worse, unlike JOIN case, you'll not able to do any cascade actions. Instead you'll have to do manual handling.
What to chose?
For common case, if you don't know for certain - what will be your application conditions, I recommend you to use JOIN option. It's simple, readable, scalable. It fits relational DB principles.
For some special cases, you may want to chose second option. Those conditions would be:
Allowed pairs will never be changed (or will be changed very rare)
SELECT statements will be done much, much more often, then INSERT/UPDATE statements. And also SELECT statement performance will be in highest priority in terms of performance for your application.
I'd liked this problem but, with this information I would define a set of valid pairs for just one enum column:
CategorySubcategory ENUM("ax", "ay", "az", "bx", "by", "bz", "cx", "cy")
I think this will only be useful with a limited set of values, when they got bigger personally I would choose your second option rather than the triggered one.
First reason is absolutely an opinion, I don't like triggers too much, and they don't like me
Second reason is that a well indexed and properly sized reference from one table to another has a really high performance

Database design for chat room. Need to save every chat

Scenario:
Designing a chat room for various users to chat at a time. All the chats needs to saved. Whenever user logs in, he should be able to see all the previous chats.
Here is one example of the table that can be used for storing the chats:
CREATE TABLE chat
(
chat_id int NOT NULL auto_increment,
posted_on datetime NOT NULL,
userid int NOT NULL,
message text NOT NULL,
PRIMARY KEY (chat_id),
FOREIGN KEY(userid) references users(userid) on update cascade on delete cascade
);
For retrieving chats in proper order, I need some primary key in the table in which I am storing the chats. So, if I use the above table for storing chats then I cannot store more than 2147483647 chats. Obviously, I can use some datatype which has huge range like unsigned bigint, but still it will have some limit.
But as the scenario says that the chats to be saved can be infinite, so what kind of table should I make? Should I make some other primary key?
Please help me sorting out the solution. I wonder how Google or facebook manage to save every chat.
If you weren't using MySQL, a primary key of the user id and a timestamp would probably work fine. But MySQL's timestamp only resolves to one second. (See below for recent changes that affect this answer.) There are a few ways to get around that.
Let application code handle a primary key violation by waiting a
second, then resubmitting.
Let application code provide a higher-precision timestamp, and store
it as a sortable CHAR(n), like '2011-01-01 03:45:46.987'.
Switch to a dbms that supports microsecond timestamps.
All that application code needs to be server-side code if you intend to write a query that presents rows ordered by timestamp.
Later
The current version of MySQL supports fractional seconds in timestamps.

How to restrict a column value in SQLite / MySQL

I would like to restrict a column value in a SQL table. For example, the column values can only be "car" or "bike" or "van". My question is how do you achieve this in SQL, and is it a good idea to do this on the DB side or should I let the application restrict the input.
I also have the intention to add or remove more values in the future, for example, "truck".
The type of Databases I am using are SQLite and MySQL.
Add a new table containing these means of transport, and make your column a foreign key to that table. New means of transport can be added to the table in future, and your column definition remains the same.
With this construction, I would definitively choose to regulate this at the DB level, rather than that of the application.
For MySQL, you can use the ENUM data type.
column_name ENUM('small', 'medium', 'large')
See MySQL Reference: The ENUM Type
To add to this, I find it's always better to restrict on the DB side AND on the app side. An Enum plus a Select box and you're covered.
Yes, it is recommended to add check constraints. Check constraints are used to ensure the validity of data in a database and to provide data integrity. If they are used at the database level, applications that use the database will not be able to add invalid data or modify valid data so the data becomes invalid, even if the application itself accepts invalid data.
In SQLite:
create table MyTable
(
name string check(name = "car" or name = "bike" or name = "van")
);
In MySQL:
create table MyTable
(
name ENUM('car', 'bike', 'van')
);
You would use a check constraint. In SQL Server it works like this
ALTER TABLE Vehicles
ADD CONSTRAINT chkVehicleType CHECK (VehicleType in ('car','bike','van'));
I'm not sure if this is ANSI standard but I'm certain that MySQL has a similar construct.
If you want to go with DB-side validation, you can use triggers. See this for SQLite, and this detailed how-to for MySQL.
So the question is really whether you should use Database validation or not. If you have multiple clients -- whether they are different programs, or multiple users (with possibly different versions of the program) -- then going the database route is definitely best. The database is (hopefully) centralized, so you can decouple some of the details of validation. In your particular case, you can verify that the value being inserted into the column is contained in a separate table that simply lists valid values.
On the other hand, if you have little experience with databases, plan to target several different databases, and don't have the time to develop expertise, perhaps simple application level validation is the most expedient choice.
To add some beginner level context to the excellent answer of #NGLN above.
First, one needs to check the foreign key constraint is active, otherwise sqlite won't limit to the input to the column to the reference table:
PRAGMA foreign_key;
...which gives a response of 0 or 1, indicating on or off.
To set the foreign key constraint:
PRAGMA foreign_keys = ON;
This needs to be set to ensure that sqlite3 enforces the constraint.
I found it simplest to just set the primary key of the reference table to be the type. In the OP's example:
CREATE TABLE IF NOT EXISTS vehicle_types(
vehicle_type text PRIMARY KEY);
Then, one can insert 'car', 'bike' etc into the vehicle_types table (and more in the future) and reference that table in the foreign key constraint in the child table (the table in which the OP wished to reference the type of vehicle):
CREATE TABLE IF NOT EXISTS ops_original_table(
col_id integer PRIMARY KEY,
...many other columns...
vehicle_type text NOT NULL,
FOREIGN KEY (vehicle_type) REFERENCES vehicle_types(vehicle_type);
Outwith the scope of the OP's question but also take note that when setting up a foreign key constraint thought should be given to what happens to the column in child table (ops_original_table) if a parent table value (vehicle_types) is deleted or updated. See this page for info