Alternatives to alter table in mysql - mysql

We use mysql(AWS aurora) to store data of our online payment transactions. One of our tables, in which each row stores information of a particular transaction, has more than 1 billion rows.
How can I go about adding a new attribute for a transaction ? Altering this table is not possible because of large amount of time required to do so.
Only possible solution seems to be creating a new table which stores key-value pairs for each transaction. Are there other more efficient ways to do this, assuming altering table structure is not possible ?

An alternative is to create a parallel table. It would have the same PRIMARY KEY as your current table (but without AUTO_INCREMENT). And it would have the 'new' column(s).
Then you would JOIN on the PK to fetch both old and new columns at the same time.
Pros: No downtime, no big ALTER, etc.
Cons: Now the table is split in two. Subsequent columns being added go through the same dilemma.
Alternative to the alternative: Put a JSON column in that new table.
Pros: Very open-ended wrt adding more columns.
Cons: Can't index it very well. (This depends on what version you are using.)

At my work, we have quite a few tables with over 1 billion rows. Developers add or remove columns, change data types, add or remove indexes, etc. Any kind of ALTER TABLE.
The way we do this is to use pt-online-schema-change, a free tool available from Percona. It allows you to do long-running schema changes, and you can still read and write the table while it's doing the change in the background.
It still takes a long time to do a change to a large table. In the largest cases, it takes weeks. But it doesn't block your work in the meantime.

Related

Does it make sense to split a large table into smaller ones to reduce the number of rows (not columns)? [duplicate]

rails app, I have a table, the data already has hundreds of millions of records, I'm going to split the table to multiple tables, this can speed up the read and write.
I found this gem octopus, but he is a master/slave, I just want to split the big table.
or what can I do when the table too big?
Theoretically, a properly designed table with just the right indexes will be able to handle very large tables quite easily. As the table grows the slow down in queries and insertion of new records is supposed to be negligible. But in practice we find that it doesn't always work that way! However the solution definitely isn't to split the table into two. The solution is to partition.
Partitioning takes this notion a step further, by enabling you to
distribute portions of individual tables across a file system
according to rules which you can set largely as needed. In effect,
different portions of a table are stored as separate tables in
different locations. The user-selected rule by which the division of
data is accomplished is known as a partitioning function, which in
MySQL can be the modulus, simple matching against a set of ranges or
value lists, an internal hashing function, or a linear hashing
function.
If you merely split a table your code is going to become inifinitely more complicated, each time you do an insert or a retrieval you need to figure out which split you should run that query on. When you use partitions, mysql takes care of that detail for you an as far as the application is concerned it's still one table.
Do you have an ID on each row? If the answer is yes, you could do something like:
CREATE TABLE table2 AS (SELECT * FROM table1 WHERE id >= (SELECT COUNT(*) FROM table1)/2);
The above statement creates a new table with half of the records from table1.
I don't know if you've already tried, but an index should help in speed for a big table.
CREATE INDEX index_name ON table1 (id)
Note: if you created the table using unique constraint or primary key, there's already an index.

When to add a column vs adding a related table?

I have a big table with over 100 million rows. I have been trimming it down for months getting rid of bad data (rows wise), trying to keep it small. I already had 9 columns on this table. I want to add a new boolean column to it. Below is the current state.
This table started off small, and now its getting pretty wide. Yet again, I am tasked with adding more information per row. This time it's a new boolean field. I expect this field to be low volume, meaning less than 10% will have this set to true. I know I can make it default null, and it is a boolean column which should be small.
However, I wanted to get some advice. This table cannot get infinitely wide, and I will need to work around this. Under these circumstances, does it make more sense to create another table and foreign key reference the record when I have additional data to add? How do the pro's handle this in database design?
The best situation for usability is to have all data on the record so any form of a query can get or calculate on the table itself without joins. I just do not have confidence that it will scale to 1 BILLION rows (insert meme).
At my job I support MySQL instances that have multi-billion row tables. At that scale, care must be taken to optimize queries properly. You don't want to do a table-scan at that scale.
But that's about rows, not columns. You asked first about columns.
The way to choose between adding a column versus adding another table is to follow rules of database normalization. If the new column is for an attribute of the same entity as your current table, add the column to that table. If it's a multi-valued attribute or if it's really an attribute of some other entity, then add it to a different table.
Very, very rarely is it the right choice to make another table solely for the sake of having too many columns. A given MySQL table can have dozens of columns pretty easily, and hundreds if you're careful.
In theory, there is no limit to the number of columns that might be appropriate to put in the same table with respect to normalization. But there are limitations due to the code to store those columns in a given implementation (e.g. InnoDB storage engine in MySQL).
See https://www.percona.com/blog/2013/04/08/understanding-the-maximum-number-of-columns-in-a-mysql-table/
So the maximum number of columns for a table in MySQL is somewhere between 191 and 2829, depending on a number of factors.
In the comments on that blog, I was able to design a table that failed to be created at 59 columns. Read the blog for details.

Alternatives to mysql for large reference tables

We currently use mysql for two types of tables:
The first set are the typical transaction based tables.
The second, are tables that are ones that store historical data which is usually write once, and read many times. They are large, hundreds of millions of rows or larger, and have a couple of indexes.
We have a couple of issues with these tables.
Any schema changes take forever
We’re not comfortable with the whole table being a single point of failure. If anything goes wrong, rebuilding this table would take ages.
It doesn't seem scalable
Are there any features of mysql we are missing that would alleviate these issues? I saw that MariaDB now has a way to add columns that doesn’t lock the whole table, but it doesn’t solve the other issues.
We’re also open to other products that might solve the issue. Any ideas?
Why would you ever need to add columns to Historical data? Anyway, what values would you assign to the 'old' rows.
An alternative to adding a column is to create a "parallel" table (aka "verdical partitioning"). The new table would have the same PRIMARY KEY as the original (except for any AUTO_INCREMENT declaration). You would use LEFT JOIN to fetch columns from both tables, and understand that 'old' rows would give you NULLs for the 'new' columns.
Another useful thing to do for Historical data is to treat it like a Fact table in Data Warehousing. The build and maintain "Summary table(s)" to significantly speed up common "report" type queries.
In newer versions of MySQL/MariaDB, ALTER TABLE ... ADD COLUMN ... ALGORITHM=INPLACE removes most of the performance pain.
Adding columns is also solved by moving toward EAV schema, which has a lot of bad qualities. So, move only part-way toward such. That is, keep the 5-10 main columns that you use for filtering and sorting as real columns, then put the rest of the key-value junk into a JSON column. Both MySQL and MariaDB had such (though with some differences), plus MariaDB has "Dynamic Columns".
Summary tables
EAV
"but it doesn’t solve the other issues" -- such as??

Multiple table or one single table?

I already saw a few forums with this question but they do not answer one thing I want to know. I'll explain first my topic:
I have a system where each log of multiple users are entered to the database (ex. User1 logged in, User2 logged in, User1 entered User management, User2 changed password, etc). So I would be expecting 100 to 200 entries per user per day. Right now, I'm doing it in a single table and to view it, I just have to filter out using UserID.
My question is, which is more efficient? Should I use one single table or create a table per user?
I am worried that if I use a single table, the system might have some difficulty filtering thousands of entries. I've read some pros and cons using multiple tables and a single table especially concerning updating the table(s).
I also want to know which one saves more space? multiple table or single table?
As long as you use indexes on the fields you're selecting from, you shouldn't have any speed problems (although indexes slow writes, so too many are a bad thing). A table with a few thousand entries is nothing to mySQL (or any other database engine).
The overhead of creating thousands of tables is much worse -- say you want to make a change to the fields in your user table -- now you'd have to change thousands of tables.
A table we regularly search against for a single record # work has about 150,000 rows, and because the field we search for is indexed, the search time is in very small fractions of a second.
If you're selecting those records without using the primary key, create an index on the field you use to select like this:
CREATE INDEX my_column_name ON my_table(my_column_name);
Thats the most basic form. To learn more about it, check here
I would go with a single table. With an index on userId, you should be able to scale easily to millions of rows with little issue.
A table per user might be more efficient, but it's generally poor design. The problem with a table per user is it makes it difficult to answer other kinds of questions like "who was in user management yesterday?" or "how many people have changed their passwords?"
As for storage space used - I would say a table per user would probably use a little more space, but the difference between the two options should be quite small.
I would go with just 1 table. I certainly wouldn't want to create a new table every time a user is added to the system. The number of entries you mention for each day really is really not that much data.
Also, create an index on the user column of your table to improve query times.
Definitely a single table. Having tables created dynamically for entities that are created by the application does not scale. Also, you would need to create your queries with variable tables names, something which makes things difficult to debug and maintain.
If you have an index on the user id you use for filtering it's not a big deal for a db to work through millions of lines.
Any database worth its salt will handle a single table containing all that user information without breaking a sweat. A single table is definitely the right way to do it.
If you used multiple tables, you'd need to create a new table every time a new user registered. You'd need to create a new statement object for each user you queried. It would be a complete mess.
I would go for the single table as well. You might want to go for multiple tables, when you want to server multiple customers with different set of users (multi tenancy).
Otherwise if you go for multiple tables, take a look at this refactoring tool: http://www.liquibase.org/. You can do schema modifications on the fly.
I guess, if you are using i.e. proper indexing, then the single table solution can perform well enough (and the maintenance will be much more simple).
Single table brings efficiency in $_POST and $_GET prepared statements of PHP. I think, for small to medium platforms, single table will be fine. Summary, few tables to many tables will be ideal.
However, multiple tables will not cause any much havoc as well. But, the best is on a single table.

Should one steer clear of adding yet another field to a larger MySQL table?

I have a MySQL-InnoDB table with 350,000+ rows, containing a couple of things like id, otherId, shortTitle and so on. Now I'm in need of a Bool/ Bit field for perhaps a couple of hundreds or thousands of those rows. Should I just add that bool field into the table, or should I best create a new table referencing the IDs of the old table -- thereby not risking to cause performance issues on all the old existing functions that access the first table?
(Side info: I'm never using "SELECT * ...". The main table has lots of reading, rarely writing.)
Adding a field can indeed hamper performance a little, since your table row grow larger, but it's hardly a problem for a BIT field.
Most probably, you will have exactly same row count per page, which means having no performance decrease at all.
On the other hand, using an extra JOIN to access the row value in another table will be much slower.
I'd add the column right into the table.
What does the new column denote?
From the data modelling perspective, if the column belongs with the data under whichever normal form is in use, then put it with the data; performance impact be damned. If the column doesn't directly belong to the table, then put it in a second table with a foreign key.
Realistically, the performance impact of adding a new column on a table with ~350,000 isn't going to be particularly huge. Have you tried issuing the ALTER TABLE statement against a copy, perhaps on a local workstation?
I don't know why people insist in called 350K-row tables big. In the mainframe world, that's how big the DBMS configuration tables are :-).
That said, you should be designing your tables in third normal form. If, and only if, you have performance problems, then should you consider de-normalizing.
If you have a column that will apply only to certain of the rows, it's (probably) not going to be 3NF to put it in the same table. You should have a separate table with a foreign key into your 'primary' table.
Keep in mind that's if the boolean field actually doesn't apply to some of the rows. That's a different situation to the field applying to all rows but not being known for some. In that case, a nullable column in the primary table would be better. But that doesn't sound like what you're describing.
Requiring a bit field for the next entries only sounds like you want to implement inheritance. If that is the case, I would add it to a new table to keep things readable. Otherwise, it doesn't matter if you add it to the main table or not, unless your queries are not using indexes, in which case I would change that first before making any other decisions regarding performance.