Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
How do i modify a MySQL table where hundreds of records are being inserted every second without having any downtime / losing data or errors .
Ex: Adding a new field
Thanks
Gut feel says that you should avoid modifying the table. An alternative would be to add your column to a new table and link to the original table to maintain referential integrity, so the original table remains untouched.
Another, fairly typical approach would be to create a new table with the added column, swap it out with the old table and then add the data back to the new table. Not a great solution though.
MySQL gurus will likely disagree with me.
Well if only inserts i think best would be:
CREATE TABLE the_table_copy LIKE the_table;
ALTER TABLE the_table_copy ADD new_field VARCHAR(60);
RENAME TABLE the_table TO the_table_backup;
RENAME TABLE the_table_copy TO the_table;
But first make a copy of it and try with copy if fast enough. Then do it on real live thing! :)
Well, I would lock the table, add the field, unlock the table. There would then be a little delay on the inserts, but it wouldn't be much of a hassle I guess.
when you say "add a field" you are creating new columns, or just updating values in a column? Both are quite different things. If you are merely updating a column, change your data model to support multiple versions of the same row in the same table, and keep adding the new rows at the end of the table. This gives you the best read-write concurrency.
Other than that, you can take a look at the "Handler Socket" method.
http://yoshinorimatsunobu.blogspot.com/search/label/handlersocket
If you only need to add a column, there is no reason for that to require downtime. The table will be locked for the duration of the ALTER TABLE statement (milliseconds). Any DML submitted during this time will have to wait, so it could briefly impact performance, but it should not cause any exceptions, unless the code somewhere is explicitly checking for locks.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last year.
The community reviewed whether to reopen this question last year and left it closed:
Original close reason(s) were not resolved
Improve this question
First of all, I found many discussions on this question around the web, but I'm really skeptical about the solutions proposed also mainly because most of them were from over 4 years ago when GDPR was novel, and also because they contradict each other a lot.
In order to be compliant with GDPR, we need to fully remove user's personal information (email, name, address, etc) when they request it.
In our system we have the table User and other tables like Training and Session for example.
Training has a ManyToMany relationship with User and Session table has an user_id field that is a foreign key to User table. Also Session has a training_id field which is also a foreign key to Training
I want to delete the users personal info that is contained in User, but I want to be able to keep their data in Session for statistics and log purposes.
I understand a hard delete is not the way to go, the fk constraints would be problematic and even though I could use Null, it's not really recommended. The suggestion I see more often is to just add a flag on User table to indicate if the user is deleted (is_delete) and use that appropriately in my business logic. IMO, that alone does not really resolve the problem with GDPR, as the user's info would still be in the database and this could get you in trouble with an audit and I was surprised this kind of answer was upvoted in most topics.
What I'm thinking of doing, it's to add that flag of is_delete, but also update the user row with fake or empty data to get rid of all user personal info. So I'd have something like this:
ID
email
name
surname
address
is_deleted
1
""
Anonymous
""
""
true
Would this be sufficient? Do you guys see any flaw with my plan?
You're right. UPDATEs to rows with personally identifiable information is the way to do this without messing up your database constraints. You may want to change one of the PII columns to say "Redacted at (date-time)" in case some auditor wants to see evidence that you comply with these requests.
It takes a while to get your changes to filter through to all your backups. There's not much you can do about that operationally except making sure you don't retain old backups too far back in time.
If you use PII as a primary, unique, or constraint key, that's a problem. For example, if you use email address as a key to user information, you may have to replace the values in those columns with a randomly-generated text string to erase the data and preserve the uniqueness. You'll have to do that in a way that preserves constraints.
Don't forget that your web server logs very likely allow you to find users' IP addresses given their user IDs. Be sure you set up a log retention policy (ten days?) and enforce it by automatic deletion.
Make your policy say "it may take up to thirty days to completely delete..." to give your workflows time to affect your server logs and backups.
If a user asks you to remove all data, and then later returns to use your service again, handle them as a user you've never seen before. If you could figure out that you had seen them before, you obviously did not delete all their PII when requested.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have a doubt (probably very basic) when it comes to know if a field has been stored with information in the database.
I'm working with Laravel and MySQL as DB engine.
The situation is as follows:
I have a field in the database that stores information (in this case it is a LONGTEXT field with a large amount of information in it). This field stores information in an automated way (by means of a CRON).
When listing the information related to the records of that table, I need to know if the field in question contains information or not.
At first I had thought of including another field (column) in the same table that tells me if the field is empty or not. Although I consider that this would be a correct way to do it, on the other hand I think that I could save this column by simply checking if the field in question is empty or not. However, I'm not sure if this would be the right way to do it and if this could affect the performance of the application (I don't know how MySQL does exactly this comparison or if it could be optimised by making use of the new field).
I hope I have explained myself correctly.
Schematically, the options are:
Option 1:
Have a single field (very large amount of information).
When obtaining the list with the records, check in the corresponding search if the field in question contains information.
Option 2:
Have two fields: one of them contains the information and the other is a boolean that indicates if the first one contains information.
When obtaining the list of records, look at the boolean.
The aim of the question is to use good practices as well as to optimise both the search and minimise the impact on the performance of the process.
Thank you very much in advance.
It takes extra work for MySQL to retrieve the contents of a LONGTEXT or any other BLOB / CLOB column. You'll incur that work even if your query says.
SELECT id FROM tbl WHERE longtext IS NOT NULL /* slow! */
or
SELECT id FROM tbl WHERE CHAR_LENGTH(longtext) >= 1 /* slow! */
So, yes, you should also use another column to indicate whether the LONGTEXT column is populated if you need to run a lot of queries like that.
You could consider using a generated -- virtual --- colum like this for the purpose.
textlen BIGINT GENERATED ALWAYS AS (CHAR_LENGTH(longtext)) STORED
The generated column will get its value at the time you INSERT or UPDATE the row. Then WHERE textlen >= 1 will be fast. You can even put an index on it.
Go for the length rather than the Boolean value. It doesn't take significantly more space, it gives you slightly more information, and it gives you a nice way to sanity-check your logic during testing.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Consider a trip itinerary. There are 20 possible stops on a tour. A standard tour involves stops 1 through 20 in order. However, each user may create their own tour consisting of 5 or more stops in any order with possibility for repeats. What is the most efficient way to model this in a database?
If we use a join table
user_id, stop_id, order
we would have millions of records very quickly but we could easily pull the stop & user attributes on queries.
If we stored the stops as an array,
user_id, stop_id_array_in_order
we have a much smaller, non-normalized table and we cannot easily access the stop attributes.
Are there other options that allow for accessing of parent attributes while minimizing table size?
I would define the entities and create tables for them with the relations between them in separate tables as you described in the first example:
users table
tours table
stops table
tours_users table (a User can go to a Tour more than once)
stops_order table: stop_id, order, tours_users_id
For querying the tables, for any user you want to check their tour you can achieve this with the tours_users table , if the stops needs to be retrieved , you can easily join the tours_users table with the stops_order table through the tours_users_id.
If the tables are indexed correctly, there should be no problem with performance and you will be using the relational database engine as you supposed to.
You're thinking that saving some space will help you. It won't. It's also arguable how much space you'd actually save.
You'd also be using an unordered data structure - that's something you don't want. You want ordered structure (table) which can relate to other records - and that's exactly the reason why we normalize tables - so we can extrapolate all kinds of data without altering physical location. The other benefit is that ordered structures can be indexed and we can reduce the amount of time finding the records. Tradeoff is spending space to keep the index records.
However, millions, billions - even trillions of rows are ok. Just imagine how difficult it would be querying a structure where an array is saved as a comma separated list in a column (or multiple columns). It would be a nightmare to write a query, and performance would go down linearly as amount of records goes up.
TL;DR: keep it normalized.
I know how to do this, but I am not sure if it is wise, so I ask: I have one table that stores any issues with software that we use at work. If the problem becomes resolved, should I move that row to a resolved issue table, or should I only insert the issue's table pk, and whenever I query open issues use an outer join? Just looking for industry standard on this.
I think you should take one column with name status and update this column as per your choice .and use trigger to maintain this table history .
Moving rows around is almost always a bad idea. If you add additional information regarding resolved issues (e.g., who resolved it, when was it resolved, etc.), having an additional "resolutions" table with a foreign key to the "issues" table might be a good idea. Otherwise, I'd just add a boolean field is_resolved to the "issues" table and set it to true when the issue is resolved.
Maybe add a column with a boolean: "Resolved". Set to true when the issue is resolved and find all resolved rows with "WHERE resolved=true".
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I would like to know if it is possible to have the following database behaviour:
Create a USER table, with primary key USER_ID
Data comes in from external source: e.g. "USER_ID, TIMESTAMP, DATA"
For each user in the USER table, create a table to store only data entries pertinent to USER_ID, and store all incoming Data with the correct USER_ID into that table
When querying all the data entries for a specific USER_ID, just return all rows from that table.
I could of course do this all in one table "ALLDATALOG" and then search for all entries in ALLDATALOG that contain USER_ID, but my concern is that as the ALLDATALOG table grows, searches will take too long.
You should not split your tables like that. You will want an index on the USER_ID column in your data log table. Searches do become slower as data size increases, but your strategy will not necessarily mitigate that. It will however make your application more complex to write, harder to debug, and quite likely actually slow it down.
You should also consider unpacking that data blob into additional columns and tables as appropriate in order to take advantage of the relational nature of the database.
How many rows do you expect the table to hold over time? Thousands? Millions? Billions? At what rate do you expect rows to be added?
Suggestion 1: As answered by Jonathan, do not split the tables. It won't help you increase the speed.
Suggestion 2: If you still want to split the tables. Use the logic in your PHP code. Check if the table for a particular use already exists or not. If it does, insert values in it and if it doesn't create a new one. Should be quite straight forward. If you want me to share code for this, let me know.