What is the best practice for delete data from database table? - mysql

In my case I want to maintain a table for store some kind of data and after some period remove from the first table and store to another table.
I want to clarify the what is the best practice in this kind of scenario.
I am using MySql database in java base application.

Generally, I follow this procedure. Incase I want to delete a row. I have a tinyint column called deleted. I mark this column for that row as true.
That indicates that that row has been marked as deleted, So I dont, pick it up.
Later (maybe once a day), I run a script which in a single shot either delete the rows entirely or migrate them to another table... etc.
This is useful as every time you delete a row (even if it's 1 row), mysql has to reindex (it's indexes). This might require significant system resources depending on your data size or number of indexes. You might not want to incur these overheads everytime...

You did not provide enough information but I think if both tables have same data structure then you can avoid using two tables. Just add another column in first table and set status/type for those particular second table records.
For Example:
id | Name | BirthDate | type
------------------------------------
1 | ABC | 01-10-2001 | firsttable
2 | XYZ | 01-01-2000 | secondtable
You can pick records like this:
select * from tablename where type='firsttable'
OR
select * from tablename where type='secondtable'

If you are archiving old data, there should be a way to set up a scheduled job in mysql. I know there is in SQL Server and it's the kind of function that most databases require, so I imagine it can be done in mySQL. Shecdule the job to run in the low-usage hours. Have it select all records more than a year old (or whatever amount of time of records you want to keep active) and move them to an archive table and then delete them. Depending on the number of records you would be moving, it might be best to do this once a week or daily. You don't want the number of records expiring to be so large it affects performance greatly or makes the job take too long.
Inarchiving, the critical piece is to make sure you keep all the records that will be needed frequently and don't forget to consider reporting in that(many reports need to havea years worth or two years worth of data, do not archive records these reports should need). Then you also need to set up a way for users to access the archived records on the rare occasions they may need to see them.

Related

How to properly organize related tables in MySQL database?

There are two tables - users and orders:
id
first_name
orders_amount_total
1
Jone
5634200
2
Mike
3982830
id
user_id
order_amount
1
1
200
2
1
150
3
2
70
4
1
320
5
2
20
6
2
10
7
2
85
8
1
25
The tables are linked by user id. The task is to show for each user the sum of all his orders, there can be thousands of them (orders), maybe tens of thousands, while there can be hundreds and thousands of users simultaneously making a request. There are two options:
With each new order, in addition to writing to the orders table, increase the orders_amount_total counter, and then simply show it to the user.
Remove the orders_amount_total field, and to show the sum of all orders using tables JOIN and use the SUM operator to calculate the sum of all orders of a particular user.
Which option is better to use? Why? Why is the other option bad?
P.S. I believe that the second option is concise and correct, given that the database is relational, but there are strong doubts about the load on the server, because the sample when calculating the amount is large even for one user, and there are many of them.
Option 2. is the correct one for the vast majority of cases.
Option 1. would cause data redundancy that may lead to inconsistencies. With option 2. you're on the safe side to always get the right values.
Yes, denormalizing tables can improve performance. But that's a last resort and great care needs to be taken. "tens of thousands" of rows isn't a particular large set for an RDMBS. They are built to handle even millions and more pretty well. So you seem to be far away from the last resort and should go with option 1. and proper indexes.
I agree with #sticky_bit that Option 2. is better than 1. There's another possibility:
Create a VIEW that's a pre-defined invocation of the JOIN/SUM query. A smart DBMS should be able to infer that each time the orders table is updated, it also needs to adjust orders_amount_total for the user_id.
BTW re your schema design: don't name columns id; don't use the same column name in two different tables except if they mean the same thing.

SQL - does database extract repeating joined data multiple times or just once?

This is a performance question. In a query joining another table (the other acting as dictionary) where the joined data repeat, because foreign key value is repeated in many records of the base table, will database engine extract the repeating data multiple times (I mean by that not the presented output, but actually accessing and searching the table again and again), or is it smart enough to somehow cache the results and extract everything just once? I am using mySQL.
I mean a situation like this:
SELECT *
FROM Tasks
JOIN People
ON Tasks.personID = People.ID;
Lets assume People table consists of:
ID | Name
1 | John
2 | Mary
And Tasks:
ID | personID
1 | 1
2 | 1
3 | 2
Will "John" data be physically extracted twice or once? Is it worth trying to avoid such queries?
John will show up twice in the result set.
However, if I interpret your question right, this is not about the resulting result set, but more about how the data is internally read to produce this result set.
In this case you have a join between two tables. In a join between two tables there's a "driving table" that's read first, and then the "secondary table" that is accessed once per each row of the driving table.
Now:
If MySQL chooses Tasks as the driving table, then the row John from the People will be accessed twice (because it will be in the secondary table).
If MySQL chooses People as the driving table, then naturally the row John will be accessed only once.
So, which option will MySQL pick? Get the execution plan and you'll find out. The table that shows up first in the plan is the driving table; the other is the secondary table. Mind that the execution plan may change in the future without notice.
Note: accessing doesn't mean to perform physical I/O on the disk. Once the row is read, it becomes "hot" and it's usually cached for some time; any repeated access will probably end up reading from the cache and won't cause more physical I/O.
The answer to your question is that it repeats the data. The string values are not cached or reduced to one per distinct value.
In general, this isn't a problem because you would run queries that have small result sets by selecting a limited subset of data.
But if you don't limit the query, it would produce a large result set, potentially with strings repeated.
MySQL takes the table task and add for every row a/some row(s) from people that fits.
It has to gather every row, that belongs one row of the table tasks.
So it would grab for the second row woth the same id also the same data again.
this is usually not aproblem as you would put the colums in an INDEX and s it would find them quickly

I came up with this SQL structure to allow rolling back and auditing user information, will this be adequate?

So, I came up with an idea to store my user information and the updates they make to their own profiles in a way that it is always possible to rollback (as an option to give to the user, for auditing and support purposes, etc.) while at the same time improving (?) the security and prevent malicious activity.
My idea is to store the user's info in rows but never allow the API backend to delete or update those rows, only to insert new ones that should be marked as the "current" data row. I created a graphical explanation:
Schema image
The potential issues that I come up with this model is the fact that users may update the information too frequently, bloating up the database (1 million users and an average of 5 updates per user are 5 million entries). However, for this I came up with the idea of putting apart the rows with "false" in the "current" column through partitioning, where they should not harm the performance and will await to be cleaned up every certain time.
Am I right to choose this model? Is there any other way to do such a thing?
I'd also use a second table user_settings_history.
When a setting is created, INSERT it in the user_settings_history table, along with a timestamp of when it was created. Then also UPDATE the same settings in the user_settings table. There will be one row per user in user_settings, and it will always be the current settings.
So the user_settings would always have the current settings, and the history table would have all prior sets of settings, associated with the date they were created.
This simplifies your queries against the user_settings table. You don't have to modify your queries to filter for the current flag column you described. You just know that the way your app works, the values in user_settings are defined as current.
If you're concerned about the user_settings_history table getting too large, the timestamp column makes it fairly easy to periodically DELETE rows over 180 days old, or whatever number of days seems appropriate to you.
By the way, 5 million rows isn't so large for a MySQL database. You'd want your queries to use an index where appropriate, but the size alone isn't disadvantage.

best approach to exchanging data dumps between organizations

I am working a project where I will receive student data dumps once a month. The data will be imported into my system. The initial import will be around 7k records. After that, I don't anticipate more than a few hundred a month. However, there will also be existing records that will be updated as the student changes grades, etc.
I am trying to determine the best way to keep track of what has been received, imported, and updated over time.
I was thinking of setting up a hosted MySQL database with a script that imports the SFTP dump into a table that includes a creation_date and a modification_date field. My thought was, the person performing the extraction, could connect to the MySQL db and run a query on the imported table each month to get the differences before the next extraction.
Another thought I had, was to create a new received table every month for each data dump. Then I would perform the query on the differences.
Note: The importing system is legacy and will accept imports using a utility and unique csv type files. So that probably rules out options like XML.
Thank you in advance for any advice.
I'm going to assume you're tracking students' grades in a course over time.
I would recommend a two table approach:
Table 1: transaction level data. Add-only. New information is simply appended on. Sammy got a 75 on this week's quiz, Beth did 5 points extra credit, etc. Each row is a single transaction. Presumably it has the student's name/id, the value being added, maybe the max possible value or some weighting factor, and of course the timestamp added.
All of this just keeps adding to a never-ending (in theory) table.
Table 2: summary table, rebuilt at some interval. This table does a simple aggregation on the first table, processing the transactional scores into a global one. Maybe it's a simple sum, maybe it's a weighted average, maybe you have something more complex in mind.
This table has one row per student (per course?). You want this to be rebuilt nightly. If you're lazy, you just DROP/CREATE/INSERT. If you're worried about data-loss, you just INSERT and add a timestamp so you can have snapshots going back.

mysql optimization: current & previous orders should be on different tables, or same table with a flag column?

I want to know what's the most optimized way to work with mysql.
I have a quite large database for orders. i have another table for previous orders, whenever the order is completed, it's erased from orders and is moved to previous orders.
should i keep using this mothod or put them all in the same table and add a column that flags if it's current or previous orders?
kfir
In general, moving data around tables is a sensitive process - data can easily get lost or corrupted. The question is how are you querying this table - if you search and filter through it on an often basis, you want to keep the table relatively small. If you only access it to read specific lines (using a direct primary key), the size of the table is less crucial, and then I would advise to keep a flag.
A third option you might want to consider, is having 3 tables - one for ongoing orders, one for historic orders, and one with the order details. That third table can be long and static, while you query the first two tables.
Lastly - at some point, you might want to move the historic data out of the table all together. Maybe you would keep a cron job that runs once a month and moves out data which is older than 6 months to a different, remote database.