Best Update Method for MySQL DB - mysql

I have read through the solutions to similar problems, but they all seem to involve scripts and extra tools. I'm hoping my problem simple enough to avoid that.
So the user uploads a csv of next week's data. It gets inserted into the DB, no problem.
BUT
an hour later he gets feedback from everyone, and must make updates accordingly. He updates the csv and goes to upload it to the DB.
Right now, the system I'm using checks to see if the data for that week is already there, and if it is, pulls all of that data from the DB, a script finds the differences and sends them out, and after all of this, the data the old data is deleted and replaced with the new data.
Obviously, it is a lot easier to just wipe it clean and reenter the data, but not the best method, especially if there are lots of changes or tons of data. But I have to know WHAT changes have been made to send out alerts. But I don't want a transaction log, as the alerts only need to be sent out the one time and after that, the old data is useless.
So!
Is there a smart way to compare the new data to the already existing data, get only the rows that are changed/deleted/added, and make those changes? Right now it seems like I could do an update, but then I won't get any response on what has changed...
Thanks!
Quick Edit:
No foreign keys are currently in use. This will soon change, but it shouldn't make a difference, because the foreign keys will only point to who the data effects and thus won't need to be changed. As far as primary keys go, that does present a bit of a dilemma:
The data in question is everyone's work schedule. So it would be nice (for specific applications of this schedule beyond simple output) for each shift to have a key. But the problem is, let's say that user1 was late on Monday. The tardiness is recorded in a separate table and is tied to the shift using the shift key. But if on Tuesday there is some need to make some changes to the week already in progress, my fear is that it will become too difficult to insure that all entries in the DB that have already happened (and thus may have associations that shouldn't be broken) will get re-keyed in the process. Unfortunately, it is not as simple as only updating all events occurring AFTER the current time, as this would add work (and thus make it less marketable) to the people who do the uploading. Basically, they make the schedule on one program, export it to a CSV, and then upload it on a web page for all of the webapps that need that data. So it is simply much easier for them (and less stressful for everyone involved) to do the same routine every time of exporting the entire week and uploading it.
So my biggest concern is to make the upload script as smart as possible on both ends. It doesn't get bloated trying to find the changes, it can find the changes no matter the input AND none of the data that is unchanged risks getting re-keyed.
Here's a related question:
Suppose Joe User was schedule to wash dishes from 7:00 PM to 8:00 PM, but the new
data has him working 6:45 PM to 8:30 PM. Has the shift been changed? Or has the old
one been deleted and a new one added?
And another one:
Say Jane was schedule to work 1:00 PM to 3:00 PM, but now everyone has a mandatory
staff meeting at 2:00 to 3:00. Has she lost one shift and gained two? Or has one
shift changed and she gained one?
I'm really interested in knowing how this kind of data is typically handled/approached, more than specific answers to the above.
Again, thank you.

Right now, the system I'm using checks to see if the data for that week is already there, and if it is, pulls all of that data from the DB, a script finds the differences and sends them out, and after all of this, the data the old data is deleted and replaced with the new data.
So your script knows the differences, right? And you don't want to use some extra extra tools, apart from your script and MySQL, right?
I'm quite convinced that MySQL doesn't offer any 'diff' tool by itself, so the best you can achieve is making new CSV file for updates only. I mean - it should contain only changed rows. Updating would be quicker, and all changed data would be easily available.

If you have a unique key on one of the fields, you can use:
LOAD DATA LOCAL INFILE '/path/to/data.csv' REPLACE INTO TABLE table_name

Related

Data pipeline proposal

Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).
The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.
What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.
The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.
I hope the problem is clear?
For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).
Thanks!
PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.

Get all transactions or rows from MySQL since a certain date

At the moment, we're working on a project that involves an archaic on board computer (OBC) and a proprietary database. The idea, at the moment, is to use MySQL on the desktop/website, but, when an OBC wants to become up to date, we send it the proprietary database files it needs to come up to date. That is, we don't send it a new copy of the files, just files with the changes and the OBC updates its own instance of the proprietary database.
At the moment, we are using said database on the desktop as well, but, we're trying to move away from it and into MySQL. The problem is that the OBC is so old and so heavily invested into, that we can't move away from its use of the proprietary version.
My question boils down to this: Is there a way to search MySQL for every row that has been altered since a given date (not searching for a date time in a column, but a date time for when this row was altered last), or, would we have to keep track of, on our own, every change (there won't be all that many, at least 1000) made to the database?

Cross Stream Data changes - EDW

I got a scenario where Data Stream B is dependent on Data Stream A. Whenever there is change in Data Stream A it is required re-process the Stream B. So a common process is required to identify the changes across datastreams and trigger the re-processing tasks.
Is there a good way to do this besides triggers.
Your question is rather unclear and I think any answer depends very heavily on what your data looks like, how you load it, how you can identify changes, if you need to show multiple versions of one fact or dimension value to users etc.
Here is a short description of how we handle it, it may or may not help you:
We load raw data incrementally daily, i.e. we load all data generated in the last 24 hours in the source system (I'm glossing over timing issues, but they aren't important here)
We insert the raw data into a loading table; that table already contains all data that we have previously loaded from the same source
If rows are completely new (i.e. the PK value in the raw data is new) they are processed normally
If we find a row where we already have the PK in the table, we know it is an updated version of data that we've already processed
Where we find updated data, we flag it for special processing and re-generate any data depending on it (this is all done in stored procedures)
I think you're asking how to do step 5, but it depends on the data that changes and what your users expect to happen. For example, if one item in an order changes, we re-process the entire order to ensure that the order-level values are correct. If a customer address changes, we have to re-assign him to a new sales region.
There is no generic way to identify data changes and process them, because everyone's data and requirements are different and everyone has a different toolset and different constraints and so on.
If you can make your question more specific then maybe you'll get a better answer, e.g. if you already have a working solution based on triggers then why do you want to change? What problem are you having that is making you look for an alternative?

Change Data Capture or Change Tracking - Same as Traditional Audit Trail Table?

Before I delve into the abyss of Microsoft documentation any deeper, I'd like to know if someone experienced with Change Data Capture and Change Tracking know if one or both of these can be used to replace the traditional ...
"Audit trail table copy of the 'real
table' (all of the fields of the original table,
plus date/time, user ID, and DML
action field) inserted into by
Triggers"
... setup for a database table audit trail, where the trigger populates the audit trail table (which is all manual work).
The MSDN overview documentation explains at a high level what Change Data Capture and Change Tracking are, but it isn't clear enough to me, and doesn't state outright, that these tools can be used to replace the traditional audit trail tables we've made so often.
Can someone with any experience using Change Data Capture and Change Tracking save me a lot of time, or confirm that I am spending time looking at the right tool? The critical part of our audit trail is capturing all changes to a table's fields (on INSERT, UPDATE, DELETE), when it happened, and who did it. These changes are commonly provided to an end user chronologically via an audit trail report. Which is another question ... Change Data Capture or Change Tracking is the solution, I'd assume that this data can be queried just like data from a normal table?
EDIT: I need a permanent audit trail, irregardless of time. I see that Change Data Capture has to do with the transaction logs, so this sounds finite to me.
I think you still need audit tables in your circumstances. Looking in BOL it appears that a cleanup job is automatically created and ascheduled that runs every day at 2 am. From BOL:
The cleanup job runs daily at 2 A.M.
It retains change table entries for
4320 minutes or 3 days, removing a
maximum of 5000 entries with a single
delete statement.
That sounds like it definetely doesn't do what you want. I can't think that would do what anyone who audits tables woudl want. It also appears that it would be difficult if not impossible to add any fields not in the data table to the audit log other than it's own five default fields (I couldn't find what they were.) It also appears that the data would not be very useful to query or to use to rollback a specific bad change. OR maybe I just don;t understand the process because BOL is pretty poorly written on this subject, it certainly didn't answer any of the concerns I would have in replacing my auditing with this apparently poorly thought out process.

Never delete entries? Good idea? Usual?

I am designing a system and I don't think it's a good idea to give the ability to the end user to delete entries in the database. I think that way because often then end user, once given admin rights, might end up making a mess in the database and then turn to me to fix it.
Of course, they will need to be able to do remove entries or at least think that they did if they are set as admin.
So, I was thinking that all the entries in the database should have an "active" field. If they try to remove an entry, it will just set the flag to "false" or something similar. Then there will be some kind of super admin that would be my company's team who could change this field.
I already saw that in another company I worked for, but I was wondering if it was a good idea. I could just make regular database backups and then roll back if they commit an error and adding this field would add some complexity to all the queries.
What do you think? Should I do it that way? Do you use this kind of trick in your applications?
In one of our databases, we distinguished between transactional and dictionary records.
In a couple of words, transactional records are things that you cannot roll back in real life, like a call from a customer. You can change the caller's name, status etc., but you cannot dismiss the call itself.
Dictionary records are things that you can change, like assigning a city to a customer.
Transactional records and things that lead to them were never deleted, while dictionary ones could be deleted all right.
By "things that lead to them" I mean that as soon as the record appears in the business rules which can lead to a transactional record, this record also becomes transactional.
Like, a city can be deleted from the database. But when a rule appeared that said "send an SMS to all customers in Moscow", the cities became transactional records as well, or we would not be able to answer the question "why did this SMS get sent".
A rule of thumb for distinguishing was this: is it only my company's business?
If one of my employees made a decision based on data from the database (like, he made a report based on which some management decision was made, and then the data report was based on disappeared), it was considered OK to delete these data.
But if the decision affected some immediate actions with customers (like calling, messing with the customer's balance etc.), everything that lead to these decisions was kept forever.
It may vary from one business model to another: sometimes, it may be required to record even internal data, sometimes it's OK to delete data that affects outside world.
But for our business model, the rule from above worked fine.
A couple reasons people do things like this is for auditing and automated rollback. If a row is completely deleted then there's no way to automatically rollback that deletion if it was in error. Also, keeping a row around and its previous state is important for auditing - a super user should be able to see who deleted what and when as well as who changed what, etc.
Of course, that's all dependent on your current application's business logic. Some applications have no need for auditing and it may be proper to fully delete a row.
The downside to just setting a flag such as IsActive or DeletedDate is that all of your queries must take that flag into account when pulling data. This makes it more likely that another programmer will accidentally forget this flag when writing reports...
A slightly better alternative is to archive that record into a different database. This way it's been physically moved to a location that is not normally searched. You might add a couple fields to capture who deleted it and when; but the point is it won't be polluting your main database.
Further, you could provide an undo feature to bring it back fairly quickly; and do a permanent delete after 30 days or something like that.
UPDATE concerning views:
With views, the data still participates in your indexing scheme. If the amount of potentially deleted data is small, views may be just fine as they are simpler from a coding perspective.
I prefer the method that you are describing. Its nice to be able to undo a mistake. More often than not, there is no easy way of going back on a DELETE query. I've never had a problem with this method and unless you are filling your database with 'deleted' entries, there shouldn't be an issue.
I use a combination of techniques to work around this issue. For some things adding the extra "active" field makes sense. Then the user has the impression that an item was deleted because it no longer shows up on the application screen. The scenarios where I would implement this would include items that are required to keep a history...lets say invoice and payment. I wouldn't want such things being deleted for any reason.
However, there are some items in the database that are not so sensitive, lets say a list of categories that I want to be dynamic...I may then have users with admin privileges be allowed to add and delete a category and the delete could be permanent. However, as part of the application logic I will check if the category is used anywhere before allowing the delete.
I suggest having a second database like DB_Archives whre you add every row deleted from DB. The is_active field negates the very purpose of foreign key constraints, and YOU have to make sure that this row is not marked as deleted when it's referenced elsewhere. This becomes overly complicated when your DB structure is massive.
There is an acceptable practice that exists in many applications (drupal's versioning system, et. al.). Since MySQL scales very quickly and easily, you should be okay.
I've been working on a project lately where all the data was kept in the DB as well. The status of each individual row was kept in an integer field (data could be active, deleted, in_need_for_manual_correction, historic).
You should consider using views to access only the active/historic/... data in each table. That way your queries won't get more complicated.
Another thing that made things easy was the use of UPDATE/INSERT/DELETE triggers that handled all the flag changing inside the DB and thus kept the complex stuff out of the application (for the most part).
I should mention that the DB was a MSSQL 2005 server, but i guess the same approach should work with mysql, too.
Yes and no.
It will complicate your application much more than you expect since every table that does not allow deletion will be behind extra check (IsDeleted=false) etc. It does not sound much but then when you build larger application and in query of 11 tables 9 require chech of non-deletion.. it's tedious and error prone. (Well yeah, then there are deleted/nondeleted views.. when you remember to do/use them)
Some schema upgrades will become PITA since you'll have to relax FK:s and invent "suitable" data for very, very old data.
I've not tried, but have thought a moderate amount about solution where you'd zip the row data to xml and store that in some "Historical" table. Then in case of "must have that restored now OMG the world is dying!1eleven" it's possible to dig out.
I agree with all respondents that if you can afford to keep old data around forever it's a good idea; for performance and simplicity, I agree with the suggestion of moving "logically deleted" records to "old stuff" tables rather than adding "is_deleted" flags (moving to a totally different database seems a bit like overkill, but you can easily change to that more drastic approach later if eventually the amount of accumulated data turns out to be a problem for a single db with normal and "old stuff" tables).