How to separate these two processes? - mysql

I have an asp.net website that stores events inside a database table. Then I have a windows service app that reads those events and performs appropriate actions. Currently its possible for the two processes to insert and remove records from the same table at the same time.
What is a better pattern for developing such a system so to insure the two are never working on the same table simultaneously?

I'm not sure about pattern but I'd do a WCF-service and let both use that to access the data. Then share a common lock object between all methods that alter (or read) the table contents.

For this scenario I use a pattern in that ensures that the data cannot be updated concurrently.
I always add a special column to the table, usually 'LastModified' of type 'timestamp'. When adding or inserting a row I always set this column.
When I come to update a record I make sure that the stored procedure checks the value that I am passing in with that stored in the database. If these are different then another user or process has altered this row, and I raise a concurrency error.
This can be propergated up to the calling process or handled in your service.

This could be an architecture problem more than anything else.
Why would you need two processes that delete records?
You generally don't need two different processes to CRUD data in the same tables. One thing you can do is wrap the database/tables with a service, then let all processes that require working with the data use that service. The service can then take care of the serialization of calls. Either way, there will be only 1 process working with the DB directly.
Additionally, it sounds to me like you're in an event-sourcing type of architecture, which makes me wonder why you'd need to delete records in the first place...

Related

Millions of Data to be inserted in MySQL database using Spring data JPA

Our application is based on Java 8, Spring Data JPA and MySQL. We have two different data source in my application, our task is to fetch millions of data (text stored in a table) from one data source and insert into different data source after some small computation.
When I tried to iterate through each record and insert into different Database, it is taking a longer time than the expected.
Is there any standard and fastest way of doing this? Do I need to use a stored procedure? if yes, then how would I pass the list of entities in the procedure?
Don't use JPA. JPAs main use case is: Loading a non-trivial domain model, manipulating it, then flushing it to the database with automatic detection what changed. You don't seem to need that in your usecase.
Use JDBC and batch inserts. Springs JdbcTemplate will come in handy.
Select a batch, manipulate it as desired, insert it into the target.
For tuning the select process consider value based pagination.
For writing consider removing constraints and indexes and creating them after the process.
There might be more MySQL specific options available, but I don't know about those.
You might want to split your work in three thread pools: One for reading, one for writing, one for processing the data.
I'm not sure, but Spring Batch might help with that.
Load/save entries in batches (100 or 1000 entries in one go).
Load and/or save asynchronously.

Approaches in managing data from one database to another

There are two databases, MAIN and TEMP, used in a website. TEMP database is used to manage data fetched from MAIN for insert/update and on publishing the data moved back to MAIN database. What can be the approaches for error handling while publishing ?
I think of below two approaches :
Rollback script - if error occurred while insert/update then the rollback can help.
Third DB Concept - Introduce a third database same as MAIN and first use this database for insert/update and if it result success then execute the same commands to MAIN database otherwise no need to update MAIN database.
I am not sure which approach is better among the two. Can there be any other approach?
Suggestions are really helpful.
Use a transaction to move/update the data from TEMP to MAIN. You either want it to work, or not, right? Presumably leaving the state in TEMP if it doesn't?
The only case I can see where you might want to do anything different is if you deliberately want to NOT leave the data in TEMP if a publish fails (if for example there's no sensible way to follow such a case up), in which case you could consider having 2 transactions, one that removes it from TEMP followed by a second one that adds it back to MAIN only if the first succeeds, and if either of those transactions fails an error is reported and the whole thing has to be restarted again.
Using a third DB doens't help. You could still succeed with the attempt to the third DB and fail with MAIN, and keeping the third DB up to date with MAIN means you immediately double all your work.

How to cache infrequently changing mysql query?

I have a mysql query that is taking 8 seconds to execute/fetch (in workbench).
I won't go into the details of why it may be slow (I think GROUPBY isnt helping though).
What I really want to know is, how I can basically cache it to work more quickly because the tables only change like 5-10 times/hr, while users access the site 1000s times/hour.
Is there a way to just have the results regenerated/cached when the db changes so results are not constantly regenerated?
I'm quite new to sql so any basic thought may go a long way.
I am not familiar with such a caching facility in MySQL. There are alternatives.
One mechanism would be to use application level caching. The application would store the previous result and use that if possible. Note this wouldn't really work well for multiple users.
What you might want to do is store the report in a separate table. Then you can run that every five minutes or so. This would be a simple mechanism using a job scheduler to run the job.
A variation on this would be to have a stored procedure that first checks if the data has changed. If the underlying data has changed, then the stored procedure would regenerate the report table. When the stored procedure is done, the report table would be up-to-date.
An alternative would be to use triggers, whenever the underlying data changes. The trigger could run the query, storing the results in a table (as above). Alternatively, the trigger could just update the rows in the report that would have changed (harder, because it involves understanding the business logic behind the report).
All of these require some change to the application. If your application query is stored in a view (something like vw_FetchReport1) then the change is trivial and all on the server side. If the query is embedded in the application, then you need to replace it with something else. I strongly advocate using views (or in other databases user defined functions or stored procedures) for database access. This defines the API for the database application and greatly facilitates changes such as the ones described here.
EDIT: (in response to comment)
More information about scheduling jobs in MySQL is here. I would expect the SQL code to be something like:
truncate table ReportTable;
insert into ReportTable
select * from <ReportQuery>;
(In practice, you would include column lists in the select and insert statements.)
A simple solution that can be used to speed-up the response time for long running queries is to periodically generate summarized tables, based on underlying data refreshing or business needs.
For example, if your business don't care about sub-minute "accuracy", you can run the process once each minute and make your user interface to query this calculated table, instead of summarizing raw data online.

MySQL table modified timestamp

I have a test server that uses data from a test database. When I'm done testing, it gets moved to the live database.
The problem is, I have other projects that rely on the data now in production, so I have to run a script that grabs the data from the tables I need, deletes the data in the test DB and inserts the data from the live DB.
I have been trying to figure out a way to improve this model. The problem isn't so much in the migration, since the data only gets updated once or twice a week (without any action on my part). The problem is having the migration take place only when it needs to. I would like to have my migration script include a quick check against the live tables and the test tables and, if need be, make the move. If there haven't been updates, the script quits.
This way, I can include the update script in my other scripts and not have to worry if the data is in sync.
I can't use time stamps. For one, I have no control over the tables on the live side once it goes live, and also because it seems a bit silly to bulk up the tables more for conviencience.
I tried doing a "SHOW TABLE STATUS FROM livedb" but because the tables are all InnoDB, there is no "Update Time", plus, it appears that the "Create Time" was this morning, leading me to believe that the database is backed up and re-created daily.
Is there any other property in the table that would show which of the two is newer? A "Newest Row Date" perhaps?
In short: Make the development-live updating first-class in your application. Instead of depending on the database engine to supply you with the necessary information to enable you to make a decision (to update or not to update ... that is the question), just implement it as part of your application. Otherwise, you're trying to fit a round peg into a square hole.
Without knowing what your data model is, and without understanding at all what your synchronization model is, you have a few options:
Match primary keys against live database vs. the test database. When test > live IDs, do an update.
Use timestamps in a table to determine if it needs to be updated
Use the md5 hash of a database table and modification date (UTC) to determine if a table has changed.
Long story short: Database synchronization is very hard. Implement a solution which is specific to your application. There is no "generic" solution which will work ideally.
If you have an autoincrement in your tables, you could compare the maximum autoincrement values to see if they're different.
But which version of mysql are you using?
Rather than rolling your own, you could use a preexisting solution for keeping databases in sync. I've heard good things about SQLYog's SJA (see here). I've never used it myself, but I've been very impressed with their other programs.

What is the best way to update (or replace) an entire database table on a live machine?

I'm being given a data source weekly that I'm going to parse and put into a database. The data will not change much from week to week, but I should be updating the database on a regular basis. Besides this weekly update, the data is static.
For now rebuilding the entire database isn't a problem, but eventually this database will be live and people could be querying the database while I'm rebuilding it. The amount of data isn't small (couple hundred megabytes), so it won't load that instantaneously, and personally I want a bit more of a foolproof system than "I hope no one queries while the database is in disarray."
I've thought of a few different ways of solving this problem, and was wondering what the best method would be. Here's my ideas so far:
Instead of replacing entire tables, query for the difference between my current database and what I want to place in the database. This seems like it could be an unnecessary amount of work, though.
Creating dummy data tables, then doing a table rename (or having the server code point towards the new data tables).
Just telling users that the site is going through maintenance and put the system offline for a few minutes. (This is not preferable for obvious reasons, but if it's far and away the best answer I'm willing to accept that.)
Thoughts?
I can't speak for MySQL, but PostgreSQL has transactional DDL. This is a wonderful feature, and means that your second option, loading new data into a dummy table and then executing a table rename, should work great. If you want to replace the table foo with foo_new, you only have to load the new data into foo_new and run a script to do the rename. This script should execute in its own transaction, so if something about the rename goes bad, both foo and foo_new will be left untouched when it rolls back.
The main problem with that approach is that it can get a little messy to handle foreign keys from other tables that key on foo. But at least you're guaranteed that your data will remain consistent.
A better approach in the long term, I think, is just to perform the updates on the data directly (your first option). Once again, you can stick all the updating in a single transaction, so you're guaranteed all-or-nothing semantics. Even better would be online updates, just updating the data directly as new information becomes available. This may not be an option for you if you need the results of someone else's batch job, but if you can do it, it's the best option.
BEGIN;
DELETE FROM TABLE;
INSERT INTO TABLE;
COMMIT;
Users will see the changeover instantly when you hit commit. Any queries started before the commit will run on the old data, anything afterwards will run on the new data. The database will actually clear the old table once the last user is done with it. Because everything is "static" (you're the only one who ever changes it, and only once a week), you don't have to worry about any lock issues or timeouts. For MySQL, this depends on InnoDB. PostgreSQL does it, and SQL Server calls it "snapshotting," and I can't remember the details off the top of my head since I rarely use the thing.
If you Google "transaction isolation" + the name of whatever database you're using, you'll find appropriate information.
We solved this problem by using PostgreSQL's table inheritance/constraints mechanism.
You create a trigger that auto-creates sub-tables partitioned based on a date field.
This article was the source I used.
Which database server are you using? SQL 2005 and above provides a locking method called "Snapshot". It allows you to open a transaction, do all of your updates, and then commit, all while users of the database continue to view the pre-transaction data. Normally, your transaction would lock your tables and block their queries, but snapshot locking would be perfect in your case.
More info here: http://blogs.msdn.com/craigfr/archive/2007/05/16/serializable-vs-snapshot-isolation-level.aspx
But it requires SQL Server, so if you're using something else....
Several database systems (since you didn't specify yours, I'll keep this general) do offer the SQL:2003 Standard statement called MERGE which will basically allow you to
insert new rows into a target table from a source which don't exist there yet
update existing rows in the target table based on new values from the source
optionally even delete rows from the target that don't show up in the import table anymore
SQL Server 2008 is the first Microsoft offering to have this statement - check out more here, here or here.
Other database system probably will have similar implementations - it's a SQL:2003 Standard statement after all.
Marc
Use different table names(mytable_[yyyy]_[wk]) and a view for providing you with a constant name(mytable). Once a new table is completely imported update your view so that it uses that table.