Decentralized synchronized secure data storage - mysql

Introduction
Hi, I am going to ask a question which seems utopic for me, but I need to know if there is a way to achieve what I need. And if not, I need to know why not.
The idea
Suppose I have a database structure, in MySql.
I want to create some solution to allow anyone (no matter who, no matter where) to have a synchronized copy (updated clone) of this database (with its content)
Well, and it is not going to be just one synchronized copy, it could (and should) be a multiple replication (supposing the basic, this means, for example, ten copies all over the world)
And, the most important thing: It must be secure. By secure I mean only real-accepted transactions will be synchronized with all the others (no matter how many) database copies/clones.
Note: Since it would be quite difficult to make the synchronization in real-time, I will design everything to make this feature dispensable. So it is not required.
My auto-suggestion
This is how I am thinking to manage it:
Time identifiers and Updates checking: Every action (insert, update, delete...) will be stored as the action instruction itself, associated to the time identifier. [I think better than a DATETIME field, it'll be an INT one, with the number of miliseconds passed from 1st january 2013 on, for example]. So each copy is going to ask to the "neighbour copy" for new actions done since last update, and execute them after checking they are allowed.
Problem 1: the "neighbour copy" could be outdated too.
Solution 1: do not ask just one neighbour, create a random list with some of the copies/clones and ask them for news (I could avoid the list and ask ALL the clones for updates, but this will be inefficient if clones number ascends too much).
Problem 2: Real-time global synchronization is not active. What if...
Someone at CLONE_ENTERPRISING inserts a row into TABLE.
... this row goes to every clone ...
Someone at CLONE_FIXEMALL deletes this row.
... and at the same time, somewhere in an outdated clone ...
Someone at CLONE_DROPOUT edits this row (now inexistent at the other clones)
Solution 2: easy stuff, force a GLOBAL synchronization before doing any new "depending-on-third-data action" (edit, for example). This global synch. will be unnecessary when making an INSERT, for instance.
Note: Well, someone could have some fun, and make the same insert in two clones... since they're not getting updated in real-time, this row will exist twice. But, it's the same as when we have one single database, in some needed cases we check if there is an existing same-row before doing the final action. Not a problem.
Problem 3: It is possible to edit the code and do not filter actions, so someone could spread instructions to delete everything, or just make some trolling activity. This is not a problem, since good clones will always be somewhere. Those who got bad won't interest anymore.
I really appreciate if you read. I know this is not the perfect solution, it has possibly hundred of holes, but it is my basic start. I will now appreciate anything you can teach me now. Thanks a lot.
PS.: It could be that all this I am trying already exists and has its own name. Sorry for asking then (I'd anyway thank this name, if it exists)

I would suggest a look at Sync Framework from Microsoft. It might be better suited to SQL Server but it should work with MySQL too. The problem you are tackling is quite a complex one.

Related

SQL: Trying to understand how to use safely access and modify database concurrently

So, I'm working in MySQL at the moment, but any SQL answers will probably do, cuz I'm trying to understand the general concepts.
So thread safety is obviously important in concurrent environments. I program primarily in Java and I'm always extremely careful to write code that guards its mutable state to avoid thread conflicts.
In SQL, though, I'm very confused about how to achieve that same level of safety. So I'm gonna start with what I don't know, go on to what I'm confused about, and take it from there.
First, what I do know is transactions. Disable auto commit, use savepoints, rollbacks, etc. Transactions, as I understand them, are atomic at the point of committing them.
But I've also seen references to explicit locking statements and concurrency models (optimistic,pessimistic). And I don't really get where all that fits in. I also don't want to just use transactions for everything and assume it'll be safe. I don't write code unless I understand it in its entirety, I don't want to leave anything to chance.
Moreover, what about triggers, procedures, etc. How do I use them with transactions? How do I ensure atomicity there?
I feel like I'm overcomplicating this a bit, but I'm looking for a comprehensive, clear cut explanation as to how to ensure that multiple threads and users can modify the database safely. Not quite and ELI5, since I understand SQL better than that, but something that really thoroughly explains the process.
Thanks. I haven't found a good match for this question on this site in my search, but if it is a duplicate I apologize and simply ask that a link to the appropriate answer be provided before this question is locked.

MySQL - What happens when multiple queries hit the database

I am working on a project, which will be used by around 500 employees in my organization. Currently, it's still in development phase, and very few people(around 10) are using it. I'm using MySQL. I just want to know, what happens if many users are doing front end edits and then save, at the same point of time? Some SELECT queries that I've written do take as long as 6 seconds to execute. As only one query can be executed at any point of time, if already a query is in progress, and another hits the database, will it create problem? If this is a common situation in large scale projects, please let me know how can I handle this. I'm not sure, if I've made myself clear :). Any advice or links will be very helpful.
From technical aspect, no - nothing bad will happen, the database won't go ballistics and die on you, they're made for purposes like concurrent access.
From logical point of view - something bad will happen. If two people edit the same thing at the same time and then post it at the same time - it gets saved to hard drive one after another. The last one to save is the one whose updates will be on the HDD, effectively causing the first person to lose their changes.
You can approach this problem from several angles. Some projects introduce the concept of locking (not table locking but in-app locking). It revolves around marking a record as locked using a boolean column and if anyone tries to access that record for updating, the software says that someone else is editing it. It's something really difficult to implement and for the most time it doesn't work as expected (I think I vaguely remember Joomla! using something like that, it was one of the most annoying features ever).
The other option you have is to save each update as a revision. That way you can keep track on who updated what and when and you never lose any records in case of would-get overwritten. I believe that SO and Wikipedia use that approach and it works really great because you can inspect what two or more people have done and merge their contributions.
Optimistic Concurrency Control
http://en.wikipedia.org/wiki/Optimistic_concurrency_control
Make sure that each record contains date metadata on last changed/modified time, and load that as part of your data object. Then when attempting to commit the row to database, check the last_modified time in the table to ensure that it is the SAME as the one stored in memory for your object. If it matches, commit it, else throw exception.

Is a 'blackhole' table evil?

Reading to this question i've just learned the existence of the blackhole table trick: basically consist in using a single table to insert data, and then a trigger that split the data in many other tables.
Im wondering if this could cause problems, once the developers whos working on the project are aware of that.
What are the pro and cons of this tecnique?
Edit:
The blink I got in mind when I saw the example, is about transactions: if for some reason the transaction fail, you'll find the blackhole row with the original data, for historical purpose and maybe a help with debug - but this seems to be the only +1 i can see with blackholes. Ideas?
I don't think blackhole has any real pros.
Writing the trigger code to move data around is probably not noticably less work than writing the code to insert the data in the right place in the first place.
As Christian Oudard writes, it doesn't reduce complexity - just moves it to a place where it's really hard to debug.
On the downside:
"Side effects" are usually a bad idea in software development. Triggers are side effects - I intend to do one thing (insert data in a table), and it actually does lots of other things. Now, when I'm debugging my code, I have to keep all the side effects in my head too - and the side effects could themselves have side effects.
most software spends far more time in maintenance than it does in development. Bringing new developers into the team and explaining the black hole trick is likely to increase the learning curve - for negligible benefit (in my view).
Because triggers are side effects, and it's relatively easy to set off a huge cascade of triggers if you're not careful, I've always tried to design my databases without a reliance on triggers; where triggers are clearly the right way to go, I've only let my most experienced developers create them. The black hole trick makes triggers into a normal, regular way of working. This is a personal point of view, of course.
The original question that prompted yours does not get at the heart of MySQL's "blackholes."
What is a BLACKHOLE?
In MySQL-speak, BLACKHOLE is a storage engine that simply discards all data INSERTed into it, analogous to a null device. There are a number of reasons to use this backend, but they tend to be a bit abstruse:
A "relay-only" binlog-filtering slaveSee the docs, and here and here.
BenchmarkingE.g., measuring the overhead of binary logging without worrying about storage engine overhead
Various computational tricksSee here.
If you don't know why you need a data sink masquerading as a table, don't use it.
What is the technique you are asking about?
The use under consideration seems to be to:
redirect INSERTed data to other tables
audit log the original INSERTion action
discard the original INSERT data
Thus the answer to the question of "evilness" or pros/cons is the same as the answer to those questions for insertable/updatable VIEWs (the common way to implement #1), trigger-based audit logging (how most people do #2) and behavioral overrides/counteractions generally (there are a number of ways to accomplish #3).
So, what is the answer?
The answer is, of course, "sometimes these techniques are appropriate and sometimes not." :) Do you know why you're doing it? Is the application a better place for this functionality? Is the abstraction too brittle, too leaky, too rigid, etc.?
This doesn't look like a good idea. If you're trying to keep the front end code simple, why not just use a stored procedure? If it's not to keep the front end code simple, I don't understand the point at all.
Funnily enough I learnt about the existence of blackholes today too.
Arguably the question here is actually a broader one i.e. whether or not business logic should be embedded in database triggers or not. In this instance the blackhole table is essentially being used as a transient data store that the trigger on the blackhole table can make use of. Should the trigger be used in the first place? To me that is the real meat of the question.
Personally I feel that the use of triggers should be restricted to logging and DBA-specific tasks only and should not contain business logic (or any logic for that matter) that should belong firmly in the application layer. It appears as though there have been quite a few opinions expressed about whether database triggers are evil or not. I think your question kinda falls into that category too.
Embedding application layer logic in database triggers can be risky.
It is likely to end up splitting business logic between application
code and the database. This can be very confusing indeed for
somebody trying to support and get their head into a code base.
If you end up with too much logic in triggers, and indeed stored procedures, you can easily end up with performance issues on your database server that could have, indeed should have been addressed by distributing the heavy duty processing tasks i.e. complex business logic among application servers and leaving the database server free for its primary purpose i.e. serving data.
Just my two bits' worth of course!
Each time you insert a row into a table, the odds are that you are writing to the same area of the hard drive or the same page (in MS-SQL world, I don't know about postgresql), so this technique will likely lead to contention and locking as all transactions are now competing to write to the same table.
Also this will halve insert performance since inserts require two inserts instead of one.
And this is denormalization since there are now two copies of the data instead of one.
Please don't do this. This doesn't reduce complexity, it just moves it around. This sort of logic belongs in the application layer, where you can use a nicer language like PHP, Python, or Ruby to implement it.
Don't do this. The fact that it's called a trick and not a standard way of doing something says enough for me.
This totally kills the normal usage pattern of the relational model. Not sure that it actually kills normal form as you can still have that all in place. It's just messing with the way data is making it to the destination tables. Looks like a performance nightmare on top of a maintenance nightmare. Imagine one table having a trigger that has to fire for 1,800 plus table inserts for example. That just makes me feel sick.
This is a interesting parlor trick nothing more.
I would suppose that this would be quite slow, as the advantages of "bulk inserts" cannot be used.

Generate general schema of a huge unknown database

I am required to make a general schema of a huge database that I have never used.
The problem is that I do not know how/where could I start doing this because, not considering the size, I have no idea of what is each table for. I can guess some but there are the mayority of them in which generic name fields do not say anything to me.
Do you have some advice?what could I do?
There is no documentation about the database and the creators are not able to help me because they are in another company now.
Thank you very much in advanced.
This isn't going to be easy.
Start by gathering any documentation, notes, etc. that exist. Also, it'll greatly help to have a thorough understanding of the type of data being stored, and of the application. Keep ample notes of your discoveries, and build the documentation that should have been built before.
If your database contains declared foreign keys, you can start there, and at least get down the relationships between the tables. Keeping in mind that this may be incomplete. As #John Watson points out, if the relationships are declared, there are tools to do this for you.
Check for stored functions and procedures, including triggers. Though these are somewhat uncommon in MySQL databases. Triggers especially will often yield clues ("every update to table X inserts a new row to table Y" -> "table Y is probably a log or audit table").
Some of the tables are hopefully obvious, and if you know what is related to them, you may be able to start figuring out those related tables.
Hopefully you have access to application code, which you can grep and read to find clues. Access to a test environment which you can destroy repeatedly will be useful too ("what happens if I change this in the app, where does the database change?"; "what happens if I scramble these values?"; etc.). You can dump tables and use diff on them, provided you dump them ordered by primary or unique key.
Doing queries like SELECT DISTINCT foo FROM table can help you see what different things can be in a column.
If its possible to start from a mostly-empty database (e.g., minimal to get the app to run), you can observe what changes as you add data to the app. Much quicker to dump the database when its small. Same for diffing it, same for reading through the output. Some things are easier to understand in a tiny database, but some things are more difficult. When you have a huge dataset and a column is always 3, you can be much more confident it always is.
You can watch SQL traffic from the application(s) to get an idea of what tables and columns they access for each function, and how they join them. Watching SQL traffic can be done in application-specific ways (e.g., DBI trace) or server-specific ways (turn on the general query log) or with a packet tracer like Wireshark or tcpdump. Which is appropriate is going to depend on the environment you're working in. E.g., if you have to do this on a production system, you probably want Wireshark. If you are doing this in dev/test, the disadvantage of the MySQL query log is that all the apps may very well be mixed together, and if multiple people are hitting the apps it'll get confusing. The app-specific log probably won't suffer from this, but of course the app may not have that.
Keep in mind the various ways data can be stored. For example, all three of these could mean May 1, 1980:
1980-05-01 — As a DATE, TIMESTAMP, or text.
2444330.5 — Julian day (with time, specifies at midnight)
44360 — Modified Julian day
326001600 — UNIX timestamp (with time, specifies midnight) assuming local time is US Eastern Time (seconds since Jan 1 1970 UTC)
There may be things in the database which are denormalized, and some of them may be denormalized incorrectly. E.g., you may be wondering "why does this user have a first name Bob in one table, and a first name Joe in another?" and the answer is "data corruption".
There may be columns that aren't used. There may be entire tables that aren't used. Despite this, they may still have data from older versions of the app (or other, no-longer-in-use apps), queries run from the MySQL console, etc.
There may be things which aren't visible in the application anywhere, but are used. Their purpose may be completely non-obvious without knowledge of the algorithms implemented in the app(s). For example, a search function in an app may store all kinds of precomputed information about the documents to search and their connections. Worse, these tables may only be updated by batch jobs, so changing a document won't touch them (making you mistakenly believe they have nothing to do with documents). Then, you come in the next morning, and the table is mysteriously very different. Though, in the search case, a query log when running search will tell you.
Try using the free mySQL workbench (it's specific to mySQL).
I have reverse engineered databases this way and also ended up with great Entity Relationship Diagrams!
I've worked with SQL for 20 years and this product really is great (it's free, from the mysql folks themselves).
It can have occasional problems, crashes, etc. at least it did on Ubuntu10 but they've been relatively rare and far out-weighed by the benefits! It's also actively developed so bugs are actually fixed on an on-going basis.
Assuming that nobody bothered to declare foreign keys in the table definition, and the database belongs to an application which is in use, after grabbing the current schema, the next step for me would be to enable logging of all queries (hoping that the data does NOT use a trivial ORM like [x]hibernate) to identify joins and data semantics.
This perl script may be helpful.

MS Access antiquated? Anything new in 2011?

Our company has a database of 17,000 entries. We have used MS Access for over 10 years for our various mailings. Is there something new and better out there? I'm not a techie, so keep in mind when answering. Our problems with Access are:
-no record of what was deleted,
-will not turn up a name in a search if cap's or punctuation
is not entered exactly,
-is complicated for us to understand the de-duping process.
- We'd like a more nimble program that we can access from more than one dedicated computer.
The only applications I know of that are comparable to Access are FileMaker Pro, and the database component of the Open Office suite. FM Pro is a full-fledged product and gets good marks for ease of use from non-technical users, while Base is much less robust and is not nearly as easy for creating an application.
All of the answers recommending different databases really completely miss the point here -- the original question is about a data store and application builder, not just the data store.
To the specific problems:
PROBLEM 1: no record of what was deleted
This is a design error, not a flaw in Access. There is no database that really keeps a record of what's deleted unless someone programs logging of deleted data.
But backing up a bit, if you are asking this question it suggest that you've got people deleting things that shouldn't be deleted. There are two solutions:
regular backups. That would mean you could restore data from the last backup and likely recover most of the lost data. You need regular backups with any database, so this is not really something that is specific to Access.
design your database so records are never deleted, just marked deleted and then hidden in data entry forms and reports, etc. This is much more complicated, but is very often the preferred solution, as it preserves all the data.
Problem #2: will not turn up a name in a search if cap's or punctuation is not entered exactly
There are two parts to this, one of which is understandable, and the other of which makes no sense.
punctuation -- databases are stupid. They can't tell that Mr. and Mister are the same thing, for instance. The solution to this is that for all data that needs to be entered in a regularized fashion, you use all possible methods to insure that the user can only enter valid choices. The most common control for this is a dropdown list (i.e., "combo box"), which limits the choices the user has to the ones offered in the list. It insures that all the data in the field conforms to a finite set of choices. There are other ways of maintaining data regularity and one of those involves normalization. That process avoids the issue of repeatedly storing, say, a company name in multiple records -- instead you'd store the companies in a different table and just link your records to a single company record (usually done with a combo box, too). There are other controls that can be used to help insure regularity of data entry, but that's the easiest.
capitalization -- this one makes no sense to me, as Access/Jet/ACE is completely case-insensitive. You'll have to explain more if you're looking for a solution to whatever problem you're encountering, as I can't conceive of a situation where you'd actually not find data because of differences in capitalization.
Problem #3: is complicated for us to understand the de-duping process
De-duping is a complicated process, because it's almost impossible for the computer to figure out which record among the candidates is the best one to keep. So, you want to make sure your database is designed so that it is impossible to accidentally introduce duplicate records. Indexing can help with this in certain kinds of situations, but when mailing lists are involved, you're dealing with people data which is almost impossible to model in a way where you have a unique natural key that will allow you to eliminate duplicates (this, too, is a very complicated topic).
So, you basically have to have a data entry process that checks the new record against the existing data and informs the user if there's a duplicate (or near match). I do this all the time in my apps where the users enter people -- I use an unbound form where they type in the information that is the bare minimum to create a new record (usually some combination of lastname, firstname, company and email), and then I present a list of possible matches. I do strict and loose matching and rank by closeness of the match, with the closer matches at the top of the list.
Then the user has to decide if there's a match, and is offered the opportunity to create the duplicate anyway (it's possible to have two people with the same name at the same company, of course), or instead to abandon adding the new record and instead go to one the existing records that was presented as a possible duplicate.
This leaves it up to the user to read what's onscreen and make the decision about what is and isn't a duplicate. But it maximizes the possibility of the user knowing about the dupes and never accidentally creating a duplicate record.
Problem #4: We'd like a more nimble program that we can access from more than one dedicated computer.
This one confuses me. Access is multi-user out of the box (and has been from the very beginning, nearly 20 years ago). There is no limitation whatsoever to a single computer. There are things you have to do to make it work, such as splitting your database into to parts, one part with just the data tables, and the other part with your forms and reports and such (and links to the data tables in the other file). Then you keep the back end data file on one of the computers that acts as a server, and give a copy of the front end (the reports, forms, etc.) to each user. This works very well, actually, and can easily support a couple of dozen users (or more, depending on what they are doing and how well your database is designed).
Basically, after all of this, I would tend to second #mwolfe02's answer, and agree with him that what you need is not a new database, but a database consultant who can design for you an application that will help you manage your mailing lists (and other needs) without you needing to get too deep into the weeds learning Access (or FileMaker or whatever). While it might seem more expensive up front, the end result should be a big productivity boost for all your users, as well as an application that will produce better output (because the data is cleaner and maintained better because of the improved data entry systems).
So, basically, you either need to spend money upfront on somebody with technical expertise who would design something that allows you to do better work (and more efficiently), or you need to invest time in upping your own technical skills. None of the alternatives to Access are going to resolve any of the issues you've raised without significant investment in interface design to further the goals you have (cleaner data, easier to find information, etc.).
At the risk of sounding snide, what you are really looking for is a consultant.
In the hands of a capable programmer, all of your issues with Access are easily handled. The problems you are having are not the result of using the wrong tool, but using that tool less than optimally.
Actually, if you are not a techie then Access is already the best tool for you. You will not find a more non-techie friendly way to build a data application from bottom to top.
That said, I'd say you have three options at this point:
Hire a competent database consultant to improve your application
Find commercial off-the-shelf (COTS) software that does what you need (I'm sure there are plenty of products to handle mailings; you'll need to research)
Learn about database normalization and building proper MS Access applications
If you can find a good program that does what you want then #2 above will maximize your Return on Investment (ROI). One caveat is that you'll need to convert all of your existing data, which may not be easy or even possible. Make sure you investigate that before you buy anything.
While it may be the most expensive option up-front, hiring a competent database consultant is probably your best option if you need a truly custom solution.
SQL Server sounds like a viable alternative to your scenario. If cost is a concern, you can always use SQL Server Express, which is free. Full blown SQL Server provides a lot more functionality that might not be needed right away. Express is a lot simpler as the number of features provided with it are much smaller. With either version though you will have centralized store for your data and the ability to allow all transactions to be recorded in the transaction log. Also, both have the ability to import data from an Access Database.
The newest version of SQL Server is 2008 R2
You probably want to take a look at modern databases. If you're into Microsoft-based products, start with SQL Server Express
EDIT: However, since I understand that you're not a programmer yourself, you'd probably be better off having someone experienced look into your technical problem more deeply, like the other answer suggests.
It sounds like you may want to consider a front-end for your existing Access data store. Microsoft has yet to replace Access per se, but they do have a new tool that is a lot lower on the programming totem pole than some other options. Check out Visual Studio Lightswitch - http://www.microsoft.com/visualstudio/en-us/lightswitch.
It's fairly new (still in beta) but showing potential. With it, just as with any Visual Studio project, you can connect to an MS Access datasource and design a front-end to interact with it. The plus-side here is programming requirements are much lower than with straight-up Visual Studio (read: Wizards).
Given that replacing your Access DB will require some font-end programming, you may look into VistaDB. It should allow your front end to be created in .NET with an XCopy database on the backend without requiring a server. One plus is that it retains SQL Server syntax, so if you do decide to move to SQL Server you'll be one step ahead.
(Since you're not a techie and may not understand my previous statement, you might pass my answer on to the consultant/programmer/database guy who is going to do the work for you.)
http://www.vistadb.net/