Archiving Theory - mysql

I am about to set up a new database that will need to include archiving of records that are still accessible. Records are all associated with certain projects and when the project is archived, I want the records to stay the same, a snapshot. (e.g. If a contact is associated with an archived project, and they move a year later, I want it to still pull the old address.) The archived records do not need to be updated, but they do need to be accessible.
I have an idea of how to go about this, but I am not sure if this is the best approach: Have a duplicate of each table that would "archive" everything, and then when putting an item in archive, all the FK/PK relationships would be updated, though that seems like a cumbersome process.
Another idea I had was each item (i.e. contact) would be assigned a PK and then there would be a secondary key for each item which would then be associated with each project. The main problem with this is it seems difficult if a contact updates on a live project, a lot of updates would be required.
Please let me know if you have any questions.
Thank you for the help.

Hm... The only time I came accross with something like this, there was the idea to solve it in the application layer, and not on the db.
For instance, for ruby, you may use vestal_versions or paper-trail.
Paper-trail, for instance, stores the versions of all objects as serialized objects on one single table, and works with deltas.

What you are looking for is to include temporal data along side your domain data. This sort of a thing isn't a trivial task to undertake and is often the cause of great complication in applications that need to support it.
There are a number of ways to go about doing this, each with their pros and cons, and the way you choose will depend on what you need to do with the temporal element of your data. Some of these include:
Audit trail
You track the changes made to a record over time and the primary record reflects it's current state
Reduces data duplication to a minimum
Likely doesn't fit the "snapshot" model you are looking for easily
Most Current
You have a record for each "version" of an entity with a timestamp of when it was created
Makes it easy to jump back to a point in time
Makes it easy to "fork" an entity
Has the most data duplication
Martin Fowler has written some articles relating to designing models that deal with temporal data, so I would start there for a good, solid grounding in the topic.

Related

Active Collab 5 Webhooks / Maintaining "metric" data

I have an application I am working on that basically takes the data from Active Collab and creates reports / graphs out of the data. The API itself is insufficient to get the proper data on a per request basis so I resorted to pulling the data down into a separate data set that can be queried more efficiently.
So in order to avoid needing to query the entire API constantly I decided to make use of webhooks in order to make the transformations to the relevant data and lower the need to resync the data.
However I notice not all events are sent, notably the following.
TaskListUpdated
MemberUpdated
TimeRecordUpdated
ProjectUpdated
There is probably more but these are the main ones I noticed so far,
Time reports is probably the most important, in fact it missing from webhooks means that almost any application has a good chance of incorrect data if it needs time record data. Its fairly common to do a typo in a time record and then adjust it later.
So am I missing anything here? Is there some way to see these events reliably?
EDIT:
In order to avoid a long comment to Ilija I am putting the bulk here.
Webhooks apart, what information do you need to pull? API that powers
time tracking reports can do all sorts of cross project filtering, so
your approach to keep a separate database may be an overkill.
Basically we are doing a multi-variable tiered time report. It can be sorted / grouped by any conceivable method you may want to look at.
http://www.appsmagnet.com/product/time-reports-plus/
This is the closest to what we are trying to do, back when we used Active Collab 4 this did the job, but even with it we had to consolidate it in our own spreadsheets.
So the idea of this is to better integrate our Active Collab data into our own workflow.
So the main data we are looking for in this case is
Job Types
Projects
Task Lists
Tasks
Time Records
Categories
Members / Clients
Companies
These items can feed not only our reports, but many other aspects of our company as well. For us Active Collab is the point of truth, so we want the data quickly accessible and fully query-able.
So I have set up a sync system that initially grabs all the data it can from Active Collab and then uses a mix of cron's and webhooks to keep it up to date.
Cron jobs work well for all aspects that do not have "sub items" (projects/tasks/task lists/time records). So those I need to rely on the webhook since syncing them takes to much time to be able to keep it up to date in real time.
For the webhook I noticed the above do not carry through. Time Records I figured out a way around it listed in my answer, and member can be done through the cron. However Task list and project updating are the only 2 of some concern. Project is fairly important as the budget can change and that would be used in reports, task lists has the start / end dates that could be used as well. Since going through every project / task list constantly to see if there is a change is really not a great idea I am looking for a way to reliably see updates for them.
I have based this system on https://developers.activecollab.com/api-documentation/ but I know there are at least a few end points that are not listed.
Cross-project time-record filtering using Active Collab 5 API
This question is actually from another developer on the same system (and also shows a TrackingFilter report not listed in the docs). Due to issues with maintaining an accurate set of data we had to adapt it. I actually notice that you (Ilija) are the person replying and did recommend we move over to this style of system.
This is not a total answer but a way to solve the issue with TimeRecordUpdated not going through the webhook.
There is another API endpoint for /whats-new This endpoint describes changes for the last day or so and it has a category called TrackingObjectUpdatedActivityLog this refers to an updated time record.
So I set up a cron job to check this fairly consistently and manually push the TimeRecordUpdated event through my system to keep it consistent.
For MemberUpdated since the data for a member being updated is unlikely to affect much, having a daily cron for checking the users seems good enough.
ProjectUpdated could technically be considered the same, but with the absence of TaskListUpdated that leads to far to many api calls to sync the data. I have not found a solution for this yet unfortunately.

MySQL - What happens when multiple queries hit the database

I am working on a project, which will be used by around 500 employees in my organization. Currently, it's still in development phase, and very few people(around 10) are using it. I'm using MySQL. I just want to know, what happens if many users are doing front end edits and then save, at the same point of time? Some SELECT queries that I've written do take as long as 6 seconds to execute. As only one query can be executed at any point of time, if already a query is in progress, and another hits the database, will it create problem? If this is a common situation in large scale projects, please let me know how can I handle this. I'm not sure, if I've made myself clear :). Any advice or links will be very helpful.
From technical aspect, no - nothing bad will happen, the database won't go ballistics and die on you, they're made for purposes like concurrent access.
From logical point of view - something bad will happen. If two people edit the same thing at the same time and then post it at the same time - it gets saved to hard drive one after another. The last one to save is the one whose updates will be on the HDD, effectively causing the first person to lose their changes.
You can approach this problem from several angles. Some projects introduce the concept of locking (not table locking but in-app locking). It revolves around marking a record as locked using a boolean column and if anyone tries to access that record for updating, the software says that someone else is editing it. It's something really difficult to implement and for the most time it doesn't work as expected (I think I vaguely remember Joomla! using something like that, it was one of the most annoying features ever).
The other option you have is to save each update as a revision. That way you can keep track on who updated what and when and you never lose any records in case of would-get overwritten. I believe that SO and Wikipedia use that approach and it works really great because you can inspect what two or more people have done and merge their contributions.
Optimistic Concurrency Control
http://en.wikipedia.org/wiki/Optimistic_concurrency_control
Make sure that each record contains date metadata on last changed/modified time, and load that as part of your data object. Then when attempting to commit the row to database, check the last_modified time in the table to ensure that it is the SAME as the one stored in memory for your object. If it matches, commit it, else throw exception.

Decentralized synchronized secure data storage

Introduction
Hi, I am going to ask a question which seems utopic for me, but I need to know if there is a way to achieve what I need. And if not, I need to know why not.
The idea
Suppose I have a database structure, in MySql.
I want to create some solution to allow anyone (no matter who, no matter where) to have a synchronized copy (updated clone) of this database (with its content)
Well, and it is not going to be just one synchronized copy, it could (and should) be a multiple replication (supposing the basic, this means, for example, ten copies all over the world)
And, the most important thing: It must be secure. By secure I mean only real-accepted transactions will be synchronized with all the others (no matter how many) database copies/clones.
Note: Since it would be quite difficult to make the synchronization in real-time, I will design everything to make this feature dispensable. So it is not required.
My auto-suggestion
This is how I am thinking to manage it:
Time identifiers and Updates checking: Every action (insert, update, delete...) will be stored as the action instruction itself, associated to the time identifier. [I think better than a DATETIME field, it'll be an INT one, with the number of miliseconds passed from 1st january 2013 on, for example]. So each copy is going to ask to the "neighbour copy" for new actions done since last update, and execute them after checking they are allowed.
Problem 1: the "neighbour copy" could be outdated too.
Solution 1: do not ask just one neighbour, create a random list with some of the copies/clones and ask them for news (I could avoid the list and ask ALL the clones for updates, but this will be inefficient if clones number ascends too much).
Problem 2: Real-time global synchronization is not active. What if...
Someone at CLONE_ENTERPRISING inserts a row into TABLE.
... this row goes to every clone ...
Someone at CLONE_FIXEMALL deletes this row.
... and at the same time, somewhere in an outdated clone ...
Someone at CLONE_DROPOUT edits this row (now inexistent at the other clones)
Solution 2: easy stuff, force a GLOBAL synchronization before doing any new "depending-on-third-data action" (edit, for example). This global synch. will be unnecessary when making an INSERT, for instance.
Note: Well, someone could have some fun, and make the same insert in two clones... since they're not getting updated in real-time, this row will exist twice. But, it's the same as when we have one single database, in some needed cases we check if there is an existing same-row before doing the final action. Not a problem.
Problem 3: It is possible to edit the code and do not filter actions, so someone could spread instructions to delete everything, or just make some trolling activity. This is not a problem, since good clones will always be somewhere. Those who got bad won't interest anymore.
I really appreciate if you read. I know this is not the perfect solution, it has possibly hundred of holes, but it is my basic start. I will now appreciate anything you can teach me now. Thanks a lot.
PS.: It could be that all this I am trying already exists and has its own name. Sorry for asking then (I'd anyway thank this name, if it exists)
I would suggest a look at Sync Framework from Microsoft. It might be better suited to SQL Server but it should work with MySQL too. The problem you are tackling is quite a complex one.

Organizing a MySQL Database

I'm developing an application that will require me to create my first large-scale MySQL database. I'm currently having a difficult time wrapping my mind around the best way to organize it. Can anyone recommend any reading materials that show different ways to organize a MySQL database for different purposes?
I don't want to try getting into the details of what I imagine the database's main components will be because I'm not confident that I can express it clearly enough to be helpful at this point. That's why I'm just looking for some general resources on MySQL database organization.
The way I learned to work these things out is to stop and make associations.
In more object oriented languages (I'm assuming you're using PHP?) that force OO, you learn to think OO very quickly, which is sort of what you're after here.
My workflow is like this:
Work out what data you need to store. (Customer name etc.)
Work out the main objects you're working with (e.g. Customer, Order, Salesperson etc), and assign each of these a key (e.g. Customer ID).
Work out which data connects to which objects. (Customer name belongs to a customer)
Work out how the main objects connect to each other (Salesperson sold order to Customer)
Once you have these, you have a good object model of what you're after. The next step is to look at the connections. For example:
Each customer has only one name.
Each product can be sold multiple times to anybody
Each order has only one salesperson and one customer.
Once you've worked that out, you want to try something called normalization, which is the art of getting this collection of data into a list of tables, still minimizing redundancy. (The idea is, a one-to-one (customer name) is stored in the table with the customer ID, many to one, one to many and many to many are stored in separate tables with certain rules)
That's pretty much the gist of it, if you ask for it, I'll scan an example sheet from my workflow for you.
Maybe I can provide some advices based on my own experience
unless very specific usage (like fulltext index), use the InnoDB tables engine (transactions, row locking etc...)
specify the default encoding - utf8 is usually a good choice
fine tune the server parameters (*key_buffer* etc... a lot of material on the Net)
draw your DB scheme by hand, discuss it with colleagues and programmers
define data types based not only on the programs usage, but also on the join queries (faster if types are equal)
create indexes based on the expected necessary queries, also to be discussed with programmers
plan a backup solution (based on DB replication, or scripts etc...)
user management and access, grant only the necessary access rights, and create a read-only user to be used by most of queries, that do not need write access
define the server scale, disks (raid?), memory, CPU
Here are also some tips to use and create a database.
I can recomend you the first chapter of this book: An Introduction to Database Systems, it may help you organize your ideas, and I certainly recomend not using 5th normal form but using 4th, this is very important.
If I could only give you one piece of advice, that would be to generate test data at similar volumes as production and benchmark the main queries.
Just make sure that the data distribution is realistic. (Not all people are named "John", and not all people have unique names. Not all people give their phone nr, and most people won't have 10 phone numbers either).
Also, make sure that the test data doesn't fit into RAM (unless you expect the production data volumes to do too).

MS Access antiquated? Anything new in 2011?

Our company has a database of 17,000 entries. We have used MS Access for over 10 years for our various mailings. Is there something new and better out there? I'm not a techie, so keep in mind when answering. Our problems with Access are:
-no record of what was deleted,
-will not turn up a name in a search if cap's or punctuation
is not entered exactly,
-is complicated for us to understand the de-duping process.
- We'd like a more nimble program that we can access from more than one dedicated computer.
The only applications I know of that are comparable to Access are FileMaker Pro, and the database component of the Open Office suite. FM Pro is a full-fledged product and gets good marks for ease of use from non-technical users, while Base is much less robust and is not nearly as easy for creating an application.
All of the answers recommending different databases really completely miss the point here -- the original question is about a data store and application builder, not just the data store.
To the specific problems:
PROBLEM 1: no record of what was deleted
This is a design error, not a flaw in Access. There is no database that really keeps a record of what's deleted unless someone programs logging of deleted data.
But backing up a bit, if you are asking this question it suggest that you've got people deleting things that shouldn't be deleted. There are two solutions:
regular backups. That would mean you could restore data from the last backup and likely recover most of the lost data. You need regular backups with any database, so this is not really something that is specific to Access.
design your database so records are never deleted, just marked deleted and then hidden in data entry forms and reports, etc. This is much more complicated, but is very often the preferred solution, as it preserves all the data.
Problem #2: will not turn up a name in a search if cap's or punctuation is not entered exactly
There are two parts to this, one of which is understandable, and the other of which makes no sense.
punctuation -- databases are stupid. They can't tell that Mr. and Mister are the same thing, for instance. The solution to this is that for all data that needs to be entered in a regularized fashion, you use all possible methods to insure that the user can only enter valid choices. The most common control for this is a dropdown list (i.e., "combo box"), which limits the choices the user has to the ones offered in the list. It insures that all the data in the field conforms to a finite set of choices. There are other ways of maintaining data regularity and one of those involves normalization. That process avoids the issue of repeatedly storing, say, a company name in multiple records -- instead you'd store the companies in a different table and just link your records to a single company record (usually done with a combo box, too). There are other controls that can be used to help insure regularity of data entry, but that's the easiest.
capitalization -- this one makes no sense to me, as Access/Jet/ACE is completely case-insensitive. You'll have to explain more if you're looking for a solution to whatever problem you're encountering, as I can't conceive of a situation where you'd actually not find data because of differences in capitalization.
Problem #3: is complicated for us to understand the de-duping process
De-duping is a complicated process, because it's almost impossible for the computer to figure out which record among the candidates is the best one to keep. So, you want to make sure your database is designed so that it is impossible to accidentally introduce duplicate records. Indexing can help with this in certain kinds of situations, but when mailing lists are involved, you're dealing with people data which is almost impossible to model in a way where you have a unique natural key that will allow you to eliminate duplicates (this, too, is a very complicated topic).
So, you basically have to have a data entry process that checks the new record against the existing data and informs the user if there's a duplicate (or near match). I do this all the time in my apps where the users enter people -- I use an unbound form where they type in the information that is the bare minimum to create a new record (usually some combination of lastname, firstname, company and email), and then I present a list of possible matches. I do strict and loose matching and rank by closeness of the match, with the closer matches at the top of the list.
Then the user has to decide if there's a match, and is offered the opportunity to create the duplicate anyway (it's possible to have two people with the same name at the same company, of course), or instead to abandon adding the new record and instead go to one the existing records that was presented as a possible duplicate.
This leaves it up to the user to read what's onscreen and make the decision about what is and isn't a duplicate. But it maximizes the possibility of the user knowing about the dupes and never accidentally creating a duplicate record.
Problem #4: We'd like a more nimble program that we can access from more than one dedicated computer.
This one confuses me. Access is multi-user out of the box (and has been from the very beginning, nearly 20 years ago). There is no limitation whatsoever to a single computer. There are things you have to do to make it work, such as splitting your database into to parts, one part with just the data tables, and the other part with your forms and reports and such (and links to the data tables in the other file). Then you keep the back end data file on one of the computers that acts as a server, and give a copy of the front end (the reports, forms, etc.) to each user. This works very well, actually, and can easily support a couple of dozen users (or more, depending on what they are doing and how well your database is designed).
Basically, after all of this, I would tend to second #mwolfe02's answer, and agree with him that what you need is not a new database, but a database consultant who can design for you an application that will help you manage your mailing lists (and other needs) without you needing to get too deep into the weeds learning Access (or FileMaker or whatever). While it might seem more expensive up front, the end result should be a big productivity boost for all your users, as well as an application that will produce better output (because the data is cleaner and maintained better because of the improved data entry systems).
So, basically, you either need to spend money upfront on somebody with technical expertise who would design something that allows you to do better work (and more efficiently), or you need to invest time in upping your own technical skills. None of the alternatives to Access are going to resolve any of the issues you've raised without significant investment in interface design to further the goals you have (cleaner data, easier to find information, etc.).
At the risk of sounding snide, what you are really looking for is a consultant.
In the hands of a capable programmer, all of your issues with Access are easily handled. The problems you are having are not the result of using the wrong tool, but using that tool less than optimally.
Actually, if you are not a techie then Access is already the best tool for you. You will not find a more non-techie friendly way to build a data application from bottom to top.
That said, I'd say you have three options at this point:
Hire a competent database consultant to improve your application
Find commercial off-the-shelf (COTS) software that does what you need (I'm sure there are plenty of products to handle mailings; you'll need to research)
Learn about database normalization and building proper MS Access applications
If you can find a good program that does what you want then #2 above will maximize your Return on Investment (ROI). One caveat is that you'll need to convert all of your existing data, which may not be easy or even possible. Make sure you investigate that before you buy anything.
While it may be the most expensive option up-front, hiring a competent database consultant is probably your best option if you need a truly custom solution.
SQL Server sounds like a viable alternative to your scenario. If cost is a concern, you can always use SQL Server Express, which is free. Full blown SQL Server provides a lot more functionality that might not be needed right away. Express is a lot simpler as the number of features provided with it are much smaller. With either version though you will have centralized store for your data and the ability to allow all transactions to be recorded in the transaction log. Also, both have the ability to import data from an Access Database.
The newest version of SQL Server is 2008 R2
You probably want to take a look at modern databases. If you're into Microsoft-based products, start with SQL Server Express
EDIT: However, since I understand that you're not a programmer yourself, you'd probably be better off having someone experienced look into your technical problem more deeply, like the other answer suggests.
It sounds like you may want to consider a front-end for your existing Access data store. Microsoft has yet to replace Access per se, but they do have a new tool that is a lot lower on the programming totem pole than some other options. Check out Visual Studio Lightswitch - http://www.microsoft.com/visualstudio/en-us/lightswitch.
It's fairly new (still in beta) but showing potential. With it, just as with any Visual Studio project, you can connect to an MS Access datasource and design a front-end to interact with it. The plus-side here is programming requirements are much lower than with straight-up Visual Studio (read: Wizards).
Given that replacing your Access DB will require some font-end programming, you may look into VistaDB. It should allow your front end to be created in .NET with an XCopy database on the backend without requiring a server. One plus is that it retains SQL Server syntax, so if you do decide to move to SQL Server you'll be one step ahead.
(Since you're not a techie and may not understand my previous statement, you might pass my answer on to the consultant/programmer/database guy who is going to do the work for you.)
http://www.vistadb.net/