I am working on a project, which will be used by around 500 employees in my organization. Currently, it's still in development phase, and very few people(around 10) are using it. I'm using MySQL. I just want to know, what happens if many users are doing front end edits and then save, at the same point of time? Some SELECT queries that I've written do take as long as 6 seconds to execute. As only one query can be executed at any point of time, if already a query is in progress, and another hits the database, will it create problem? If this is a common situation in large scale projects, please let me know how can I handle this. I'm not sure, if I've made myself clear :). Any advice or links will be very helpful.
From technical aspect, no - nothing bad will happen, the database won't go ballistics and die on you, they're made for purposes like concurrent access.
From logical point of view - something bad will happen. If two people edit the same thing at the same time and then post it at the same time - it gets saved to hard drive one after another. The last one to save is the one whose updates will be on the HDD, effectively causing the first person to lose their changes.
You can approach this problem from several angles. Some projects introduce the concept of locking (not table locking but in-app locking). It revolves around marking a record as locked using a boolean column and if anyone tries to access that record for updating, the software says that someone else is editing it. It's something really difficult to implement and for the most time it doesn't work as expected (I think I vaguely remember Joomla! using something like that, it was one of the most annoying features ever).
The other option you have is to save each update as a revision. That way you can keep track on who updated what and when and you never lose any records in case of would-get overwritten. I believe that SO and Wikipedia use that approach and it works really great because you can inspect what two or more people have done and merge their contributions.
Optimistic Concurrency Control
http://en.wikipedia.org/wiki/Optimistic_concurrency_control
Make sure that each record contains date metadata on last changed/modified time, and load that as part of your data object. Then when attempting to commit the row to database, check the last_modified time in the table to ensure that it is the SAME as the one stored in memory for your object. If it matches, commit it, else throw exception.
Related
I am sure there are lots of tutorial for this kind of topic, but I can't find what I want because I don't know the jargon for it. So I ask StackOverflow.
Here the example:
People can Like or Dislike videos on Youtube, and the database should update the counts for Like or Dislike. However, it's impractical, especially for sites like Youtube, to update the database every time a user clicked on Like / Dislike button.
How can we cache the query / count numbers at a time interval, and when the time expired we send all the queries / update the database at one time? Or any similar technique for this kind of situation?
So what you're observing is the time delay between something happening and being able to view the results of what happened.
And you're on the right path to only update periodically.
But you're on the wrong path as far as where to do the periodic updates.
Thing is you WANT to update the "database" every time ASAP (namely the database(s) responsible for writing - choose your missing corner of the CAP triangle) to capture everything pretty quickly, but for your visitors/viewers, you give them a slightly-behind (a few seconds to maybe a day, depending the situation) view of the write database(s).
You do NOT want to store this on the browser and potentially lose what the user did should the request fail, the internet go down, etc.
Slightly off topic - you typically do not try to "prematurely optimize" without data on knowing how much you're going to save by caching, buffering, etc. Optimizations like that add complexity - and you will stay sane, longer, if you keep things simple for as long as possible. Keep your design simple and optimize your bottlenecks once you know what they are.
Slightly more off topic - I'd recommend reading on distributed computing, specifically as it pertains to databases and then some design. You'll realize these highly focused abstract problems all have "solutions" with various advantages and disadvantages.
Introduction
Hi, I am going to ask a question which seems utopic for me, but I need to know if there is a way to achieve what I need. And if not, I need to know why not.
The idea
Suppose I have a database structure, in MySql.
I want to create some solution to allow anyone (no matter who, no matter where) to have a synchronized copy (updated clone) of this database (with its content)
Well, and it is not going to be just one synchronized copy, it could (and should) be a multiple replication (supposing the basic, this means, for example, ten copies all over the world)
And, the most important thing: It must be secure. By secure I mean only real-accepted transactions will be synchronized with all the others (no matter how many) database copies/clones.
Note: Since it would be quite difficult to make the synchronization in real-time, I will design everything to make this feature dispensable. So it is not required.
My auto-suggestion
This is how I am thinking to manage it:
Time identifiers and Updates checking: Every action (insert, update, delete...) will be stored as the action instruction itself, associated to the time identifier. [I think better than a DATETIME field, it'll be an INT one, with the number of miliseconds passed from 1st january 2013 on, for example]. So each copy is going to ask to the "neighbour copy" for new actions done since last update, and execute them after checking they are allowed.
Problem 1: the "neighbour copy" could be outdated too.
Solution 1: do not ask just one neighbour, create a random list with some of the copies/clones and ask them for news (I could avoid the list and ask ALL the clones for updates, but this will be inefficient if clones number ascends too much).
Problem 2: Real-time global synchronization is not active. What if...
Someone at CLONE_ENTERPRISING inserts a row into TABLE.
... this row goes to every clone ...
Someone at CLONE_FIXEMALL deletes this row.
... and at the same time, somewhere in an outdated clone ...
Someone at CLONE_DROPOUT edits this row (now inexistent at the other clones)
Solution 2: easy stuff, force a GLOBAL synchronization before doing any new "depending-on-third-data action" (edit, for example). This global synch. will be unnecessary when making an INSERT, for instance.
Note: Well, someone could have some fun, and make the same insert in two clones... since they're not getting updated in real-time, this row will exist twice. But, it's the same as when we have one single database, in some needed cases we check if there is an existing same-row before doing the final action. Not a problem.
Problem 3: It is possible to edit the code and do not filter actions, so someone could spread instructions to delete everything, or just make some trolling activity. This is not a problem, since good clones will always be somewhere. Those who got bad won't interest anymore.
I really appreciate if you read. I know this is not the perfect solution, it has possibly hundred of holes, but it is my basic start. I will now appreciate anything you can teach me now. Thanks a lot.
PS.: It could be that all this I am trying already exists and has its own name. Sorry for asking then (I'd anyway thank this name, if it exists)
I would suggest a look at Sync Framework from Microsoft. It might be better suited to SQL Server but it should work with MySQL too. The problem you are tackling is quite a complex one.
When is it okay to have duplication of data in your database?
I'm working on this application that is supposed to track the number of user downloads. From my layman's point of view I can
simply have a column in the user table and increment the counter every time the user downloads something, or
have a counter table that has two columns, one for the user and one for the downloaded file.
As I see it both options enable me to track how many downloads each user has. However if this application sees the light of day and has tons of users then querying the database to look through the whole counter table could be quite expensive.
I guess my question is which do you all recommend?
There's no data duplication in the second option, just more data.
If you're not interested in knowing which files are downloaded, I go for the first option (takes least space). If you are, go for the second.
At some point, though, you might also be interested to see the download trend over time :) have you considered logging downloads using Google Analytics? They're probably a lot better at this game than you :)
I am required to make a general schema of a huge database that I have never used.
The problem is that I do not know how/where could I start doing this because, not considering the size, I have no idea of what is each table for. I can guess some but there are the mayority of them in which generic name fields do not say anything to me.
Do you have some advice?what could I do?
There is no documentation about the database and the creators are not able to help me because they are in another company now.
Thank you very much in advanced.
This isn't going to be easy.
Start by gathering any documentation, notes, etc. that exist. Also, it'll greatly help to have a thorough understanding of the type of data being stored, and of the application. Keep ample notes of your discoveries, and build the documentation that should have been built before.
If your database contains declared foreign keys, you can start there, and at least get down the relationships between the tables. Keeping in mind that this may be incomplete. As #John Watson points out, if the relationships are declared, there are tools to do this for you.
Check for stored functions and procedures, including triggers. Though these are somewhat uncommon in MySQL databases. Triggers especially will often yield clues ("every update to table X inserts a new row to table Y" -> "table Y is probably a log or audit table").
Some of the tables are hopefully obvious, and if you know what is related to them, you may be able to start figuring out those related tables.
Hopefully you have access to application code, which you can grep and read to find clues. Access to a test environment which you can destroy repeatedly will be useful too ("what happens if I change this in the app, where does the database change?"; "what happens if I scramble these values?"; etc.). You can dump tables and use diff on them, provided you dump them ordered by primary or unique key.
Doing queries like SELECT DISTINCT foo FROM table can help you see what different things can be in a column.
If its possible to start from a mostly-empty database (e.g., minimal to get the app to run), you can observe what changes as you add data to the app. Much quicker to dump the database when its small. Same for diffing it, same for reading through the output. Some things are easier to understand in a tiny database, but some things are more difficult. When you have a huge dataset and a column is always 3, you can be much more confident it always is.
You can watch SQL traffic from the application(s) to get an idea of what tables and columns they access for each function, and how they join them. Watching SQL traffic can be done in application-specific ways (e.g., DBI trace) or server-specific ways (turn on the general query log) or with a packet tracer like Wireshark or tcpdump. Which is appropriate is going to depend on the environment you're working in. E.g., if you have to do this on a production system, you probably want Wireshark. If you are doing this in dev/test, the disadvantage of the MySQL query log is that all the apps may very well be mixed together, and if multiple people are hitting the apps it'll get confusing. The app-specific log probably won't suffer from this, but of course the app may not have that.
Keep in mind the various ways data can be stored. For example, all three of these could mean May 1, 1980:
1980-05-01 — As a DATE, TIMESTAMP, or text.
2444330.5 — Julian day (with time, specifies at midnight)
44360 — Modified Julian day
326001600 — UNIX timestamp (with time, specifies midnight) assuming local time is US Eastern Time (seconds since Jan 1 1970 UTC)
There may be things in the database which are denormalized, and some of them may be denormalized incorrectly. E.g., you may be wondering "why does this user have a first name Bob in one table, and a first name Joe in another?" and the answer is "data corruption".
There may be columns that aren't used. There may be entire tables that aren't used. Despite this, they may still have data from older versions of the app (or other, no-longer-in-use apps), queries run from the MySQL console, etc.
There may be things which aren't visible in the application anywhere, but are used. Their purpose may be completely non-obvious without knowledge of the algorithms implemented in the app(s). For example, a search function in an app may store all kinds of precomputed information about the documents to search and their connections. Worse, these tables may only be updated by batch jobs, so changing a document won't touch them (making you mistakenly believe they have nothing to do with documents). Then, you come in the next morning, and the table is mysteriously very different. Though, in the search case, a query log when running search will tell you.
Try using the free mySQL workbench (it's specific to mySQL).
I have reverse engineered databases this way and also ended up with great Entity Relationship Diagrams!
I've worked with SQL for 20 years and this product really is great (it's free, from the mysql folks themselves).
It can have occasional problems, crashes, etc. at least it did on Ubuntu10 but they've been relatively rare and far out-weighed by the benefits! It's also actively developed so bugs are actually fixed on an on-going basis.
Assuming that nobody bothered to declare foreign keys in the table definition, and the database belongs to an application which is in use, after grabbing the current schema, the next step for me would be to enable logging of all queries (hoping that the data does NOT use a trivial ORM like [x]hibernate) to identify joins and data semantics.
This perl script may be helpful.
In my MS Access application I have several forms that are very data intensive (several subforms based on even more tables). My users are complaining that when opening the data across the network the load times are unbearably long.
I have do have a slit front end / back end setup using the excellent autofe application.
One solution I have come up with to the problem is instead of docmd.close when the user clicks the "Save & Close" button I me.visible = false. The user then has the long wait time the first time after the application is loaded but for later loads performance is improved by a noticeable amount.
So far this has been working fairly well. I am just concerned that there may be some hidden gotchas hidden in this strategy that I haven't encountered yet.
My users aren't overly intelligent and I don't use the application myself so I can't expect to get meaningful feedback if something is behaving erratically.
Anyone else employed this strategy successfully or know of a good reason not to do it?
Anyone else employed this strategy successfully or know of a good reason not to do it?
Yes, that strategy is similar to recipe #8.1 Accelerate the Load Time of Forms from the second edition of the Access Cookbook. However that recipe pre-loads a set of forms, with WindowMode:=acHidden, at database startup. So the tradeoff is that database startup takes longer, but subsequent form opens (for the pre-loaded forms) are comparatively fast.
The discussion for that recipe didn't mention any drawbacks for that technique. In limited use, I haven't discovered any. And since it seems to improve your users' experience, I would continue to use it.
Beyond that, I would take a hard look at the amount of data your forms pull from the back-end database. Limit the number of rows retrieved as the Record Sources for the main and subforms. Give the user a method to select a different record or small set of records. Also make sure you use indexing to support Record Source WHERE and ORDER BY clauses. Avoid WHERE conditions that use functions which will force a full table scan to figure out which rows to exclude from the Record Source. Similar considerations apply to combo and list boxes which use saved queries or SELECT statements as their Record Sources; if you can't limit the rows, at least make sure to optimize data retrieval.
At first, just hiding the form is not too bad, I think.
I would dig a bit more on WHY your load times are so long. You mentionned several subforms. Are they all displayed at the same time, or are they in the various pages of a Tab control ?
In the latter case, you could quite easily unbind the subforms that are not visible, and bind them on the PageClick event. That makes a big difference in performance.
EDIT:
Also, a bit out of scope for this question, but good for every performance issue:
-did you double check that the foreign keys in the related tables are properly indexed ?
-make sure the back-end is regularly compacted.
Are you making sure that the data gets refreshed in an appropriate timeframe?
Yes, I've doen the same thing myself in very complex forms which had about 10 or 15 tabs each with a subform. Worked for at least ten years. You had to watch for varous form level values or unbound controls which you assume start as null or zero. But once it's running smoothly it should run just fine. We had to this back in Access 97 days because Access would crash with out of memory errors after the users had opened and closed varous forms thousands of times per day.