When is the duplication of database data okay? - mysql

When is it okay to have duplication of data in your database?
I'm working on this application that is supposed to track the number of user downloads. From my layman's point of view I can
simply have a column in the user table and increment the counter every time the user downloads something, or
have a counter table that has two columns, one for the user and one for the downloaded file.
As I see it both options enable me to track how many downloads each user has. However if this application sees the light of day and has tons of users then querying the database to look through the whole counter table could be quite expensive.
I guess my question is which do you all recommend?

There's no data duplication in the second option, just more data.
If you're not interested in knowing which files are downloaded, I go for the first option (takes least space). If you are, go for the second.
At some point, though, you might also be interested to see the download trend over time :) have you considered logging downloads using Google Analytics? They're probably a lot better at this game than you :)

Related

MySQL - What happens when multiple queries hit the database

I am working on a project, which will be used by around 500 employees in my organization. Currently, it's still in development phase, and very few people(around 10) are using it. I'm using MySQL. I just want to know, what happens if many users are doing front end edits and then save, at the same point of time? Some SELECT queries that I've written do take as long as 6 seconds to execute. As only one query can be executed at any point of time, if already a query is in progress, and another hits the database, will it create problem? If this is a common situation in large scale projects, please let me know how can I handle this. I'm not sure, if I've made myself clear :). Any advice or links will be very helpful.
From technical aspect, no - nothing bad will happen, the database won't go ballistics and die on you, they're made for purposes like concurrent access.
From logical point of view - something bad will happen. If two people edit the same thing at the same time and then post it at the same time - it gets saved to hard drive one after another. The last one to save is the one whose updates will be on the HDD, effectively causing the first person to lose their changes.
You can approach this problem from several angles. Some projects introduce the concept of locking (not table locking but in-app locking). It revolves around marking a record as locked using a boolean column and if anyone tries to access that record for updating, the software says that someone else is editing it. It's something really difficult to implement and for the most time it doesn't work as expected (I think I vaguely remember Joomla! using something like that, it was one of the most annoying features ever).
The other option you have is to save each update as a revision. That way you can keep track on who updated what and when and you never lose any records in case of would-get overwritten. I believe that SO and Wikipedia use that approach and it works really great because you can inspect what two or more people have done and merge their contributions.
Optimistic Concurrency Control
http://en.wikipedia.org/wiki/Optimistic_concurrency_control
Make sure that each record contains date metadata on last changed/modified time, and load that as part of your data object. Then when attempting to commit the row to database, check the last_modified time in the table to ensure that it is the SAME as the one stored in memory for your object. If it matches, commit it, else throw exception.

best mysql table structure for INSERT only?

I have a website on a shared host, where I expect a lot of visitors. I don't need a database for reading (everything presented on the pages is hardcoded in PHP) but I would like to store data that my users enter, so for writing only. In fact, I only store this to do a statistical analysis on it afterwards (on my local computer, after downloading it).
So my two questions:
Is MySQL a viable option for this? It is meant to run on shared hosting, with PHP/MySQL available, so I cannot really use much other fancy packages, but if e.g. writing to a file would be better for this purpose, that's possible too. As far as I understood, adding a line to a file is a relatively complex operation for huge files. On the other hand, 100+ users simultaneously connecting to a MySQL database is probably also a huge load, even if it's just for doing 1 cheap INSERT query.
If MySQL is a good option, how should the table best be configured? Currently I have one InnoDB table, with a primary key id that auto-increments (next to of course the columns storing the data). This is general-purpose configuration, so maybe there are more optimized ways given that I only need to write to the table, and not read from it?
Edit: I mainly fear that the website will go viral as soon as it's released, so I expect the users to visit in a very short timeframe. And of course I would not like to lose the data that they enter due to an overloaded database.
MySQL is a perfectly reasonable choice for this. Probably much better than a flat file, since you say you want to aggregate and analyze this data later. Doing so with a flat file might take a long time, especially if the file is large. Additionally, RDBMS are for aggregation and dataset manipulation. Ideal for creating report data.
Put whatever data columns you want in your table, and some kind of identifier to track a user, in addition to your existing row key. IP address is a logical choice for user tracking, or a magic cookie value could potentially work. It's only a single table, you don't need to think too hard about it. You may want to add nonclustered indexes on columns you'll frequently filter on for reports, e.g. IP address, access date, etc.
I mainly fear that the website will go viral as soon as it's released, so I expect the users to visit in a very short timeframe. And of course I would not like to lose the data that they enter due to an overloaded database.
RDBMS such as MySQL are explicitly designed to handle heavy loads, assuming appropriate hardware backing. Don't sweat it.

Django Log File vs MySql Database

So I am going to be building a website using the Django web framework. In this website, I am going to have an advertising component. Whenever an advertisement is clicked, I need to record it. We charge the customer every time a separate user clicks on the advertisement. So my question is, should I record all the click entries into a log file or should I just create a Django model and record the data into a mysql database? I know its easier to create a model but am worried if there is a lot of high traffic to the website. Please give me some advice. I appreciate you taking the time to read and address my concerns.
Awesome. Thank you. I will definitely use a database.
Traditionally, this sort of interactions is stored in a DB. You could do it in a log, but I see at least two disadvantages:
log rotation
the fact that after logging you'll still have to process the data in a meaningful manner.
IMO, you could do it in a separate DB (see the multiple db feature in django). This way, you could have the performance somewhat more balanced.
You should save all clicks to a DB. A database is created to handle the kind of data you are trying to save.
Additionally, a database will allow you to analyze your data a lot more simply then a flat file. If you want to graph traffic from country, or by user agent or by date range, this will be almost trivial in a database, but parsing giganitc log files could be more involving.
Also a database will be easier to extend. Right now you are just tracking clicks but what happens if you want to start pushing advertisements that require some sort of additional user action or conversion. You will be able to extend this beyond clicks extremely easy in a database.

Granular 'Up to the minute' data recoverability of mySQL database data

I operate a web-based online game with a mySQL backend. Every day many writes are performed against hundreds of related tables holding user data.
Frequently a user's account will become compromised. I would like the ability to restore the user's data to a certain point in time prior to the attack without affecting any other user data.
I'm aware of binary logging in mySQL, but as far as I know this is whole-database recovery up to a certain point in time. I would like a more granular solution, ie able to specify which tables, which rows etc.
What should I be looking into? What are the general best-practices?
If you create and use audit tables (populated through triggers) you can always get back to the data for one particular user in any table.
Be sure to write your general restore script before you need it though. Much easier to put in a userid into a script that you already have available than to sit there looking at the audit tables going, how do I do this again.
MySQL (or any other RDBMS that I'm aware of) is not able to do that by itself. Therefore you should implement that yourself in your application layer.
This is (without external modules) not possible.
As thejh in the comments suggested, revisions would be a good solution. When you only need to work with userdata, create a table that resembles the usertable with additional timestamp or similar and run a cron job once a week/day/.. that copies the userdata that has recently been modified (additional flags/dates in the actual user table) into this table.

How to store parts of form-data when they're on separate pages?

Whenever I'm to prepare a long form for the client I always want to split it into separate pages, so the visitor doesn't have to fill it all, but does it in steps.
Something like:
Step 1 > Step 2 > Step 3 > Thank You!
I've never done it for one reason: I don't know how to store the data from separate steps efficiently? By efficiently I mean, how to store it, so when a visitor decides not to finish it at Step 3 all the data is deleted.
I've come up with few ways of how this could be resolved, but I'm just not convinced by any of them:
Storing form data in database
I can imagine a table with columns representing each question, with final column representing a bool value whether the form has been completed or not?
But I would have to do a clean-up of the table every now and then (maybe even every time it gets updated with new data?) and delete all entries with complete = 0.
Store form data in session data.
This on the other hand, does not have to store data in database (depending on how sessions are being handled) and all info would be in Cookie. But what if browser doesn't support cookies or user disabled them (rare, but happens), or if form has file attachments, then this is a no-go.
echo'ing form data from previous page as <input type="hidden"> on the next page
Yes, I'm aware this is a rather stupid idea, but it's an alternative. Poor, but it is.
Option 1 seems to be the best, but I feel it's unnecessary to store temporary data in DB. And what if this becomes a very popular form, with a lot of visitors filling it in? The amount of Updates/Deletes could be massive?
I want to know how you deal with it.
Edit
David asked a good question. What technology I'm using?
I personally use PHP+MySQL, but I feel like it's more generic question. Please share your solutions no matter of what server-side technology you use, as I'm sure the concept can be adapted one way or the other to different technologies.
I think the choice between options 1 and 2 comes down to how much data you are storing. I think in most cases the amount of data you are collecting on a form is going to be fairly small (a few kilobytes). In that case, I think storing it in the session data is the way to go. There's not that much overhead in passing that amount of data back and forth. Also, unless your users are on computers where there is a strict security policy in place, the application should work. If you make the page requirements clear users can decide if they want to proceed or not.
If you are storing large amounts of data for the form then a database would be better so you don't need to pass the data back and forth. However, I think this is going to be a fairly rare situation. Even if the application allows the uploading of files you can save those to a temporary location and only write them to the database once the form is completed. The other situation where you might want to use a database is if your form needs to be able to support the user leaving and coming back at a later time to resume the form.
I agree that option 1 is the best, because it has a few benefits over the other 2:
If the data is persisted, users can come back later and continue the process
Your code base will be much cleaner with incremental saves, and it alleviates the need for 1 massive save operation
Your foot print (each page request) will be lighter than option 3
If you're worried about performance, you can queue the data to be saved, since it's not necessary to save it near-real-time.
Edit to clear up a misconception: The data inside PHP Sessions, by default, are NOT stored in Cookies and are capable of storing a lot of data without too much overhead.
I'd go with number 2, but use the cookie only for identifying the user. The session data should actually be stored on your server and the cookie merely provides a lookup key to the session object that contains all the details.
If the site becomes popular and needs to run on more than a single web server, then your session data will need to be persisted in some kind of database anyway. In that case you would need a database that could handle massive amounts of transactions.
Note: I agree that this is a platform independent question. Stack Overflow users prefer to see code in questions and prefer to give code in answers, so that's why I normally ask what language someone is using.
To be brutally honest, just use the database as in option 1 and stop worring about data volumes. Seriously if your site is that successful that it becomes a problem then you ought be able fund a re-vamp to cope.
There's nothing wrong with taking the POST data from the previous step and adding hidden input elements. Just take all the POST data from the previous page that you care about and get them into the current page's form. This way, you don't have to worry about using persistent storage in any form, whether it's on the client side or the server side.
What are the perceived downsides? That there are a lot of extra elements on the page? Not that the user sees. All you have to do is add an element for each input you ask the user to give (on every page, if you want the user to be able to go back). Besides these elements, which don't give any visual clutter, there's nothing extra.
There's also the fact that all the form data will have to be transmitted on every page load. Sure, but this is probably going to be faster than a lookup in a database, and you don't have to worry about getting rid of stale data.