DISTINCT Query or DELETE Query - mysql

On each page load, what's faster, carrying out a distinct, or deleting duplicate records?

If having multiple records means that the data is incorrect, then neither of these options is exactly what you should be doing.
What you should do is determine how bad data is entering the system in the first place and prevent it from happening. At any given point in time the database should contain only good, meaningful data. It should, entirely in and of itself and without the application, maintain the integrity of the data at rest.
Don't add work-arounds to your application to handle the bad data. Add validation to your application and your database to prevent bad data. In this case it sounds like you have a database table which doesn't properly define constraints for what constitutes a "unique" record. That should be fixed.

Related

Memcached with row data that change constantly

I have a question that I didn't find an answer. Yet ;-)
I have a Django/MySQL application that runs memcached in the background. One of my tables change every access. I mean, when the user access the page I have a "count" field that is incremented, this same table contains all data that is going to be displayed.
Is recommended to use memcached in this scenario? Or should I create a new relation table that will contain only "id" and "count" field?
Thanks!
Sure, that's a valid use for memcached. The basic rule is that anytime you update, or delete in the mysql sense of the words, you need to do something to keep the memcache record consistent. Usually that is done by either adjusting the value right there, or deleting it so the next access of it builds it and saves it.
In your case, I would just get the value, increment it, and then set it. Depending on how important accuracy is to you, and how much concurrent traffic you get, you should consider atomicity of the transactions outlined in this post.

What is the best way (in Rails/AR) to ensure writes to a database table are performed synchronously, one after another, one at a time?

I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...
You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.
If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)

Designing tables for deleted data

I have many tables where data needs to be "marked for deletion" but not deleted, or toggle between published and hidden data.
Most intuitive way to handle these cases is to add a column in the database deleted int(1) or public int(1). This raises the concern of not forgetting to specify WHERE deleted=0 for each and every time that table is being accessed.
I considered overcoming this by creating duplicate tables for deleted/unpublished data such as article => article_deleted and moving the data instead of deleting it. This provides with 2 issues:
Foreign key constraints end up being extremely annoying to maintain
Number of tables with hidden content doubles (in my case ~20 becomes ~40 tables)
My last idea is to create a duplicate of the entire database called unreleased and migrate data there.
My question isn't about safety of the data management, but more of - what is the right way of doing it from the beginning?
I have run into this exact issue before and I think it is a bad idea to create an unnecessarily cumbersome DB because you are afraid of bad code.
I think it would be a better idea to do thorough testing on your Test server before you release to production. Even I was tripped up by the "Deleted" column a few times when I first encountered it but I eventually caught on, and if you have a proper Dev/Test/Production environment you should be fine.
In summary, keep the delete column and demand more from your coders.
UPDATE:
Alternatively you could create a view that only pulls the records that aren't deleted and make sure everyone uses that for select queries.
I think your initial approach is "correct" and "right", but your concern about it being slightly error-prone is a valid one.
You'll probably just have to make sure that your test procedures are rigourous enough to catch errors.
The first approach is the best I've come up with. I call the column active instead of deleted. The record exists but it can be either active or inactive. That then if you really do need to delete things the terminology doesn't get screwy.
Saying "Delete the inactive records" makes sense but saying "Delete the deleted records" just gets confusing.

MySQL DB design help

Pardon the elementary question but my newness to the realm of database design leaves me in a bind quite often.
I have a site that keeps growing with regard to families of information. In the beginning I had one sort of item I was describing and all was well. That item occupied one record and had 34 columns (a lot now that I look back) attributed to it of descriptive data. As I get more and more into this stuff, I see that many developers break out data (when practical) into distinct tables.
I've now got additional tables that relate to the original item but are not always needed when describing the original item so I broke them out so they're not queried unnecessarily.
Anyhow, I have a new item I've been trying to organize which is a USER. The user table has typical columns like username, email, last_login, path to associated image, etc. These users have been making comments, which I keep in yet another table that contains columns with IDs that relate to the user and the item on which they are commenting.
Now... I am in the process of adding the obligatory user profile page to the site. Should I create yet another table containing only essential profile data or append the existing user record with profile data in the original user table? I am thinking housekeeping might be a pain if I am to add a "Remove me from site" function as I would have to run something that kills the user record, the user profile record, and any other data associated with that user ID in other tables.
Basically what I am asking is should I keep going with this "granular" design method - breaking everything out into essential parts or does it ever serve me to consolidate into larger tables? I see a few instances where if a user deletes their account, I'll be left with a bunch of non-relevant data. For instance, the original item are restaurants... if I make a table to record "Visits" to restaurants, containing the Restaurant ID and the User ID, if the user or restaurant get removed from the site, this "Visits" table will have a bunch of useless records saying either "non existent restaurant was visited by user 45" or "Restaurant 21 was visited by non-existent user"
I hope I make sense here... I'm just wondering if it's normal to end up with this "junk" data over time.
Thanks much,
Rob
Deleting that "on-relevant" data is a normal, healthy part of an application's life. It's just what happens. You just have to do it, like you brush your teeth or make your bed. Don't let two or three DELETE queries influence how your tables get structured. They're not that expensive, and honestly, if you think that's too much of a pain, you're in the wrong business :)
If you're using InnoDB tables, you can look into foreign key constraints that will take care of some of the cleanup for you.
You'll be able to make these decisions much more easily if you learn about normalization.
In general, if data all relates to the same logical entity -- the same "thing" -- then it should go in the same table. Breaking one table into two just to keep the tables smaller is generally not a good idea. Depending on what you are doing, it may or may not make queries faster, and it introduces unnecessary complexity. Let me explain.
Whether it makes queries faster depends on the nature of the data and how you use it. If you have some very large field, like "rambling_comments varchar(5000)" or some such, and it is rarely used, then breaking it into a separate table so that what's left in the "main" table is relatively small could indeed make your queries faster, for the fairly obvious reason that there is now less data to read. But if the size of the fields you are thinking of breaking out are modest, and you often need data from both tables, then queries that only use one table don't gain that much, and queries that use both now need to do a join, which is usually more expensive than reading a somewhat bigger record.
But breaking up your tables will certainly make your programs more complex. Now you have to keep track of which data is in which table. You'll constantly be checking if that field is in the Item_Descriptive_Data table or the Item_Stock_Data table or whatever. You're liable to lose track at some point and accidentally put the same field into two tables. (Or worse, you'll decide this is a good idea and do it deliberately.) Then you have redundant and potentially contradictory data.
You have to do joins every time you need data that crosses tables. You create the possibility that records in one or more of the tables may not exist. Like, if you break your User table into User_Main and User_Profile, and you need data from both tables so you do a join, what happens if there is a record in User_Profile with no corresponding record in User_Main? You're going to have to add code to check for the possibility and deal with it. Oh, and blithely saying "That can never happen, no need to worry about it" is a very dangerous attitude: No matter that it's not SUPPOSED to happen, sooner or later it will, and if you don't handle the error gracefully, you could have a real mess.
In short, breaking up tables for performance reasons is usually a premature optimization. If you find that you have some real performance problem, THEN look at the tables and see if you should denormalize for efficiency. But don't start out trashing your database just to avoid a problem that might possibly happen someday.

Never delete entries? Good idea? Usual?

I am designing a system and I don't think it's a good idea to give the ability to the end user to delete entries in the database. I think that way because often then end user, once given admin rights, might end up making a mess in the database and then turn to me to fix it.
Of course, they will need to be able to do remove entries or at least think that they did if they are set as admin.
So, I was thinking that all the entries in the database should have an "active" field. If they try to remove an entry, it will just set the flag to "false" or something similar. Then there will be some kind of super admin that would be my company's team who could change this field.
I already saw that in another company I worked for, but I was wondering if it was a good idea. I could just make regular database backups and then roll back if they commit an error and adding this field would add some complexity to all the queries.
What do you think? Should I do it that way? Do you use this kind of trick in your applications?
In one of our databases, we distinguished between transactional and dictionary records.
In a couple of words, transactional records are things that you cannot roll back in real life, like a call from a customer. You can change the caller's name, status etc., but you cannot dismiss the call itself.
Dictionary records are things that you can change, like assigning a city to a customer.
Transactional records and things that lead to them were never deleted, while dictionary ones could be deleted all right.
By "things that lead to them" I mean that as soon as the record appears in the business rules which can lead to a transactional record, this record also becomes transactional.
Like, a city can be deleted from the database. But when a rule appeared that said "send an SMS to all customers in Moscow", the cities became transactional records as well, or we would not be able to answer the question "why did this SMS get sent".
A rule of thumb for distinguishing was this: is it only my company's business?
If one of my employees made a decision based on data from the database (like, he made a report based on which some management decision was made, and then the data report was based on disappeared), it was considered OK to delete these data.
But if the decision affected some immediate actions with customers (like calling, messing with the customer's balance etc.), everything that lead to these decisions was kept forever.
It may vary from one business model to another: sometimes, it may be required to record even internal data, sometimes it's OK to delete data that affects outside world.
But for our business model, the rule from above worked fine.
A couple reasons people do things like this is for auditing and automated rollback. If a row is completely deleted then there's no way to automatically rollback that deletion if it was in error. Also, keeping a row around and its previous state is important for auditing - a super user should be able to see who deleted what and when as well as who changed what, etc.
Of course, that's all dependent on your current application's business logic. Some applications have no need for auditing and it may be proper to fully delete a row.
The downside to just setting a flag such as IsActive or DeletedDate is that all of your queries must take that flag into account when pulling data. This makes it more likely that another programmer will accidentally forget this flag when writing reports...
A slightly better alternative is to archive that record into a different database. This way it's been physically moved to a location that is not normally searched. You might add a couple fields to capture who deleted it and when; but the point is it won't be polluting your main database.
Further, you could provide an undo feature to bring it back fairly quickly; and do a permanent delete after 30 days or something like that.
UPDATE concerning views:
With views, the data still participates in your indexing scheme. If the amount of potentially deleted data is small, views may be just fine as they are simpler from a coding perspective.
I prefer the method that you are describing. Its nice to be able to undo a mistake. More often than not, there is no easy way of going back on a DELETE query. I've never had a problem with this method and unless you are filling your database with 'deleted' entries, there shouldn't be an issue.
I use a combination of techniques to work around this issue. For some things adding the extra "active" field makes sense. Then the user has the impression that an item was deleted because it no longer shows up on the application screen. The scenarios where I would implement this would include items that are required to keep a history...lets say invoice and payment. I wouldn't want such things being deleted for any reason.
However, there are some items in the database that are not so sensitive, lets say a list of categories that I want to be dynamic...I may then have users with admin privileges be allowed to add and delete a category and the delete could be permanent. However, as part of the application logic I will check if the category is used anywhere before allowing the delete.
I suggest having a second database like DB_Archives whre you add every row deleted from DB. The is_active field negates the very purpose of foreign key constraints, and YOU have to make sure that this row is not marked as deleted when it's referenced elsewhere. This becomes overly complicated when your DB structure is massive.
There is an acceptable practice that exists in many applications (drupal's versioning system, et. al.). Since MySQL scales very quickly and easily, you should be okay.
I've been working on a project lately where all the data was kept in the DB as well. The status of each individual row was kept in an integer field (data could be active, deleted, in_need_for_manual_correction, historic).
You should consider using views to access only the active/historic/... data in each table. That way your queries won't get more complicated.
Another thing that made things easy was the use of UPDATE/INSERT/DELETE triggers that handled all the flag changing inside the DB and thus kept the complex stuff out of the application (for the most part).
I should mention that the DB was a MSSQL 2005 server, but i guess the same approach should work with mysql, too.
Yes and no.
It will complicate your application much more than you expect since every table that does not allow deletion will be behind extra check (IsDeleted=false) etc. It does not sound much but then when you build larger application and in query of 11 tables 9 require chech of non-deletion.. it's tedious and error prone. (Well yeah, then there are deleted/nondeleted views.. when you remember to do/use them)
Some schema upgrades will become PITA since you'll have to relax FK:s and invent "suitable" data for very, very old data.
I've not tried, but have thought a moderate amount about solution where you'd zip the row data to xml and store that in some "Historical" table. Then in case of "must have that restored now OMG the world is dying!1eleven" it's possible to dig out.
I agree with all respondents that if you can afford to keep old data around forever it's a good idea; for performance and simplicity, I agree with the suggestion of moving "logically deleted" records to "old stuff" tables rather than adding "is_deleted" flags (moving to a totally different database seems a bit like overkill, but you can easily change to that more drastic approach later if eventually the amount of accumulated data turns out to be a problem for a single db with normal and "old stuff" tables).