Is a 'blackhole' table evil? - mysql

Reading to this question i've just learned the existence of the blackhole table trick: basically consist in using a single table to insert data, and then a trigger that split the data in many other tables.
Im wondering if this could cause problems, once the developers whos working on the project are aware of that.
What are the pro and cons of this tecnique?
Edit:
The blink I got in mind when I saw the example, is about transactions: if for some reason the transaction fail, you'll find the blackhole row with the original data, for historical purpose and maybe a help with debug - but this seems to be the only +1 i can see with blackholes. Ideas?

I don't think blackhole has any real pros.
Writing the trigger code to move data around is probably not noticably less work than writing the code to insert the data in the right place in the first place.
As Christian Oudard writes, it doesn't reduce complexity - just moves it to a place where it's really hard to debug.
On the downside:
"Side effects" are usually a bad idea in software development. Triggers are side effects - I intend to do one thing (insert data in a table), and it actually does lots of other things. Now, when I'm debugging my code, I have to keep all the side effects in my head too - and the side effects could themselves have side effects.
most software spends far more time in maintenance than it does in development. Bringing new developers into the team and explaining the black hole trick is likely to increase the learning curve - for negligible benefit (in my view).
Because triggers are side effects, and it's relatively easy to set off a huge cascade of triggers if you're not careful, I've always tried to design my databases without a reliance on triggers; where triggers are clearly the right way to go, I've only let my most experienced developers create them. The black hole trick makes triggers into a normal, regular way of working. This is a personal point of view, of course.

The original question that prompted yours does not get at the heart of MySQL's "blackholes."
What is a BLACKHOLE?
In MySQL-speak, BLACKHOLE is a storage engine that simply discards all data INSERTed into it, analogous to a null device. There are a number of reasons to use this backend, but they tend to be a bit abstruse:
A "relay-only" binlog-filtering slaveSee the docs, and here and here.
BenchmarkingE.g., measuring the overhead of binary logging without worrying about storage engine overhead
Various computational tricksSee here.
If you don't know why you need a data sink masquerading as a table, don't use it.
What is the technique you are asking about?
The use under consideration seems to be to:
redirect INSERTed data to other tables
audit log the original INSERTion action
discard the original INSERT data
Thus the answer to the question of "evilness" or pros/cons is the same as the answer to those questions for insertable/updatable VIEWs (the common way to implement #1), trigger-based audit logging (how most people do #2) and behavioral overrides/counteractions generally (there are a number of ways to accomplish #3).
So, what is the answer?
The answer is, of course, "sometimes these techniques are appropriate and sometimes not." :) Do you know why you're doing it? Is the application a better place for this functionality? Is the abstraction too brittle, too leaky, too rigid, etc.?

This doesn't look like a good idea. If you're trying to keep the front end code simple, why not just use a stored procedure? If it's not to keep the front end code simple, I don't understand the point at all.

Funnily enough I learnt about the existence of blackholes today too.
Arguably the question here is actually a broader one i.e. whether or not business logic should be embedded in database triggers or not. In this instance the blackhole table is essentially being used as a transient data store that the trigger on the blackhole table can make use of. Should the trigger be used in the first place? To me that is the real meat of the question.
Personally I feel that the use of triggers should be restricted to logging and DBA-specific tasks only and should not contain business logic (or any logic for that matter) that should belong firmly in the application layer. It appears as though there have been quite a few opinions expressed about whether database triggers are evil or not. I think your question kinda falls into that category too.
Embedding application layer logic in database triggers can be risky.
It is likely to end up splitting business logic between application
code and the database. This can be very confusing indeed for
somebody trying to support and get their head into a code base.
If you end up with too much logic in triggers, and indeed stored procedures, you can easily end up with performance issues on your database server that could have, indeed should have been addressed by distributing the heavy duty processing tasks i.e. complex business logic among application servers and leaving the database server free for its primary purpose i.e. serving data.
Just my two bits' worth of course!

Each time you insert a row into a table, the odds are that you are writing to the same area of the hard drive or the same page (in MS-SQL world, I don't know about postgresql), so this technique will likely lead to contention and locking as all transactions are now competing to write to the same table.
Also this will halve insert performance since inserts require two inserts instead of one.
And this is denormalization since there are now two copies of the data instead of one.

Please don't do this. This doesn't reduce complexity, it just moves it around. This sort of logic belongs in the application layer, where you can use a nicer language like PHP, Python, or Ruby to implement it.

Don't do this. The fact that it's called a trick and not a standard way of doing something says enough for me.
This totally kills the normal usage pattern of the relational model. Not sure that it actually kills normal form as you can still have that all in place. It's just messing with the way data is making it to the destination tables. Looks like a performance nightmare on top of a maintenance nightmare. Imagine one table having a trigger that has to fire for 1,800 plus table inserts for example. That just makes me feel sick.
This is a interesting parlor trick nothing more.

I would suppose that this would be quite slow, as the advantages of "bulk inserts" cannot be used.

Related

Which is the right database for the job?

I am working on a feature and could use opinions on which database I should use to solve this problem.
We have a Rails application using MySQL. We have no issues with MySQL and it runs great. But for a new feature, we are deciding whether to stay MySQL or not. To simplify the problem, let's assume there is a User and Message model. A user can create messages. The message is delivered to other users based on their association with the poster.
Obviously there is an association based on friendship but there are many many more associations based on the user's profile. I plan to store some metadata about the poster along with the message. This way I don't have to pull the metadata each time when I query the messages.
Therefore, a message might look like this:
{
id: 1,
message: "Hi",
created_at: 1234567890,
metadata: {
user_id: 555,
category_1: null,
category_2: null,
category_3: null,
...
}
}
When I query the messages, I need to be able to query based on zero or more metadata attributes. This call needs to be fast and occurs very often.
Due to the number of metadata attributes and the fact any number can be included in a query, creating SQL indexes here doesn't seem like a good idea.
Personally, I have experience with MySQL and MongoDB. I've started research on Cassandra, HBase, Riak and CouchDB. I could use some help from people who might have done the research as to which database is the right one for my task.
And yes, the messages table can easily grow into millions or rows.
This is a very open ended question, so all we can do is give advice based on experience. The first thing to consider is if it's a good idea to decide on using something you haven't used before, instead of using MySQL, which you are familiar with. It's boring not to use shiny new things when you have the opportunity, but believe me that it's terrible when you've painted yourself in a corner because you though that the new toy would do everything it said on the box. Nothing ever works the way it says in the blog posts.
I mostly have experience with MongoDB. It's a terrible choice unless you want to spend a lot of time trying different things and realizing they don't work. Once you scale up a bit you basically can't use things like secondary indexes, updates, and other things that make Mongo an otherwise awesomely nice tool (most of this has to do with its global write lock and the database format on disk, it basically sucks at concurrency and fragments really easily if you remove data).
I don't agree that HBase is out of the question, it doesn't have secondary indexes, but you can't use those anyway once you get above a certain traffic load. The same goes for Cassandra (which is easier to deploy and work with than HBase). Basically you will have to implement your own indexing which ever solution you choose.
What you should consider is things like if you need consistency over availability, or vice versa (e.g. how bad is it if a message is lost or delayed vs. how bad is it if a user can't post or read a message), or if you will do updates to your data (e.g. data in Riak is an opaque blob, to change it you need to read it and write it back, in Cassandra, HBase and MongoDB you can add and remove properties without first reading the object). Ease of use is also an important factor, and Mongo is certainly easy to use from the programmer's perspective, and HBase is horrible, but just spend some time making your own library that encapsulates the nasty stuff, it will be worth it.
Finally, don't listen to me, try them out and see how they perform and how it feels. Make sure you try to load it as hard as you can, and make sure you test everything you will do. I've made the mistake of not testing what happens when you remove lots of data in MongoDB, and have paid for that dearly.
I would recommend to look at presentation about Why databases suck for messaging which is mainly targeted on the fact why you shouldn't use databases such as MySQL for messaging.
I think in this scenario CouchDB's changes feed may come quite handy although you probably would also have to create some more complex views based on querying message metadata. If speed is critical try to also look at redis which is really fast and comes with pub/sub functionality. MongoDB with it's ad hoc queries support may also be a decent solution for this use case.
I think you're spot-on in storing metadata along with each message! Sacrificing storage for faster retrieval time is probably the way to go. Note that it could get complicated if you ever need to change a user's metadata and propagate that to all the messages. You should consider how often that might happen, whether you'll actually need to update all the message records, and based on that whether it's worth paying the price for the sake of less queries (it probably is worth it, but that depends on the specifics of your system).
I agree with #Andrej_L that Hbase isn't the right solution for this problem. Cassandra falls in with it for the same reason.
CouchDB could solve your problem, but you're going to have to define views (materialized indices) for any metadata you're going to want to query. If the whole point of not using MySQL here is to avoid indexing everything, then Couch is probably not the right solution either.
Riak would be a much better option since it queries your data using map-reduce. That allows you to build any query you like without the need to pre-index all your data as in couch. Millions of rows are not a problem for Riak - no worries there. Should the need arise, it also scales very well by simply adding more nodes (and it can balance itself too, so this is really a non-issue).
So based on my own experience, I'd recommend Riak. However, unlike you, I've no direct experience with MongoDB so you'll have to judge it agains Riak yourself (or maybe someone else here can answer on that).
From my experience with Hbase is not good solution for your application.
Because:
Doesn't contain secondary index by default(you should install plugins or something like these). So you can effectively search only by primary key. I have implemented secondary index using hbase and additional tables. So you can't use this one in online application because of for getting result you should run map/reduce job and it will take much time on million data.
It's very difficult to support and adjust this db. For effective work you will use HBAse with Hadoop and it's necessary powerful computers or several ones.
Hbase is very useful when you need make aggregation reports on big amount of data. It seems that you needn't.
Due to the number of metadata attributes and the fact any number can
be included in a query, creating SQL indexes here doesn't seem like a
good idea.
It sounds like you need a join, so you can mostly forget about CouchDB till they sort out the multiview code that was worked on (not actually sure it is still worked on).
Riak can query as fast as you make it, depends on the nodes
Mongo will let you create an index on any field, even if that is an array
CouchDB is very different, it builds indexes using a stored Map-Reduce(but without the reduce) they call a "view"
RethinkDB will let you have SQL but a little faster
TokuDB will too
Redis will kill all in speed, but it's entirely stored in RAM
single level relations can be done in all of them, but differently for each.

how much work should we do in the database?

how much work should we do in the database?
Ok I'm really confused as to exactly how much "work" should be done IN the database, and how much work had to be done instead at the application level?
I mean I'm not talking about obvious stuff like we should convert strings into SHA2 hashes at the application level instead of the database level..
But rather stuff that are more blur, including, but not limited to "should we retrieve the data for 4 column and do a uppercase/concatenation at the application level, or should we do those stuff at the database level and send the calculated result to the application level?
And if you could list any more other examples it would be great.
It really depends on what you need.
I like to do my business logic in the database, other people are religously against that.
You can use triggers and stored procedures/functions in SQL.
Links for MySQL:
http://dev.mysql.com/doc/refman/5.5/en/triggers.html
http://www.mysqltutorial.org/introduction-to-sql-stored-procedures.aspx
http://dev.mysql.com/doc/refman/5.5/en/stored-routines.html
My reasons for doing business logic in triggers and stored proces
Note that I'm not talking about bending the database structure towards the business logic, I'm talking about putting the business logic in triggers and stored procedures.
It centralizes your logic, the database is a central place, everything has to go through it. If you have multiple insert/update/delete points in your app (or you have multiple apps) you'll need to do the checks multiple times, if you do it in the database you only have to do the checks in one place.
It simplifies the application e.g., you can just add a member, the database will figure out if the member is already known and take the appopriate action.
It hides the internals of your database from the application, if you do all your logic in the application you will need intricate knowledge of your database in the application. If you use database code (triggers/procs) to hide that, you don't need to know every database detail in your app.
It makes it easier to restucture your database If you have the logic in your database, you can just change a tablelayout, replace the old table with a blackhole table, put a trigger on that and let the trigger do the updates to the new table, your app does not even need to know the database has changed, this allows legacy apps to keep working unchanged, whilst new apps can use the improved database layout.
Some things are easier in SQL
Some things work faster in SQL
I don't like to use (lots of and/or complicated) SQL code in my application, I like to put SQL code in a stored procedure/function and try to only put simple queries in my application code, that way I can just write code that explains what I mean in my application and let the database layer do the heavy lifting.
Some people disagree strongly with this, but this approach works well for me and has simplified debugging and maintenance of my applications a lot.
Generally, its a good practice to expect only "Data" from the Database. Its upto Application(s), to apply Business/Domain Logic and make sense of the data retrieved. Its highly recommended to do the following things in the Application Layer:
1) Formatting Date
2) Applying Math functions, such as interpolation/extrapolation, etc
3) Dynamic sorting (based on columns)
However, situations sometime warrant few things to be done at the database level.
In my opinion application should use data and database should provide them and that should be clear separation of concerns. So database gives records sorted, ordered and filtered according to requested conditions but it is up to application to apply some business logic to that records and "convert" them into something meaningful to the user.
For example, in my previous company we worked on big application for work time calculations. One of obvious functionalities in this kind of application is tracking vacation days of employees - how many days employee has per year, how many he used, how many left, etc. Basically we could write some triggers and procedures that would update those columns automatically. So when employee had his vacation days approved amount of days he applied for is taken from his "vacation pool" and added to "vacation days used". Pretty easy stuff but we decided to make it explicit on application level and boy, very soon we were happy we did it that way. Application had to be labor law compliant and it quickly turned out that not for all employees vacation days are calculated equally and sometimes vacation day can be not so vacation day at all but that is beside the point. Had we put this "easy" operation in database we had to version our database with every little change to a vacation days related logic and that would lead us straight to hell in customer support field due to a fact that it was possible to update only application without a need to update database (except clear "breakthrough" moments where database structure was changed of course).
In my experience I've found that many applications start with a straight-forward set of tables and then and handful of stored procedures to provide basic functionality. This works very well; it usually yields high performance and is simple to understand, it also mitigates any need for a complex middle-tier.
However, applications grow. It's not unusual to see large data-driven applications with thousands of stored procedures. Throw triggers into the mix and you have an application which, for anybody other than the original developers (if they're still working on it), is very difficult to maintain.
I will put a word in for applications which place most logic in the database - they can work well when you have some good database developers and/or you have a legacy schema which cannot be changed. The reason I say this is that ORMs take much of the pain out of this part of application development when you let them control the schema (if not, you often need to do a lot of fiddling to get it working).
If I was designing a new application then I would usually opt for a schema which is dictated by my application domain (the design of which will be in code). I would normally let an ORM handle the mapping between the objects and the database. I would treat stored procedures as exceptions to the rule when it came to data access (reporting can be much easier in sprocs than trying to coax an ORM into producing a complex output efficiently).
The most important thing to remember though, is that there are no "best practices" when it comes to design. It is up to you the developer to weigh up the pros and cons of each option in the context of your design.

How easy (or otherwise) is it to to tune a database AFTER 'going LIVE'?

It is looking increasingly like I'll have to go live before I have had the time to tweak all the queries/tables etc, before I go live with a website (already 6 months behind schedule, so all though this is not the ideal scenario - thats how things are).
Its now a case of having to bite the bullet. Its just a case of trying to work out how big that bullet will be when we come to 'biting it'. Once the databse goes live obviously we cant change the data on a whim, because its live data. I am fairly confident on the most of db schema - for e.g. the tables are in most 3 and 4th normal form, and constraints are used to ensure data integrity. I have also put in some indexes on some column that (I think) will be used a lot in queries though this was done quite hurridly and not tested - this is the bit I am worried about.
To clarify, I am not talking about wholesale structure change. The tables themselves are unlikely to change (if ever), however it is almost guaranteed that I will have to tune the tables at some stage (either personally or by hiring someone).
I want to know how much of a task this is. Specifically, assuming a database of a few gigabytes (so far roughly 300 tables)
Assuming 50% of the tables need tuning in the next few months:
How long will it take to perform the tuning (I know this is a "how long is a piece of string" type question) - but what are the main determinants of the effort required, so I can work out how long it is likely to take?
Is it possible to either lock sections of the database, (or specific tables) whilst the indexes are being reworked, or does the entire databse need to go offline? (I am using mySQL 5.x as the db)
Is what I describe (going live before ALL tables perfectly tuned) outrageously risky/inadvisable ? (Does it justify the months of sleepless nights this has caused me so far) ?
In general it is much harder to fix a poor database design that is causing performance issues after going live becasue you have to deal with the existing records. Even worse, the poor design may not become apparent until months after going live when there are many records instead of a few. This is why databses should be designed with performance in mind (no this is not premature optimization, there are known techniques which generally perform better than other techniques and they shoulod be considered inthe design) and databases should be tested against a test set of records that is close to or more than the expected level of records you would have after a couple of years.
As to how long it will take to completely fix a badly designed database, months or years. Often the worst part is something that is central to the design (like say an EAV table) and which will require almost every query/sp/view. UDF to be adjusted to move to a better structure. You then have to ensure all records are moved to the new better structure. The sooner you can fix a mistake like this the better. Far better to move a couple of thousand records to a new structure than 100,000,000.
If your structure is ok but your queries are bad, you are better off as you can take the top ten worst performing (Choose based not just on total time to run but time X no of times run) and fix, rinse and repeat.
If you are in the midst of fixing a poor database, this book might come in handy:
http://www.amazon.com/Refactoring-Databases-Evolutionary-Database-Design/dp/0321293533/ref=sr_1_1?ie=UTF8&s=books&qid=1268158669&sr=8-1
I would try at least to quantify the limits of the database before going live, so that at least you would know when the activity generated from your application is getting near to that threshold.
You may want to simulate (automatically as much as possible) the typical usage of the database from your application, and check how many concurrent users/sessions/transactions, etc it can handle before it breaks. This, at least, should allow you to solve the "sleepless nights" issue.
As for the original "How easy is it...?" question, the answer obviously depends on many factors. However, the above analysis would undoubtedly help, as at the very least you will be in a position to say whether your database requires tweaking or not.
To answer the title question, I'd say it's fairly easy to tune your DB after deploying into Production.
It's a great idea to be improving performance after deploying to any environment. Being Production adds a bit of pressure, along with the schedule. I'd suggest deploying to Prod, and let it perform as it will. Then start measuring:
how long to run Report X in different times (peak vs after-hours, if there is such a concept in your app).
what's the user's experience when using the app for those critical use-cases?
Then take a backup of your Prod environment, and create yourself a pre-Prod environment. There you'll be able to run your upgrade scripts to be able to measure the 'how long' type questions you have. Index creation, upgrade down-times, etc. When tuning queries, etc, you'll have a great idea of how it performs with production data & volumes. Granted, you won't have the benefits of having those users performing those inserts at the same time.
Keep that backup for multiple iterations, failed upgrades, new/unprepared-for issues, etc.
Keep making backups after each deployment, so that you can test the next round of improvements to your DB.
It depends on what you're tuning. Let's say you're adding an index to a couple tables, or changing a table type from MyISAM to InnoDB or something, then with a large enough table, those things could be done in 5 to 10 minutes depending on your hardware. It won't take hours. That said, it's still best to do any live-db tuning in the middle of the night.
You can grab a read lock by calling FLUSH TABLES WITH READ LOCK but it's probably better to put up a "we're doing maitenance" message in your app for the 15-30 mins you're doing it, just to be safe.
The risk is inherent to the situation and what happens if there are serious problems. I usually take a more cowboy approach and take stuff live, especially if they aren't under high load so I can easily find pain points and fix them. If this is a mission critical system, then no, load test or whatever you can first to be sure you're as ready as you can be. Also, keep in mind that you cannot foresee all the issues you'll have. If your indexes are good, then you're probably okay to take it live and see what needs to be worked on.

What are the essential dba skills a developer should learn?

Creation of objects like tables and indexes are fairly essential, even if the code has to be authorized or created by the dba. What other areas normally carried out by dbas should the accomplished developer be aware of?
A developer is responsible for doing everything that makes his code a) correct and b) fast.
This of course includes creating indexes and tables.
Making a DBA responsible for indexes is a bad idea. What if the code runs slowly? Who is to be blamed: a developer with bad code or a DBA with a bad index?
A DBA should convey database supporting operations like making backups, building the infrastructure etc, and report the lack of resources.
He or she should not be a sole person for making the decicions that affect the performance of the whole database system.
Relational databases, as for now, are not yet in that state that would allow splitting of responsibility so that developers could make the queries right and the DBA could make them fast. That's a myth.
If there is a lack of resources (say, an index makes some query fast at the expence of some DML operation being slow), this should be reported by a DBA, not fixed.
Now, it is a decision making time. What do we need more, fast query or a fast insert?
This decision should be made by the program manager (and not the DBA or developer).
And when the decision is made, the developer should be given the new task: "make the SELECT query to be as fast as possible, taking in account that you don't have this index". Or "make an INSERT query to be as fast as possible, taking in account that you will have this index".
A developer should know everything about how a database works, when it works normally.
A DBA should know everything about how to make a database to work normally.
The latter includes ability to make a backup, ability to restore from a backup and ability to detect and report a resource contention.
The ins and outs of database storage and optimization are huge. Knowing how to index and partition tables well is invaluable knowledge.
Also, how to read a query execution plan. SQL is such a cool language in that it will tell you exactly how it's going to run your code, so long as you ask nicely. This is absolutely essential in optimizing your code and finding bottlenecks.
Database maintenance (backups, shrinking files, etc) is always important to keep your server running smoothly. It's something that's often overlooked, too.
Developers should know all about triggers and stored procedures--getting the database to work for you. These things can help automate so many tasks, and often developers overlook them and try to handle it all app side, when they should really be handled by something that thinks in sets.
Which brings me to the most important point, database developers need to think in sets. To often I hear, "For each row, I want to..." and this is generally an alarm in my head. You should be thinking about how the set interacts and the actions you want to take on entire columns.
Optimization. Your code allways should use as little resources as you can achieve.
I would recommend developing an understanding of the security architecture for the relevant DBMS.
Doing so could facilitate your development of secure code.
With SQL Server specifically in mind for example:
Understand why your “managed code” (such as .NET CLR) should not be granted elevated privileges. What would be the implications of doing so?
What is Cross-Database ownership chaining? How does it work?
Understand execution context.
How does native SQL Server encryption work?
How can you sign a stored procedure? Why would you even want to do this?
Etc.
As a general rule, the more you understand about the engine you are working with, the more performance you can squeeze from it.
One thing that currently springs to mind is how to navigate and understand the information that database "system" tables/views gives to you. E.g. in sql server the views that are under the master database. These views hold information such as current logins, lists of tables and partitions etc. which is all useful stuff in trying to track down things such as hung logins or whether users are currently connected etc.
Relationships of your tables. You should always have a recent printout and soft copy of your database. You need to know the primary keys, foreign keys, required and auto filled columns, without that I think you can't write efficient queries or make sure your database is carrying only what it needs.
I think everyone else covered it.
Having a good understanding of the architecture of your database system will definitely be helpful. Can you draw a diagram by heart to show components of your DBMS and their interactions?

Should I put my entire site on one database or split into two databases?

My webapp that I am developing will incorporate a forum (phpbb3 or smf), a wiki (doku wiki flat file storage), and my main web app.
All of these will share a User table. The forum I expect to see moderate usage (roughly 200 posts a day) to begin with and will increase from there. The main web app will be used heavily with triggers, mysql events, and stored procedures.
Would splitting the forum and main web app up into separate databases be the wiser choice (IE maintainability and performance)?
Why would you ever use two databases? What possible reason can there be? Please update the question to explain why you think two databases has some advantage over one database.
Generally, everyone tries to use one database to keep the management overhead down to an acceptable level. Why add complexity?
Keep your app simple to begin with. If you're not expecting a huge amount of traffic to begin with, then it's fine to have a single database. You can always upgrade your site later. Changing which database a table is stored in shouldn't require a huge amount of code.
It depends. If there is any chance of optimization, then splitting is good not otherwise. Another thing is that even if you split, you should be able to manage them all successfully, a bit of overhead.
For the sake of maintainability in the future, I would definitely recommend against splitting DBs when you have a shared table. This will only serve to increase the size of your queries (since joins across DBs will require further qualification). This also will eventually lead to confusion when someone new can't figure out why their query doesn't work.
Furthermore, if the two DBs are actually running on separate servers or instances, you won't be able to join and will be forced to move back and forth between code and query to do something that would otherwise be a simple join. That may not be a big deal for a simple one row look up type queries, but for more complicated summary type queries, it means dumping off a lot of the processing that RDBMs' are specifically optimized for into your more general purpose programming language.
separating to two database could not guarantee performance. putting tables in different tablespace (i.e. different drives) could boost performance, as independent queries doesn't often compete for hard disk's read/write head's access turns