In the physics lab I work in, at the beginning of each experiment (of which we run several per minute) we perform a series of Inserts into a MariaDB database. One of the tables has a few hundred columns - each corresponding to a named variable - and serves as a log of the parameters used during that run. For example, one variable is the laser power used during a particular step of the experiment.
Over time, experimenters add new variables to parametrize new steps of the experiment. Originally my code handled this by simply adding new columns to the table - but as the number rows in the table increased beyond around 60000, the time it took to add a column became unusably long (over a minute).
For now, I have circumvented the problem by pre-defining a few hundred extra columns which can be renamed as new variables are needed. At the rate at which variables are added, however, this will only last our lab (or the other labs that use this software) a few years before table maintenance is required. I am wondering whether anyone can recommend a different architecture or different platform that would provide a natural solution to this "number of columns" problem.
I am guessing you are running various different types of experiments and that is why you need an ever increasing number of variables? If that is the case, you may want to consider either:
having a separate table for each type of experiment,
separate tables to hold each experiment type's parameter values (that reference the experiment in the primary table),
have a simpler "experiment parameters" table that has 3 (or more, depending on complexity of values) references: the experiment, the parameter, and parameter value.
My preference would be to one of the first two options, the third one tends to make data a bit more complicated, to analyze AND maintain, than the flexibility is worth.
It would seem that EAV is best for your scenario. I would always steer away from it, but in this case it seems to make sense. I would keep the last n experiments of data in the main table(s), and dog off the other ones to an archive table. Naturally you would know of the speed increases in archiving away data not needed at the moment, yet always available with joins to larger tables.
For an introduction into EAV, see a web ddocument by Rick James (a stackoverflow User). Also, visit the questions available on the stack here.
Everytime I look at EAV I wonder why in the world would anyone use it to program against. But just imagining the academic/experimental/ad-hoc environment that you must work in, I can't help but think it might be the best one for you. The following is a high-level exploratory question entitled Should I use EAV model?.
Related
So basically I am in the process of creating a personal finance tracking system. It occurred to be that keeping tabs on when each instance and transaction was last edited or updated might be of relevant information some day.
Now as far as I can see there are two approaches to implement something like this:
Create "updated" fields to all the tables I want to keep track of and then let mysql update those fields for me (ON UPDATE clause)
Create a completely seperate table for holding the log data and then update that with a triggers and transactions
Now it seems that 1st approach would have the benefit of keeping things simple and easy to maintain. However how this will impact the performance if I suddenly decide to get every log in the database for review. Also this would kind of goes against normalization (not by much though) with same data stored in multiple tables.
The second approach would allow more flexibility to the logging system and might actually shorten the sql query necessary to retrieve certain data. However it would make the schema more complex as two additional tables would have to be created (the actual log table and many-to-many relation table for holding the keys) and maintained. On the other hand if I ever want to implement an activity history this approach would propably be the only one capable of doing it.
As such I would like to know some more pros and cons to each method. Since 2nd option allows more flexibility I am considering implementing it but I am not sure about performance issues. In the end it comes down to two guestions:
Are there any real life examples where both approaches are
implemented?
And:
Are there any studies, comparisons or other resource that might shed
some light on which is considered more performance friendly and "best
practices" approach?
It depends on what kind of reporting you need and your current architecture.
If you just want to know last update date, then having 2 fields (creation date and last update) should be enough. That's because having separate table won't give any perfomance boost, but will make your code harder to maintain.
It's another story if you want to have something more elaborate, like reporting differences (what was changed) and/or have full change log on each transaction (there might be few updates to one transaction, right?). In this case you actually must have separate table, because otherwise it will bloat your table and reduce perfomance.
Based on my experience, I'd go with separate table. That's because it will be easier to maintain - your logging logic will be practically separated from everything else and I think one day you'll need that additional info on your transactions and full transaction history.
As far as perfomance goes, you won't notice any formidable difference unless your system will be under serious load. But as your system is personal, any choice would suffice, just don't forget about proper indexing.
Note that I'm making alot of assumptions here, so if you want something more specific, please provide your actual architecture and reporting needs. I'd suggest some books on high availability/perfomance, but they are not on your specific needs, but on general availability/perfomance.
Related but not quite the same thing:which is more effcient? (or at least reading through it didn't help me any)
So I am working on a new site (selling insurance policies) we already have several sites up (its a rails application) that do this so I have a table in my sql database called policies.
As you can imagine it has lots of columns to support all the different options available.
While working on this new site I realized I needed to keep track of 20+ more options.
My concern is that the policies table is already large, but the columns in it right now are almost all used by every application we have. Whereas if I add these they would only be used for the new site and would leave tons of null cells on all the rest of the policies.
So my question is do I add those to the existing table or create a new table just for the policies sold on that site? Also I believe that if I created a new table I could leave out some of the columns (but not very many) from the main policies table because they are not needed for this application.
"[A]lmost all used" suggests that you could, upon considering it, split it more naturally.
Now, much of the efficiency concern here goes down to three things:
A single table can be scanned through more quickly than joins across several.
Large rows have a memory and disk-space cost in themselves.
If a single table represents something that is really a 1-to-many, then it requires more work on insert, delete or update.
Point 2 only really comes in, should there be a lot of cases where you need one particular subset of the data, and another batch where you need another subset, and maybe just a few where you need them all. If you're using most of the columns in most places, then it doesn't gain you anything. In that case, splitting tables is bad.
Point 1 and 3 argue for and against joining into one big table, respectively.
Before any of that though, let's get back to "almost all". If there are several rows with a batch of null fields, why? Often answering that "why?" reveals that really there's a natural split there, that should be broken off into another table as part of normal normalisation*. Repetition of fields, is an even greater suggestion that this is the case.
Do this first.
To denormalise - whether by splitting what is naturally one table, or joining what is naturally several - is a very particular type of optimisation - it makes some things more efficient at the cost of making other things less efficient, and it introduces possibilities of bugs that don't exist otherwise. I would never say you should never denormalise - I do it myself - but you need to be able to say "I am denormalising table X & Y in this manner, because it will help case C which happens enough and I can live with the extra cost to case D". Then you need to check it actually did help case C significantly and case D insignificantly, along with looking for hidden costs.
One of the reasons for normalising in the first place is it gives good average performance over a wide range of cases. It's the balance you want most of the time. Denormalising from the get-go rather than with a normalised database as a starting point is almost always premature.
*Fun trivia fact: The name "normalization" was in part a take on Richard Nixon's "Vietnamisation" policy meaning there was a running joke in some quarters of adding "-isation" onto just about anything. Were it not for the Whitehouse's reaction to the Tet Offensive, we could be using the gernund "normalising," or something completely different instead.
I'm currently designing a web application using php, javascript, and MySQL. I'm considering two options for the databases.
Having a master table for all the tournaments, with basic information stored there along with a tournament id. Then I would create divisions, brackets, matches, etc. tables with the tournament id appended to each table name. Then when accessing that tournament, I would simply do something like "SELECT * FROM BRACKETS_[insert tournamentID here]".
My other option is to just have generic brackets, divisions, matches, etc. tables with each record being linked to the appropriate tournament, (or matches to brackets, brackets to divisions etc.) by a foreign key in the appropriate column.
My concern with the first approach is that it's a bit too on the fly for me, and seems like the database could get messy very quickly. My concern with the second approach is performance. This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo, so any and all help is appreciated. Thanks!
Do not create tables for each tournament. A table is a type of an entity, not an instance of an entity. Maintainability and scalability would be horrible if you mix up those concepts. You even say so yourself:
This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
How on Earth would you scale to that level if you need to create a whole table for each record?
Regarding the performance of your second approach, why are you concerned? Do you have specific metrics to back up those concerns? Relational databases tend to be very good at querying relational data. So keep your data relational. Don't try to be creative and undermine the design of the database technology you're using.
You've named a few types of entities:
Tournament
Division
Bracket
Match
Competitor
etc.
These sound like tables to me. Manage your indexes based on how you query the data (that is, don't over-index or you'll pay for it with inserts/updates/deletes). Normalize the data appropriately, de-normalize where audits and reporting are more prevalent, etc. If you're worried about performance then keep an eye on the query execution paths for the ways in which you access the data. Slight tweaks can make a big difference.
Don't pre-maturely optimize. It adds complexity without any actual reason.
First, find the entities that you will need to store; things like tournament, event, team, competitor, prize etc. Each of these entities will probably be tables.
It is standard practice to have a primary key for each of them. Sometimes there are columns (or group of columns) that uniquely identify a row, so you can use that as primary key. However, usually it's best just to have a column named ID or something similar of numeric type. It will be faster and easier for the RDBMS to create and use indexes for such columns.
Store the data where it belongs: I expect to see the date and time of an event in the events table, not in the prizes table.
Another crucial point is conforming to the First normal form, since that assures data atomicity. This is important because it will save you a lot of headache later on. By doing this correctly, you will also have the correct number of tables.
Last but not least: add relevant indexes to the columns that appear most often in queries. This will help a lot with performance. Don't worry about tables having too many rows, RDBMS-es these days handle table with hundreds of millions of rows, they're designed to be able to do that efficiently.
Beside compromising the quality and maintainability of your code (as others have pointed out), it's questionable whether you'd actually gain any performance either.
When you execute...
SELECT * FROM BRACKETS_XXX
...the DBMS needs to find the table whose name matches "BRACKETS_XXX" and that search is done in the DBMS'es data dictionary which itself is a bunch of tables. So, you are replacing a search within your tables with a search within data dictionary tables. You pay the price of the search either way.
(The dictionary tables may or may not be "real" tables, and may or may not have similar performance characteristics as real tables, but I bet these performance characteristics are unlikely to be better than "normal" tables for large numbers of rows. Also, performance of data dictionary is unlikely to be documented and you really shouldn't rely on undocumented features.)
Also, the DBMS would suddenly need to prepare many more SQL statements (since they are now different statements, referring to separate tables), which would present the additional pressure on performance.
The idea of creating new tables whenever a new instance of an item appears is really bad, sorry.
A (surely incomplete) list of why this is a bad idea:
Your code will need to automatically add tables whenever a new Division or whatever is created. This is definitely a bad practice and should be limited to extremely niche cases - which yours definitely isn't.
In case you decide to add or revise a table structure later (e.g. adding a new field) you will have to add it to hundreds of tables which will be cumbersome, error prone and a big maintenance headache
A RDBMS is built to scale in terms of rows, not tables and associated (indexes, triggers, constraints) elements - so you are working against your tool and not with it.
THIS ONE SHOULD BE THE REAL CLINCHER - how do you plan to handle requests like "list all matches which were played on a Sunday" or "find the most recent three brackets where Frank Perry was active"?
You say:
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo...
Can you remember another project where tables were cloned whenever a new set was required? If yes, didn't you notice some problems with that approach? If not, have you considered that this is precisely what a DBA would never ever do for any reason whatsoever?
I am stuck between row vs columns table design for storing some items but the decision is which table is easier to manage and if columns then how many columns are best to have? For example I have object meta data, ideally there are 45 pieces of information (after being normalized) on the same level that i need to store per object. So is 45 columns in a heavry read/write table good? Can it work flawless in a real world situation of heavy concurrent read/writes?
If all or most of your columns are filled with data and this number is fixed, then just use 45 fields. It's nothing inherently bad with 45 columns.
If all conditions are met:
You have a possibility of the the attributes which are neither known nor can be predicted at design time
The attributes are only occasionally filled (say, 10 or less per entity)
There are many possible attributes (hundreds or more)
No attribute is filled for most entities
then you have a such called sparce matrix. This (and only this) model can be better represented with an EAV table.
"There is a hard limit of 4096 columns per table", it should be just fine.
Taking the "easier to manage" part of the question:
If the property names you are collecting do not change, then columns is just fine. Even if it's sparsely populated, disk space is cheap.
However, if you have up to 45 properties per item (row) but those properties might be radically different from one element to another then using rows is better.
For example taking a product catalog. One product might have color, weight, and height. Another might have a number of buttons or handles. These are obviously radically different properties. Further this type of data suggests that new properties will be added that might only be related to a particular set of products. In this case, rows is much better.
Another option is to go NoSql and utilize a document based database server. This would allow you to set the named "columns" on a per item basis.
All of that said, management of rows will be done by the application. This will require some advanced DB skills. Management of columns will be done by the developer at design time; which is usually easier for most people to get their minds around.
I don't know if I'm correct but I once read in MySQL to keep your table with minimum columns IF POSSIBLE, (read: http://dev.mysql.com/doc/refman/5.0/en/data-size.html ), do NOTE: this is if you are using MySQL, I don't know if their concept applies to other DBMS like oracle, firebird, posgresql, etc.
You could take a look at your table with 45 column and analyze what you truly need and leave the optional fields into other table.
Hope it helps, good luck
I have a database where most tables have a delete flag for the tables. So the system soft deletes items (so they are no longer accessible unless by admins for example)
What worries me is in a few years, when the tables are much larger, is that the overall speed of the system is going to be reduced.
What can I do to counteract effects like that.
Do I index the delete field?
Do I move the deleted data to an identical delete table and back when undeleted?
Do I spread out the data over a few MySQL servers over time? (based on growth)
I'd appreciate any and all suggestions or stories.
UPDATE:
So partitioning seems to be the key to this. But wouldn't partitioning just create two "tables", one with the deleted items and one without the deleted items.
So over time the deleted partition will grow large and the occasional fetches from it will be slow (and slower over time)
Would the speed difference be something I should worry about? Since I fetch most (if not all) data by some key value (some are searches but they can be slow for this setup)
I'd partition the table on the DELETE flag.
The deleted rows will be physically kept in other place, but from SQL's point of view the table remains the same.
Oh, hell yes, index the delete field. You're going to be querying against it all the time, right? Compound indexes with other fields you query against a lot, like parent IDs, might also be a good idea.
Arguably, this decision could be made later if and only if performance problems actually appear. It very much depends on how many rows are added at what rate, your box specs, etc. Obviously, the level of abstraction in your application (and the limitations of any libraries you are using) will help determine how difficult such a change will be.
If it becomes a problem, or you are certain that it will be, start by partitioning on the deleted flag between two tables, one that holds current data and one that holds historical/deleted data. IF, as you said, the "deleted" data will only be available to administrators, it is reasonable to suppose that (in most applications) the total number of users (here limited only to admins) will not be sufficient to cause a problem. This means that your admins might need to wait a little while longer when searching that particular table, but your user base (arguably more important in most applications) will experience far less latency. If performance becomes unacceptable for the admins, you will likely want to index the user_id (or transaction_id or whatever) field you access the deleted records by (I generally index every field by which I access the table, but at certain scale there can be trade-offs regarding which indexes are most worthwhile).
Depending on how the data is accessed, there are other simple tricks you can employ. If the admin is looking for a specific record most of the time (as opposed to, say, reading a "history" or "log" of user activity), one can often assume that more recent records will be looked at more often than old records. Some DBs include tuning options for making recent records easier to find than older records, but you'll have to look it up for your particular database. Failing that, you can manually do it. The easiest way would be to have an ancient_history table that contains all records older than n days, weeks or months, depending on your constraints and suspected usage patterns. Newer data then lives inside a much smaller table. Even if the admin is going to "browse" all the records rather than searching for a specific one, you can start by showing the first n days and have a link to see all days should they not find what they are looking for (eg, most online banking applications that lets you browse transactions but shows only the first 30 days of history unless you request otherwise.)
Hopefully you can avoid having to go a step further, and sharding on user_id or some such scheme. Depending on the scale of the rest of your application, you might have to do this anyway. Unless you are positive that you will need to, I strongly suggest using vertical partitioning first (eg, keeping your forum_posts on a separate machine than your sales_records), as it is FAR easier to setup and maintain. If you end up needing to shard on user_id, I suggest using google ;-]
Good luck. BTW, I'm not a DBA so take this with a grain of salt.