Every single user has say, 3 of GROUP_A, 10 GROUP_B's per GROUP_A, and 20 GROUP_C's per GROUP_B. And each of the 20 GROUP_C's involve lots of inserts/deletes...And every piece of data is unique to one another for all GROUPs/users.
I'm not an expert, but I've done research but it's all theoretical at this point of course, and I don't have hands on experience with the implementation that's for sure. I think my options are something like 'adjacency lists' or 'nested sets'?
Any guidance into the right direction would be very much appreciated!
(I posted this on DBA stackexchange too but I'd really appreciate if I could get more opinions and help from the community!)
I know the trivial solution is just to have simple tables with foreign keys to the parent 'container' but I'm thinking about in the long term, in the event there's a million users or so.
I know the trivial solution is just to have simple tables with foreign keys to the parent 'container' but I'm thinking about in the long term, in the event there's a million users or so.
I would go with just that approach. As long as the number of hierarchy levels remains fixed, the resulting scheme will likely scale well because it is so trivial. Fancy table structures and elaborate queries might work well enough for small data sets, but for large amounts of data, simple structures will work best.
Things would be a lot more difficult if the number of leverls might vary. If you want to be prepared for such cases, you could devise a different approach, but that would probably scale badly if the amount of data increases.
Related
I am torn. I am dealing with data that is very difficult to deal with; a "job" has at the moment over 100 columns.
I put all of the columns into the job because every time I get a job's info, I will 99.99% of the time need all of the data. So, splitting it would probably get me better grades if I were a student, but it would simply resolve into joints every time I load the data.
One example I find it hard to decide is cargoes. A ship can have one (80% of the time), 2 (99% of the time) or 3 (1% of the time) cargoes. Never 4. Storing cargoes in a 1:n relationship with the job is very easy, but it also means that:
Every time I load a job, I need an extra query to get the cargoes
CRUD is a little more painful, as I have to make another store, with permissions, etc.
However, now I have these columns in my DB:
cargoId1, cargoDescription1, contractTonnage1,
contractTonnageTolerance1, commentsOnTonnageTolerance1,
tonnageToBeLoaded1, tonnageLoaded1
cargoId2, cargoDescription2, contractTonnage2,
contractTonnageTolerance2, commentsOnTonnageTolerance2,
tonnageToBeLoaded2, tonnageLoaded2
cargoId3, cargoDescription3, contractTonnage3,
contractTonnageTolerance3, commentsOnTonnageTolerance3,
tonnageToBeLoaded3, tonnageLoaded3
What would you do? Ideas?
I'll have to warn you that you will probably get downvotes, close votes and/or delete votes for a "primarily opinion-based" question. I think your question IS primarily opinion-based, as it is essentially synonymous with "pros and cons of normalization". (ps: I hate the fact that this should get you downvotes though).
One thing you could do if you would like to have the best of both worlds is to make the table normalized, and create a view that will return the de-normalized form with PIVOT. This way, the integrity of your data gets better from normalization, and WRITING a query will be easier. Joins that will (slightly with a good index) affect performance will be done, but imo that's a small price for integrity.
So we're looking to store two kinds of indexes.
First kind will be in the order of billions, each with between 1 and 1000 values, each value being one or two 64 bit integers.
Second kind will be in the order of millions, each with about 200 values, each value between 1KB and 1MB in size.
And our usage pattern will be something like this:
Both kinds of index will have values added to the top up to thousands of times per second.
Indexes will be infrequently read, but when they are read it'll be the entirety of the index that is read
Indexes should be pruned, either on writing values to the index or in some kind of batch type job
Now we've considered quite a few databases, our favourites at the moment are Cassandra and PostreSQL. However, our application is in Erlang, which has no production-ready bindings for Cassandra. And a major requirement is that it can't require too much manpower to maintain. I get the feeling that Cassandra's going to throw up unexpected scaling issues, whereas PostgreSQL's just going to be a pain to shard, but at least for us it's a know quantity. We're already familiar with PostgreSQL, but not hugely well acquainted with Cassandra.
So. Any suggestions or recommendations as to which data store would be most appropriate to our use case? I'm open to any and all suggestions!
Thanks,
-Alec
You haven't given enough information to support much of an answer re: your index design. However, Cassandra scales up quite easily by growing the cluster.
You might want to read this article: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
A more significant issue for Cassandra is whether it supports the kind of queries you need - scalability won't be the problem. From the numbers you give, it sounds like we are talking about terabytes or tens of terabytes, which is very safe territory for Cassandra.
Billions is not a big number by todays standards, why not writing a benchmark instead of guesswork? That will give you a better decision tool and it's really easy to do. Just install your target OS, and each database engine, then run querys with let's say Perl (because i like it)
It won't take you more than one day to do all this, i've done something like this before.
A nice way to benchmark is writing a script that randomly , or with something like a gauss bell curve, executes querys, "simulating" real usage. Then plot the data or do it like a boss and just read the logs.
This question already has answers here:
Is storing a delimited list in a database column really that bad?
(10 answers)
Closed 8 years ago.
A while ago, I came to the realization that a way I would like to hold the skills for a player in a game would be through CSV format. On the player's stats, I made a varchar of skills that would be stored as CSV. (1,6,9,10 etc.) I made a 'skills' table with affiliated stats for each skill (name, effect) and when it comes time to see what skills they have, all I have to do is query that single column and use PHP's str_getcsv() to see if a certain skill exists because it'll be in an array.
However, my coworker suggests that a superior system is to have each skill simply be an entry into a master "skills" table that each player will use, and each skill will have an ID foreign key to the player. I just query all rows in this table, and what's returned will be their skills!
At first I thought this wouldn't be very good at all, but it appears the Internet disagrees. I understand that it's less searchable - but it was not my intention to ever say, "does the player have x skill?" or "show me all players with this skill!". At worst if I wanted such data, I'd just make a PHP report for it that would, admittedly, be slow.
But it appears as though this is really faster?! I'm having trouble finding a hard answer extending beyond "yeah it's good and normalized". Can Stack Overflow help me out?
Edit: Thanks, guys! I never realized how bad this was. And sorry about the dupe, but believe me, I didn't type all of that without at least checking for dupes. :P
Putting comma-separated values into a single field in a database is not just a bad idea, it is the incarnation of Satan expressed in a database model.
It cannot represent a great many situations accurately (cases in which the value contains a comma or something else that your CSV-consuming code has trouble with), often has problems with values nested in other values, cannot be properly indexed, cannot be used in database JOINs, is difficult to dedupe, cannot have additional information added to it (number of times the skill was earned, in your case, or a skill level), cannot participate in relational integrity, cannot enforce type constraints, and so on. The list is almost endless.
This is especially true of MySQL which has the very convenient group_concat function that makes it easy to present this data as a comma-separated string when needed while still maintaining the full functionality and speed of a normalized database.
You gain nothing from using the comma-separate approach but lose searchability and performance. Get Satan behind thee, and normalize your data.
Well, there are things such as scaleability to consider. What if you need to add/remove a skill? How about renaming a skill? What happens if the number of skills out grows the size of your field? It's bad practice to have to re-size a field just to accommodate something like this.
What about maintainability? Could another developer come in and understand what you've done? What happens if the same skill is given to a player twice?
You coworker's suggestion is not correct either. You would have 3 tables in this case. A master player table, a skills table, and a table that has a relationship to both, creating a many to many relationship, allowing a single skill to be associated with many players, and many players having the same skill.
Since the database will index the content (assuming that you use index) it will be very very fast to search the content and get the desired contents. Remember: databases are designed to hold a lot of information and a database such as mysql, which is a relational database, is made for relations.
Another matter is the maintainability of the system. It will be much much easier to maintain a system that's normalized. And when you are to remove or add a skill it will be easier.
When you are about to get the information from the database regarding the skills of the player you can easily get information connected to the concerned skills with a simple JOIN.
I say: Let the database do what it does best - handle the data. And let your programming do what it should do ;)
I'm working on a browser-based RPG for one of my websites, and right now I'm trying to determine the best way to organize my SQL tables for performance and maintenance.
Here's my question:
Does the number of columns in an SQL table affect the speed in which it can be queried?
I am not a newbie when it comes to PHP or MySQL. I used to develop things with the common goal of getting them to work, but I've recently advanced to the stage where a functional program is not good enough unless it's fast and reliable.
Anyways, right now I have a members table that has around 15 columns. It contains information such as the player's username, password, email, logins, page views, etcetera. It doesn't contain any information on the player's progress in the game, however. If I added columns for things such as army size, gold, turns, and whatnot, then it could easily rise to around 40 or 50 total columns.
Oh, and my database structure IS normalized.
Will a table with 50 columns that gets constantly queried be a bad idea? Should I split it into two tables; one for the user's general information and one for the user's game statistics?
I know I could check the query time myself, but I haven't actually created the tables yet and I think I'd be better off with some professional advice on this important decision for my game.
Thank you for your time! :)
The number of columns can have a measurable cost if you're relying on table-scans or on caching pages of table data. But the best way to get good performance is to create indexes to assist your queries. If you have indexes in place that benefit your queries, then the width of a row in the table is pretty much inconsequential. You're looking up specific rows through much faster means than scanning through the table.
Here are some resources for you:
EXPLAIN Demystified
More Mastering the Art of Indexing
Based on your caveat at the end of your question, you already know that you should be measuring performance and only fixing code that has problems. Don't try to make premature optimizations.
Unfortunately, there are no one-size-fits-all rules for defining indexes. The best set of indexes need to be designed custom for the queries that you need to be fastest. It's hard work, requiring a lot of analysis, testing, and taking comparative measurements for performance. It also requires a lot of reading to understand how your given RDBMS technology uses indexes.
In my Rails App I've several models dealing with assets (attachments, pictures, logos etc.). I'm using attachment_fu and so far I have 3 different tables for storing the information in my MySQL DB.
I'm wondering if it makes a difference in the performance if I used STI and put all the information in just 1 table, using a type column and having different, inherited classes. It would be more DRY and easier to maintain, because all share many attributes and characteristics.
But what's faster? Many tables and less rows per table or just one table with many rows? Or is there no difference at all? I'll have to deal with a lot of information and many queries per second.
Thanks for your opinion!
Many tables and fewer rows is probably faster.
That's not why you should do it, though: your database ought to model your Problem Domain. One table is a poor model of many entity types. So you'll end up writing lots and lots of code to find the subset of that table that represents the entity type you're currently concerned with.
Regular, accepted, clean database and front-end client code won't work, because of your one-table-that-is-all-things-and-no-thing-at-all.
It's slower, more fragile, will multiply your code all over you app, and makes a poor model.
Do this only if all the things have exactly the same attributes and the same (or possibly Liskov substitutable) semantic meaning in your problem domain.
Otherwise, just don't even try to do this.
Or if you do, ask why this is any better than having one big Map/hash table/associative array to hold all entities in your app (and lots of functions, most of them duplicared, cut and paste, and out of date, doing switch cases or RTTI to figure out the real type of each entity).
The only way to know for sure is to try both approaches and measure the performance.
In general terms, it depends if you're doing joins across those tables and if you are, how the tables are indexed. Generally speaking, database joins are expensive which is why database schemas are sometimes denormalized to improve performance. This doesn't usually happen until you're dealing with a serious amount of data though i.e. millions of records. You probably don't have that problem yet and maybe never will.
If rows have same attributes then, yes, one table is very better, and just one row to specify type of data, otherwise, use differents tables to deal with, that better in performance, code amount and even in the lisibility of code aswell.