Meta Tables in MySQL - mysql

I'm rewriting a system that is currently linked to a MySQL database that is roughly 1GB in size. There are hundreds of thousands of articles, each with a list of contributors (think Wiki style). I've not yet been given access to the existing database schema, but while I wait I've been brainstorming a bit.
Basically, what I'm wondering is if having an article_contributors table would be an efficient way of handling this or if there is a better method to approaching this situation. Considering there are roughly 200,000 articles, if there are 5 contributors on each, that'd be 1,000,000 rows in the meta table.

I'd call that a one-to-many table, not a "meta" table. Or else a multi-valued attribute.
Storing contributors in a separate table, one per row, is the proper way of designing a relational database. There may be other ways to store the data, but they are not relational.
Consider my answer to Is storing a delimited list in a database column really that bad? Storing the contributors as a list in the articles table causes a lot of common SQL queries to break or become horribly inefficient. If you need to do a variety of queries against this data, you will thank yourself for storing it in a normalized fashion.
On the other hand, if you never query anything but the list of contributors as an indivisible unit, then why not store it denormalized (as a list)? That's a valid choice too -- but it depends on how you're going to use the table.
By the way, 1 million rows is not a large MySQL database by some people's standards. This week I'm advising a client who has a table with 900 million rows.

An interesting question!
You're going to need to see the schema to get a straight answer about this. That's because the schema probably embodies some core decisions made by experts in bibliography (reference librarians, etc).
If you try use a join table (articles_contributors) so you can avoid listing a given contributor multiple times when she contributes to multiple articles, you're implicitly declaring that you can create a canonical list of contributors, with a contributor_id for each distinct person.
In the world of bibliography and library science, that sort of list is called a "controlled vocabulary" It's controlled by an "authority." (Read this: http://en.wikipedia.org/wiki/Authority_control) That is, some organization has the responsibility to decide whether this "Jane Smaith" is a different person from that "Jane Smith." That is surprisingly hard to do correctly with people.
For an example of a relatively simple controlled vocabulary, see the "North American Industry Classification System" (NAICS). This has a code for each distinct kind of industry. http://www.census.gov/eos/www/naics/ It's controlled by national committees in three countries. Many bibliographic databases that cover industry include those terms as one of the ways of classifying their contents.
The designers of the system you're soon to take over will have made decisions about these kinds of controlled vocabularies. Will they have one for contributors? You could wait and see, or you could ask. But one thing is sure: the bibliographic designers won't be too delighted if you, on your own authority, create that kind of controlled vocabulary.
The Library of Congress in the USA doesn't attempt to create a controlled list of authors and contributors.
Edit
If you do have a definitive list of contributors, it is a good idea to create a join table articles_contributors as you suggested. You should consider the following columns:
article_id primary key
contributor_id primary key
role primary key values like ("author", "illustrator", "editor", etc)
order 1, 2, 3 so contributors can be listed in proper order.
contact 1 or 0 indicating whether readers should contact this author for more info.

Related

Restructure Inventory Management Database (2 to 3 Tables; Development Stage)

I’m developing a database. I’d appreciate some help restructuring 2 to 3 tables so the database is both compliant with the first 3 normal forms; and practical to use and to expand on / add to in the future. I want to invest time now to reduce effort / and confusion later.
PREAMBLE
Please be aware that I'm both a nube, and an amateur, though I have a certain amount of experience and skill and an abundance of enthusiasm!
BACKGROUND TO PROJECT
I am writing a small (though ambitious!) web application (using PHP and AJAX to a MySQL database). It is essentially an inventory management system, for recording and viewing the current location of each individual piece of equipment, and its maintenance history. If relevant, transactions will be very low (probably less than 100 a day, but with a possibility of simultaneous connections / operations). Row count will also be very low (maybe a few thousand).
It will deal with many completely different categories of equipment, eg bikes and lamps (to take random examples). Each unit of equipment will have its details or specifications recorded in the database. For a bike, an important specification might be frame colour, whereas a lamp it might require information regarding lampshade material.
Since the categories of equipment have so little in common, I think the most logical way to store the information is 1 table per category. That way, each category can have columns specific to that category.
I intend to store a list of categories in a separate table. Each category will have an id which is unique to that category. (Depending on the final design, this may function as a lookup table and / or as a table to run queries against.) There are likely to be very few categories (perhaps 10 to 20), unless the system is particulary succesful and it expands.
A list of bikes will be held in the bikes table.
Each bike will have an id which is unique to that bike (eg bike 0001).
But the same id will exist in the lamp table (ie lamp 0001).
With my application, I want the user to select (from a dropdown list) the category type (eg bike).
They will then enter the object's numeric id (eg 0001).
The combination of these two ids is sufficient information to uniquely identify an object.
Images:
Current Table Design
Proposed Additional Table
PROBLEM
My gut feeling is that there should be an “overarching table” that encompasses every single article of equipment no matter what category it comes from. This would be far simpler to query against than god knows how many mini tables. But when I try to construct it, it seems like it will break various normal forms. Eg introducing redundancy, possibility of inconsistency, referential integrity problems etc. It also begins to look like a domain table.
Perhaps the overarching table should be a query or view rather than an entity?
Could you please have a look at the screenshots and let me know your opinion. Thanks.
For various reasons, I’d prefer to use surrogate keys rather than natural keys if possible. Ideally, I’d prefer to have that surrogate key in a single column.
Currently, the bike (or lamp) table uses just the first column as its primary key. Should I expand this to a composite key including the Equipment_Category_ID column too? Then make the Equipment_Article table into a view joining on these two columns (iteratively for each equipment category). Optionally Bike_ID and Lamp_ID columns could be renamed to something generic like Equipment_Article_ID. This might make the query simpler, but is there a risk of losing specificity? It would / could still be qualified by the table name.
Speaking of redundancy, the Equipment_Category_ID in the current lamp or bike tables seems a bit redundant (if every item / row in that table has the same value in that column).
It all still sounds messy! But surely this must be very common problem for eg online electronics stores, rental shops, etc. Hopefully someone will say oh that old chestnut! Fingers crossed! Sorry for not being concise, but I couldn't work out what bits to leave out. Most of it seems relevant, if a bit chatty. Thanks in advance.
UPDATE 27/03/2014 (Reply to #ElliotSchmelliot)
Hi Elliot.
Thanks for you reply and for pointing me in the right direction. I studied OOP (in Java) but wasn't aware that something similar was possible in SQL. I read the link you sent with interest, and the rest of the site/book looks like a great resource.
Does MySQL InnoDB Support Specialization & Generalization?
Unfortunately, after 3 hours searching and reading, I still can't find the answer to this question. Keywords I'm searching with include: MySQL + (inheritance | EER | specialization | generalization | parent | child | class | subclass). The only positive result I found is here: http://en.wikipedia.org/wiki/Enhanced_entity%E2%80%93relationship_model. It mentions MySQL Workbench.
Possible Redundancy of Equipment_Category (Table 3)
Yes and No. Because this is a lookup table, it currently has a function. However because every item in the Lamp or the Bike table is of the same category, the column itself may be redundant; and if it is then the Equipment_Category table may be redundant... unless it is required elsewhere. I had intended to use it as the RowSource / OptionList for a webform dropdown. Would it not also be handy to have Equipment_Category as a column in the proposed Equipment parent table. Without it, how would one return a list of all Equipment_Names for the Lamp category (ignoring distinct for the moment).
Implementation
I have no way of knowing what new categories of equipment may need to be added in future, so I’ll have to limit attributes included in the superclass / parent to those I am 100% sure would be common to all (or allow nulls I suppose); sacrificing duplication in many child tables for increased flexibility and hopefully simpler maintenance in the long run. This is particulary important as we will not have professional IT support for this project.
Changes really do have to be automated. So I like the idea of the stored procedure. And the CreateBike example sounds familiar (in principle if not in syntax) to creating an instance of a class in Java.
Lots to think about and to teach myself! If you have any other comments, suggestions etc, they'd be most welcome. And, could you let me know what software you used to create your UML diagram. Its styling is much better than those that I've used.
Cheers!
You sound very interested in this project, which is always awesome to see!
I have a few suggestions for your database schema:
You have individual tables for each Equipment entity i.e. Bike or Lamp. Yet you also have an Equipment_Category table, purely for identifying a row in the Bike table as a Bike or a row in the Lamp table as a Lamp. This seems a bit redundant. I would assume that each row of data in the Bike table represents a Bike, so why even bother with the category table?
You mentioned that your "gut" feeling is telling you to go for an overarching table for all Equipment. Are you familiar with the practice of generalization and specialization in database design? What you are looking for here is specialization (also called "top-down".) I think it would be a great idea to have an overarching or "parent" table that represents Equipment. Then, each sub-entity such as Bike or Lamp would be a child table of Equipment. A parent table only has the fields that all child tables share.
With these suggestions in mind, here is how I might alter your schema:
In the above schema, everything starts as Equipment. However, each Equipment can be specialized into Lamp, Bike, etc. The Equipment entity has all of the common fields. Lamp and Bike each have fields specific to their own type. When creating an entity, you first create the Equipment, then you create the specialized entity. For example, say we are adding the "BMX 200 Ultra" bike. We first create a record in the Equipment table with the generic information (equipmentName, dateOfPurchase, etc.) Then we create the specialized record, in this case a Bike record with any additional bike-specific fields (wheelType, frameColor, etc.) When creating the specialized entities, we need to make sure to link them back to the parent. This is why both the Lamp and Bike entities have a foreign key for equipmentID.
An easy and effective way to add specialized entities is to create a stored procedure. For example, lets say we have a stored procedure called CreateBike that takes in parameters bikeName, dateOfPurchase, wheelType, and frameColor. The stored procedure knows we are creating a Bike, and therefore can easily create the Equipment record, insert the generic equipment data, create the bike record, insert the specialized bike data, and maintain the foreign key relationship.
Using specialization will make your transactional life very simple. For example, if you want all Equipment purchased before 1/1/14, no joins are needed. If you want all Bikes with a frameColor of blue, no joins are needed. If you want all Lamps made of felt, no joins are needed. The only time you will need to join a specialized table back to the Equipment table is if you want data both from the parent entity and the specialized entity. For example, show all Lamps that use 100 Watt bulbs and are named "Super Lamp."
Hope this helps and best of luck!
Edit
Specialization and Generalization, as mentioned in your provided source, is part of an Enhanced Entity Relationship (EER) which helps define a conceptual data model for your schema. As such, it does not need to be "supported" per say, it is more of a design technique. Therefore any database schema naturally supports specialization and generalization as long as the designer implements it.
As far as your Equipment_Category table goes, I see where you are coming from. It would indeed make it easy to have a dropdown of all categories. However, you could simply have a static table (only contains Strings that represent each category) to help with this population, and still keep your Equipment tables separate. You mentioned there will only be around 10-20 categories, so I see no reason to have a bridge between Equipment and Equipment_Category. The fewer joins the better. Another option would be to include an "equipmentCategory" field in the Equipment table instead of having a whole table for it. Then you could simply query for all unique equipmentCategory values.
I agree that you will want to keep your Equipment table to guaranteed common values between all children. Definitely. If things get too complicated and you need more defined entities, you could always break entities up again down the road. For example maybe half of your Bike entities are RoadBikes and the other half are MountainBikes. You could always continue the specialization break down to better get at those unique fields.
Stored Procedures are great for automating common queries. On top of that, parametrization provides an extra level of defense against security threats such as SQL injections.
I use SQL Server. The diagram I created is straight out of SQL Server Management Studio (SSMS). You can simply expand a database, right click on the Database Diagrams folder, and create a new diagram with your selected tables. SSMS does the rest for you. If you don't have access to SSMS I might suggest trying out Microsoft Visio or if you have access to it, Visual Paradigm.

mysql table with 40+ columns

I have 40+ columns in my table and i have to add few more fields like, current city, hometown, school, work, uni, collage..
These user data wil be pulled for many matching users who are mutual friends (joining friend table with other user friend to see mutual friends) and who are not blocked and also who is not already friend with the user.
The above request is little complex, so i thought it would be good idea to put extra data in same user table to fast access, rather then adding more joins to the table, it will slow the query more down. but i wanted to get your suggestion on this
my friend told me to add the extra fields, which wont be searched on one field as serialized data.
ERD Diagram:
My current table: http://i.stack.imgur.com/KMwxb.png
If i join into more tables: http://i.stack.imgur.com/xhAxE.png
Some Suggestions
nothing wrong with this table and columns
follow this approach MySQL: Optimize table with lots of columns - which serialize extra fields into one field, which are not searchable's
create another table and put most of the data there. (this gets harder on joins, if i already have 3 or more tables to join to pull the records for users (ex. friends, user, check mutual friends)
As usual - it depends.
Firstly, there is a maximum number of columns MySQL can support, and you don't really want to get there.
Secondly, there is a performance impact when inserting or updating if you have lots of columns with an index (though I'm not sure if this matters on modern hardware).
Thirdly, large tables are often a dumping ground for all data that seems related to the core entity; this rapidly makes the design unclear. For instance, the design you present shows 3 different "status" type fields (status, is_admin, and fb_account_verified) - I suspect there's some business logic that should link those together (an admin must be a verified user, for instance), but your design doesn't support that.
This may or may not be a problem - it's more a conceptual, architecture/design question than a performance/will it work thing. However, in such cases, you may consider creating tables to reflect the related information about the account, even if it doesn't have a x-to-many relationship. So, you might create "user_profile", "user_credentials", "user_fb", "user_activity", all linked by user_id.
This makes it neater, and if you have to add more facebook-related fields, they won't dangle at the end of the table. It won't make your database faster or more scalable, though. The cost of the joins is likely to be negligible.
Whatever you do, option 2 - serializing "rarely used fields" into a single text field - is a terrible idea. You can't validate the data (so dates could be invalid, numbers might be text, not-nulls might be missing), and any use in a "where" clause becomes very slow.
A popular alternative is "Entity/Attribute/Value" or "Key/Value" stores. This solution has some benefits - you can store your data in a relational database even if your schema changes or is unknown at design time. However, they also have drawbacks: it's hard to validate the data at the database level (data type and nullability), it's hard to make meaningful links to other tables using foreign key relationships, and querying the data can become very complicated - imagine finding all records where the status is 1 and the facebook_id is null and the registration date is greater than yesterday.
Given that you appear to know the schema of your data, I'd say "key/value" is not a good choice.
I would advice to run some tests. Try it both ways and benchmark it. Nobody will be able to give you a definitive answer because you have not shared your hardware configuration, sample data, sample queries, how you plan on using the data etc. Here is some information that you may want to consider.
Use The Database as it was intended
A relational database is designed specifically to handle data. Use it as such. When written correctly, joining data in a well written schema will perform well. You can use EXPLAIN to optimize queries. You can log SLOW queries and improve their performance. Databases have been around for years, if putting everything into a single table improved performance, don't you think that would be all the buzz on the internet and everyone would be doing it?
Engine Types
How will inserts be affected as the row count grows? Are you using MyISAM or InnoDB? You will most likely want to use InnoDB so you get row level locking and not table. Make sure you are using the correct Engine type for your tables. Get the information you need to understand the pros and cons of both. The wrong engine type can kill performance.
Enhancing Performance using Partitions
Find ways to enhance performance. For example, as your datasets grow you could partition the data. Data partitioning will improve the performance of a large dataset by keeping slices of the data in separate partions allowing you to run queries on parts of large datasets instead of all of the information.
Use correct column types
Consider using UUID Primary Keys for portability and future growth. If you use proper column types, it will improve performance of your data.
Do not serialize data
Using serialized data is the worse way to go. When you use serialized fields, you are basically using the database as a file management system. It will save and retrieve the "file", but then your code will be responsible for unserializing, searching, sorting, etc. I just spent a year trying to unravel a mess like that. It's not what a database was intended to be used for. Anyone advising you to do that is not only giving you bad advice, they do not know what they are doing. There are very few circumstances where you would use serialized data in a database.
Conclusion
In the end, you have to make the final decision. Just make sure you are well informed and educated on the pros and cons of how you store data. The last piece of advice I would give is to find out what heavy users of mysql are doing. Do you think they store data in a single table? Or do they build a relational model and use it the way it was designed to be used?
When you say "I am going to put everything into a single table", you are saying that you know more about performance and can make better choices for optimization in your code than the team of developers that constantly work on MySQL to make it what it is today. Consider weighing your knowledge against the cumulative knowledge of the MySQL team and the DBAs, companies, and members of the database community who use it every day.
At a certain point you should look at the "short row model", also know as entity-key-value stores,as well as the traditional "long row model".
If you look at the schema used by WordPress you will see that there is a table wp_posts with 23 columns and a related table wp_post_meta with 4 columns (meta_id, post_id, meta_key, meta_value). The meta table is a "short row model" table that allows WordPress to have an infinite collection of attributes for a post.
Neither the "long row model" or the "short row model" is the best model, often the best choice is a combination of the two. As #nevillek pointed out searching and validating "short row" is not easy, fetching data can involve pivoting which is annoyingly difficult in MySql and Oracle.
The "long row model" is easier to validate, relate and fetch, but it can be very inflexible and inefficient when the data is sparse. Some rows may have only a few of the values non-null. Also you can't add new columns without modifying the schema, which could force a system outage, depending on your architecture.
I recently worked on a financial services system that had over 700 possible facts for each instrument, most had less than 20 facts. This could have been built by setting up dozens of tables, each for a particular asset class, or as a table with 700 columns, but we chose to use a combination of a table with about 20 columns containing the most popular facts and a 4 column table which contained the other facts. This design was efficient but was difficult ot access, so we built a few table functions in PL/SQL to assist with this.
I have a general comment for you,
Think about it: If you put anything more than 10-12 columns in a table even if it makes sense to put them in a table, I guess you are going to pay the price in the short term, long term and medium term.
Your 3 tables approach seems to be better than the 1 table approach, but consider making those into 5-6 tables rather than 3 tables because you still can.
Move currently, currently_position, currently_link from user-table and work from user-profile into a new table with your primary key called USERWORKPROFILE.
Move locale Information from user-profile to a newer USERPROFILELOCALE information because it is generic in nature.
And yes, all your generic attributes in all the tables should be int and not varchar.
For instance, City needs to move out to a new table called LIST_OF_CITIES with cityid.
And your attribute city should change from varchar to int and point to cityid in LIST_OF_CITIES.
Do not worry about performance issues; the more tables you have, better the performance, because you are actually handing out the performance to the database provider instead of taking it all in your own hands.

Database Design For Tournament Management Software

I'm currently designing a web application using php, javascript, and MySQL. I'm considering two options for the databases.
Having a master table for all the tournaments, with basic information stored there along with a tournament id. Then I would create divisions, brackets, matches, etc. tables with the tournament id appended to each table name. Then when accessing that tournament, I would simply do something like "SELECT * FROM BRACKETS_[insert tournamentID here]".
My other option is to just have generic brackets, divisions, matches, etc. tables with each record being linked to the appropriate tournament, (or matches to brackets, brackets to divisions etc.) by a foreign key in the appropriate column.
My concern with the first approach is that it's a bit too on the fly for me, and seems like the database could get messy very quickly. My concern with the second approach is performance. This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo, so any and all help is appreciated. Thanks!
Do not create tables for each tournament. A table is a type of an entity, not an instance of an entity. Maintainability and scalability would be horrible if you mix up those concepts. You even say so yourself:
This program will hopefully have a national if not international reach, and I'm concerned with so many records in a single table, and with so many people possibly hitting it at the same time, it could cause problems.
How on Earth would you scale to that level if you need to create a whole table for each record?
Regarding the performance of your second approach, why are you concerned? Do you have specific metrics to back up those concerns? Relational databases tend to be very good at querying relational data. So keep your data relational. Don't try to be creative and undermine the design of the database technology you're using.
You've named a few types of entities:
Tournament
Division
Bracket
Match
Competitor
etc.
These sound like tables to me. Manage your indexes based on how you query the data (that is, don't over-index or you'll pay for it with inserts/updates/deletes). Normalize the data appropriately, de-normalize where audits and reporting are more prevalent, etc. If you're worried about performance then keep an eye on the query execution paths for the ways in which you access the data. Slight tweaks can make a big difference.
Don't pre-maturely optimize. It adds complexity without any actual reason.
First, find the entities that you will need to store; things like tournament, event, team, competitor, prize etc. Each of these entities will probably be tables.
It is standard practice to have a primary key for each of them. Sometimes there are columns (or group of columns) that uniquely identify a row, so you can use that as primary key. However, usually it's best just to have a column named ID or something similar of numeric type. It will be faster and easier for the RDBMS to create and use indexes for such columns.
Store the data where it belongs: I expect to see the date and time of an event in the events table, not in the prizes table.
Another crucial point is conforming to the First normal form, since that assures data atomicity. This is important because it will save you a lot of headache later on. By doing this correctly, you will also have the correct number of tables.
Last but not least: add relevant indexes to the columns that appear most often in queries. This will help a lot with performance. Don't worry about tables having too many rows, RDBMS-es these days handle table with hundreds of millions of rows, they're designed to be able to do that efficiently.
Beside compromising the quality and maintainability of your code (as others have pointed out), it's questionable whether you'd actually gain any performance either.
When you execute...
SELECT * FROM BRACKETS_XXX
...the DBMS needs to find the table whose name matches "BRACKETS_XXX" and that search is done in the DBMS'es data dictionary which itself is a bunch of tables. So, you are replacing a search within your tables with a search within data dictionary tables. You pay the price of the search either way.
(The dictionary tables may or may not be "real" tables, and may or may not have similar performance characteristics as real tables, but I bet these performance characteristics are unlikely to be better than "normal" tables for large numbers of rows. Also, performance of data dictionary is unlikely to be documented and you really shouldn't rely on undocumented features.)
Also, the DBMS would suddenly need to prepare many more SQL statements (since they are now different statements, referring to separate tables), which would present the additional pressure on performance.
The idea of creating new tables whenever a new instance of an item appears is really bad, sorry.
A (surely incomplete) list of why this is a bad idea:
Your code will need to automatically add tables whenever a new Division or whatever is created. This is definitely a bad practice and should be limited to extremely niche cases - which yours definitely isn't.
In case you decide to add or revise a table structure later (e.g. adding a new field) you will have to add it to hundreds of tables which will be cumbersome, error prone and a big maintenance headache
A RDBMS is built to scale in terms of rows, not tables and associated (indexes, triggers, constraints) elements - so you are working against your tool and not with it.
THIS ONE SHOULD BE THE REAL CLINCHER - how do you plan to handle requests like "list all matches which were played on a Sunday" or "find the most recent three brackets where Frank Perry was active"?
You say:
I'm not a complete newb when it comes to database management; however, this is the first one I've done completely solo...
Can you remember another project where tables were cloned whenever a new set was required? If yes, didn't you notice some problems with that approach? If not, have you considered that this is precisely what a DBA would never ever do for any reason whatsoever?

What is the difference between a Relational and Non-Relational Database?

MySQL, PostgreSQL and MS SQL Server are relational database systems, and NoSQL, MongoDB, etc. are non-relational DBMSs.
What are the differences between the two types of system?
Hmm, not quite sure what your question is.
In the title you ask about Databases (DB), whereas in the body of your text you ask about Database Management Systems (DBMS). The two are completely different and require different answers.
A DBMS is a tool that allows you to access a DB.
Other than the data itself, a DB is the concept of how that data is structured.
So just like you can program with Oriented Object methodology with a non-OO powered compiler, or vice-versa, so can you set-up a relational database without an RDBMS or use an RDBMS to store non-relational data.
I'll focus on what Relational Database (RDB) means and leave the discussion about what systems do to others.
A relational database (the concept) is a data structure that allows you to link information from different 'tables', or different types of data buckets. A data bucket must contain what is called a key or index (that allows to uniquely identify any atomic chunk of data within the bucket). Other data buckets may refer to that key so as to create a link between their data atoms and the atom pointed to by the key.
A non-relational database just stores data without explicit and structured mechanisms to link data from different buckets to one another.
As to implementing such a scheme, if you have a paper file with an index and in a different paper file you refer to the index to get at the relevant information, then you have implemented a relational database, albeit quite a simple one. So you see that you do not even need a computer (of course it can become tedious very quickly without one to help), similarly you do not need an RDBMS, though arguably an RDBMS is the right tool for the job. That said there are variations as to what the different tools out there can do so choosing the right tool for the job may not be all that straightforward.
I hope this is layman terms enough and is helpful to your understanding.
Relational databases have a mathematical basis (set theory, relational theory), which are distilled into SQL == Structured Query Language.
NoSQL's many forms (e.g. document-based, graph-based, object-based, key-value store, etc.) may or may not be based on a single underpinning mathematical theory. As S. Lott has correctly pointed out, hierarchical data stores do indeed have a mathematical basis. The same might be said for graph databases.
I'm not aware of a universal query language for NoSQL databases.
Most of what you "know" is wrong.
First of all, as a few of the relational gurus routinely (and sometimes stridently) point out, SQL doesn't really fit nearly as closely with relational theory as many people think. Second, most of the differences in "NoSQL" stuff has relatively little to do with whether it's relational or not. Finally, it's pretty difficult to say how "NoSQL" differs from SQL because both represent a pretty wide range of possibilities.
The one major difference that you can count on is that almost anything that supports SQL supports things like triggers in the database itself -- i.e. you can design rules into the database proper that are intended to ensure that the data is always internally consistent. For example, you can set things up so your database asserts that a person must have an address. If you do so, anytime you add a person, it will basically force you to associate that person with some address. You might add a new address or you might associate them with some existing address, but one way or another, the person must have an address. Likewise, if you delete an address, it'll force you to either remove all the people currently at that address, or associate each with some other address. You can do the same for other relationships, such as saying every person must have a mother, every office must have a phone number, etc.
Note that these sorts of things are also guaranteed to happen atomically, so if somebody else looks at the database as you're adding the person, they'll either not see the person at all, or else they'll see the person with the address (or the mother, etc.)
Most of the NoSQL databases do not attempt to provide this kind of enforcement in the database proper. It's up to you, in the code that uses the database, to enforce any relationships necessary for your data. In most cases, it's also possible to see data that's only partially correct, so even if you have a family tree where every person is supposed to be associated with parents, there can be times that whatever constraints you've imposed won't really be enforced. Some will let you do that at will. Others guarantee that it only happens temporarily, though exactly how long it can/will last can be open to question.
The relational database uses a formal system of predicates to address data. The underlying physical implementation is of no substance and can vary to optimize for certain operations, but it must always assume the relational model. In layman's terms, that's just saying I know exactly how many values (attributes) each row (tuple) in my table (relation) has and now I want to exploit the fact accordingly, thoroughly and to it's extreme. That's the true nature of the beast.
Since we're obviously the generation that has had a relational upbringing, if you look at NoSQL database models from the perspective of the relational model, again in layman's terms, the first obvious difference is that no assumptions about the number of values a row can contain is ever made. This is really oversimplifying the matter and does not cleanly apply to the intricacies of the physical models of every NoSQL database, but it's the pinnacle of the relational model and the first assumption we have to leave behind or, if you'd rather, the biggest leap we have to make.
We can agree to two things that are true for every DBMS: it can store any kind of data and has enough mathematical underpinnings to make it possible to manage the data in any way imaginable. The reality is that you'll never want to make the mistake of putting any of the two points to the test, but rather just stick with what the actual DBMS was really made for. In layman's terms: respect the beast within!
(Please note that I've avoided comparing the (obviously) well founded standards revolving around the relational model against the many flavors provided by NoSQL databases. If you'd like, consider NoSQL databases as an umbrella term for any DBMS that does not completely assume the relational model, in exclusion to everything else. The differences are too many, but that's the principal difference and the one I think would be of most use to you to understand the two.)
Try to explain this question in a level referring to a little bit technology
Take MongoDB and Traditional SQL for comparison, imagine the scenario of posting a Tweet on Twitter. This tweet contains 9 pictures. How do you store this tweet and its corresponding pictures?
In terms of traditional relationship SQL, you can store the tweets and pictures in separate tables, and represent the connection through building a new table.
What's more, you can set a field which is an image type, and zip the 9 pictures into a binary document and store it in this field.
Using MongoDB, you could build a document like this (similar to the concept of a table in relational SQL):
{
"id":"XXX",
"user":"XXX",
"date":"xxxx-xx-xx",
"content":{
"text":"XXXX",
"picture":["p1.png","p2.png","p3.png"]
}
Therefore, in my opinion, the main difference is about how do you store the data and the storage level of the relationships between them.
In this example, the data is the tweet and the pictures. The different mechanism about storage level of relationship between them also play a important role in the difference between both.
I hope this small example helps show the difference between SQL and NoSQL (ACID and BASE).
Here's a link of picture about the goals of NoSQL from the Internet:
http://icamchuwordpress-wordpress.stor.sinaapp.com/uploads/2015/01/dbc795f6f262e9d01fa0ab9b323b2dd1_b.png
The difference between relational and non-relational is exactly that. The relational database architecture provides with constraints objects such as primary keys, foreign keys, etc that allows one to tie two or more tables in a relation. This is good so that we normalize our tables which is to say split information about what the database represents into many different tables, once can keep the integrity of the data.
For example, say you have a series of table that houses information about an employee. You could not delete a record from a table without deleting all the records that pertain to such record from the other tables. In this way you implement data integrity. The non-relational database doesn't provide this constraints constructs that will allow you to implement data integrity.
Unless you don't implement this constraint in the front end application that is utilized to populate the databases' tables, you are implementing a mess that can be compared with the wild west.
First up let me start by saying why we need a database.
We need a database to help organise information in such a manner that we can retrieve that data stored in a efficient manner.
Examples of relational database management systems(SQL):
1)Oracle Database
2)SQLite
3)PostgreSQL
4)MySQL
5)Microsoft SQL Server
6)IBM DB2
Examples of non relational database management systems(NoSQL)
1)MongoDB
2)Cassandra
3)Redis
4)Couchbase
5)HBase
6)DocumentDB
7)Neo4j
Relational databases have normalized data, as in information is stored in tables in forms of rows and columns, and normally when data is in normalized form, it helps to reduce data redundancy, and the data in tables are normally related to each other, so when we want to retrieve the data, we can query the data by using join statements and retrieve data as per our need.This is suited when we want to have more writes, less reads, and not much data involved, also its really easy relatively to update data in tables than in non relational databases. Horizontal scaling not possible, vertical scaling possible to some extent.CAP(Consistency, Availability, Partition Tolerant), and ACID (Atomicity, Consistency, Isolation, Duration)compliance.
Let me show entering data to a relational database using PostgreSQL as an example.
First create a product table as follows:
CREATE TABLE products (
product_no integer,
name text,
price numeric
);
then insert the data
INSERT INTO products (product_no, name, price) VALUES (1, 'Cheese', 9.99);
Let's look at another different example:
Here in a relational database, we can link the student table and subject table using relationships, via foreign key, subject ID, but in a non relational database no need to have two documents, as no relationships, so we store all the subject details and student details in one document say student document, then data is getting duplicated, which makes updating records troublesome.
In non relational databases, there is no fixed schema, data is not normalized. no relationships between data is created, all data mostly put in one document. Well suited when handling lots of data, and can transfer lots of data at once, best where high amounts of reads and less writes, and less updates, bit difficult to query data, as no fixed schema. Horizontal and vertical scaling is possible.CAP (Consistency, Availability, Partition Tolerant)and BASE (Basically Available, soft state, Eventually consistent)compliance.
Let me show an example to enter data to a non relational database using Mongodb
db.users.insertOne({name: ‘Mary’, age: 28 , occupation: ‘writer’ })
db.users.insertOne({name: ‘Ben’ , age: 21})
Hence you can understand that to the database called db, and there is a collections called users, and document called insertOne to which we add data, and there is no fixed schema as our first record has 3 attributes, and second attribute has 2 attributes only, this is no problem in non relational databases, but this cannot be done in relational databases, as relational databases have a fixed schema.
Let's look at another different example
({Studname: ‘Ash’, Subname: ‘Mathematics’, LecturerName: ‘Mr. Oak’})
Hence we can see in non relational database we can enter both student details and subject details into one document, as no relationships defined in non relational databases, but here this way can lead to data duplication, and hence errors in updating can occur therefore.
Hope this explains everything
In layman terms it's strongly structured vs unstructured, which implies that you have different degrees of adaptability for your DB.
Differences arise in indexation particularly as you need to ensure that a certain reference index can link to a another item -> this a relation. The more strict structure of relational DB comes from this requirement.
To note that NosDB apaprently provides both relational and non relational DBs and a way to query both http://www.alachisoft.com/nosdb/sql-cheat-sheet.html

First-time database design: am I overengineering? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Background
I'm a first year CS student and I work part time for my dad's small business. I don't have any experience in real world application development. I have written scripts in Python, some coursework in C, but nothing like this.
My dad has a small training business and currently all classes are scheduled, recorded and followed up via an external web application. There is an export/"reports" feature but it is very generic and we need specific reports. We don't have access to the actual database to run the queries. I've been asked to set up a custom reporting system.
My idea is to create the generic CSV exports and import (probably with Python) them into a MySQL database hosted in the office every night, from where I can run the specific queries that are needed. I don't have experience in databases but understand the very basics. I've read a little about database creation and normal forms.
We may start having international clients soon, so I want the database to not explode if/when that happens. We also currently have a couple big corporations as clients, with different divisions (e.g. ACME parent company, ACME healthcare division, ACME bodycare division)
The schema I have come up with is the following:
From the client perspective:
Clients is the main table
Clients are linked to the department they work for
Departments can be scattered around a country: HR in London, Marketing in Swansea, etc.
Departments are linked to the division of a company
Divisions are linked to the parent company
From the classes perspective:
Sessions is the main table
A teacher is linked to each session
A statusid is given to each session. E.g. 0 - Completed, 1 - Cancelled
Sessions are grouped into "packs" of an arbitrary size
Each packs is assigned to a client
I "designed" (more like scribbled) the schema on a piece of paper, trying to keep it normalised to the 3rd form. I then plugged it into MySQL Workbench and it made it all pretty for me: (Click here for full-sized graphic)
(source: maian.org)
Example queries I'll be running
Which clients with credit still left are inactive (those without a class scheduled in the future)
What is the attendance rate per client/department/division (measured by the status id in each session)
How many classes has a teacher had in a month
Flag clients who have low attendance rate
Custom reports for HR departments with attendance rates of people in their division
Question(s)
Is this overengineered or am I headed the right way?
Will the need to join multiple tables for most queries result in a big performance hit?
I have added a 'lastsession' column to clients, as it is probably going to be a common query. Is this a good idea or should I keep the database strictly normalised?
Thanks for your time
Some more answers to your questions:
1) You're pretty much on target for someone who is approaching a problem like this for the first time. I think the pointers from others on this question thus far pretty much cover it. Good job!
2 & 3) The performance hit you will take will largely be dependent on having and optimizing the right indexes for your particular queries / procedures and more importantly the volume of records. Unless you are talking about well over a million records in your main tables you seem to be on track to having a sufficiently mainstream design that performance will not be an issue on reasonable hardware.
That said, and this relates to your question 3, with the start you have you probably shouldn't really be overly worried about performance or hyper-sensitivity to normalization orthodoxy here. This is a reporting server you are building, not a transaction based application backend, which would have a much different profile with respect to the importance of performance or normalization. A database backing a live signup and scheduling application has to be mindful of queries that take seconds to return data. Not only does a report server function have more tolerance for complex and lengthy queries, but the strategies to improve performance are much different.
For example, in a transaction based application environment your performance improvement options might include refactoring your stored procedures and table structures to the nth degree, or developing a caching strategy for small amounts of commonly requested data. In a reporting environment you can certainly do this but you can have an even greater impact on performance by introducing a snapshot mechanism where a scheduled process runs and stores pre-configured reports and your users access the snapshot data with no stress on your db tier on a per request basis.
All of this is a long-winded rant to illustrate that what design principles and tricks you employ may differ given the role of the db you're creating. I hope that's helpful.
You've got the right idea. You can however clean it up, and remove some of the mapping (has*) tables.
What you can do is in the Departments table, add CityId and DivisionId.
Besides that, I think everything is fine...
The only changes I would make are:
1- Change your VARCHAR to NVARCHAR, if you might be going international, you may want unicode.
2- Change your int id's to GUIDs (uniqueidentifier) if possible (this might just be my personal preference). Assuming you eventually get to the point where you have multiple environments (dev/test/staging/prod), you may want to migrate data from one to the other. Have GUID Ids makes this significantly easier.
3- Three layers for your Company -> Division -> Department structure may not be enough. Now, this might be over-engineering, but you could generalize that hierarchy such that you can support n-levels of depth. This will make some of your queries more complex, so that may not be worth the trade-off. Further, it could be that any client that has more layers may be easily "stuffable" into this model.
4- You also have a Status in the Client Table that is a VARCHAR and has no link to the Statuses table. I'd expect a little more clarity there as to what the Client Status represents.
No. It looks like you're designing at a good level of detail.
I think that Countries and Companies are really the same entity in your design, as are Cities and Divisions. I'd get rid of the Countries and Cities tables (and Cities_Has_Departments) and, if necessary, add a boolean flag IsPublicSector to the Companies table (or a CompanyType column if there are more choices than simply Private Sector / Public Sector).
Also, I think there's an error in your usage of the Departments table. It looks like the Departments table serves as a reference to the various kinds of departments that each customer division can have. If so, it should be called DepartmentTypes. But your clients (who are, I assume, attendees) do not belong to a department TYPE, they belong to an actual department instance in a company. As it stands now, you will know that a given client belongs to an HR department somewhere, but not which one!
In other words, Clients should be linked to the table that you call Divisions_Has_Departments (but that I would call simply Departments). If this is so, then you must collapse Cities into Divisions as discussed above if you want to use standard referential integrity in the database.
By the way, it's worth noting that if you're generating CSVs already and want to load them into a mySQL database, LOAD DATA LOCAL INFILE is your best friend: http://dev.mysql.com/doc/refman/5.1/en/load-data.html . Mysqlimport is also worth looking into, and is a command-line tool that's basically a nice wrapper around load data infile.
Most things have already been said, but I feel that I can add one thing: it is quite common for younger developers to worry about performance a little bit too much up-front, and your question about joining tables seems to go into that direction. This is a software development anti-pattern called 'Premature Optimization'. Try to banish that reflex from your mind :)
One more thing: Do you believe you really need the 'cities' and 'countries' tables? Wouldn't having a 'city' and 'country' column in the departments table suffice for your use cases? E.g. does your application need to list departments by city and cities by country?
Following comments based on role as a Business Intelligence/Reporting specialist and strategy/planning manager:
I agree with Larry's direction above. IMHO, It's not so much over engineered, some things just look a little out of place. To keep it simple, I would tag client directly to a Company ID, Department Description, Division Description, Department Type ID, Division Type ID. Use Department Type ID and Division Type ID as references to lookup tables and internal reporting/analysis fields for long term consistency.
Packs table contains "Credit" column, shouldn't that actually be tied to the Client base table so if they many packs you can see how much credit owed is left for future classes? The application can take care of the calc and store it centrally in the Client table.
Company info could use many more fields, including the obvious address/phone/etc. information. I'd also be prepared to add in D&B "DUNs" columns (Site/Branch/Ultimate) long term, Dun and Bradstreet (D&B) has a huge catalog of companies and you'll find later down the road their information is very helpful for reporting/analysis. This will take care of the multiple division issue you mention, and allow you to roll up their hierarchy for sub/division/branches/etc. of large corps.
You don't mention how many records you'll be working with which could imply setting yourself up for a large development initiative which could have been done quicker and far fewer headaches with prepackaged "reporting" software. If your not dealing with a large database (< 65000) rows, make sure MS-Access, OpenOffice (Base) or related report/app dev solutions couldn't do the trick. I use Oracle's free APEX software quite a bit myself, it comes with their free database Oracle XE just download it from their site.
FYI - Reporting insight: for large databases, you typically have two database instances a) transaction database for recording each detailed record. b) reporting database (data mart/data warehouse) housed on a separate machine. For more information search google both Star Schema and Snowflake Schema.
Regards.
I want to address only the concern that joining to mutiple tables will casue a performance hit. Do not be afraid to normalize because you will have to do joins. Joins are normal and expected in relational datbases and they are designed to handle them well. You will need to set PK/FK relationships (for data integrity, this is important to consider in designing) but in many databases FKs are not automatically indexed. Since they wil be used in the joins, you will definitelty want to start by indexing the FKS. PKs generally get an index on creation as they have to be unique. It is true that datawarehouse design reduces the number of joins, but usually one doesn't get to the point of data warehousing until one has millions of records needed to be accessed in one report. Even then almost all data warehouses start with a transactional database to collect the data in real time and then data is moved to the warehouse on a schedule (nightly or monthly or whatever the business need is). So this is a good start even if you need to design a data warehouse later to improve report performance.
I must say your design is impressive for a first year CS student.
It isn't over-engineered, this is how I would approach the problem. Joining is fine, there won't be much of a performance hit (it's completely necessary unless you de-normalise the database out which isn't recommended!). For statuses, see if you can use an enum datatype instead to optimise that table out.
I've worked in the training / school domain and I thought I'd point out that there's generally a M:1 relationship between what you call "sessions" (instances of a given course) and the course itself. In other words, your catalog offers the course ("Spanish 101" or whatever), but you might have two different instances of it during a single semester (Tu-Th taught by Smith, Wed-Fri taught by Jones).
Other than that, it looks like a good start. I bet you'll find that the client domain (graphs leading to "clients") is more complex than you've modeled, but don't go overboard with that until you've got some real data to guide you.
A few things came to mind:
The tables seemed geared to reporting, but not really running the business. I would think when a client signs up, there's essentially an order being placed for the client attending a list of sessions, and that order might be for multiple employees in one company. It would seem an "order" table would really be at the center of your system and driving your data capture and eventual reporting. (Compare the paper documents you've been using to run the business with your database design to see if there's a logical match.)
Companies often don't have divisions. Employees sometimes change divisions/departments, maybe even mid-session. Companies sometimes add/delete/rename divisions/departments. Make sure the possible realtime changing contents of your tables doesn't make subsequent reporting/grouping difficult. With so much contact data split over so many tables, you might have to enforce very strict data entry validation to keep your reports meaningful and inclusive. Eg, when a new client is added, making sure his company/division/department/city match the same values as his coworkers.
The "packs" concept isn't clear at all.
Since you indicate it's a small business, it would be surprising if performance would be an issue, considering the speed and capacity of current machines.