I can't discuss things in great detail due to an NDA, but I'm hoping an overview of the system being built can help you in aiding me in making a decision concerning our databases.
I'm building an app that will help vendors compete to gain clientele by making strategic offers based on records of inventory/purchase from the storefronts.
One side of the app is for the store owners to see presented offers, network, etc. I've got that going with a standard php/MySQL setup.
My question is concerning the records of inventory. We are talking millions of records here nearly immediately. The sample data I'm using is roll up of four of their managers (they have dozens) over the course of a year or two and it had over 500k rows with about 30 or more columns. When we get scores of stores with all of their managers it will be massive, at least compared to anything I've worked with as of yet.
The vendors will have a side of the product in which they can search through these records and make competitive offers based off of it.
Is the sheer size a good reason to use something like mongo? Or is it more a matter of how the data is laid out / what it consists of? Or some other element that I'm not considering?
And, if not mongo/nosql, then is there some other methodology or technology that such large data stores would benefit from me using (sharding, amazon cloud database, etc).
Thanks
Answers ...
Q: Is the sheer size a good reason to use something like mongo?
A: I think so. Mongo was built from the ground up to scale in a massive way. You have replica sets and sharding that can help you scale. They also have features to make sure your data gets stored in the appropriately geographically distributed data centers.
Q: Or is it more a matter of how the data is laid out / what it consists of?
A: Mongo is a document database and you're right, the data models will be different. You have to think of data in a denormalized way instead of normalized. Just like any technology, there are pros and cons to storing things as documents.
Some pros: Schema management is a breeze. Data more naturally fits objects in your application. Don't have to pay the price of complicated/slow joins.
Some cons: Schemas can be inconsistent - you have to manage it. Data is repeated, which is not managed means it can become inconsistent.
In general I think Mongo would be a good choice to deal with that scale. Mongo has a new aggregation framework that brings a lot of SQL concepts to queries on documents. Easier to make complex queries. Also Mongo has map/reduce to run any kind of query you might have.
After using Mongo daily for about a year, I've really enjoyed the support around it as a product and the general ease of setting it up and working with it.
Related
Despite reading an awful lot of varying opinions and advice online and in SO I still cannot really decide the best solution to solve my current requirements.
In essence I need to make a system where objects can be arbitarily defined with any number of properties. The applicaiton tracks the where abouts and state of these objects and I cannot possibly know at compile time what the full gambit of these objects will be (besides this will be sold to many companies to track what they will).
The data itself WILL have some forms of relation to each other. The biggest of which will be the notion of a location hierarchy; think Country->Province->Town->Postcode->Building->Room->Locker->Object
There will be some other Parent->Child relations in the data too. For example an instance of a car has an instance of an engine, has an instance of a piston.
The history of objects and data will be important. What the state of the object has been at various times and places will be a heavily used feature of this system. Being able to retrieve the full history for reporting will be important too.
The options as I see them:
EAV - Entity Attribute Value (or hybrid of) in SQL
Pros:
It's relational and normalised
Querying is powerful
The relational and hierarchical parts of the data fit this paradigm
History achieved by storing dates against properties
Cons:
Query complexity
90% of the time every attribute for an object will be required giving
a serious number of joins
Most likely there will be pivots everywhere
Others?
Relational with XML catch all columns:
Pros:
Relational goodness, ORMs etc, etc (all of above)
How to store the history?
Cons:
The vast majority of the attributes will be in this XML column
(say>70%)
Slow queries?
Others?
Document DB
Pros:
Open Schema
History is as simple as retrieving older docs
Cons:
I've got a fair amount of relational data!
Query support (I'm not well enough versed to say what the pros and
cons are of each Document DB tech)
Others?
As you can tell the most of my experience has come in the form of relational DBs (SQL). Add to that I have already prototyped a similar solution with and EAV/Relational hybrid in SQL and found it an utter pain when things got even remotely complex.
I'm tech agnostic; I have front to back end experience in lots of techs and not adverse to learning anything new.
What are the thoughts on my situation; the long and short of it is each of the above is a valid way to solve the problem but I'm keen to hear what other people think and have experienced so I can try to avoid any costly blind sides.
I am developing a Discussion Forum for my University. For this to manipulate the data i m using CouchDB as database.
I m finding difficulty in designing the structure of my db, in order to maximize the performance of my db.
I want to discuss what is the good practice of designing a document database.
Either we should make only one database as SQL and make 'n' no. of documents in the database.
Or we can make more no of database in order to flatten my db structure.This also reduce the more no. of documents to be developed.
The questions you need to ask are simply this: "How do you want to get data out of your database?"
Database design hinges around the queries to be made, not what is available to be stored.
This is especially important for Document DBs like Couch, since, while it does have a flexible schema, it does not have flexible indexing. By that I mean that because of the granularity of the data, it's quite like that later on, when you need to ask a question that it was not designed to answer, answering that question may well be very expensive. It's much, much cheaper to design your views and other constructs early, when there is little data in the data base rather than later after you have thousands or millions of rows.
RDBMS's, since they tend to have a finer granularity of data, tend to be more nimble to new queries and such later in life. Document DBs, not so much.
So think through your use cases up front, and design around those, and design those early on, it's much less painless now than later.
It's hard to tell the right way to approach modeling your data since you don't give much information. Generally though you want to keep as much data as possible in one database as this allows you to index it together (indexes cannot span more than one database).
Also, since there is no schema enforcement in the database, you can create different types of records in each database. For example, there is nothing wrong with have both user information and forum entries in the same database.
Last, you will most likely want to keep messages and their replies in different records. This is an old but still relevant discussion on this topic: http://www.cmlenz.net/archives/2007/10/couchdb-joins
Cheers.
Ok guys.
I've begun developing a little sparetime project that might become big someday. Before I really get started, I want to be certain that I'm starting with the right setup. So I come to you.
I'm making a service, which will work mostly as a todolist/project planner.
In this system there will be an amount of users and an amount of tasks. Each task can be assigned to multiple users, and each user can have multiple tasks (many to many relation).
Until now I was planning to use MySQL, but a friend of mine, who is part of the project, sugested MongoDB instead. He tells me that it would increase performance and be more scaleable.
On the other hand I'm thinking that in order to either get all tasks assigned to a specific user, or all users assigned to a specifik task, one would need to use joins, which MongoDB doesnt have (or have in a cumbersome way as far as I have understood).
Now my question to you is "Which DB system would you suggest. MySQL or MongoDB or a third option? And why?"
Thank you for your time and your assistance.
Morten
We use MySQL at IGN to store person relationships (many-to-many like your use case), and have about 5M records in the relationship table. We have 4 MySQL servers in a cluster and the reads are distributed across 3 MySQL slaves. BTW you can always denormalize to optimize reads and penalizing writes among other things based on the read/write heavyness of your system.
We use the DAO pattern with Spring, so its fairly easy for us to swap DB providers through configuration (and by writing a Mongo/MySQL DAO Implementation as applicable). We have moved activities (like in Social Media) to Mongo almost a year ago but the person relationships are living happily in MySQL.
The comment to your post by Jonas says it all,
If need be, you can always scale later.
This.
I am very much of the mindset that If you don't have scaling problems, don't worry too much (if at all) about scaling problems. Why not use what is easiest, smartest and cleanest to deliver the features clients pay for (in my case at least!) This approach saves a lot of time and energy and is the proper one for 9 projects out of 10.
Learning a technology because it scales is great. Being tied to an unlearned technology and unknown technology because it scales in an upcoming project, is not as great. There are many other factors than scalability, when using 3rd party stuff.
MySQL would seem to be a good choice MySQL being more mature and having loads of client libraries, ORM's and other timesaving technologies. MySQL can handle millions (billions if you have the ram) of rows. I have yet to encounter a project it could not handle, and I have seen some pretty impressive datasets!
Of course, when you will need performance, sure maybe you will find yourself ripping out orm and sql generating code to replace with your own hand tweaked queries, but that day is way down the line and chances are, that day will never even come.
Mongodb, although it is real cool I am sorry to say may well bring you issues having nothing to do with scaling.
My 2 cents, happy coding!
MySQL
Either would likely work for your purposes, but your database seems relatively rigid in its structure, something which SQL deals well with. As such, I would recommend MySQL. A many-to-many relationship is relatively easy to implement and access, as well.
You may take a tiny bit of a performance hit, but in my experience, this is generally not extremely noticeable with smaller scale applications (i.e. databases with less than millions of entries). I do agree with #Jonas Elfström's comment, however: you should have an abstraction layer between your application and the database, so that should scaling become an issue, you can address it without too many problems.
Stick with a relational database, it can handle many to many relationships and is fully featured for backup and recovery, high availability and importantly you will find that every developer you need is familiar with it. There are plenty of documented methods for scaling a relational database.
Pick an open source databases either MySQL or Postgres dependant upon which your team is most familiar with and how it integrates into the rest of your infrastructure stack.
Make sure you design your data model correctly most importantly the relationships between the entities.
Good luck!
We are currently planning the database structure of a quite complex e-commerce web app that has flexibility as its main cornerstone.
Our app features a large amount of data (products) and we have run into a slight headache trying to keep performance high without compromizing normalization rules in the database, or leaving our highly beloved flexibility concept behind when integrating product options (also widely known as product attributes or parameters).
Based on various references and sources available, we have made up lists on pros and cons of all major and well known database patterns to solve this. After comparing these, we have come up with two final alternatives:
EAV (Entity-attribute-value model) :
Pros: Database is used for all sorting.
Cons: All related queries will include a number of joins between multiple tables in order to complete the collection of data.
SLOB (Serialized LOB, also known as Facade?) :
Pros: Very flexible. Keeping the number of necessary joins low compared to a EAV design pattern. Easy to update/add/remove data from each product but hard to keep data integrity without additional tables.
Cons: All sorting will be done by the application instead of the database. Will use lots of performance (memory?) when big datasets is processed by a large number of users.
Our main questions:
Which pattern/structure would you use, or maybe even a different solution?
Is there better databases besides mySQL available nowadays to accomplish what we want?
Thanks a lot!
Reference: How to design a product table for many kinds of product where each product has many parameters
Why limit yourself to one model? It's very possible that you'll be better off with two different models where each one meets a specific goal very well.
Assuming, as is often the case, that the two don't have to be absolutely and instantaneously in sync, you might easily end up with much better overall performance. What kind of hard requirements would you have on synchronization? Milliseconds up to a minute?
Udi Dahan has some good information on command query responsibility separation (CQRS) that's relevant. See also a couple of other articles. InfoQ also has very relevant video of Greg Young from QCon08.
EDIT: Here's another video (by Udi Dahan) that discusses, among other things, the benefits of multiple models.
MySQL performs very well even for very large datasets. I use it at a financial services SaaS company and it has always worked well. I have also use SQL Server and Oracle for very large applications and MySQL performs no better or worse on whole. My focus is more the business layer, though, and you may get more detailed opinions from people closer to the DB.
When selecting a pattern, keep in mind that it's much more straightforward to scale the application tier than the data tier (easy and cheap to add application servers). Performing many joins for common operations can cause a real performance bottleneck.
I would suggest you prototype both approaches so that you can both get more familiar with each of them, and benchmark their performance in your specific environment.
Additionally, you may want to look into alternatives to SQL that attempt to achieve a pattern similar to the ones you outline. A friend at a very large, well-known Internet company is starting to use Project Voldemort. He prefers it over similar efforts mostly due to the very active community.
from your solution, it seems you don't want to use a relational model, so perhaps it's better not to use a relational database, take a look at these alternatives: http://nosql-database.org/ btw SQLServer has nice SLOB features in the form of xml fields (can be indexed an queried through XQuery)
I have huge database (kinda wordnet) and want to know if it's easier to use Cassandra instead of MySQL|PostrgreSQL
All my life I was using MySQL and PostrgreSQL and I could easily think in terms of relational algebra, but several weeks ago I learned about Cassandra and that it's used in Facebook and Twitter.
Is it more convenient?
What DBMS are usually used nowadays to store social net's data, relationships between objects, wordnet?
There is nothing like a Silver bullet solution, everything is built to solve specific problem and has its own pros and cons. It is up to you to decide - what problem statement you have and what is best solution that fits your problem. Whether you use Cassandra (NoSQL) or MySQL(RDBMS), it is all driven from your system's requirements. Below are the inputs that will help you in taking better decision while deciding on database.
Why to Use NoSQL
In the case of RDBMS database, making choice is quite easy because almost all the databases like MySQL, Oracle, MS SQL, PostgreSQL in this category offer almost same kind of solutions oriented to the ACID property. When it comes to NoSQL, decision becomes difficult because every NoSQL database offers different solution and you have to understand which one is best suited for your app/system requirement. For example, MongoDB fits for use cases where your system demands schema-less document store. HBase might fit for Search engines, analysing log data, any place where scanning huge, two-dimensional join-less tables is a requirement. Redis is built to provide In-Memory search for varieties of data structures like tree, queue, link list etc and can be good fit for making real time leader board, pub-sub kind of system. Similarly there are other database in this category (including Cassandra) which fits for different problems. Now lets move to original question, and answer them one by one.
When to use Cassandra
Being a part of NoSQL family, Cassandra offers solution for problem where your requirement is to have very heavy write system and you want to have quite responsive reporting system on top of that stored data. Consider use case of Web analytics where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner. You can refer to blog post (http://blogs.shephertz.com/2015/04/22/why-cassandra-excellent-choice-for-realtime-analytics-workload/) to understand more about the use cases where Cassandra fits in.
When to Use a RDMS instead of Cassandra/NoSQL
Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
There are many different flavours of "NoSQL" databases. If your application is really like Wordnet perhaps you should look at a graph database such as Neo4j.
I would suggest to analyse your request.
If you are going with more clusters, machines take NoSQL
If your data model is complicated - require efficient structures take NoSQL (no limits with type of columns)
If you fit in a few machines without scales, and you don't need super performance for multi request (as for example in social network - where lot of users send http request), and you don't think you involve saleability take RDBMS (Postgres have some good functions and structures which you can use, like array column type).
Cassandra should work better with large scales of data, multi purpose.
neo4j - would be better for special structures, graphs.
Cassandra and other NoSQL stores are being used for social based sites because of their need for massive write based operations. Not that MySQL and Postgres can't achieve this but NoSQL requires far less time and money, generally speaking.
Sounds like you may want to look at Neo4J though, just in terms of your object model needs.
All different products and they all have their pro's and conn's. What kind of problem do you have to solve?
Huge, as in TB's?