Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
First of all, I'm not a very experienced developer, I'm making mid-size apps in PHP, MySQL and Javascript.
There is something though which is making it hard for me to design a MySQL InnoDB database before each project. And that is the performance. I'm always quite worried about if I'm creating a normalized database scheme that when I'll have to join a couple of tables (like 5-6) together (there are usually a few many-to-many, many-to-one relationships between them) it will affect the performance a LOT (in negative) when each of these 5-6 tables has around 100k rows.
These projects that I usually have is creating analytics platforms. Therefore I'm expecting around 100M of clicks in total and I usually have to join this table to many others (each around 100k of rows) to get some data displayed. I'm usually making summarized tables of the clicks but cannot do the same for the other tables.
I'm not quite sure if I have to worry about future performance in this stage. Currently, I am actively managing a few of these applications with 30M+ clicks and tables that I join to this Clicks table with 40k+ rows. The performance is pretty bad - a select operation usually takes more than 10-20s to complete while I believe I have proper indexing, innodb_buffer_pool_size also.
I've read a lot about the key to having an optimized database is the design. That's why I'm usually thinking about the DB scheme a LOT before creating it.
Do I really have to worry about creating DB schemes where I'll have to Join 5-6 many-to-many/many-to-one/one-to-many tables or it's quite usual and MySQL should be able to easily handle this load?
Is there anything else that I should consider before creating a DB scheme?
My usual server setup is having a MySQL Server with 4GB RAM + 2 vCPUs, to serve the DB and a WebServer with 4GB RAM + 2 vCPUs. Both of them are using Ubuntu's 16.04 release and using the latest MySQL (5.7.21) and PHP7-fpm.
Gordon is right. RDBMSs are made to handle your kind of workload.
If you're using virtual machines (cloud, etc) to host your stuff, you can generally increase your RAM, vCPU count, and IO capacity simply by spending more money. But, usually, throwing money at DBMS peformance problems is less helpful than throwing better indexes at them.
At the scale of 100M rows, query performance is a legitimate concern. You will, as your project develops, need to revisit your DBMS indexing to optimize the queries you're actually using. So plan on that. The thing is, you cannot and will not know until you get lots of data what your actual performance issues will be.
Read this for a preview of what's coming: https://use-the-index-luke.com/ .
One piece of advice: partitioning of tables generally doesn't solve performance problems except under very specific circumstances.
Look up this acronym: YAGNI.
And go do your project. Spend your present effort getting it working.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
We are in the planning stages for a new multi-tenant SaaS app and have hit a deciding point. When designing a multi-tenant application, is it better to go for one monolithic database that holds all customer data (Using a 'customer_id' column) or is it better to have an independent database per customer? Regardless of the database decisions, all tenants will run off of the same codebase.
It seems to me that having separate databases makes backups / restorations MUCH easier, but at the cost of increased complexity in development and upgrades (Much easier to upgrade 1 database vs 500). It also is easier / possible to split individual customers off to separate dedicated servers if the situation warrants the move. At the same time, aggregating data becomes much more difficult when trying to get a broad overview of how customers are using the software.
We expect to have less than 250 customers for at least a year after launch, but they will be large customers and more will follow afterward.
As this is our first leap into SaaS, we are definitely looking to do this right from the start.
This is a bit long for a comment.
In most cases, you want one database with a separate customer id column in the appropriate tables. This makes it much easier to maintain the application. For instance, it much easier to replace a stored procedure in one database than in 250 databases.
In terms of scalability, there is probably no issue. If you really wanted to, you could partition your tables by client.
There are some reasons why you would want a separate database per client:
Access control: maintaining access control at the database level is much easier than at the row level.
Customization: customizing the software for a client is much easier if you can just work in a single environment.
Performance bottlenecks: if the data is really large and/or there are really large numbers of transactions on the system, it might be simpler (and cheaper) to distribute databases on different servers rather than maintain a humongous database.
However, I think the default should be one database because of maintainability and consistency.
By the way, as for backup and restore. If a client requires this functionality, you will probably want to write custom scripts anyway. Although you could use the database-level backup and restore, you might have some particular needs, such as maintaining consistency with data not stored in the database.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am developing a website using Django1.7 on python3.4 along with MySql as database engine. For the next 15-20 days I am planning to test it. The site is something like linkedin in terms of functionality and complexity and I am expecting to get around 20-30 thousand users in the next 6 months.
I have learnt about MySql during the development of my website only. I am using Django-debug-toolbar and have tried to reduce the query time and reduce the number of joins. I have a few questions:
what tools can be used to create http request to automatically fill the database and also query various pages.
Is django-debug-toolbar enough to do the profiling and optimization considering increasing number of requests from multiple user.
Whether should I work on reducing the number of database hits or the size of querysets django would be caching and how the use of RAM is going to affect the performance of the website.
Considering that I have no prior experience of database administration and handling a website, how should i determine whether the website is performing up to the mark. Please share the bestpractices as i am quite unfamiliar.
The single biggest factor in SQL performance is the number of rows in the tables you use. You should figure out how to load 50 thousand fake users and 1 million fake nodes into your test database.
Then guess which of your pageviews will be most common. Find a free load testing tool on the net (there are quite a few) and use it to hit that page hard on your server.
Figure out which queries are slow. Add appropriate indexes, or if you must, redesign your data base, and get those queries to be fast.
Then guess at your second most popular pageview and repeat. Keep going until you run out of time.
Keep in mind that this is guesswork. As your service ramps up in users, you need to keep an eye on the pageviews your real users prefer, and keep an eye on those slow queries.
This will, if you add users at the rate you plan, take a sizeable fraction of your time during your first year in operation.
Read a web site called http://use-the-index-luke.com/
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
We are in the planning stages for a new multi-tenant SaaS app and have hit a deciding point. When designing a multi-tenant application, is it better to go for one monolithic database that holds all customer data (Using a 'customer_id' column) or is it better to have an independent database per customer? Regardless of the database decisions, all tenants will run off of the same codebase.
It seems to me that having separate databases makes backups / restorations MUCH easier, but at the cost of increased complexity in development and upgrades (Much easier to upgrade 1 database vs 500). It also is easier / possible to split individual customers off to separate dedicated servers if the situation warrants the move. At the same time, aggregating data becomes much more difficult when trying to get a broad overview of how customers are using the software.
We expect to have less than 250 customers for at least a year after launch, but they will be large customers and more will follow afterward.
As this is our first leap into SaaS, we are definitely looking to do this right from the start.
This is a bit long for a comment.
In most cases, you want one database with a separate customer id column in the appropriate tables. This makes it much easier to maintain the application. For instance, it much easier to replace a stored procedure in one database than in 250 databases.
In terms of scalability, there is probably no issue. If you really wanted to, you could partition your tables by client.
There are some reasons why you would want a separate database per client:
Access control: maintaining access control at the database level is much easier than at the row level.
Customization: customizing the software for a client is much easier if you can just work in a single environment.
Performance bottlenecks: if the data is really large and/or there are really large numbers of transactions on the system, it might be simpler (and cheaper) to distribute databases on different servers rather than maintain a humongous database.
However, I think the default should be one database because of maintainability and consistency.
By the way, as for backup and restore. If a client requires this functionality, you will probably want to write custom scripts anyway. Although you could use the database-level backup and restore, you might have some particular needs, such as maintaining consistency with data not stored in the database.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
So I'm working on a task of calculating medians for every 100 records in a giant MySQL table, which appears to be a straightforward problem but ends up with very complex SQL code. One of my friend who saw my work asked me, why don't you load the data into memory and process it with C or Python, wouldn't that be easier? My intuition is that it is a bad idea. But can someone elaborate more about why it is not suggested? Thank you!
I can think of no good reason to tell you that it's a bad idea to use a front-end to process data stored in a MySQL db... to me it's something like "don't use knives to cut your food because you can cut your own finger".
You can, of course write some stored procedures or functions that might give you the results you need, but if you can't make it work with MySQL, then the obvious step is to use another tool.
You must, however, take some precautions:
Don't overload your network connection (trivial if you are working on localhost).
Don't try to store too big resultsets in memory: keep it simple and small (divide and conquer)
Let the database server do the heavy work, and use your front-end to do the fine work (if you need to filter data, let MySQL do that for you, and write the code to do the calculations on the filtered data).
Be sure to take the appropriate precautions when sending queries to MySQL (avoid SQL injection vulnerabilities)
In general, yes you should do your heavy lifting in the database. If your dataset is fairly small, it wouldn't matter whether you do the calculations on the database server or on the database client.
The primary consideration whether to do the calculation in the db server vs on the db client is usually of performance. If you do the heavy calculations on the db client, you may end up having to transfer a lot of data through the db connection. With large datasets, transferring the entire table to the client may become performance issues, and if your database server lives in a different machine than your application server (i.e. not localhost), then the network transfer overhead becomes even worse.
If you have to transfer the entire dataset anyway, then there likely won't be any significant performance difference. The SQL language itself isn't inherently faster than the client languages for doing number crunching, it simply has the advantage of running on the server process and thus can avoid the overhead of data transfer.
There are also applications that uses multiple data sources, for these, generally you'll will often up with no other choice but to do parts your calculations the client side.
Ultimately, you have to measure. It didn't matter whether it's best practice or not, if doing the calculation in the client is fast enough and it simplify the overall code doing that, then do take that route.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm writing an application that doesn't necessarily need scaling abilities as it won't be collecting large amounts data at the beginning. (However, if I'm lucky, I could down the road potentially.)
I will be running my web server and database on the same box (for now).
That being said, I am looking for performance and efficiency.
The main part of my application will be loading blog articles. Using an RDBMS (MySQL) I will make 6 queries (2 of the queries being joins), just to load a single blog article page.
select blog
select blog_album
select blog_tags
select blog_notes
select blog_comments (join with users)
select blog_author_participants (join with users)
However, with MongoDB I can de-normalize and flatten 6 tables into just 2 tables/collections and minimizes my queries to potentially just one 1 query,
users
blogs
->blog_album
->blog_tags
->blog_notes
->blog_comments
->blog_author_participants
Now, going with the MongoDB schema, there will be some data redundancy. However, hard drive space is cheaper than CPU/servers.
1.) Would this be a good scenario to use MongoDB?
2.) Do you only benefit in performance using MongoDB when scaling beyond a single server?
3.) Are there any durability risks using MongoDB? I hear that there is potential for loss of data while performing inserts - as insert are written to memory first, then to the database.
4.) Should this stop me from using MongoDB in production?
You would use MongoDB when you have a use case that matches its strengths.
Do you need a schema-less document store? Nope, you have a stable schema.
Do you need automatic sharding? Nope, you don't have extraordinary data needs or budget for horizontally scaling hardware.
Do you need map/reduce data processing? Not for something like a blog.
So why are you even considering it?
However, with MongoDB I can de-normalize and flatten 6 tables into just 2 tables/collections and minimizes my queries to potentially just one 1 query
But you can easily query MySQL for 6 tables worth of information related to a single blog post with a single properly crafted SQL statement.
however hard drive space is cheaper than CPU/servers.
If performance and scaling is a priority then you are going to be concerned with having enough RAM to fit everything into main memory and enough CPU cores to run queries. An enterprise grade RAID 10 array is a requirement, don't get me wrong, but as soon as your database software (MongoDB or MySQL) needs to scan an index that can't fit into main memory you'll be in for a world of pain assuming a large active database. :)
I like MongoDB, but it's big strength in my mind is map/reduce and its document-orientation. You require neither of those features. MySQL is time-tested in large scale deployments and supports partitioning (but I would argue that your database would have to be in the order of 50-100 GB before you can realize substantial gain from partitioning vs a single (plus passive backup) server with tons (64 GB+) of RAM. I would also argue that if performance is truly a concern then MySQL would be preferable as you would have supreme control over your indexes.
That's not to say that MongoDB isn't high performance, but its place probably isn't serving blogs. Your concern with inserts is valid as well. MongoDB is not an ACID system. Google transactions in both systems and compare.
Here is a good explanation: http://mod.erni.st/nosql-if-only-it-was-that-easy/
The last paragraph summarizes it:
What am I going to build my next app on? Probably Postgres. Will I use NoSQL? Maybe. I might also use Hadoop and Hive. I might keep everything in flat files. Maybe I’ll start hacking on Maglev. I’ll use whatever is best for the job. If I need reporting, I won’t be using any NoSQL. If I need caching, I’ll probably use Tokyo Tyrant. If I need ACIDity, I won’t use NoSQL. If I need a ton of counters, I’ll use Redis. If I need transactions, I’ll use Postgres. If I have a ton of a single type of documents, I’ll probably use Mongo. If I need to write 1 billion objects a day, I’d probably use Voldemort. If I need full text search, I’d probably use Solr. If I need full text search of volatile data, I’d probably use Sphinx.
NoSQL vs. RDBMS: Apples and Oranges?
I would advise you to read up a little on what NoSQL is and what it does before you decide whether you can use it. You can't take a normal database and turn it into a NoSQL thing just like that. The way you work with the data is completely different.
NoSQL definitely has its uses. But it's definitely not the answer for everything. The main advantage of NoSQL is the easily changeable data model.
Advantages of using mongodb ( as per Moshe Kaplan published in dzone article)
Schema-less design
Scalability in managing Tera bytes of data
Rapid replicaSet with high availability feature
Sharding enables linear and scale out growth w/o running out of budget
Support high write load
Use of Data locality for query processing
MongoDB meets Consistency & Partitioning requirements in CAP theory ( Consistency, Availability and Partitioning)
Related SE questions:
What are the advantages of using a schema-free database like MongoDB compared to a relational database?
When to Redis? When to MongoDB?
I can't speak to the performance considerations, but for me, the first consideration of whether you want to use a SQL-DB vs MongoDB is the structure of the data you want to store.
MongoDB is "schema-less" in the sense that you don't need to know what "tables" and "columns" you want beforehand. It is very flexible. So, if you don't know what information you want to store in your "blogs" Collection for example, or if different blog posts may store different information, then MongoDB allows this flexibility. Whereas with SQL relational databases, you have to know your schema upfront.
But it sounds like you already know what information you want to store, in which case I might just stick with a SQL relational database. I don't think performance is the first consideration in your case - you're not building a real-time application where one or two milliseconds matter all that much.