I have design question for MySQL. As a side project, I am attempting to create a cloud based safety management system. In the most basic terms, a company will subscribe to the service, which will manage company document record as blobs, corrective, employee information, audit results.
My initial design concept was to have a seperate DB for each company.
However, the question I have is if user access control is secure, would it be ok to have all the companies under one DB? What are the pitfalls of this? Are there any performance issues to consider? For identifying records, would it be a compound key of the company and referenceID number unique for each company? If so when generating a reference number for a record of a company, would it slow down as the record set increases?
In terms of constraints, I would expect up to 2000 companies and initially a maximum of 1000 records per company growing at 5% per year. I expect a maximum of 2 gig of blob storage per company growing at 10% per year. The system is to run one cloud server whether multiple db or one big one.
Any thoughts on this would be appreciated.
If there is not much inter-company interaction and overall frequent statistics and you don't plan to make application updates every week or so which would impact the DB structure, I'd go with separate DB (and DB user) for each company. It's more scalable, less prone to user access bugs and easier to make some operations such as remove a company.
On the other hand, 2 mil entries is not such a big deal and if you plan to develop the application further, keeping it in one DB could be better approach.
You have two question : performance and security.
If you use the same mysql user, security will not be different from one option to the other.
If you need performance, you can have the same results, running one or multiple databases (see for instance mysql partioning).
But there are others things that you should consider: like how it will be easy to have one database for your website... or like how it would be easy to have one database per user.
In fact, I give you an answer : considering the size of your data, don't make a choice on performance matters that are quite significantly equals for your needs, but on the choice that will make your life easy.
Related
I am running a finance related web portal which involve millions of debit credit transactions in MySQL , getting the balance of a specific user at a certain limit i.e. with 3 million rows becomes slow.
Now I am thinking to create separate MySQL container for each user and record only the relevant user transactions in every container and I am sure It will be fast to calculate balance of any user.
I have around 20 thousands of user and I wants to know is it practical to create separate MySQL container for each user ? or shall I go for any other approach. Thanks
I would not recommend a separate MySQL instance per user.
I operated MySQL in docker containers at a past job. Even on very powerful servers, we could run only about 30 MySQL instances per server before running out of resources. Perhaps a few more if each instance is idle most of the time. Regardless, you'll need hundreds or thousands of servers to do what you're describing, and you'll need to keep adding servers as you get more users.
Have you considered how you will make reports if each user's data is in a different MySQL instance? It will be fine to make a report about any individual user, but you probably also need reports about aggregate financial activity across all the users. You cannot run a single query that spans MySQL instances, so you will have to do one query per instance and write custom code to combine the results.
You'll also have more work when you need to do backups, upgrades, schema changes, error monitoring, etc. Every one of these operations tasks will be multiplied by the number of instances.
You didn't describe how your data is organized or any of the specific queries you run, but there are techniques to optimize queries that don't require splitting the data into multiple MySQL instances. Techniques like indexing, caching, partitioning, or upgrading to a more powerful server. You should look into learning about those optimization techniques before you split up your data, because you'll just end up with thousands of little instances that are all poorly optimized.
I have around 20 thousands of user and I wants to know is it practical to create separate MySQL container for each user
No, definitely not. While docker containers are relatively lightweight 20k of them is a lot which will require a lot of extra resources (memory, disk, CPU).
getting the balance of a specific user at a certain limit i.e. with 3 million rows becomes slow.
There are several things you can try to do.
First of all try to optimize the database/queries (can be combined with vertical scaling - by using more powerful server for the database)
Enable replication (if not already) and use secondary instances for read queries
Use partitioning and/or sharding
I know this is sacrilegious, but for a table like that I like to use two tables. (The naughty part is the redundancy.)
History -- details of all the transactions.
Current -- the current balance
You seem to have just History, but frequently need to compute the Current for a single user. If you maintain this as you go, it will run much faster.
Further, I would do the following:
Provide Stored Procedure(s) for all actions. The typical action would be to add one row to History and update one row in Current.
Never UPDATE or DELETE rows in History. If a correction is needed, add another row with, say, a negative amount. (This, I think, is "proper" accounting practice anyway.)
Once you have made this design change, your question becomes moot. History won't need to have frequent big scans.
Use InnoDB (not MyISAM).
Another thing that may be useful -- change the main indexes on History from
PRIMARY KEY(id),
INDEX(user_id)
to
PRIMARY KEY(user_id, id), -- clusters a user's rows together
INDEX(id) -- this keeps AUTO_INCREMENT happy
It is best to explain my question in terms of a concrete example.
Consider an order management application that restaurants use to receive orders from their customers. I have a table called orders which stores all of them.
Now every day the tables keep growing in size but the amount of data accessed is constant. Generally the restaurants are only interested in orders received in the last day or so. After 100 days, for example, 'interesting' data is only about 1/100 of the table size; after 1 year it's 1/365 and so on.
Of course, I want to keep all the old orders, but performance for applications that are only interested in current orders keeps reducing. So what is the best way to not have old data interfere with the data that is 'interesting'?
From my limited database knowledge, one solution that occurred to me was to have two identical tables - order_present and order_past - within the same database. New orders would come into 'order_present' and a cron job would transfer all processed orders older than two days to 'order_old', keeping the size of 'order_present' constant.
Is this considered an acceptable solution to deal with this problem. What other solutions exist?
Database servers are pretty good at handling volume. But the performance could be limited by physical hardware. If it is the IO latency that is bothering you, there are several solutions available. You really need to evaluate what fits best for your usecase.
For example:
you can Partition the table to distribute it onto multiple physical disks.
you can do Sharding to put data on to different physical servers
you can evaluate using another Storage Engine which best fits your data and application. MyISAM delivers better read performance compared to InnoDB at the cost of being less ACID compliant
you can use Read Replicas to deligate all (most) "select" queries to replicas (slaves) of the main database servers (master)
Finally, MySQL Performance Blog is a great resource on this topic.
The case:
I have been developing a web application in which I storage data from different automated data sources. Currently I am using MySQL as DBMS and PHP as programming language on a shared LAMP server.
I use several tables to identify the data sources and two tables for the data updates. Data sources are in a three level hierarchy, and updates are timestamped.
One table contains the two upper levels of the hierarchy (geographic location and instrument), plus the time-stamp and an “update ID”. The other table contains the update ID, the third level of the hierarchy (meter) and the value.
Most queries involve a joint statement between this to tables.
Currently the first table contains near 2.5 million records (290 MB) and the second table has over 15 million records (1.1 GB), each hour near 500 records are added to the first table and 3,000 to the second one, and I expect this numbers to increase. I don't think these numbers are too big, but I've been experiencing some performance drawbacks.
Most queries involve looking for immediate past activity (per site, per group of sites, and per instrument) which are no problem, but some involve summaries of daily, weekly and monthly activity (per site and per instrument). The page takes several seconds to load, sometimes surpassing the server's timeout (30s).
It also seems that the automatic updates are suffering from these timeouts, causing the connection to fail.
The question:
Is there any rational way to split these tables so that queries perform more quickly?
Or should I attempt other types of optimizations not involving splitting tables?
(I think the tables are properly indexed, and I know that a possible answer is to move to a dedicated server, probably running something else than MySQL, but just yet I cannot make this move and any optimization will help this scenario.)
If the queries that are slow are the historical summary queries, then you might want to consider a Data Warehouse. As long as your history data is relatively static, there isn't usually much risk to pre-calculating transactional summary data.
Data warehousing and designing schemas for Business Intelligence (BI) reporting is a very broad topic. You should read up on it and ask any specific BI design questions you may have.
I need to implement a custom-developed web analytics service for large number of websites. The key entities here are:
Website
Visitor
Each unique visitor will have have a single row in the database with information like landing page, time of day, OS, Browser, referrer, IP, etc.
I will need to do aggregated queries on this database such as 'COUNT all visitors who have Windows as OS and came from Bing.com'
I have hundreds of websites to track and the number of visitors for those websites range from a few hundred a day to few million a day. In total, I expect this database to grow by about a million rows per day.
My questions are:
1) Is MySQL a good database for this purpose?
2) What could be a good architecture? I am thinking of creating a new table for each website. Or perhaps start with a single table and then spawn a new table (daily) if number of rows in an existing table exceed 1 million (is my assumption correct). My only worry is that if a table grows too big, the SQL queries can get dramatically slow. So, what is the maximum number of rows I should store per table? Moreover, is there a limit on number of tables that MySQL can handle.
3) Is it advisable to do aggregate queries over millions of rows? I'm ready to wait for a couple of seconds to get results for such queries. Is it a good practice or is there any other way to do aggregate queries?
In a nutshell, I am trying a design a large scale data-warehouse kind of setup which will be write heavy. If you know about any published case studies or reports, that'll be great!
If you're talking larger volumes of data, then look at MySQL partitioning. For these tables, a partition by data/time would certainly help performance. There's a decent article about partitioning here.
Look at creating two separate databases: one for all raw data for the writes with minimal indexing; a second for reporting using the aggregated values; with either a batch process to update the reporting database from the raw data database, or use replication to do that for you.
EDIT
If you want to be really clever with your aggregation reports, create a set of aggregation tables ("today", "week to date", "month to date", "by year"). Aggregate from raw data to "today" either daily or in "real time"; aggregate from "by day" to "week to date" on a nightly basis; from "week to date" to "month to date" on a weekly basis, etc. When executing queries, join (UNION) the appropriate tables for the date ranges you're interested in.
EDIT #2
Rather than one table per client, we work with one database schema per client. Depending on the size of the client, we might have several schemas in a single database instance, or a dedicated database instance per client. We use separate schemas for raw data collection, and for aggregation/reporting for each client. We run multiple database servers, restricting each server to a single database instance. For resilience, databases are replicated across multiple servers and load balanced for improved performance.
Some suggestions in a database agnostic fashion.
The most simplest rational is to distinguish between read intensive and write intensive tables. Probably it is good idea to create two parallel schemas daily/weekly schema and a history schema. The partitioning can be done appropriately. One can think of a batch job to update the history schema with data from daily/weekly schema. In history schema again, you can create separate data tables per website (based on the data volume).
If all you are interested is in the aggregation stats alone (which may not be true). It is a good idea to have a summary tables (monthly, daily) in which the summary is stored like total unqiue visitors, repeat visitors etc; and these summary tables are to be updated at the end of day. This enables on the fly computation of stats with out waiting for the history database to be updated.
You should definitely consider splitting the data by site across databases or schemas - this not only makes it much easier to backup, drop etc an individual site/client but also eliminates much of the hassle of making sure no customer can see any other customers data by accident or poor coding etc. It also means it is easier to make choices about partitionaing, over and above databae table-level partitioning for time or client etc.
Also you said that the data volume is 1 million rows per day (that's not particularly heavy and doesn't require huge grunt power to log/store, nor indeed to report (though if you were genererating 500 reports at midnight you might logjam). However you also said that some sites had 1m visitors daily so perhaps you figure is too conservative?
Lastly you didn't say if you want real-time reporting a la chartbeat/opentracker etc or cyclical refresh like google analytics - this will have a major bearing on what your storage model is from day one.
M
You really should test your way forward will simulated enviroments as close as possible to the live enviroment, with "real fake" data (correct format & length). Benchmark queries and variants of table structures. Since you seem to know MySQL, start there. It shouldn't take you that long to set up a few scripts bombarding your database with queries. Studying the results of your database with your kind of data will help you realise where the bottlenecks will occur.
Not a solution but hopefully some help on the way, good luck :)
I am working on an app right now which has the potential to grow quite large. The whole application runs through a single domain, with customers being given sub-domains, which means that it all, of course, runs through a common code-base.
What I am struggling with is the database design. I am not sure if it would be better to have a column in each table specifying the customer id, or to create a new set of tables (in the same database), or to create a complete new database per customer.
The nice thing about a "flag" in the database specifying the customer id is that everything is in a single location. The downfalls are obvious- Tables can (will) get huge, and maintenance can become a complete nightmare. If growth occurs, splitting this up over several servers is going to be a huge pain.
The nice thing about creating new tables it is easy to do, and also keeps the tables pretty small. And since customers data doesn't need to interact, there aren't any problems there. But again, maintenance might become an issue (Although I do have a migrations library that will do updates on the fly per customer, so that is no big deal). The other issue is I have no idea how many tables can be in a single database. Does anyone know what the limit is, and what the performance issues would be?
The nice thing about creating a new database per customer, is that when I need to scale, I will be able to, quite nicely. There are several sites that make use of this design (wordpress.com, etc). It has been shown to be effective, but also have some downfalls.
So, basically I am just looking for some advice on which direction I should (could) go.
Single Database Pros
One database to maintain. One database to rule them all, and in the darkness - bind them...
One connection string
Can use Clustering
Separate Database per Customer Pros
Support for customization on per customer basis
Security: No chance of customers seeing each others data
Conclusion
The separate database approach would be valid if you plan to support customer customization. Otherwise, I don't see the security as a big issue - if someone gets the db credentials, do you really think they won't see what other databases are on that server?
Multiple Databases.
Different customers will have different needs, and it will allow you to serve them better.
Furthermore, if a particular customer is hammering the database, you don't want that to negatively affect the site performance for all your other customers. If everything is on one database, you have no damage control mechanism.
The risk of accidentally sharing data between customers is much smaller with separate database. If you'd like to have all data in one place, for example for reporting, set up a reporting database the customers cannot access.
Separate databases allow you to roll out, and test, a bugfix for just one customer.
There is no limit on the amount of tables in MySQL, you can make an insane amount of them. I'd call anything above a hundred tables per database a maintenance nightmare though.
Are you planning to develop a Cloud App?
I think that you don´t need to make tables or data bases by customer. I recommend you to use a more scalable relational database management system. Personally I don´t know the capabilities of MySQL, but i´m pretty sure that it should support distributed data base model in order to handle the load.
creating tables or databases per customer can lead you to a maintenance nightmare.
I have worked with multi-company databases and every table contains customer ids and to access its data we develop views per customer (for reporting purposes)
Good luck,
You can do whatever you want.
If you've got the customer_id in each column, then you've got to write the whole application that way. That's not exactly true as there should be enough to add that column only to some tables, the rest could be done using some simple joins.
If you've got one database per user, there won't be any additional code in the application so that could be easier.
If you take to first approach there won't be a problem to move to many databases as you can have the customer_id column in all those tables. Of course then there will be the same value in this column in each table, but that's not a problem.
Personally I'd take the simple one customer one database approach. Easier to user more database servers for all customers, more difficult to show a customer data that belongs some other customer.