I'm in the middle of redesigning an app that has 100,000s of records in a particular table (currently 250K and growing).
The table contains information of websites and domains.
For the sake of speed and resources, should I include all the data needed about either entity in the original table, or should I be using two lookup tables to store information not shared - for example one lookup table which stores all domain specific info and one which stores all site specific info?
Thanks
Ideally you should split them into 2 different tables because a single domain would correspond to multiple sites and if we go with the design in which the metadata of both the domain and site is stored in a single table, in that case there would need to be redundant info stored for the domain in every record of the site metadata. Instead, if we have 2 separate tables in which the domain table has one record per domain and a list of sites as one of the fields in the record and a domain name column in the site table to figure out the domain given a site, it would ensure organized storage and no redundancy of data. This is the major principle of traditional RDBMS systems and that is why we have the concept of multiple tables.
Also, you may consider using a NOSQL data store if you want to really scale your database as you said that the data is continuously increasing. Apache HBase may be a good solution which has this concept of grouping related information together.
Edit:
Clarification in the question:
Just to be clear, domain and sites are not linked. They're just different entities like a domain with no traffic or revenue would be classed as a domain and have domain related data stored for it like number of hyphens or registrar while a domain with a Wordpress install for example and exisitng traffic would be classed as a site - not a domain - and have site specific information stored. Would this change your answer?
In the case where they are not inter-related, I don't think that splitting the data into multiple tables is going to help in any way unless you are going for a distributed RDBMS system. In case of a single-node hosted DB, the rows are anyways indexed by the site/domain id and a large number of rows in a single table is not going to degrade performance but if you are looking at the humongous size of data and wish to divide it over multiple nodes in a cluster then having independent tables for them will help so that each table gets hosted on individual nodes and the DB is able to scale horizontally. That is the only benefit I see in this case.
The perfomance of your application largely depends on the type of queries the application uses. To store all data in one table does not necessarily reduce the performance but very well might enhance it. You are wasting disk space of course if your table holds the information that example.com is owned by Mr XY a few thousand times.
Normalizing your database (splitting your data up) can be helpful but one would have to know what you want to do with the data to answer that.
Related
The concept of DB sharding at high level makes sense, split up DB nodes so not a single one is responsible for all of the persistent data. However I'm a little confused on what constitutes the "shard". Does it duplicate entire tables across shards, or usually just a single one?
For instance if we take twitter as an example, at the most basic level we need a users and a tweets table. If we shard based on user ID, with 10 shards, it would reason that the shard function is userID mod 10 === shard location. However what does this mean for the tweets table? Is that separate (a single DB table) or then is every single tweet divided up between the 10 tables, based on the whichever user ID created the tweet?
If it is the latter, and say we shard on something other than user ID, tweet created timestamp for example, how would we know where to look up info relating to the user if all tables are sharded based on tweet creation time (which the user has no concept of)?
Sharding is splitting the data across multiple servers. The choice of how to split is very critical, and may be obvious.
At first glance, splitting tweets by userid sounds correct. But what other things are there? Is there any "grouping" or do you care who "receives" each tweet?
A photo-sharing site is probably best split on Userid, with meta info for the user's photos also on the same server with the user. (Where the actual photos live is another discussion.) But what do you do with someone who manages to upload a million photos? Hopefully that won't blow out the disk on whichever shard he is on.
One messy case is Movies. Should you split on movies? Reviews? Users who write reviews? Genres?
Sure, "mod 10" is convenient for saying which shard a user is on. That is, until you need an 11th shard! I prefer a compromise between "hashing" and "dictionary". First do mod 4096, then lookup in a 'dictionary' that maps 4096 values to 10 shards. Then, write a robust tool to move one group of users (all with the same mod-4096 value) from one shard to another. In the long run, this tool will be immensely convenient for handling hardware upgrades, software upgrades, trump-sized tweeters, or moving everyone else out of his way, etc.
If you want to discuss sharding tweets further, please provide the main tables that are involved. Also, I have strong opinions on how to you could issue unique ids, if you need them, for the tweets. (There are fiasco ways to do it.)
I need an expert advice for my database. Basically we have 100s of sensors around the world. We collect data from the sensors and store in the database for future use.
Currently I create a separate database table for each customer i.e. When a customer registers to the application, I create a separate table for them and the data from all the sensors from this customer goes to their separate database table.
Now the number of customers are increasing so does the number of tables and this approach is not looking good anymore (may be this approach wasn't right in the first place).
I now want to keep all the data in one table so I copied all the data from the customer's table into a new table. Now the size of the new table is over 5GB with more than 34 million rows (and growing).
If I want to insert new rows into this new table simultaneously, from multiple thread for each sensor, it takes too long. To access the data from the same table takes long time too.
How can I resolve this issue? Is there any other solution ? Should I use some external cloud service to store data ?
Thanks in advance!
EDIT:
I am using indexes. Here is the table schema
With UNIQUE INDEX idx_userInsDate ( userID,instrumentID,utcDateTime)
I have also looked into the database sharding but my main issue is, inserting rows to the same table from multiple threads and reading data from multiple threads is taking some time.
With this limited information here's my advice.
When collecting millions of rows from many different customers unless the data has to be collected together for "easy reporting" a customer specific table or even a customer specific database can definitely be used and that is absolutely fine.
This actually has several benefits including protecting you from exposing one customers information to another customer on accident since all their data is in 1 table.
As your number of customers goes up then you get either a new database for each customer or a new table and that is fine and that is probably something you would want to automate in your software. For instance, if a customer signs up, this table is automatically created.
Both scenarios and designs are common and perfectly fine depending on your situation. For instance, I once owned a product company and for that company every customer had their own entire database. So as my customer count went up my number of databases went up. This is no different really than you having a database or table per customer and if you choose that route that's okay.
Whatever you choose you must consider your sql backups, size of your database versus hard drive space available etc. If the number of tables continues to grow maybe each customer should get their own database but how hard would it then be for you to backup all of these databases and relate them to central db if you needed to do so. Just consider everything like this including security and your reporting needs, how much data you will need to keep etc.
Here's another article I wrote some time ago on multi-tenant data architecture.
https://stackoverflow.com/a/38555345/671343
Check it out and hopefully this helps you. Your not the only one to struggle with a design decision about this. Just weigh all your options considering reporting, security, backups and more.
Hope thats helpful
Use Mongo or similar DB for your scenerio , that is the exact scenerio which requires Mongo .
Multiple Record Insertion at once is very fast and isolated from other records hence faster\
Reading is Faster if you have proper Data structure Tree formed for your data.
Proper structuring will furhter help to reduce the requirement of creating new table for each customer.
Say I have an application that adds companies. I have a table like this:
create table companies(
id primary key,
name varchar
)
For each company, I have to store loads of business-type information. Since they are so many, I've decided to create one database per company to avoid collision, very slow performance and complicated queries. The database name would be the company name in my companies table.
My problem is I would like to give the company name and the database a one-to-one relationship so they would be in sink with each other. Is that possible? If not, is there a better approach besides creating a database per company?
This is an elaboration on my comment.
Databases are designed to handle tables with millions, even billions of rows. I am guessing that your data is not that big.
Why do you want to store all the data for a single entity in a single table? Here are some reasons:
You can readily run queries across different companies.
You can define foreign key relationships between entities.
If you change the data structure, you can do it in one place.
It can be much more efficient in terms of space. Databases generally store data on data pages, and partially filled pages will eat up lots of space.
You have a single database for backup and recovery purposes.
A where clause to select a single company's data is not particularly "complicated".
(Note: This is referring to "entities", a database term. Data for the companies can still be spread across multiple tables.)
For performance, you can then adjust the data model, add indexes, and partition tables. This is sufficient for the vast myriad of applications that run on databases.
There are a handful of situations where a separate database per company/client is needed. Here are some I've encountered:
You are told this is what you have to do.
Each client really is customized, so there is little common data structure among them.
Security requirements specify that data must be in different databases or even on different servers (this is "good reason" for 1.).
I'm trying to develop an application where users can import their e-mails into and search their imported e-mails. As this will probably be used by many users (easily 10k+) the database design is critical. With these numbers of users the database will probably need to be able to hold over a billion rows (e-mails).
The application will need to be able to quickly return records after a search query is posted on the application. The database will be heavily searched and I would like some help on creating the database table(s) for creating an efficient db schema. I have a lot experience with MySQL myself but I've read somewhere I shouldn't go that way and go look for MongoDB or something? Is the difference so big or is there any way I can still go with MySQL?
from
to
subject
date (range)
attachments (names & types only)
message contents
(optional) mailbox / folder structure
These are the searchable fields, of course all e-mails will have an extra two "columns" for the unique id and the user_id. I've found several db schemas of e-mail but I can't find any documentation of a schema that will work with over a billion rows.
You would be best off starting simple with your proposed table definition and going from there - if the site does get near a billion records then if needed you can move it to amazon servers or another cloud host which (should) allow the table to the partioned.
MySQL can handle a fair amount of data, assuming you are not on a shared host with restrictions.
So, start simple, dont optimise a problem that doesnt exist yet, and see how it goes.
I am working on an app right now which has the potential to grow quite large. The whole application runs through a single domain, with customers being given sub-domains, which means that it all, of course, runs through a common code-base.
What I am struggling with is the database design. I am not sure if it would be better to have a column in each table specifying the customer id, or to create a new set of tables (in the same database), or to create a complete new database per customer.
The nice thing about a "flag" in the database specifying the customer id is that everything is in a single location. The downfalls are obvious- Tables can (will) get huge, and maintenance can become a complete nightmare. If growth occurs, splitting this up over several servers is going to be a huge pain.
The nice thing about creating new tables it is easy to do, and also keeps the tables pretty small. And since customers data doesn't need to interact, there aren't any problems there. But again, maintenance might become an issue (Although I do have a migrations library that will do updates on the fly per customer, so that is no big deal). The other issue is I have no idea how many tables can be in a single database. Does anyone know what the limit is, and what the performance issues would be?
The nice thing about creating a new database per customer, is that when I need to scale, I will be able to, quite nicely. There are several sites that make use of this design (wordpress.com, etc). It has been shown to be effective, but also have some downfalls.
So, basically I am just looking for some advice on which direction I should (could) go.
Single Database Pros
One database to maintain. One database to rule them all, and in the darkness - bind them...
One connection string
Can use Clustering
Separate Database per Customer Pros
Support for customization on per customer basis
Security: No chance of customers seeing each others data
Conclusion
The separate database approach would be valid if you plan to support customer customization. Otherwise, I don't see the security as a big issue - if someone gets the db credentials, do you really think they won't see what other databases are on that server?
Multiple Databases.
Different customers will have different needs, and it will allow you to serve them better.
Furthermore, if a particular customer is hammering the database, you don't want that to negatively affect the site performance for all your other customers. If everything is on one database, you have no damage control mechanism.
The risk of accidentally sharing data between customers is much smaller with separate database. If you'd like to have all data in one place, for example for reporting, set up a reporting database the customers cannot access.
Separate databases allow you to roll out, and test, a bugfix for just one customer.
There is no limit on the amount of tables in MySQL, you can make an insane amount of them. I'd call anything above a hundred tables per database a maintenance nightmare though.
Are you planning to develop a Cloud App?
I think that you don´t need to make tables or data bases by customer. I recommend you to use a more scalable relational database management system. Personally I don´t know the capabilities of MySQL, but i´m pretty sure that it should support distributed data base model in order to handle the load.
creating tables or databases per customer can lead you to a maintenance nightmare.
I have worked with multi-company databases and every table contains customer ids and to access its data we develop views per customer (for reporting purposes)
Good luck,
You can do whatever you want.
If you've got the customer_id in each column, then you've got to write the whole application that way. That's not exactly true as there should be enough to add that column only to some tables, the rest could be done using some simple joins.
If you've got one database per user, there won't be any additional code in the application so that could be easier.
If you take to first approach there won't be a problem to move to many databases as you can have the customer_id column in all those tables. Of course then there will be the same value in this column in each table, but that's not a problem.
Personally I'd take the simple one customer one database approach. Easier to user more database servers for all customers, more difficult to show a customer data that belongs some other customer.