DB schema to store billion e-mails

DB schema to store billion e-mails - mysql

I'm trying to develop an application where users can import their e-mails into and search their imported e-mails. As this will probably be used by many users (easily 10k+) the database design is critical. With these numbers of users the database will probably need to be able to hold over a billion rows (e-mails).
The application will need to be able to quickly return records after a search query is posted on the application. The database will be heavily searched and I would like some help on creating the database table(s) for creating an efficient db schema. I have a lot experience with MySQL myself but I've read somewhere I shouldn't go that way and go look for MongoDB or something? Is the difference so big or is there any way I can still go with MySQL?
from
to
subject
date (range)
attachments (names & types only)
message contents
(optional) mailbox / folder structure
These are the searchable fields, of course all e-mails will have an extra two "columns" for the unique id and the user_id. I've found several db schemas of e-mail but I can't find any documentation of a schema that will work with over a billion rows.

You would be best off starting simple with your proposed table definition and going from there - if the site does get near a billion records then if needed you can move it to amazon servers or another cloud host which (should) allow the table to the partioned.
MySQL can handle a fair amount of data, assuming you are not on a shared host with restrictions.
So, start simple, dont optimise a problem that doesnt exist yet, and see how it goes.

Related

Access a large database table from multiple threads

I need an expert advice for my database. Basically we have 100s of sensors around the world. We collect data from the sensors and store in the database for future use.
Currently I create a separate database table for each customer i.e. When a customer registers to the application, I create a separate table for them and the data from all the sensors from this customer goes to their separate database table.
Now the number of customers are increasing so does the number of tables and this approach is not looking good anymore (may be this approach wasn't right in the first place).
I now want to keep all the data in one table so I copied all the data from the customer's table into a new table. Now the size of the new table is over 5GB with more than 34 million rows (and growing).
If I want to insert new rows into this new table simultaneously, from multiple thread for each sensor, it takes too long. To access the data from the same table takes long time too.
How can I resolve this issue? Is there any other solution ? Should I use some external cloud service to store data ?
Thanks in advance!
EDIT:
I am using indexes. Here is the table schema
With UNIQUE INDEX idx_userInsDate ( userID,instrumentID,utcDateTime)
I have also looked into the database sharding but my main issue is, inserting rows to the same table from multiple threads and reading data from multiple threads is taking some time.

With this limited information here's my advice.
When collecting millions of rows from many different customers unless the data has to be collected together for "easy reporting" a customer specific table or even a customer specific database can definitely be used and that is absolutely fine.
This actually has several benefits including protecting you from exposing one customers information to another customer on accident since all their data is in 1 table.
As your number of customers goes up then you get either a new database for each customer or a new table and that is fine and that is probably something you would want to automate in your software. For instance, if a customer signs up, this table is automatically created.
Both scenarios and designs are common and perfectly fine depending on your situation. For instance, I once owned a product company and for that company every customer had their own entire database. So as my customer count went up my number of databases went up. This is no different really than you having a database or table per customer and if you choose that route that's okay.
Whatever you choose you must consider your sql backups, size of your database versus hard drive space available etc. If the number of tables continues to grow maybe each customer should get their own database but how hard would it then be for you to backup all of these databases and relate them to central db if you needed to do so. Just consider everything like this including security and your reporting needs, how much data you will need to keep etc.
Here's another article I wrote some time ago on multi-tenant data architecture.
https://stackoverflow.com/a/38555345/671343
Check it out and hopefully this helps you. Your not the only one to struggle with a design decision about this. Just weigh all your options considering reporting, security, backups and more.
Hope thats helpful

Use Mongo or similar DB for your scenerio , that is the exact scenerio which requires Mongo .
Multiple Record Insertion at once is very fast and isolated from other records hence faster\
Reading is Faster if you have proper Data structure Tree formed for your data.
Proper structuring will furhter help to reduce the requirement of creating new table for each customer.

What Are Good Solutions for a Database Table that Gets to Long?

I will describe a problem using a specific scenario:
Imagine that you create a website towhich users can register,
and after they register, they can send Private Messages to each other.
This website enables every user to maintain his own Friends list,
and also maintain a Blocked Users list, from which he prefers not to get messages.
Now the problem:
Imagine this website getting to several millions of users,
and let's also assume that every user has about 10 Friends in the Friends table, and 10 Blocked Users in the Blocked Users table.
The Friends list Table, and the Blocked Users table, will become very long,
but worse than that, every time when someone wants to send a message to another person "X",
we need to go over the whole Blocked Users table, and look for records that the user "X" defined - people he blocked.
This "scanning" of a long database table, each time a message is sent from one user to another, seems quite inefficient to me.
So I have 2 questions about it:
What are possible solutions for this problem?
I am not afraid of long database tables,
but I am afraid of database tables that contain data for so many users,
which means that the whole table needs to be scanned every time, just to pull out a few records from it for that specific user.
A specific solution that I have in my mind, and that I would like to ask about:
One solution that I have in mind for this problem, is that every user that registers to the website, will have his own "mini-database" dynamically (and programmatically) created for him,
that way the Friends table, an the Blocked Users table, will contain only records for him.
This makes scanning those table very easy, because all the records are for him.
Does this idea exist in Databases like MS-SQL Server, or MySQL? And If yes, is it a good solution for the described problem?
(each user will have his own small database created for him, and of course there is also the main (common) database for all other data that is not user specific)
Thank you all

I would wait on the partitioning and on creating mini-database idea. Is your database installed with the data, log and temp files on different RAID drives? Do you have clustered indexes on the tables and indexes on the search and join columns?
Have you tried any kind of reading Query Plans to see how and where the slowdowns are occurring? Don't just add memory or try advanced features blindly before doing the basics.
Creating separate databases will become a maintenance nightmare and it will be challenging to do the type of queries (for all users....) that you will probably like to do in the future.
Partitioning is a wonderful feature of SQL Server and while in 2014 you can have thousands of partitions you probably (unless you put each partition on a separate drive) won't see the big performance bump you are looking for.
SQL Server has very fast response time for tables (especially for tables with 10s of millions of rows (in your case the user table)). Don't let the main table get too wide and the response time will be extremely fast.

Right off the bat my first thought is this:
https://msdn.microsoft.com/en-us/library/ms188730.aspx
Partitioning can allow you to break it up into more manageable pieces and in a way that can be scalable. There will be some choices you have to make about how you break it up, but I believe this is the right path for you.
In regards to table scanning if you have proper indexing you should be getting seeks in your queries. You will want to look at execution plans to know for sure on this though.
As for having mini-DB for each user that is sort of what you can accomplish with partitioning.

Mini-Database for each user is a definite no-go zone.
Plus on a side note A separate table to hold just Two columns UserID and BlockedUserID both being INT columns and having correct indexes, you cannot go wrong with this approach , if you write your queries sensibly :)
look into table partitioning , also a well normalized database with decent indexes will also help.
Also if you can afford Enterprise Licence table partitioning with the table schema described in last point will make it a very good , query friendly database schema.

I did it once for a social network system. Maybe you can look for your normalization. At the time I got a [Relationship] table and it just got
UserAId Int
UserBId Int
RelationshipFlag Smallint
With 1 million users and each one with 10 "friends" that table got 10 millions rows. Not a problem since we put indexes on the columns and it can retrieve a list of all "related" usersB to a specific userA in no time.
Take a good look on your schema and your indexes, if they are ok you DB ill not got problems handling it.
Edit
I agree with #M.Ali
Mini-Database for each user is a definite no-go zone.
IMHO you are fine if you stick with the basic and implement it the right way

MySQL Database - storing data in one table or using lookups

I'm in the middle of redesigning an app that has 100,000s of records in a particular table (currently 250K and growing).
The table contains information of websites and domains.
For the sake of speed and resources, should I include all the data needed about either entity in the original table, or should I be using two lookup tables to store information not shared - for example one lookup table which stores all domain specific info and one which stores all site specific info?
Thanks

Ideally you should split them into 2 different tables because a single domain would correspond to multiple sites and if we go with the design in which the metadata of both the domain and site is stored in a single table, in that case there would need to be redundant info stored for the domain in every record of the site metadata. Instead, if we have 2 separate tables in which the domain table has one record per domain and a list of sites as one of the fields in the record and a domain name column in the site table to figure out the domain given a site, it would ensure organized storage and no redundancy of data. This is the major principle of traditional RDBMS systems and that is why we have the concept of multiple tables.
Also, you may consider using a NOSQL data store if you want to really scale your database as you said that the data is continuously increasing. Apache HBase may be a good solution which has this concept of grouping related information together.
Edit:
Clarification in the question:
Just to be clear, domain and sites are not linked. They're just different entities like a domain with no traffic or revenue would be classed as a domain and have domain related data stored for it like number of hyphens or registrar while a domain with a Wordpress install for example and exisitng traffic would be classed as a site - not a domain - and have site specific information stored. Would this change your answer?
In the case where they are not inter-related, I don't think that splitting the data into multiple tables is going to help in any way unless you are going for a distributed RDBMS system. In case of a single-node hosted DB, the rows are anyways indexed by the site/domain id and a large number of rows in a single table is not going to degrade performance but if you are looking at the humongous size of data and wish to divide it over multiple nodes in a cluster then having independent tables for them will help so that each table gets hosted on individual nodes and the DB is able to scale horizontally. That is the only benefit I see in this case.

The perfomance of your application largely depends on the type of queries the application uses. To store all data in one table does not necessarily reduce the performance but very well might enhance it. You are wasting disk space of course if your table holds the information that example.com is owned by Mr XY a few thousand times.
Normalizing your database (splitting your data up) can be helpful but one would have to know what you want to do with the data to answer that.

MYSQL - Database Design Large-scale real world deployment

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan

I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Read vs Write tables database design

I have a user activity tracking log table where it logs all user activity as they occur. This is extremely high write table due to the in depth tracking of click by click tracking. Up to here the database design is perfect. Problem is the next step.
I need to output the data for the business folks + these people can query to fetch past activity data. Hence there is semi-medium to high read also. I do not like the idea of reading and writing from the same high traffic table.
So ideally I want to split the tables: The first one for quick writes (less to no fks), then copy that data over fully formatted & pulling in all the labels for the ids into a read table for reading use.
So questions:
1) Is this the best approach for me?
2) If i do keep 2 tables, how to keep them in sync? I cant copy the data to the read table instant as it writes to the write table - it will defeat the whole purpose of having seperate tables then, nor can i keep the read table to be old because the activity data tracked links with other user data like session_id, etc so if these IDs are not ready when their usecase calles for it the writes will fail.
I am using MySQL for user data and HBase for some large tables, with php codeignitor for my app.
Thanks.

Yes, having 2 separate tables is the best approach. I've had the same problem to solve a few months ago, though for a daemon-type application and not a website.
Eventually I ended up with 1 MEMORY table keeping "live" data which is inserted/updated/deleted on almost every event and another table that had duplicates of the live data rows, but without the unnecesary system columns - my history table, which was used for reading only per request.
The live table is only relevant to the running process, so I don't care if the contained data is lost due to a server failure - whatever data needs to be read later is already stored in the history table. So ... there's no problem in duplicating the data in the two tables - your goal is performance, not normalization.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008