Just a general question.
I was wondering how companies like facebook and google search through millions of data in such a short span of time.
Lets say if i have to login, I enter my user credentials on the login page. How does fb and google store username and password so that they can go through millions/ billions of username and check if the user exist or not?
If there is a startup, how should they save their user's data so that later on searching and extracting users details can be faster. Should we create a separate table for user based on first alphabet of their user name? or is their any other better way to do this?
Let me know if there is any good article related to this question that you would suggest me to read.
Searching for data in a centralized database will be a bottle neck as the application data size grows. If you are thinking of scaling problems when starting the development of application itself, make sure your system can easily be deployed to parallel systems.
For example, think of scenarios where your data can't fit on a single database server however good configuration it is. You must split this data on to multiple hosts. This is called sharding. In sharding, data gets distributed to multiple hosts based on some keys. Take the same example of Facebook. It can maintain a database server for each country (just an assumption, I don't really know how they have implemented it). So when a user tries to login from India, his user will be searched only in Indian users database rather than the whole user base of Facebook. Considering the huge database size of Facebook, reducing the search space from whole user base to indian user base will definitely improve the query performance.
Database servers like MongoDB, and ElasticSearch provide in built support for sharding. With the help of these features, we can horizontally scale a system by adding more and more machines than vertical scaling (Scaling a single server to it's fullest capacity).
Related
I have a video surveillance project running on a cloud infrastructure and using MySQL database.
We are now integrating some artificial intelligence into our project including face recognition, plate recognition, tag search, etc.. which implies a huge amount of data every day
All the photos and the images derived from those photos by image processing algorithms are stored in cloud storage but their references and tags are stored in the database.
I have been thinking of the best way to integrate this, do I have to stick to MySQL or use another system. The different options I thought about are:
1- Use another database MongoDB to store the photos references and tags. This will cost me another database server, as well as the integration with a new database system along with the existent MySQL server
2- Use elastic search to retrieve data and perform tag searching. This leads to question the capacity of MySql to store this amount of data
3- Stick with MySQL purely, but is the user experience will be impacted?
Would you guide me to the best option to choose or give me another proposal?
EDIT:
For more information:
The physical pictures are stored in cloud storage, only the URLs are stored in the database.
In the database, we will store the metadata of the picture like id, the id of the client, URL, tags, date of creation, etc...
Operations are of the type :
It will be generally a SELECTs based on different criteria and search by tags
How big the data is?
Imagine a camera placed outdoor in the street and each time it detects a face it will send an image.
Imagine thousands of cameras are doing so. Then, we are talking about millions of images per client.
MySQL can handle billions of rows. You have not provided enough other information to comment on the rest of your questions.
Large blobs (images, videos, etc) are probably best handled by some large, cheap, storage. And then, as you say, a URL to the blob would be stored in the database.
How many rows? How frequently inserting? Some desired SELECT statements? Is it mostly just writing to the database? Or will you have large, complex, queries?
I'm relatively new to Databases and have been looking into a solution that will allow users, under my server, to access their own data and no one else. I want these databases for the user to be scalable, so if more space is needed to store files, they can do so with little intervention.
I was looking into MySQL, as I have only done single database work with it, and was trying to see how this could potentially be done. Would the best course of action be set a database for each user? That way the tables for each database are separate, password protected, and cannot exchange data among the tables in other databases? I know there can be essentially unlimited databases and tables, in addition to table sharding/partitioning, so I think this is a solid choice, but was wondering if anyone who has worked with MySQL more had any input.
Thanks
EDIT: Update for clarification of desires. So what I essentially want is a platform where I am the owner, but I can have users log in to my platform to access their data. This data will probably mostly include files, such as PDF's, but as to their size I cannot tell, but am planning for the worst. They will be able to use a web/application to view their files and download, upload, sort, delete these files. So in addition to creating files, there will be the ability to see historic files and download those as well if desired. What my platform will be providing is the framework for these files with fields being autofilled if I can, as well as the UI for the file management. My concern comes from architecture of having multiple users, with separate data, to be kept separate, scalable, and not completely crash the server with read/writes.
It sounds like you are looking to store the users "files" as BLOBs in the database which doesn't necessarily lend itself to scaling well in the first place. Depending on the type of files generally the best solution would be to provide security in the application layer and use cloud based storage for your files. If you need an additional layer of security (i.e. users can only access the files assigned to them) there are a number of options - one such option, for example, assuming you were using S3 would be to use IAM profiles which could be generated when the user a/c is set up. The same would apply for any third party cloud storage with API.
Having an individual database per user would be an administrative nightmare unless you could generate each database on login (which would mean a separate data store for credentials anyway so it would be somewhat pointless) and also would not work in a BLOB storage scenario.
If you can detail a little more in terms of precisely what you are trying to achieve and why there will for sure be plenty of answers.
I have an application in which we want to provide the functionality using which user can add/update/delete the columns of different tables. My approach is to create a different database for each client so that their changes specific to tables will remain in their database.
Since each client will have their own database, I wonder how can I manage authentication and authorization? Do I need to create a different database for that as well? Will it affect the performance of the application?
Edit: The approach that I am planning to use for authentication and authorization is to create an additional field called "Account" on the login page. This account name will guide the program to connect it to correct database. And each database will have it's own users to authenticate.
The answer to your question is of course (and unfortunately) Yes and No. :)
This is known as multi-tenant data architecture.
Having separate databases can definitely be a great design option however so can using one database shared with all of your clients/customers and you will need to consider many factors before choosing.
Each design has pluses and minuses.
Here are your 3 essential choices
1) Each customer shares the same database and database tables.
2) Each customer shares the same database but they get their own schema inside the database so they each get their own set of tables.
3)Each customer gets their own database.
One major benefit (that I really like) to the separate database approach is data security. What I mean by this is that every customer gets their own database and because of this they will edit/update/delete just their database. Because of this, there is no risk in end users overriding other users data either due to programmatic error on your part or due to a security breach in your application.
When all users are in the same database you could accidentally pull and expose another customers data. Or, worse, you could expose a primary key to a record on screen and forget to secure it appropriately and a power user could override this key very easily to a key that belongs to another customer thus exposing another clients data.
However, lets say that all of your customers are actually subsidieries of 1 large company and you need to roll up financials every day/week/month/year etc.
If this is the case, then having a database for every client could be a reporting nightmare and having everyone in a single database sharing tables would just make life so much easier. When it comes time to report on your daily sales for instance, its easier to just sum up a column then go to 10,000 databases and sum them up. :)
So the answer definitely depends on your applicaton and what it will be doing.
I work on a large enterprise system where we have tens of thousands of clients in the same database and in order to support this we took very great care to secure all of our data very carefully.
I also work on a side project in my spare time which supports a database per customer multi-tenant architecture.
So, consider what your application will do, how you will backup your data, do you need to roll up data etc and this will help you decide.
Heres a grea article on MSDN for this:
https://msdn.microsoft.com/en-us/library/aa479086.aspx
Regarding your question about authentication.
Yes, having a separate database for authentication is a great design. When a customer authenticates, you will authenticate them off of your authentication database and they will receive the connectionstring to their database as part of this authentication. Then all data from that point comes from that clients database.
Hope this was helpful.
Good luck!
I'll be soon developing a big cms where users can configure their website managing news, products, services and much more about their company.
Think about a shopify without the ecommerce part (at least for now).
The rdbms is MySQL and the user base will be about 150 (maybe bigger).
I'm trying to figure out which one of these two approaches would fit better.
DEDICATED DATABASE FOR EACH USER
PROS:
performance (and possible future sharding?): is querying smaller database with just your data better than querying a giant database with every user data?
easy "export my data" for users: I can simply dump their own db without fetching everything and putting it in some big encoded logical datastruct
SINGLE DATABASE FOR EVERY USER
PROS:
less general overhead
statistic: just one db to query to get and aggregate whatever I need
backup: one dump (not sure about this one because I've no experience in cluster dumping)
Which way would you go for? I don't think shopify created a dedicated database for any user registered... or maybe they did?
I'd like more experienced people than me to help me figuring out the best way and all the variables I can not guess right now because of my ignorance.
It sounds like you're developing a software-as-a-service hosted system, rather than a software package to distribute to customers for them to run on their own servers. In that case, in general, you will have an easier time developing and administering your service if you design it for a single database handling multiple users.
You'll be able to add new users to your system with data manipulation language (DML) rather than data definition language (DDL). That is, you'll insert rows for new users rather than create tables. That will make your life a LOT easier when you go live.
You are absolutely right that stuff like backups and aggregate reporting will be far easier if you have a single shared database.
Don't worry too much about the user data export functions. You'll have to develop software for those functions anyway; it won't be that hard to filter by user when you do the export.
But there's a downside you should consider to the single-database approach: if part of your requirement is to conceal various users' existence or data from each other, you'll have to be very careful to do this in your development. Will your users be competitors with each other? That could be tricky. You'll need to trust your in-house admin and support teams to refrain from disclosing one user's data to another by mistake (or deliberately). With a separate database per user, you'll have a smaller risk in that area.
150 users aren't many. Don't worry about scalability until you have a workload of paying customers. When that happens you can add MySQL server RAM, partitions, solid-state disks, replication, memcached, sharding, and all that other expensive and high-workload stuff. If you add those things before you go live, you'll just take longer and blow more money before you go live. Not good.
So I am going to be building a website using the Django web framework. In this website, I am going to have an advertising component. Whenever an advertisement is clicked, I need to record it. We charge the customer every time a separate user clicks on the advertisement. So my question is, should I record all the click entries into a log file or should I just create a Django model and record the data into a mysql database? I know its easier to create a model but am worried if there is a lot of high traffic to the website. Please give me some advice. I appreciate you taking the time to read and address my concerns.
Awesome. Thank you. I will definitely use a database.
Traditionally, this sort of interactions is stored in a DB. You could do it in a log, but I see at least two disadvantages:
log rotation
the fact that after logging you'll still have to process the data in a meaningful manner.
IMO, you could do it in a separate DB (see the multiple db feature in django). This way, you could have the performance somewhat more balanced.
You should save all clicks to a DB. A database is created to handle the kind of data you are trying to save.
Additionally, a database will allow you to analyze your data a lot more simply then a flat file. If you want to graph traffic from country, or by user agent or by date range, this will be almost trivial in a database, but parsing giganitc log files could be more involving.
Also a database will be easier to extend. Right now you are just tracking clicks but what happens if you want to start pushing advertisements that require some sort of additional user action or conversion. You will be able to extend this beyond clicks extremely easy in a database.