What are the ramifications of having a ton of views in MySql - mysql

I have a multi-tenant MySQL database. For most things I use a tenant-id column in tables to discriminate but for a few purposes, it is much better to have a view that is already filtered by tenant id. So for example, I might have a view called 'quarterly_sales_view' but for tenant 30 I would have 'quarterly_sales_view_30' and for tenant 51 I would have 'quarterly_sales_view_51'. I create these views dynamically and everything is working great but we have just a few tenants right now and I realize this would never work for millions of tenants.
My question is, am I going to run into either performance problems or just hard limits with a few thousand, few hundred, or few dozen custom views?
ADDITIONAL INFO:
I am using a 3rd party tool (immature) that requires a table name (or view name, since it's read-only) and operates on that. In the context it's working, I can't let it have access to the entire view, so I create another view that is simply defined as SELECT * FROM MasterView WHERE TenantId = 30. I recognize this as a workaround for a poor limitation of having to have the tool work on the table directly. Luckily this tool is open source, so I can tweak it to use a different approach. I just wanted to have an idea of how long I had before the current approach blew up.

The primary concern within this question (IMO) should be less with performance and more with the design. First, the number of views should not affect performance, but, why do you need a view per tenant? Is it not possible to simply filter for a tenant by ID on a more generic view. E.g.:
SELECT * FROM vwMyTentants WHERE TenantId = 30
Whatever the reason, you should reconsider your approach because it is a sign of design smell.

Related

Is sql views the proper solution

I have a table named 'Customers' which has all the information about users with different types (user, drivers, admins) and I cannot separate this table right now because it's working on production and this is not a proper time to do this.
so If I make 3 views: the first has users types only, the second has drivers and the third has admins.
My goal is to use 3 models instead one in the project I'm working on so
is this a good solution and what does it cost on performance?
How big is your table 'Customers'? According to the name it doesn't sounds like heavy one.
How often these views will be queried?
Do you have some indices or pk constraints on the attribute you're are going to use in where clause for the views?
I cannot separate this table right now because it's working on
production and this is not a proper time to do this.
From what you said it sounds like a temporarily solution so it probably the good one. Later you сan replace the views with three tables and it will not affect the interface.
I suggest that it is improper to give end-users logins directly into the database. Instead, all requests should go through a database-access layer (API) that users must log into. This layer can provide the filtering you require without (perhaps) any impact to the user. The layer would, while constructing the needed SELECT, tack on, for example, AND type = 'admin' to achieve the goal.
For performance, you might also need to have type at the beginning of some of the INDEXes.

Best practices for creating a huge SQL table

I want to create a table about "users" for each of the 50 states. Each state has about 2GB worth of data. Which option sounds better?
Create one table called "users" that will be 100GB large OR
Create 50 separate tables called "users_{state}", each which will be 2GB large
I'm looking at two things: performance, and style (best practices)
I'm also running RDS on AWS, and I have enough storage space. Any thoughts?
EDIT: From the looks of it, I will not need info from multiples states at the same time (i.e. won't need to frequently join tables if I go with Option 2). Here is a common use case: The front-end passes a state id to the back-end, and based on that id, I need to query data from the db regarding the specified state, and return data back to front-end.
Are the 50 states truly independent in your business logic? Meaning your queries would only need to run over one given state most of the time? If so, splitting by state is probably a good choice. In this case you would only need joining in relatively rarer queries like reporting queries and such.
EDIT: Based on your recent edit, this first option is the route I would recommend. You will get better performance from the table partitioning when no joining is required, and there are multiple other benefits to having the smaller partitioned tables like this.
If your queries would commonly require joining across a majority of the states, then you should definitely not partition like this. You'd be better off with one large table and just build the appropriate indices needed for performance. Most modern enterprise DB solutions are capable of handling the marginal performance impact going from 2GB to 100GB just fine (with proper indexing).
But if your queries on average would need to join results from only a handful of states (say no more than 5-10 or so), the optimal solution is a more complex gray area. You will likely be able to extract better performance from the partitioned tables with joining, but it may make the code and/or queries (and all coming maintenance) noticeably more complex.
Note that my answer assumes the more common access frequency breakdowns: high reads, moderate updates, low creates/deletes. Also, if performance on big data is your primary concern, you may want to check out NoSQL (for example, Amazon AWS DynamoDB), but this would be an invasive and fundamental departure from the relational system. But the NoSQL performance benefits can be absolutely dramatic.
Without knowing more of your model, it will be difficult for anyone to make judgement calls about performance, etc. However, from a data modelling point of view, when thinking about a normalized model I would expect to see a User table with a column (or columns, in the case of a compound key) which hold the foreign key to a State table. If a User could be associated with more than one state, I would expect another table (UserState) to be created instead, and this would hold the foreign keys to both User and State, with any other information about that relationship (for instance, start and end dates for time slicing, showing the timespan during which the User and the State were associated).
Rather than splitting the data into separate tables, if you find that you have performance issues you could use partitioning to split the User data by state while leaving it within a single table. I don't use MySQL, but a quick Google turned up plenty of reference information on how to implement partitioning within MySQL.
Until you try building and running this, I don't think you know whether you have a performance problem or not. If you do, following the above design you can apply partitioning after the fact and not need to change your front-end queries. Also, this solution won't be problematic if it turns out you do need information for multiple states at the same time, and won't cause you anywhere near as much grief if you need to look at User by some aspect other than State.

Creating a MySQL Database Schema for large data set

I'm struggling to find the best way to build out a structure that will work for my project. The answer may be simple but I'm struggling due to the massive number of columns or tables, depending on how it's set up.
We have several tools, each that can be run for many customers. Each tool has a series of questions that populate a database of answers. After the tool is run, we populate another series of data that is the output of the tool. We have roughly 10 tools, all populating a spreadsheet of 1500 data points. Here's where I struggle... each tool can be run multiple times, and many tools share the same data point. My next project is to build an application that can begin data entry for a tool, but allow import of data that shares the same datapoint for a tool that has already been run.
A simple example:
Tool 1 - company, numberofusers, numberoflocations, cost
Tool 2 - company, numberofusers, totalstorage, employeepayrate
So if the same company completed tool 1, I need to be able to populate "numberofusers" (or offer to populate) when they complete tool 2 since it already exists.
I think what it boils down to is, would it be better to create a structure that has 1500 tables, 1 for each data element with additional data around each data element, or to create a single massive table - something like...
customerID(FK), EventID(fk), ToolID(fk), numberofusers, numberoflocations, cost, total storage, employee pay,.....(1500)
If I go this route and have one large table I'm not sure how that will impact performance. Likewise - how difficult it will be to maintain 1500 tables.
Another dimension is that it would be nice to have a description of each field:
numberofusers,title,description,active(bool). I assume this is only possible if each element is in its own table?
Thoughts? Suggestions? Sorry for the lengthy question, new here.
Build a main table with all the common data: company, # users, .. other stuff. Give each row a unique id.
Build a table for each unique tool with the company id from above and any data unique to that implementation. Give each table a primary (unique key) for 'tool use' and 'company'.
This covers the common data in one place, identifies each 'customer' and provides for multiple uses of a given tool for each customer. Every use and customer is trackable and distinct.
More about normalization here.
I agree with etherbubunny on normalization but with larger datasets there are performance considerations that quickly become important. Joins which are often required in normalized databases to display human readable information can be performance killers on even medium sized tables which is why a lot of data warehouse models use de-normalized datasets for reporting. This is essentially pre-building the joined reporting data into new tables with heavy use of indexing, archiving and partitioning.
In many cases smart use of partitioning on its own can also effectively help reduce the size of the datasets being queried. This usually takes quite a bit of maintenance unless certain parameters remain fixed though.
Ultimately in your case (and most others) I highly recommend building it the way you are able to maintain and understand what is going on and then performing regular performance checks via slow query logs, explain, and performance monitoring tools like percona's tool set. This will give you insight into what is really happening and give you some data to come back here or the MySQL forums with. We can always speculate here but ultimately the real data and your setup will be the driving force behind what is right for you.

Best practises : is sql views really worth it? [duplicate]

This question already has answers here:
Why do you create a View in a database?
(25 answers)
Closed 8 years ago.
I am building a new web applications with data stored in the database. as many web applications, I need to expose data from complexe sql queries (query with conditions from multiple table). I was wondering if it could be a good idea to build my queries in the database as sql view instead of building it in the application? I mean what would be the benefit of that ? database performance? do i will code longer? debug longer?
Thank you
This can not be answered really objectively, since it depends on case by case.
With views, functions, triggers and stored procedures you can move part of the logic of your application into the database layer.
This can have several benefits:
performance -- you might avoid roundtrips of data, and certain treatment are handled more efficiently using the whole range of DBMS features.
consisness -- some treatment of data are expressed more easily with the DBMS features.
But also certain drawback:
portability -- the more you rely on specific features of the DBMS, the less portable the application becomes.
maintenability -- the logic is scattered across two different technologies which implies more skills are needed for maintenance, and local reasoning is harder.
If you stick to the SQL92 standard it's a good trade-off.
My 2 cents.
I think your question is a little bit confusing in what you are trying to achieve (Gain knowledge regarding SQL Views or how to structure your application).
I believe all database logic should be stored at the database tier, ideally in a stored procedure, function rather in the application logic. The application logic should then connect to the database and retrieve the data from these procedures/functions and then expose this data in your application.
One of the the benefits of storing this at the database tier is taking advantage of the execution plans via SQL Server (which you may not get by accessing it directly via the application logic). Another advantage is that you can separate the application, i.e. if the database needs to be changed you don't have to modify the application directly.
For a direct point on Views, the advantages of using them include:
Restrict a user to specific rows in a table.
For example, allow an employee to see only the rows recording his or her work in a labor- tracking table.
Restrict a user to specific columns.
For example, allow employees who do not work in payroll to see the name, office, work phone, and department columns in an employee table, but do not allow them to see any columns with salary information or personal information.
Join columns from multiple tables so that they look like a single table.
http://msdn.microsoft.com/en-us/library/aa214068(v=sql.80).aspx
Personally I prefer views, especially for reports/apps as if there are any problems in the data you only have to update a single view rather than re-building the app or manually editing the queries.
SQL views have many uses. Try first reading about them and then asking a more specific question:
http://odetocode.com/code/299.aspx
http://msdn.microsoft.com/en-us/library/ms187956.aspx
I have seen that views are used a lot to do two things:
Simplify queries, if you have a HUGE select with multiple joins and joins and joins, you can create a view that will have the same performance but the query will be only a couple of lines.
For security reason, if you have a table with some information that shouldn't be accessed for all the developers, you can create views and grant privileges to see the views and not the main table, I.E:
table 1: Name, Last_name, User_ID, credit_card, social_security. You create a view table.table view: name, last_name, user_id .
You can run into performance issues and constraints on the types queries you can run against a view.
Restrictions on what you can do with views.
http://dev.mysql.com/doc/refman/5.6/en/view-restrictions.html
Looks like the big one is that you cannot create an index on the view. This could cause a big performance hit if your final result table is large
This is also a good forum discussing views: http://forums.mysql.com/read.php?100,22967,22967#msg-22967
In my experience a well indexed table, using the right engine, and properly tuned (for example setting an appropriate key_buffer value) can perform better than a view.
Alternatively you could create a trigger that updates a table based on the results of other tables. http://dev.mysql.com/doc/refman/5.6/en/triggers.html
The technic you are saying is called denormalization. Cal Henderson, software engineer from Flickr, openly supports this technic.
In theory JOIN operation is one of the most expensive operations, so it is a good practice to denormalize, since you are transforming n queries with m JOIN in 1 query with m JOIN and n queries that select from a view.
That said, the best way is to test it for yourself. Because what could be incredibly good for Flickr may not be so good for your application.
Moreover, the performance of views may vary a lot from one RBDMS to another. For instance, depending on the RBDMS views can be updated when the orginal table is changed, can have indexes, etc.

Private messaging system, large single table versus many small tables

I'm considering a design for a private messaging system and I need some input here, basically I have several questions regarding this. I've read most of the related questions and they've given me some thought already.
All of the basic messaging systems I've thus far looked into use a single table for all of the users' messages. With indexes etc this approach would seem fine.
What I wanted to know is if there would be any benefit to splitting the user messages into separate tables. So when a new user is created a new table is created (either in the same or a dedicated message database) which stores all of the messages - sent and received -for that user.
What are the pitfalls/benefits to approaching things that way?
I'm writing in PHP would the code required to write be particularly more cumbersome than the first large table option?
Would the eventual result, with a large amount of smaller tables be a more robust, trouble free design than one large table?
In the event of large amounts of concurrent users, how would the performance of the server compare where dealing with one large versus many small tables?
Any help with those questions or other input would be appreciated. I'm currently working through a smaller scale design for my test site before rewriting the PM module and would like to optimise it. My poor human brain handles separate table far more easily, but the same isn't necessarily so for a computer.
You'll just get headaches from moving to small numerous tables. Databases are made for handling lots of data, let it do it's thing.
You'll likely end up using dynamic table names in queries (SELECT * FROM $username WHERE ...), making smart features like stored procedures and possibly parameterized queries a lot trickier if not outright impossible. Usually a really bad idea.
Try rewriting SELECT * FROM messages WHERE authorID = 1 ORDER BY date_posted DESC, but where "messages" is anywhere between 1 and 30,000 different tables. Keeping your table relations monogamous will keep them bidirectional, way more useful.
If you think table size will really be a problem, set up an "archived messages" clone table and periodically move old & not-unread messages there where they won't get in the way. Also note how most forum software with private messaging allows for limiting user inbox sizes. There are a few ways to solve the problem while keeping things sane.
I'm agreeing with #MarkR here - in that initially the one table for messages is definitely the way to proceed. As time progresses and should you end up with a very large table then you can consider how to partition the table to best proceed. That's counter to the way I'd normally advise design, but we're talking about one table which is fairly simple - not a huge enterprise system.
A very long time ago (pre availability of SQL databases) I built a system that stored private and public messages, and I can confirm that once you split a message base logical entity into more than one everything¹ becomes a lot more complicated; and I doubt that a user per file is the right approach - the overheads will be massive compared to the benefit.
Avoid auto-increment[2] - and using natural keys is very important to the future scalability. Designing well to ensure that you can insert and retrieve without locking will be of more benefit.
¹ Indexing, threading, searching, purging/archiving.
² Natural keys are better if you can find one for your data as the autoincremented ID does not describe the data at all and databases are good at locating based on the primary key, so a natural primary key can improve things. Autoincrement can cause problems with a distributed database; it also leaks data when presented externally (to see the number of users registered just create a new account and check your user ID). If you can't find a natural key then a UUID (or GUID) may still be a better option - providing that the database has good support for this as a primary key. See When to use an auto-incremented primary key and when not to
Creating one table per user certainly won't scale well when there are a large number of users with a small number of messages. The way MySQL handles table opening/closing, very large numbers of tables (> 10k, say) become quite inefficient, especially at server startup and shutdown, as well as trying to backup non-transactional tables.
However, the way you've worded your question sounds like a case of premature optimisation. Make it work first, then fix performance problems. This is always the right way to do things.
Partitioning / sharding will become necessary once your scale gets high enough. But there are a lot of other things to worry about in the mean time. Sort them out first :)
One table is the right way to go from an RDBMS PoV. I recommend you use it until you know better.
Splitting large amounts of data into smaller sets makes sense if you're trying to avoid locking issues: for example - locking the messages table - doing big selects or updating huge amounts of data at once. In this case long running queries could block whole table and everyone needs to wait... You should ask yourself if this going to happen in your case? At least for me it looks like messaging system is not going to have such things because all information is being pushed into table or retrieved from it in rather small sets. If this is a user centric application - so, for example, getting all messages for single user is quite easy and fast to do, the same goes also for creating new messages for one or another particular user... Unless you would have really huge amounts of users/messages in your system.
Splitting data into multiple tables has also some drawbacks - you will need kind of management system or logic how do you split everything - giving separate table for each user could grow up soon into hundreds or thousands of tables - which is, in my opinion, not that nice. Therefore probably you would need some other criteria how to split the data. If you want splitting logic to be dynamic and easy adjustable - you would probably need also to save it in DB somehow. As you see complexity grows...
As advantage of such data sharding could be the scalability - you could easy put different sets of data on different machines once single machine is not able to handle whole load.
It depends how your message system works.
Are there cuncurrency issue?
Does it need to be scalable as the application accomodate more customers?
Designing one table will perfectly work on small, one message at a time single user system.
However, if you are considering multiple user, concurrent messaging system, the tables should be splited
Data model for Real time application is recommended to be "normalized"(Spliting table) due to "locking & latching" and data redundency issue.
Locking policy varies by Database Vendor. If you have tables that have updates & select by applicaiton concurrently, "Locking"(page level, row level, table level depending on vendor) issue araise. Some bad DB & app design completely lock the table so message never go through.
Redendency issue is more clear. If you use only one table, some information(like user. I guess one user could send multiple messages) is redundent.
Try to google with "normalization", 'Locking"..