I am building a database that contains public, private(limited to internal) and confidential data (limited to very few). It has very specific requirements that the security of the the data is managed on the database side, but I am working in an environment where I do not have direct control of the permissions, and requests to change them will be time consuming (2-3 days).
So I created a structure that should meet our needs without requiring a lot of permissioning. I created two databases on the same server, one is the internal one, who's tables can only be edited by certain users within certain subnets of our network. The second is the public database where, using an admin account, I create views limited to public fields of tables in the internal database to expose public data and it seems to work well. However the data should only flow one way and the views should not be able to write to the source tables. And I cannot just lock down the public database to be only SELECTable since the public database is used for various tasks of our public website.
So I need to create views to limit access of some scripts to certain fields in a table. I need to make sure that those views are not able insert, update, or delete data in the source table. To create the view I use:
CREATE ALGORITHM = UNDEFINED
VIEW `table_view` AS
SELECT *
FROM `table`
Looking at the documentation to prevent updates the view needs to have aggregate data, sub queries in the WHERE clause, and ALGORITHM = TEMPTABLE. I would go with TEMPTABLE, but the manual is unclear whether it would impact the performance. In one paragraph the manual states:
It prefers MERGE over TEMPTABLE if
possible, because MERGE is usually
more efficient
Then immediately states:
A reason to choose TEMPTABLE
explicitly is that locks can be
released on underlying tables after
the temporary table has been created
and before it is used to finish
processing the statement. This might
result in quicker lock release than
the MERGE algorithm so that other
clients that use the view are not
blocked as long.
The views are going to be queried on page load to generate the contents of the page, would MERGE still be more efficient or would the lower lock time serve me better? And no, handling this through account permissions is not really an option due to the inability to GRANT permissions on individual fields to meet the legal confidentiality requirements. To meet them would require fragmenting each table into 2-3 tables containing fields with homogeneous confidentiality.
Should the algorithm be UNDEFINED or TEMPTABLE, or is there another setting in the view definition that will lock down the view. And what are the performance effects I will experience. Also, if I do something to force it to be uneditable, like including HAVING 1 to make it an aggregate function force it to be TEMPTABLE and the choice of algorithm moot.
I'm wondering why you don't just lock down grants to the account(s) being used to not have DELETE, INSERT or UPDATE.
MySQL doesn't appear to support roles, where I'd have defined a role without these grants & just associated the account(s) with that role - pity...
Related
Is there any work around to get the latest change in MySQL Database using Ado.NET.
i.e. change in which table, which column, performed operation, old and new value. both for single table change and multiple table change. want to log the changes in my own new table.
There are several ways how change tracking can be implemented for mysql:
triggers: you can add DB trigger for insert/update/delete that creates an entry in the audit log.
add application logic to track changes. Implementation highly depends on your data layer; if you use ADO.NET DataAdapter, RowUpdating event is suitable for this purpose.
Also you have the following alternatives how to store audit log in mysql database:
use one table for audit log with columns like: id, table, operation, new_value (string), old_value (string). This approach has several drawbacks: this table will grow up very fast (as it holds history for changes in all tables), it keeps values as strings, it saves excessive data duplicated between old-new pairs, changeset calculation takes some resources on every insert/update.
use 'mirror' table (say, with '_log' suffix) for each table with enabled change tracking. On insert/update you can execute additional insert command into mirror table - as result you'll have record 'snapshots' on every save, and by this snapshots it is possible to calculate what and when is changed. Performance overhead on insert/update is minimal, and you don't need to determine which values are actually changed - but in 'mirror' table you'll have a lot of redundant data as full row copy is saved even if only one column is changed.
hybrid solution when record 'snapshots' are temporarily saved, and then processed in background to store differences in optimal way without affecting app performance.
There are no one best solution for all cases, everything depends on the concrete application requirements: how many inserts/updates are performed, how audit log is used etc.
I am developing a multi-tenant application where for each tenant I create separate set of 50 tables in a single MySQL database in LAMP environment.
In each set average table size is 10 MB with the exception of about 10 tables having size between 50 to 200MB.
MySQL InnoDB creates 2 files(.frm & .ibd) for each table.
For 100 tenants there will be 100 x 50 = 5000 Tables x 2 Files = 10,000 Files
It looks too high to me. Am I doing it in a wrong way or its common in this kind of scenario. What other options I should consider ?
I also read this question but this question was closed by moderators so it did not attract many thoughts.
Have one database per tenant. That would be 100 directories, each with 2*50 = 100 files. 100 is reasonable; 10,000 items in a directory is dangerously high in most operating systems.
Addenda
If you have 15 tables that are used by all tenants, put them in one extra database. If you call that db Common, then consider these snippits:
USE Tenant; -- Customer starts in his own db
SELECT ... FROM table1 ...; -- Accesses `table1` for that tenant
SELECT a.this, b.blah
FROM table1 AS a -- tenant's table
JOIN Common.foo AS b ON ... -- common table
Note on grants...
GRANT ALL PRIVILEGES ON Tenant_123.* TO tenant_123#'%' IDENTIFIED BY ...;
GRANT SELECT ON Common.* TO tenant_123#'%';
That is, it is probably OK to 'grant' everything to his own database. But he show have very limited access to the Common data.
If, instead, you manage the logins and all accesses go through, say, a PHP API, then you probably have only one mysql 'user' for all accesses. In this case, my notes above about GRANTs are not relevant.
Do not let the Tenants have access to everything. Your entire system will quickly be hacked and possibly destroyed.
Typically, this has little to do with which way to do it versus which way you've basically sold your customers on how it's to be done, or in some cases having no choice due to the type of data.
For example, does your application have a policy or similar that defines isolation of user generated data? Does your application store HIPAA or PCI type data? If so, you may not even have a choice, and if the customer is expecting that sort of privacy, that normally comes at a premium due to the potential overhead of creating the separation.
If the separation/isolation of data is not required, then adding a field to tables indicating which application owns the data would be most ideal from a performance perspective, and you would just need to update your queries to filter based on that.
Using MySQL or MariaDB I prefer to use a single database for all tenants and restrict access to data by using a different database user per tenant which only has permission to their data.
You can accomplish by using an tenant_id column that stores the database username of the tenant that owns the data. I use a trigger to populate this column automatically when new rows are added. I then use Views to filter the tables where the tenant_id = current_database_user. Then I restrict the tenant database users to only have access to the views, not the real tables.
I was able to convert a large single-tenant application to a multi-tenant application over a weekend using this technique because I only needed to modify the database and my database connection code.
I've written a blog post fully describing this approach. https://opensource.io/it/mysql-multi-tenant/
I have a mysql query that is taking 8 seconds to execute/fetch (in workbench).
I won't go into the details of why it may be slow (I think GROUPBY isnt helping though).
What I really want to know is, how I can basically cache it to work more quickly because the tables only change like 5-10 times/hr, while users access the site 1000s times/hour.
Is there a way to just have the results regenerated/cached when the db changes so results are not constantly regenerated?
I'm quite new to sql so any basic thought may go a long way.
I am not familiar with such a caching facility in MySQL. There are alternatives.
One mechanism would be to use application level caching. The application would store the previous result and use that if possible. Note this wouldn't really work well for multiple users.
What you might want to do is store the report in a separate table. Then you can run that every five minutes or so. This would be a simple mechanism using a job scheduler to run the job.
A variation on this would be to have a stored procedure that first checks if the data has changed. If the underlying data has changed, then the stored procedure would regenerate the report table. When the stored procedure is done, the report table would be up-to-date.
An alternative would be to use triggers, whenever the underlying data changes. The trigger could run the query, storing the results in a table (as above). Alternatively, the trigger could just update the rows in the report that would have changed (harder, because it involves understanding the business logic behind the report).
All of these require some change to the application. If your application query is stored in a view (something like vw_FetchReport1) then the change is trivial and all on the server side. If the query is embedded in the application, then you need to replace it with something else. I strongly advocate using views (or in other databases user defined functions or stored procedures) for database access. This defines the API for the database application and greatly facilitates changes such as the ones described here.
EDIT: (in response to comment)
More information about scheduling jobs in MySQL is here. I would expect the SQL code to be something like:
truncate table ReportTable;
insert into ReportTable
select * from <ReportQuery>;
(In practice, you would include column lists in the select and insert statements.)
A simple solution that can be used to speed-up the response time for long running queries is to periodically generate summarized tables, based on underlying data refreshing or business needs.
For example, if your business don't care about sub-minute "accuracy", you can run the process once each minute and make your user interface to query this calculated table, instead of summarizing raw data online.
I have an Access database containing information about people (employee profiles and related information). The front end has a single console-like interface that modifies one type of data at a time (such as academic degrees from one form, contact information from another). It is currently linked to multiple back ends (one for each type of data, and one for the basic profile information). All files are located on a network share and many of the back ends are encrypted.
The reason I have done that is that I understand that MS Access has to pull the entire database file to the local computer in order to make any queries or updates, then put any changed data back on the network share. My theory is that if a person is changing a telephone number or address (contact information), they would only have to pull/modify/replace the contact information database, rather than pull a single large database containing contact information, projects, degrees, awards, etc. just to change one telephone number, thus reducing the potential for locked databases and network traffic when multiple users are accessing data.
Is this a sane conclusion? Do I misunderstand a great deal? Am I missing something else?
I realize there is the consideration of overhead with each file, but I don't know how great the impact is. If I were to consolidate the back ends, there is also the potential benefit of being able to let Access handle referential integrity for cascading deletes, etc., rather than coding for that...
I'd appreciate any thoughts or (reasonably valid) criticisms.
This is a common misunderstanding:
MS Access has to pull the entire database file to the local computer in order to make any queries or updates
Consider this query:
SELECT first_name, last_name
FROM Employees
WHERE EmpID = 27;
If EmpID is indexed, the database engine will read just enough of the index to find which table rows match, then read the matching rows. If the index includes a unique constraint (say EmpID is the primary key), the reading will be faster. The database engine doesn't read the entire table, nor even the entire index.
Without an index on EmpID, the engine would do a full table scan of the Employees table --- meaning it would have to read every row from the table to determine which include matching EmpID values.
But either way, the engine doesn't need to read the entire database ... Clients, Inventory, Sales, etc. tables ... it has no reason to read all that data.
You're correct that there is overhead for connections to the back-end database files. The engine must manage a lock file for each database. I don't know the magnitude of that impact. If it were me, I would create a new back-end database and import the tables from the others. Then make a copy of the front-end and re-link to the back-end tables. That would give you the opportunity to examine the performance impact directly.
Seems to me relational integrity should be a strong argument for consolidating the tables into a single back-end.
Regarding locking, you shouldn't ever need to lock the entire back-end database for routine DML (INSERT, UPDATE, DELETE) operations. The database base engine supports more granular locking. Also pessimistic vs. opportunistic locking --- whether the lock occurs once you begin editing a row or is deferred until you save the changed row.
Actually "slow network" could be the biggest concern if slow means a wireless network. Access is only safe on a hard-wired LAN.
Edit: Access is not appropriate for a WAN network environment. See this page by Albert D. Kallal.
ms access is not good to use in local area network nor wide area network which certainly have lower speed. the solution is to use a client server database such as Ms SQL Server or MySQL. Ms SQL Server is much better than My SQL but it is not free. Consider Ms SQL server for large-scale projects. Again I said MS access is only good for 1 computer not for computer network.
I'm being given a data source weekly that I'm going to parse and put into a database. The data will not change much from week to week, but I should be updating the database on a regular basis. Besides this weekly update, the data is static.
For now rebuilding the entire database isn't a problem, but eventually this database will be live and people could be querying the database while I'm rebuilding it. The amount of data isn't small (couple hundred megabytes), so it won't load that instantaneously, and personally I want a bit more of a foolproof system than "I hope no one queries while the database is in disarray."
I've thought of a few different ways of solving this problem, and was wondering what the best method would be. Here's my ideas so far:
Instead of replacing entire tables, query for the difference between my current database and what I want to place in the database. This seems like it could be an unnecessary amount of work, though.
Creating dummy data tables, then doing a table rename (or having the server code point towards the new data tables).
Just telling users that the site is going through maintenance and put the system offline for a few minutes. (This is not preferable for obvious reasons, but if it's far and away the best answer I'm willing to accept that.)
Thoughts?
I can't speak for MySQL, but PostgreSQL has transactional DDL. This is a wonderful feature, and means that your second option, loading new data into a dummy table and then executing a table rename, should work great. If you want to replace the table foo with foo_new, you only have to load the new data into foo_new and run a script to do the rename. This script should execute in its own transaction, so if something about the rename goes bad, both foo and foo_new will be left untouched when it rolls back.
The main problem with that approach is that it can get a little messy to handle foreign keys from other tables that key on foo. But at least you're guaranteed that your data will remain consistent.
A better approach in the long term, I think, is just to perform the updates on the data directly (your first option). Once again, you can stick all the updating in a single transaction, so you're guaranteed all-or-nothing semantics. Even better would be online updates, just updating the data directly as new information becomes available. This may not be an option for you if you need the results of someone else's batch job, but if you can do it, it's the best option.
BEGIN;
DELETE FROM TABLE;
INSERT INTO TABLE;
COMMIT;
Users will see the changeover instantly when you hit commit. Any queries started before the commit will run on the old data, anything afterwards will run on the new data. The database will actually clear the old table once the last user is done with it. Because everything is "static" (you're the only one who ever changes it, and only once a week), you don't have to worry about any lock issues or timeouts. For MySQL, this depends on InnoDB. PostgreSQL does it, and SQL Server calls it "snapshotting," and I can't remember the details off the top of my head since I rarely use the thing.
If you Google "transaction isolation" + the name of whatever database you're using, you'll find appropriate information.
We solved this problem by using PostgreSQL's table inheritance/constraints mechanism.
You create a trigger that auto-creates sub-tables partitioned based on a date field.
This article was the source I used.
Which database server are you using? SQL 2005 and above provides a locking method called "Snapshot". It allows you to open a transaction, do all of your updates, and then commit, all while users of the database continue to view the pre-transaction data. Normally, your transaction would lock your tables and block their queries, but snapshot locking would be perfect in your case.
More info here: http://blogs.msdn.com/craigfr/archive/2007/05/16/serializable-vs-snapshot-isolation-level.aspx
But it requires SQL Server, so if you're using something else....
Several database systems (since you didn't specify yours, I'll keep this general) do offer the SQL:2003 Standard statement called MERGE which will basically allow you to
insert new rows into a target table from a source which don't exist there yet
update existing rows in the target table based on new values from the source
optionally even delete rows from the target that don't show up in the import table anymore
SQL Server 2008 is the first Microsoft offering to have this statement - check out more here, here or here.
Other database system probably will have similar implementations - it's a SQL:2003 Standard statement after all.
Marc
Use different table names(mytable_[yyyy]_[wk]) and a view for providing you with a constant name(mytable). Once a new table is completely imported update your view so that it uses that table.