Long story short - I am dealing with a largish database where basic user details (userid (index), username, password, parent user, status) are stored in one database and extended user details (same userid (index), full name, address etc. etc.) are stored in another database on another server.
I need to do a query where I select all users owned by a particular user (via the parent user field from basic user details database), sorted by their full name (from the extended user details field) and just 25 at a time (there are thousands, maybe tens of thousands for any one user).
As far as I can work out there are three possible solutions;
No JOIN - get all the user IDs in one query and run the second query based on those IDs. This would be fine, except the number of user IDs could get so high that it would exceed the maximum query length, or be horribly inefficient.
Replicate the database table with the basic user details onto the server with the extended details so I can do a JOIN
Use a federated storage engine table to achieve the same results as #2
It seems that 3 is the best option, but I have been able to find little information about performance and I also found one comment to be careful using this on production databases.
I would appreciate any suggestions on what would be the best implementation.
Thanks!
FEDERATED tables are a nice feature .. but they do not support indexes, which would slow down your application dramatically.
if (!) you do read only from the users database on the remote server.
replication would be more effective and also faster.
Talking in terms of performance or limitations, Federated Engine has a lot of limitations. It doesn't support for transactions, Performance on a FEDERATED table when performing bulk inserts is slower than with other table types etc..
Replication and Federated engines are not meant to do same things. First of all, did you try both?
Related
I have some questions before implement the following scenario:
I have the Database A (it contains multiple tables with lots of data, and is being queried by multiple clients)
this database contains a users table, which I need to create some triggers, but this database is managed by a partner. We don't have permissions to create triggers.
And the Database B is managed by me, much lighter, the queries are only from one source, and I need to have access to users table data from Database A so I can create triggers and take actions for every update, create or delete in users table from database A.
My most concern is, how can this federated table impact on performance in database A? Database B is not the problem.
Both databases stay in the same geographic location, just different servers.
My goal is to make possible take actions from every transaction in database A users table.
Definitely queries that read federated tables have performance issues.
https://dev.mysql.com/doc/refman/8.0/en/federated-usagenotes.html says:
A FEDERATED table does not support indexes in the usual sense; because access to the table data is handled remotely, it is actually the remote table that makes use of indexes. This means that, for a query that cannot use any indexes and so requires a full table scan, the server fetches all rows from the remote table and filters them locally. This occurs regardless of any WHERE or LIMIT used with this SELECT statement; these clauses are applied locally to the returned rows.
Queries that fail to use indexes can thus cause poor performance and network overload. In addition, since returned rows must be stored in memory, such a query can also lead to the local server swapping, or even hanging.
(emphasis mine)
The reason the federated engine was created was to support applications that need to write to tables at a rate greater than a single server can support. If you are inserting to a table and overwhelming the I/O of that server, you can use a federated table so you can write to a table on a different server.
Reading from federated tables is likely to be worse than reading local tables, and cannot be optimized with indexes.
If you need good performance, you should use replication or a CDC tool, to maintain a real table on server B that you can query as a local table, not a federated table.
Another solution would be to cache the user's table in the client application, so you don't have to read it on every query.
I am creating a custom analytics system and currently in the database designing process. I'm planning to use MariaDB with the InnoDB engine to be able to handle big loads.
The data I'm expecting could be around 500k clicks/day. I will need to insert these rows into the database, which means that I'll have around 5.8 inserts/sec on average. However, at the same time, I want to record if someone visited a page associated with that click. (basically to record funnels)
So what I'm planning to do is to create additional columns and search for the ID of the specific row then update that column with the exact time of the visit.
My first question: is this generally a recommended approach to design the database like that? If not, how else is it worth to design the database?
My only concern is that while updating rows the Table will be locked, and can't do inserts, therefore slowing down the user experience.
My second question: is this something I should worry about, that the table gets locked while updating, and thus slowing down inserts? Does it hurt performance?
InnoDB doesn't lock the table for insert if you're performing the update. Your users won't experience any weird hanging.
It's an MVCC compliant engine, designed to handle concurrent access to underlying tables.
You can control the engine's behavior by choosing an appropriate isolation level, however the default (REPEATABLE READ) is excellent and does the job more than well.
If a table is being modified by multiple users (not users that connect to your site but connections established towards MySQL via a scripting language or some other service) and there's many inserts/updates/deletes - MySQL can throw an error saying a deadlock occurred.
A deadlock is a warning, not an error, that more than 1 thread tried to access an occupied resource (such as two threads tried to update the same row at the same time, but only 1 will be allowed to do so). It's an indication you should repeat the query.
I'm suggesting that you take care of all possible scenarios in the language of your choice when it comes to handling MySQL that's under heavier I/O.
~6 inserts a second isn't a lot, make sure you're allowing MySQL to access sufficient system resources. For InnoDB, check the value of innodb_buffer_pool_size or google a bit to see what it is and how to use it to make your database run fast.
Good luck!
At a mere 5.6/second, there won't be much problem.
I do, however, suggest vertical partitioning for "Likes", "Upvotes", "Clicks", and similar things. These tend to have a lot of UPDATEs of random single rows, and may interfere with other activity.
That is, have a separate table with (perhaps) just 2 columns:
The id of the item being Liked/Clicked/etc.
A counter.
It is simple enough (and fast enough) to JOIN via that id when you want to display info including the counter.
As already pointed out, the row is locked, not the table.
I am developing a multi-tenant application where for each tenant I create separate set of 50 tables in a single MySQL database in LAMP environment.
In each set average table size is 10 MB with the exception of about 10 tables having size between 50 to 200MB.
MySQL InnoDB creates 2 files(.frm & .ibd) for each table.
For 100 tenants there will be 100 x 50 = 5000 Tables x 2 Files = 10,000 Files
It looks too high to me. Am I doing it in a wrong way or its common in this kind of scenario. What other options I should consider ?
I also read this question but this question was closed by moderators so it did not attract many thoughts.
Have one database per tenant. That would be 100 directories, each with 2*50 = 100 files. 100 is reasonable; 10,000 items in a directory is dangerously high in most operating systems.
Addenda
If you have 15 tables that are used by all tenants, put them in one extra database. If you call that db Common, then consider these snippits:
USE Tenant; -- Customer starts in his own db
SELECT ... FROM table1 ...; -- Accesses `table1` for that tenant
SELECT a.this, b.blah
FROM table1 AS a -- tenant's table
JOIN Common.foo AS b ON ... -- common table
Note on grants...
GRANT ALL PRIVILEGES ON Tenant_123.* TO tenant_123#'%' IDENTIFIED BY ...;
GRANT SELECT ON Common.* TO tenant_123#'%';
That is, it is probably OK to 'grant' everything to his own database. But he show have very limited access to the Common data.
If, instead, you manage the logins and all accesses go through, say, a PHP API, then you probably have only one mysql 'user' for all accesses. In this case, my notes above about GRANTs are not relevant.
Do not let the Tenants have access to everything. Your entire system will quickly be hacked and possibly destroyed.
Typically, this has little to do with which way to do it versus which way you've basically sold your customers on how it's to be done, or in some cases having no choice due to the type of data.
For example, does your application have a policy or similar that defines isolation of user generated data? Does your application store HIPAA or PCI type data? If so, you may not even have a choice, and if the customer is expecting that sort of privacy, that normally comes at a premium due to the potential overhead of creating the separation.
If the separation/isolation of data is not required, then adding a field to tables indicating which application owns the data would be most ideal from a performance perspective, and you would just need to update your queries to filter based on that.
Using MySQL or MariaDB I prefer to use a single database for all tenants and restrict access to data by using a different database user per tenant which only has permission to their data.
You can accomplish by using an tenant_id column that stores the database username of the tenant that owns the data. I use a trigger to populate this column automatically when new rows are added. I then use Views to filter the tables where the tenant_id = current_database_user. Then I restrict the tenant database users to only have access to the views, not the real tables.
I was able to convert a large single-tenant application to a multi-tenant application over a weekend using this technique because I only needed to modify the database and my database connection code.
I've written a blog post fully describing this approach. https://opensource.io/it/mysql-multi-tenant/
I have an Access database containing information about people (employee profiles and related information). The front end has a single console-like interface that modifies one type of data at a time (such as academic degrees from one form, contact information from another). It is currently linked to multiple back ends (one for each type of data, and one for the basic profile information). All files are located on a network share and many of the back ends are encrypted.
The reason I have done that is that I understand that MS Access has to pull the entire database file to the local computer in order to make any queries or updates, then put any changed data back on the network share. My theory is that if a person is changing a telephone number or address (contact information), they would only have to pull/modify/replace the contact information database, rather than pull a single large database containing contact information, projects, degrees, awards, etc. just to change one telephone number, thus reducing the potential for locked databases and network traffic when multiple users are accessing data.
Is this a sane conclusion? Do I misunderstand a great deal? Am I missing something else?
I realize there is the consideration of overhead with each file, but I don't know how great the impact is. If I were to consolidate the back ends, there is also the potential benefit of being able to let Access handle referential integrity for cascading deletes, etc., rather than coding for that...
I'd appreciate any thoughts or (reasonably valid) criticisms.
This is a common misunderstanding:
MS Access has to pull the entire database file to the local computer in order to make any queries or updates
Consider this query:
SELECT first_name, last_name
FROM Employees
WHERE EmpID = 27;
If EmpID is indexed, the database engine will read just enough of the index to find which table rows match, then read the matching rows. If the index includes a unique constraint (say EmpID is the primary key), the reading will be faster. The database engine doesn't read the entire table, nor even the entire index.
Without an index on EmpID, the engine would do a full table scan of the Employees table --- meaning it would have to read every row from the table to determine which include matching EmpID values.
But either way, the engine doesn't need to read the entire database ... Clients, Inventory, Sales, etc. tables ... it has no reason to read all that data.
You're correct that there is overhead for connections to the back-end database files. The engine must manage a lock file for each database. I don't know the magnitude of that impact. If it were me, I would create a new back-end database and import the tables from the others. Then make a copy of the front-end and re-link to the back-end tables. That would give you the opportunity to examine the performance impact directly.
Seems to me relational integrity should be a strong argument for consolidating the tables into a single back-end.
Regarding locking, you shouldn't ever need to lock the entire back-end database for routine DML (INSERT, UPDATE, DELETE) operations. The database base engine supports more granular locking. Also pessimistic vs. opportunistic locking --- whether the lock occurs once you begin editing a row or is deferred until you save the changed row.
Actually "slow network" could be the biggest concern if slow means a wireless network. Access is only safe on a hard-wired LAN.
Edit: Access is not appropriate for a WAN network environment. See this page by Albert D. Kallal.
ms access is not good to use in local area network nor wide area network which certainly have lower speed. the solution is to use a client server database such as Ms SQL Server or MySQL. Ms SQL Server is much better than My SQL but it is not free. Consider Ms SQL server for large-scale projects. Again I said MS access is only good for 1 computer not for computer network.
What I have:
A MySQL database running on Ubuntu that maintains a
large table of articles (similar to
wordpress).
Need to relate a given article to
another set of data. This set of data
will be fairly large.
There maybe various sets of data that
will be related.
The query:
Is it better to contain these various large sets of data within the same database of articles, which will have a large set of traffic on it?
or
Is it better to create different databases (on the same server) that
relate by a primary key to the main database with the articles?
Put them all in the same DB initially, until you find that there is a performance issue. Much easier than prematurely optimising.
Modern RDBMS are very good at optimising data access.
If you need to connect frequently and read both of the records, you should put in a the same database. The server then won't have to run permission checks twice for each of your databases.
If you have serious traffic, you should consider using persistent connection for that query.
If you don't need to read them together frequently, consider to put on different machine. As the high traffic for the bigger database won't cause slow downs on the other.
Different databases on the same server gives you all the problems of a distributed architecture without any of the benefits of scaling out. One database per server is the way to go.
When you say 'same database' and 'different databases related' don't you mean 'same table' vs 'different tables'?
if that's the question, i'd say:
one table for articles
if these 'other sets of data' are all of the same structure, put them all in the same table. if not, one table per kind of data.
everything on the same database
if you grow big enough to make database size a performance issue (after many million records and lots of queries a second), consider table partitioning or maybe replacing the biggest table with a key/value store (couchDB, mongoDB, redis, tokyo cabinet, [etc][6]), which can be a little faster than MySQL but a lot easier to distribute for performance.
[6]:key-value store