How to query data across many (2000+) databases (schemas) in mysql? - mysql

I have around 2000 equal database schemas I want to query across. I found multiple ways online to do so, but it always requires writing a procedure that combines them all with unions into a view. Isn't there a simpler way, maybe a tool that can do this for me? So is there a way to query always the most up-to-date data and even execute update statements across all schemas, instead always having to generate a view for queries and writing scripts for running updates on all schemas?
I'm using mysql8.

Related

Laravel - Possible solutions for improving query read performance for large data (8m+ rows)

I have a Laravel web app that's using a VueJS front-end and MySQL as the RDBMS. I currently have a table that is 23.8gb and contains 8m+ rows and it's growing every second.
When querying this data, I'm joining it to 4 other tables so the entire dataset is humongous.
I'm currently only pulling and displaying 1000 rows as I don't need anymore than that. VueJS is showing the data in a table and there are 13 filter options for the user to select from to filter the data ranging from date, name, status, etc.
Using Eloquent and having MySQL indexes in place, I've managed to get the query time down to a respectable time but I need this section of the app to be as responsive as possible.
Some of the where clauses that kick off from the filters are taking 13 seconds to execute which I feel is too long.
I've been doing some reading and thinking maybe MongoDB or Redis may be an option but have very little experience with either.
For this particular scenario, what do you think would be the best option to maximise read performance?
If I were to use MongoDB, I wouldn't migrate the current data... I'd basically have a second database that contains all the new data. This app hasn't gone into production yet and in most use cases, only the last 30 days worth of data will be required but the option to query old data is still required hence keeping both MySQL and MongoDB.
Any feedback will be appreciated.
Try to use elasticsearch. It will speed up the read process.
Try converting the query into a stored procedure. You can execute the stored procedure like this..
DB::select('exec stored_procedure("Param1", "param2",..)');
or
DB::select('exec stored_procedure(?,?,..)',array($Param1,$param2));
Try this for without parameters
DB::select('EXEC stored_procedure')
Try using EXPLAIN to optimise the performance.
How to optimise MySQL queries based on EXPLAIN plan

How to cache infrequently changing mysql query?

I have a mysql query that is taking 8 seconds to execute/fetch (in workbench).
I won't go into the details of why it may be slow (I think GROUPBY isnt helping though).
What I really want to know is, how I can basically cache it to work more quickly because the tables only change like 5-10 times/hr, while users access the site 1000s times/hour.
Is there a way to just have the results regenerated/cached when the db changes so results are not constantly regenerated?
I'm quite new to sql so any basic thought may go a long way.
I am not familiar with such a caching facility in MySQL. There are alternatives.
One mechanism would be to use application level caching. The application would store the previous result and use that if possible. Note this wouldn't really work well for multiple users.
What you might want to do is store the report in a separate table. Then you can run that every five minutes or so. This would be a simple mechanism using a job scheduler to run the job.
A variation on this would be to have a stored procedure that first checks if the data has changed. If the underlying data has changed, then the stored procedure would regenerate the report table. When the stored procedure is done, the report table would be up-to-date.
An alternative would be to use triggers, whenever the underlying data changes. The trigger could run the query, storing the results in a table (as above). Alternatively, the trigger could just update the rows in the report that would have changed (harder, because it involves understanding the business logic behind the report).
All of these require some change to the application. If your application query is stored in a view (something like vw_FetchReport1) then the change is trivial and all on the server side. If the query is embedded in the application, then you need to replace it with something else. I strongly advocate using views (or in other databases user defined functions or stored procedures) for database access. This defines the API for the database application and greatly facilitates changes such as the ones described here.
EDIT: (in response to comment)
More information about scheduling jobs in MySQL is here. I would expect the SQL code to be something like:
truncate table ReportTable;
insert into ReportTable
select * from <ReportQuery>;
(In practice, you would include column lists in the select and insert statements.)
A simple solution that can be used to speed-up the response time for long running queries is to periodically generate summarized tables, based on underlying data refreshing or business needs.
For example, if your business don't care about sub-minute "accuracy", you can run the process once each minute and make your user interface to query this calculated table, instead of summarizing raw data online.

Run analytics on huge MySQL database

I have a MySQL database with a few (five to be precise) huge tables. It is essentially a star topology based data warehouse. The table sizes range from 700GB (fact table) to 1GB and whole database goes upto 1 terabyte. Now I have been given a task of running analytics on these tables which might even include joins.
A simple analytical query on this database can be "find number of smokers per state and display it in descending order" this requirement could be converted in a simple query like
select state, count(smokingStatus) as smokers
from abc
having smokingstatus='current smoker'
group by state....
This query (and many other of same nature) takes a lot of time to execute on this database, time taken is in order of tens of hours.
This database is also heavily used for insertion which means every few minutes there are thousands of rows getting added.
In such a scenario how can I tackle this querying problem?
I have looked in Cassandra which seemed easy to implement but I am not sure if it is going to be as easy for running analytical queries on the database especially when I have to use "where clause and group by construct"
Have Also looked into Hadoop but I am not sure how can I implement RDBMS type queries. I am not too sure if I want to right away invest in getting at least three machines for name-node, zookeeper and data-nodes!! Above all our company prefers windows based solutions.
I have also thought of pre-computing all the data in a simpler summary tables but that limits my ability to run different kinds of queries.
Are there any other ideas which I can implement?
EDIT
Following is the mysql environment setup
1) master-slave setup
2) master for inserts/updates
3) slave for reads and running stored procedures
4) all tables are innodb with files per table
5) indexes on string as well as int columns.
Pre-calculating values is an option but since requirements for this kind of ad-hoc aggregated values keeps changing.
Looking at this from the position of attempting to make MySQL work better rather than positing an entirely new architectural system:
Firstly, verify what's really happening. EXPLAIN the queries which are causing issues, rather than guessing what's going on.
Having said that, I'm going to guess as to what's going on since I don't have the query plans. I'm guessing that (a) your indexes aren't being used correctly and you're getting a bunch of avoidable table scans, (b) your DB servers are tuned for OLTP, not analytical queries, (c) writing data while reading is causing things to slow down greatly, (d) working with strings just sucks and (e) you've got some inefficient queries with horrible joins (everyone has some of these).
To improve things, I'd investigate the following (in roughly this order):
Check the query plans, make sure the existing indexes are being used correctly - look at the table scans, make sure the queries actually make sense.
Move the analytical queries off the OLTP system - the tunings required for fast inserts and short queries are very different to those for the sorts of queries which potentially read most of a large table. This might mean having another analytic-only slave, with a different config (and possibly table types - I'm not sure what the state of the art with MySQL is right now).
Move the strings out of the fact table - rather than having the smoking status column with string values of (say) 'current smoker', 'recently quit', 'quit 1+ years', 'never smoked', push these values out to another table, and have the integer keys in the fact table (this will help the sizes of the indexes too).
Stop the tables from being updated while the queries are running - if the indexes are moving while the query is running I can't see good things happening. It's (luckily) been a long time since I cared about MySQL replication, so I can't remember if you can batch up the writes to the analytical query slave without too much drama.
If you get to this point without solving the performance issues, then it's time to think about moving off MySQL. I'd look at Infobright first - it's open source/$$ & based on MySQL, so it's probably the easiest to put into your existing system (make sure the data is going to the InfoBright DB, then point your analytical queries to the Infobright server, keep the rest of the system as it is, job done), or if Vertica ever releases its Community Edition. Hadoop+Hive has a lot of moving parts - its pretty cool (and great on the resume), but if it's only going to be used for the analytic portion of you system it may take more care & feeding than other options.
1 TB is not that big. MySQL should be able to handle that. At least simple queries like that shouldn't take hours! Can't be very helpful without knowing the larger context, but I can suggest some questions that you might ask yourself, mostly related to how you use your data:
Is there a way you can separate the reads and writes? How many read so you do per day and how many writes? Can you live with some lag, e.g write to a new table each day and merge it to the existing table at the end of the day?
What are most of your queries like? Are they mostly aggregation queries? Can you do some partial aggregation beforehand? Can you pre-calculate number of new smokers every day?
Can you use hadoop for the aggregation process above? Hadoop is kinda good at that stuff. Basically use hadoop just for daily or batch processing and store the results into the DB.
On the DB side, are you using InnoDB or MyISAM? Are the indices on String columns? Can you make it ints etc.?
Hope that helps
MySQL is have a serious limitation what prevent him to be able to perform good on such scenarious. The problem is a lack of parralel query capability - it can not utilize multiple CPUs in the single query.
Hadoop has an RDMBS like addition called Hive. It is application capable of translate your queries in Hive QL (sql like engine) into the MapReduce jobs. Since it is actually small adition on top of Hadoop it inherits its linear scalability
I would suggest to deploy hive alongside MySQL, replicate daily data to there and run heavy aggregations agains it. It will offload serious part of the load fro MySQL. You still need it for the short interactive queries, usually backed by indexes. You need them since Hive is iherently not-interactive - each query will take at least a few dozens of seconds.
Cassandra is built for the Key-Value type of access and does not have scalable GroupBy capability build-in. There is DataStax's Brisk which integrate Cassandra with Hive/MapReduce but it might be not trivial to map your schema into Cassandra and you still not get flexibility and indexing capabiilties of the RDBMS.
As a bottom line - Hive alongside MySQL should be good solution.

de-normalization, weighted aggregates for updated tables in MySQL

this time I got a more general question. Should I use multiple views rather than stored procedures for weighted aggregation of data, if the original data is updated periodically?
Basically I have a local MySQL database that is updated periodically by importing the same kind of data (tables) from a bigger transaction database.
The local database is used for statistical analysis. Thus I de-normalize (basically aggregate) the data locally for use with statistical software packages. So far I used stored procedures because I felt it was easier to handle (and arranged more clearly) when weighting schemes (basically other tables containing weights that are multiplied with variables) came into play.
Though the disadvantage of stored procedures is that I have the run all of 'em again when the tables are populated with new data. Obviously I am not a DBA... So don´t shy away from stating the obvious :) What´s the best approach to handle this kind of scenario? SP or views ? Or something completely different?
thx for any suggestions in advance!
It depends (that's the generic answer to any "general" questions, isn't it? :) ). You need to evaluate the tradeoffs to see what the best solution is for your needs.
Views are basically just query re-writing (in MySQL), so using a view will be performing the aggregation/denormalization every time the query is run. That may make your queries slower that you would like. Also, if your procedures are really complicated, maybe it's not practical to try to put that logic into a view.
Stored procedures do the work once, so queries will be faster. But then your updates won't show up automatically. So I think the answer depends on how often the data changes, how often queries are run, and how important the performance of the queries is.
As for alternative suggestions, you could also run your stored procedures using events, if your data updates are regular, and you are just trying to save yourself from the manual task of running the procedures.
Another option is to have denormalization/aggregation tables that are updated with triggers. As you update your data in the source table, the triggers will automatically keep the aggregate tables current.
Here is a link to documentation for stored procedures, views, triggers, and events.

MySQL: Data query from multiple tables with views

I want to create a query result page for a simple search, and i don't know , should i use views in my db, would it be better if i would write a query into my code with the same syntax like i would create my view.
What is the better solution for merging 7 tables, when i want to build a search module for my site witch has lots of users and pageloads?
(I'm searching in more tables at the same time)
you would be better off using a plain query with joins, instead of a view. VIEWS in MySQL are not optimized. be sure to have your tables properly indexed on the fields being used in the joins
If you always use all 7 tables, i think you should use views. Be aware that mysql changes your original query when creating the view so its always good practice to save your query elsewhere.
Also, remember you can tweak mysql's query cache env var so that it stores more data, therefore making your queries respond faster. However, I would suggest that you used some other method for caching like memcached. The paying version of mysql supports memcached natively, but Im sure you can implement it in the application layer no problem.
Good luck!