How to increase the performance of database schema creation? - mysql

For our testing environment, I need to setup und tear down a database multiple times (each test should run independently of any other).
The process is the following:
Create database schema and insert necessary data
Run test 1
Remove all tables in database
Create database schema and insert necessary data
Run test 2
Remove all tables in database
...
The schema and data are the same for each test in the test case.
Basically, this works. The big problem is, that the creation and clearing of the database takes a lot of time. Is there a possibility to improve the performance of mysql for the creation of tables and the insertion of data? Or can you think of a different process for the tests?
Thank for you your help!

Optimize the logical design
The logical level is about the structure of the query and tables themselves. Try to maximize this first. The goal is to access as few data as possible at the logical level.
Have the most efficient SQL queries
Design a logical schema that support the application's need (e.g. type of the columns, etc.)
Design trade-off to support some use case better than other
Relational constraints
Normalization
Optimize the physical design
The physical level deals with non-logical consideration, such as type of indexes, parameters of the tables, etc. Goal is to optimize the IO which is always the bottleneck. Tune each table to fit it's need. Small table can be loaded permanently loaded in the DBMS cache, table with low write rate can have different settings than table with high update rate to take less disk spaces, etc. Depending on the queries, different index can be used, etc. You can denormalized data transparently with materialized views, etc.
Tables paremeters (allocation size, etc.)
Indexes (combined, types, etc.)
System-wide parameters (cache size, etc.)
Partitioning
Denormalization
Try first to improve the logical design, then the physical design. (The boundary between both is however vague, so we can argue about my categorization).
Optimize the maintenance
Database must be operated correctly to stay as efficient as possible. This include a few mainteanance taks that can have impact on the perofrmance, e.g.
Keep statistics up to date
Re-sequence critical tables periodically
Disk maintenance
All the system stuff to have a server that rocks
source from:How to increase the performance of a Database?

I suggest you can write all your need operations into an script using shell、perl or python(init_db).
The first use, you can create、 insert and delete manually,then dump both the schema and data .
You can choose bulk insert and drop table for deleting data to improve the total performance.
Hope this can help you.

Instead of DROP TABLE + CREATE TABLE, just do TRUNCATE TABLE. This may, or may not, be faster; give it a try.
If you are INSERTing multiple rows each time, then either batch them (all rows in one INSERT), or use LOAD DATA. Either of these is much faster than row-by-row INSERTs.
Also fast... If you have the initial data in another table (which you could keep permanently), then do
CREATE TABLE test SELECT * FROM perm_table;
... (run tests using `test`)
DROP TABLE test;

Related

Giant unpartitioned MySQL table issues

I have a MySQL table which is about 8TB in size. As you can imagine, querying is horrendous.
I am thinking about:
Create a new table with partitions
Loop through a series of queries to dump data into those partitions
But the loop will require lots of queries to be submitted & each will be REALLY slow.
Is there a better way to do this? Repartitioning a production database in-situ isn't going to work - this seemed like an OK option, but slow
And is there a tool that will make life easier? Rather than a Python job looping & submitting jobs?
Thanks a lot in advance
You could use pt-online-schema-change. This free tool allows you to partition the table with an ALTER TABLE statement, but it does not block clients from using the table while it's restructuring it.
Another useful tool could be pt-archiver. You would create a new table with your partitioning idea, then pt-archiver to gradually copy or move data from the old table to the new table.
Of course try out using these tools in a test environment on a much smaller table first, so you get some practice using them. Do not try to use them for the first time on your 8TB table.
Regardless of what solution you use, you are going to need enough storage space to store the entire dataset twice, plus binary logs. The old table will not shrink, even as you remove data from it. So I hope your filesystem is at least 24TB. Or else the new table should be stored on a different server (or ideally several other servers).
It will also take a long time no matter which solution you use. I expect at least 4 weeks, and perhaps longer if you don't have a very powerful server with direct-attached NVMe storage.
If you use remote storage (like Amazon EBS) it may not finish before you retire from your career!
In my opinion, 8TB for a single table is a problem even if you try partitioning. Partitioning doesn't magically fix performance, and could make some queries worse. Do you have experience with querying partitioned tables? And you understand how partition pruning works, and when it doesn't work?
Before you choose partitioning as your solution, I suggest you read the whole chapter on partitioning in the MySQL manual: https://dev.mysql.com/doc/refman/8.0/en/partitioning.html, especially the page on limitations: https://dev.mysql.com/doc/refman/8.0/en/partitioning-limitations.html Then try it out with a smaller table.
A better strategy than partitioning for data at this scale is to split the data into shards, and store each shard on one of multiple database servers. You need a strategy for adding more shards as I assume the data will continue to grow.

(MySql) I wonder if partitioning table or database backup via query affects the performance of other activities?

Let say that's a table named orders, having almost 10millions of data coming in daily. There will be a schedule to split the table accordingly at 0000hrs.
May I know is this schedule going to affect the performance of API to retrieve data, insert & update data?
Yes.
120 Inserts/second is about the max that HDD drives can handle. (SSD can handle much more.)
Meanwhile, a backup is doing lots of reads to fetch the data and, in the dump is stored on the same machine, lots of disk writes.
ALTER TABLE (or whatever you mean by "split the table") to change the partitioning may do a lot of I/O. Please provide details. If your SQL is I/O-bound, I may have a much more efficient approach.
Selects will also be slowed down simply because of other things going on.
There are many techniques for speeding up Inserts; we should discuss that, too. Please describe your Insert mechanism -- one row per "order"? multiple tables touched per order? all orders go through one client? batch inserts? Etc, etc.

Automatically Loading the View

Can we create Materialized view in mysql or sql which will automatically reloaded with the data from the underlying bases table without hitting the base table.
Elobration:
I have created viewMasterTable view which is join of 3 tables,
TableA,TableB,Table = viewMasterTable
Now i want this view to be reloaded with the data if any changes i.e upadate,insert or delete is made on the base table without hitting the base table.
**Will this view concept will help in performance increase**
You can create materialized views in SQL-Server Enterprise Edition. In MySQL you cannot create materialized views. Thus this only applicable to SQL-Server and a very specific edition.
Now you don't get something for nothing. If you materialize a view it means that the source columns used in the base table(s) have to remain in synchronization with the data in the base tables. Thus any updates/inserts/deletes on the base table will be impacted as the server now has to write to the base tables and update the view. So you will have an extra operation to complete for every write this will incur a performance penalty on the server itself. Depending on size of tables, views and frequency of updates this might or might not be a small penalty.
You can index materialized views and this is where the power really shines. Say you have a very complex view that can be filtered by various columns a materialize view will allow you to index the fields in the view allowing a user to filter much faster. However the downside is that for every index that you create on a materialized view it will incur more write penalties as the server needs to update indexes when updating the view.
So while it can be a really good way to increase performance for reads on a complex query you will see a performance penalty on writes. How bad will this penalty be? Well that depends on how you have arranged your Disk IO pathways i.e. for example placing your indexes, views and tables on separate physical spindles will help alleviate some of the write overhead.

Only Mysql OR mysql+sqlite OR mysql+own solution

Currently I am building quite big web system and I need strong SQL database solution. I chose Mysql over Postgres because some of tasks needs to be read-only (MyISAM engine) and other are massive-writes (InnoDB).
I have a question about this read-only feature. It has to be extremely fast. User must get answer a lot less than one second.
Let say we have one well-indexed table named "object" with not more than 10 millions of rows and another one named "element" with around 150 millions of rows.
We also have table named "element_object" containing information connecting objects from table "element" with table "object" (hundreds of millions of rows)
So we're going to do partitioning on tables "element" and "element_object" and have 8192 tables "element_hash_n{0..8191}a" and 24576 of tables "element_object_hash_n{0..8191}_m{0..2}".
An Answer on user's question would be a 2-step searching:
Find id of element from tables "element_hash_n"
Do main sql select on table "object" and join with table "element_object..hash_n_m" to filter result with found (from first step) ID
I wonder about first step:
What would be better:
store (all) over 32k tables in mysql
create one sqlite database and store there 8192 tables for first step purpose
create 8192 different sqlite files (databases)
create 8192 files in file system and make own binary solution to find ID.
I'm sorry for my English. Its not my native language.
I think you make way to many partitions. If you have more than 32000 partitions you have a tremendous overhead of management. Given the name element_hash_* it seams as if you want to make a hash of your element and partition it this way. But a hash will give you a (most likely) even distribution of the data over all partitions. I can't see how this should improve performance. If your data is accessed over all those partitions you don't gain anything by having partitions in size of your memory - you will need to load for every query data from another partition.
We used partitions on a transaction systems where more than 90% of the queries used the current day as criteria. In such a case the partition based on days worked very well. But we only had 8 partitions and moved the data then off to another database for long time storage.
My advice: Try to find out what data will be needed that fast and try to group it together. And you will need to make your own performance tests. If it is so important to deliver data that fast there should be enough management support to build a decent test environment.
Maybe your test result will show that you simply can't deliver the data fast enough with a relational database system. If so you should look at NoSQL (as in Not only SQL) solutions.
In what technology do you build your web system? You should test this part as well. A super fast database will not help you much if you lose the time in a poorly performing web application.

Cassandra write performance vs Releational Databases

I am trying to grasp some performance differences between Cassandra and relational databases.
From what I have read, Cassandra's write performance remains constant regardless of data volume. By write performance, I am assuming this implies both new rows being added as well as existing rows being replaced on a key match (like an update in the relational world). Is that assumption correct?
Also, from what I understand about relational databases updates get slower when tables/partitions become larger. This is because a full table scan must be performed to locate the row, or an index lookup needs to be performed and both of these things will take longer as the table or partition grows. So updates take perpetually longer based on the data volume of the table/partition?
When new data is inserted to a relational database, I know any indexes need to to have the new data but there is no lookup involved correct? So will inserts also become perpetually slower as data volume increases or stay constant with relational databases?
Thanks for any tips
They will become slower if the table has indexes. Not only the data must be written, but the index must be updated too. Inserting in a table that has no indexes and no constraints is lightning fast, because no checks need to be done. The record can just be written at the end of the table space.
On the relational DB side, I've been doing load testing on our RDBMS where I can see that the performance drops exponentially as data is added to the DB.
I'm still working on a Cassandra setup to be able to realize a comparable test. In the meantime, this Cassandra presentation gives some info on Cassandra compared to MySQL:
http://www.slideshare.net/Eweaver/cassandra-presentation-at-nosql