Running even basic SQL queries on a >60 GB table in MariaDB - mysql

I am running MariaDB on a vServer (8 CPU vCores, 32 GB RAM) with a few dozen database tables which aggregate data from external services around the web for efficient use across my collection of websites (basically an API layer with database caching and it's own API for easy use in all of my projects).
All but one of these database tables allow quick, basic queries such as
SELECT id, content FROM tablename WHERE date_added > somedate
(with "content" being some JSON data). I am using InnoDB as the storage engine to allow inserts without table locking, "id" is always the primary key in any table and most of these tables only have a few thousand or maybe a few hundred thousand entries, resulting in a few hundred MB.
One table where things don't work properly though has already >6 million entries (potentially heading to 100 million) and uses >60 GB including indexes. I can insert, update and select by "id" but anything more complex (e.g. involving a search in 1 or 2 additional fields or sorting the results) runs into infinity. Example:
SELECT id FROM tablename WHERE extra = ''
This query would select entries where "extra" is empty. There is an index on "extra" and
EXPLAIN SELECT id FROM tablename WHERE extra = ''
tells me it is just a SIMPLE query with the correct index automatically chosen ("Using where; Using index"). If I set a low LIMIT I am fine, selecting thousands of results though and the query never stops running. Using more than 1 field in my search even with a combined index and explicitly adding the name of that index to the query and I'm out of luck as well.
Since there is more than enough storage available on my vServer and MariaDB/InnoDB don't have such low size limits for tables there might be some settings or other limitations that would prevent me from running queries on larger database tables. Looking through all the settings of MariaDB I couldn't find anything appropriate though.
Would be glad if someone could point me into the right direction.

Related

Is it a good idea to distribute records of one table into several multiple tables having similar table structure

I have a table with records 62 Million.
Table structure: 52 columns
Storage engine: InnoDB
Collation: utf8_general_ci
SELECT - Maximum number of operations performed
INSERT - Always in bulk but it doesn't happen always.
UPDATE - Very less number of operations but sometime much and sometime not at all
Since we are fetching in real time almost always. Is it a good idea to distribute records from this one big table in some logic into multiple similar tables in order to select record pretty much faster?
MYSQL Version: mysql Ver 15.1 Distrib 10.2.33-MariaDB
It is almost guaranteed to be slower by that technique.
Provide CREATE TABLE and the important queries.
Often a SELECT can be sped up by a composite index and/or a reformulation.
62M rows is above average, but not a scary size.
"INSERT - Always in bulk" -- Let's see your technique; there may be a way to speed it up further.
"Archiving" old data -- Actually removing the data may help some. Keeping it around, but using suitable indexes is usually fine. We need to see your queries and schema.
"Sharding" is putting parts of the data in separate servers. This is handy when you have exhausted the write capacity of a single machine.
"Replication" (Primary + Replica) allows shifting reads to another server, thereby spreading the load. With this technique, you system can handle a virtually unlimited number of SELECTs.
"Yes, indexes have been implemented" -- That may mean that you have one index per column. This is almost always not optimal.
"128GB RAM" -- If the entire dataset is much smaller than that, then most of the RAM is going unused.
"Query design" -- I had one example of a CPU that was pegged at 100%. After making the following change, the CPU dropped to 1%:
SELECT ... WHERE DATE(dt) = CURDATE();
-->
SELECT ... WHERE dt >= CURDATE();

Why does my MySQL database's INFORMATION_SCHEMA not accurately represent the tables

I was migrating a database from a server to the AWS cloud, and decided to double check the success of the migration by comparing the number of entries in the tables of the old database and the new one.
I first noticed that of the 46 tables I migrated, 13 were different sizes, on further inspection I noticed that 9 of the 13 tables were actually bigger in the newer database than the old one. There are no scripts/code currently setup with either database that would change the data, let alone the amount of data.
I then further inspected one of the smaller tables (only 43 rows) in the old database and noticed that when running the below sql query, I was getting a return of 40 TABLE_ROWS, instead of the actual 43. The same was the case for another smaller table in the old database where the query said 8 rows, but there were 15. (I manually counted multiple times to confirm these two cases)
However, when I ran the same below query on the new, migrated, database as I did on the old database, it was displaying the correct number of rows for those two tables.
SELECT TABLE_ROWS, TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE.SCHEMA = 'db_name';
Any thoughts?
Reading the documentation: https://dev.mysql.com/doc/refman/8.0/en/tables-table.html
TABLE_ROWS
The number of rows. Some storage engines, such as MyISAM, store the exact count. For other storage engines, such as InnoDB, this value is an approximation, and may vary from the actual value by as much as 40% to 50%. In such cases, use SELECT COUNT(*) to obtain an accurate count.
Were there any error/warning in the migration log? There are so many ways to migrate mysql table data, I personally like to use mysqldump and importing the resuting sql file using mysql command line client. In my experience importing using GUI clients always have some shortcomings.
In order for information_schema to not be painfully slow when retrieving this for large tables, it uses estimates, based on the cardinality of the primary key, for InnoDB tables. Otherwise it would end up having to do SELECT COUNT(*) FROM table_name, which for a table with billions of rows could take hours.
Look at SHOW INDEX FROM table_name and you will see that the number reported in information_schema is the same as the cardinality of the PK.
Running ANALYZE TABLE table_name will update the statistics which may make them more accurate, but it will still be an estimate rather than just-in-time checked row-count.

Queries fast after creating an index but slow after a few minutes MySQL

I have several tables with ~15 million rows. When I create an idex on the id column and then I execute a simple query like SELECT * FROM my_table WHERE id = 1 I retrieve the data within one second. But then, after a few minutes, if I execute the query with a different id it takes over 15 seconds.
I'm sure it is not the query cache because I'm trying different ids all the time to make sure I'm not retrieving from the cache. Also, I used EXPLAIN to make sure the index it's being used.
The specs of the server are:
CPU: Intel Dual Xeon 5405 Harpertown 2.0Ghz Quad Core
RAM: 8GB
Hard drive 2: 146GB SAS (15k rpm)
Another thing I noticed is that if I execute REPAIR TABLE my_table the queries become within one second again. I assume something is being cached, either the table or the index. If so, is there any way to tell MySQL to keep it cached. Is it normal, given the specs of the server, to take around 13 seconds on an indexed table? The index is not unique and each query returns around 3000 rows.
NOTE: I'm using MyISAM and I know there won't be any write in these tables, all the queries will be to read data.
SOLVED: thank you for your answers, as many of you pointed out it was the key_buffer_size.I also reordered the tables using the same column as the index so the records are not scattered, now I'm executing the queries consistently under 1 second.
Please provide
SHOW CREATE TABLE
SHOW VARIABLES LIKE '%buffer%';
Likely causes:
key_buffer_size (when using MyISAM) is not 20% of RAM; or innodb_buffer_pool_size is not 70% of available RAM (when using InnoDB).
Another query (or group of queries) is coming in and "blowing out the cache" (key_buffer or buffer_pool). Look for such queries).
When using InnoDB, you don't have a PRIMARY KEY. (It is really important to have such.)
For 3000 rows to take 15 seconds to load, I deduce:
The cache for the table (not necessarily for the index) was blown out, and
The 3000 rows were scattered around the table (hence fetching one row does not help much in finding subsequent rows).
Memory allocation blog: http://mysql.rjweb.org/doc.php/memory
Is it normal, given the specs of the server, to take around 13 seconds on an indexed table?
The high variance in response time indicates that something is amiss. With only 8 GB of RAM and 15 million rows, you might not have enough RAM to keep the index in memory.
Is swap enabled on the server? This could explain the extreme jump in response time.
Investigate the memory situation with a tool like top, htop or glances.

Slow insert statements on SQL Server

A single insert statement is taking, occasionally, more than 2 seconds. The inserts are potentially concurrent, as it depends on our site traffic which can result in 200 inserts per minute.
The table has more than 150M rows, 4 indexes and is accessed using a simple select statement for reporting purposes.
SHOW INDEX FROM ouptut
How to speed up the inserts considering that all indexes are required?
You haven't provided many details but it seems like you need partitions.
An insertion operation in an database index has, in general, an O(logN) time complexity where N is the number of rows in the table. If your table is really huge even logN may become too much.
So, to address that scalability issue you can make use of index partitions to transparently split up your table indexes in smaller internal pieces and reduce that N without changing your application or SQL scripts.
https://dev.mysql.com/doc/refman/5.7/en/partitioning-overview.html
[EDIT]
Considering information initially added in the comments and now updated in the question itself.
200 potentially concurrent inserts per minute
4 indexes
1 select for reporting purposes
There are a few not mutually exclusive improvements:
Check the output of EXPLAIN for that SELECT and remove indexes not being used, or, otherwise, combine them in a single index.
Make the inserts in batch
https://dev.mysql.com/doc/refman/5.6/en/insert-optimization.html
https://dev.mysql.com/doc/refman/5.6/en/optimizing-innodb-bulk-data-loading.html
Partitioning still an option.
Alternatively, change your approach: save the data to a nosql database like redis and populate the mysql table asynchronously for reporting purpose.

How to manage Huge operations on MySql

I have a MySql DataBase. I have a lot of records (about 4,000,000,000 rows) and I want to process them in order to reduce them(reduce to about 1,000,000,000 Rows).
Assume I have following tables:
table RawData: I have more than 5000 rows per sec that I want to insert them to RawData
table ProcessedData : this table is a processed(aggregated) storage for rows that were inserted at RawData.
minimum rows count > 20,000,000
table ProcessedDataDetail: I write details of table ProcessedData (data that was aggregated )
users want to view and search in ProcessedData table that need to join more than 8 other tables.
Inserting in RawData and searching in ProcessedData (ProcessedData INNER JOIN ProcessedDataDetail INNER JOIN ...) are very slow. I used a lot of Indexes. assume my data length is 1G, but my Index length is 4G :). ( I want to get ride of these indexes, they make slow my process)
How can I Increase speed of this process ?
I think I need a shadow table from ProcessedData, name it ProcessedDataShadow. then proccess RawData and aggregate them with ProcessedDataShadow, then insert the result in ProcessedDataShadow and ProcessedData. What is your idea??
(I am developing the project by C++)
thank you in advance.
Without knowing more about what your actual application is, I have these suggestions:
Use InnoDB if you aren't already. InnoDB makes use of row-locks and are much better at handling concurrent updates/inserts. It will be slower if you don't work concurrently, but the row-locking is probably a must have for you, depending on how many sources you will have for RawData.
Indexes usually speeds up things, but badly chosen indexes can make things slower. I don't think you want to get rid of them, but a lot of indexes can make inserts very slow. It is possible to disable indexes when inserting batches of data, in order to prevent updating indexes on each insert.
If you will be selecting huge amount of data that might disturb the data collection, consider using a replicated slave database server that you use only for reading. Even if that will lock rows /tables, the primary (master) database wont be affected, and the slave will get back up to speed as soon as it is free to do so.
Do you need to process data in the database? If possible, maybe collect all data in the application and only insert ProcessedData.
You've not said what the structure of the data is, how its consolidated, how promptly data needs to be available to users nor how lumpy the consolidation process can be.
However the most immediate problem will be sinking 5000 rows per second. You're going to need a very big, very fast machine (probably a sharded cluster).
If possible I'd recommend writing a consolidating buffer (using an in-memory hash table - not in the DBMS) to put the consolidated data into - even if it's only partially consolidated - then update from this into the processedData table rather than trying to populate it directly from the rawData.
Indeed, I'd probably consider seperating the raw and consolidated data onto seperate servers/clusters (the MySQL federated engine is handy for providing a unified view of the data).
Have you analysed your queries to see which indexes you really need? (hint - this script is very useful for this).