MySQL NDBCLUSTER: is it good for large scale solutions? - mysql

one question about NDBCLUSTER.
I inherited the writing of a web site basing on NDBCLUSTER 5.1 solution (LAMP platform).
Unfortunately, who designed the former solution didn't realize that this database engine has strong limits. One, the maximum number of fields a table can have is 128. The former programmer conceived tables with 369 fields in a single row, one for each day of the year plus some key field (he originally worked with MyISAM engine). Ok it must be refactored, anyways, I know.
What is more, the engine needs a lot of tuning: maximum number of attributes for a table (which defaults to 1000, a bit too few) and many other parameters, the misinterpretation or underestimation of which can lead to serious problems once you're in production with your database and you're forced to change something.
Even the fact that disk storage for NDBCLUSTER tables is kind of aleatory if not precisely configured: even if specified in CREATE TABLE statements, the engine seems to prefer keeping data in memory - which explains the speed - but can be a pain if your table on node 1 should suddenly collapse (as it did during testing). All table data lost on all nodes and table corrupted after 1000 records only.
We were on a server with 8Gb RAM, and the table had just 27 fields.
Please note that no ndb_mgm operation for nodes shutdown ran to compromise table data. It simply fell down, full stop. Our provider didn't understand why.
So the question is: would you recommend NDBCLUSTER as a stable solution for a large scale web service database?
We're talking about a database which should contain several millions of records, thousands of tables and thousands of catalogues.
If not that which database would you recommend as the best to accomplish the task of making a nation-level scale web service.
Thanks in advance.

I have a terrible experience with NDBCLUSTER. It's good replacement for memcached with range invalidation, nothing more. The stability and configurability does not exist for this solution. You can not force all processes to listen on specific ports, backup was working but I have to edit bkp files in vim to restore database etc..

Related

How to improve "select min(my_col)" query in MySQL without adding and index

The query below takes about a minute to run on my MySQL instance (running on a fairly beefy machine with 64G memory, 2T disc, 2.30Ghz CPU with 8 cores and 16 logical, and the query is running on localhost). This same query runs in less than a second on a SQL Server database I have access to. Unfortunately, I do not have access to the SQL Server host or the DBA, etc.
select min(visit_start_date)
from visit_occurrence;
The table has been set to ENGINE=MyISAM and default-storage-engine=INNODB and innodb_buffer_pool_size=16G are set in my.ini.
Is there some configuration I could be missing that would cause this query to run so slowly on MySQL? How can I fix it?
I have a large number of tables and queries I will need to support so I would really like to be able to fix this issue globally rather than having to create indexes everywhere I have slow queries.
The SQL Server database does not seem to have an index on the column being queried as shown below.
EDIT:
Untagged MS Sql Server, I had tagged it hoping for the help of our MS Sql Server colleagues with information that Sql Server had some way of structuring data and/or queries that would make this type of query run faster on that platform v other such as MySql
Removed image of code to more closely conform with community standards
You never know if there is a magic go-faster button if you don't ask (ENGINE=MyISAM is sometimes kind of like a magic go-faster button for some queries in MySql). I'm kind of fishing for a potential hardware or clustering solution here. Is Apache Ignite a potential solution here?
Thanks again to the community for all of your support and help. I hope this fixes most of the issues that have been raised for this post.
SECOND EDIT:
Is the partitioning/sharding described in the links below a potential solution here?
https://user3141592.medium.com/how-to-scale-mysql-42ebd2841fa6
https://dev.mysql.com/doc/refman/8.0/en/partitioning-overview.html
THIRD EDIT: A note on community standards.
Part of our community standards is explicitly to be welcoming, inclusive, and to be nice.
https://stackoverflow.blog/2018/04/26/stack-overflow-isnt-very-welcoming-its-time-for-that-to-change/?fbclid=IwAR1gr6r2qmXs506SAV3H_h6H8LoFy3mlXucfa-fqiiEXMHUR3aF_tdoZGsw
https://meta.stackexchange.com/questions/240839/the-new-new-be-nice-policy-code-of-conduct-updated-with-your-feedback).
The MS Sql Server tag was used here as one of the systems I'm comparing is MS Sql Server. We're really working with very limited information here. I have two systems: My MySql system, which is knowable as I'm running it, and the MS Sql Server running the same database in someone else's system that I have very little information about (all I have is a read only sql prompt). I am comparing apples and oranges: The same query runs well on the orange (MS Sql Server) and does not run well on the apple (My MySql instance). I'd like to know why so I can make an informed decision about how to get my queries to run in a reasonable amount of time. How do I get my apple to look like an orange? Do I switch to MS Sql Server? Do I need to deploy on different hardware? Is the other system running some kind of in memory caching system on top of their database instance? Most of these possibilities would require a non trivial amount of time to explore and validate. So yes, I would like help from MS Sql Server experts that might know if there are caching options, transactional v warehouse options, etc. that could be set that would make a world of difference, that would be magic go-fast buttons.
The magic go-fast button comment was perhaps a little bit condescending.
The picture showing the indexes was shown as I was just trying to make the point that the other system does not seem to have an index on the column being queried. I this case a picture was worth a thousand words.
If the table says ENGINE=MyISAM, then that is what counts. In almost all cases, this is a bad choice. innodb_buffer_pool_size=16G is not relevant except that it robs memory from MyISAM.
default-storage-engine=INNODB is relevant only when creating a table explicitly specifying the ENGINE=.
Are some of your tables MyISAM and some are InnoDB? How much RAM do you have?
Most performance solutions necessarily involve an INDEX. Please explain why you can't afford an index. It could turn that query into less than 10ms, regardless of the number of rows in the table.
Sorry, but I don't accept "rather than having to create indexes everywhere I have slow queries".
Changing tables from MyISAM to InnoDB will, in some cases help with performance. Suggest you change the engine as you add the indexes.
Show us some more queries, we can help you decide what indexes are needed. select min(visit_start_date) from visit_occurrence; needs INDEX(date); other queries may not be so trivial. Do not fall into the trap of "indexing every column".
More
In MySQL...
A single connection only uses one core, so more cores only helps when you have more connections. (Some tiny exceptions exist in MySQL 8.0.)
Partitioning rarely helps with performance; do use that without getting advice. (PS: BY RANGE is perhaps the only useful variant.)
Replication is for read-scaling (and backup and ...)
Sharding is for write-scaling. It requires a bunch of extra architectural things -- such as routing queries to the appropriate servers. (MariaDB has Spider and FederatedX as possible tools.) In any case, sharding is a non-trivial undertaking.
Clustering is for HA (High Availability, auto-failover, etc), while helping some with read and write scaling. Cf: Galera, InnoDB Cluster.
Hardware is rarely more than a temporary solution to performance issues.
Caching leads to potentially inconsistent results, so beware. Also, consider my mantra "don't bother putting a cache in front of a cache".
(I can advise further on any of these topics.)
Whether in MyISAM or InnoDB. or even SQL Server, your query
select min(visit_start_date) from visit_occurrence;
can be satisfied almost instantaneously by this index, because it uses a so-called loose index scan.
CREATE INDEX visit_start_date ON visit_occurrence (visit_start_date);
A query with an aggregate function like MIN() is always a GROUP BY query. But if the GROUP BY clause isn't present in the SQL statement, the server groups by the entire table.
You mentioned a query that can be satisfied immediately when using MyISAM. That's SELECT COUNT(*) FROM whatever_table. Behind the scenes MyISAM keeps table metadata showing the total number of rows in the table, so that query comes back right away. The transactional storage engine InnoDB doesn't do that. It supports so much concurrency that its designers didn't include the total row count in their metadata, because it would be wrong in so many circumstances that it wasn't worth the risk.
Index design isn't a black art. But it is an art informed by the kind of measurements we get from EXPLAIN (or ANALYZE or EXPLAIN ANALYZE). A basic truth of database-driven apps (in any make of database server) is that indexing needs to be revisited as the app grows. The good news: changing, adding, or dropping indexes doesn't change your data.

Mysql database size growing over 4 terabytes ? Azure supports up to 4

A couple of days ago, I was asked for help in this particular situation. A MYSQL Database set up on Azure is reaching 4 terabytes in size. I have set up databases before and developed for them but I'm not really a dba.
The problem according to them is that Azure size limit is 4 terabytes (and it will double that size in a couple of months but luckily just it wont keep growing like that). I talked to them about achieving some of the data, but they need all 10 years worth of data apparently. They don't want to move from Azure or use something other than MYSQL. One thing they pointed out to me was that 1 table in particular was almost 2 terabytes in size.
Unfortunately, I haven't been given access to the database yet but I just wanted to ask about my options in a situation like this. I looked into this a bit and I saw Stuff like MYSQL sharding. Is this the only option ? Can it be done on Azure (I saw SQL sharding articles for SQL server on Azure but not for Mysql). Can I partition some tables into another MYSQL database for example ?
I guess I'm just looking for advice on how to move forward with this. Any link on something like this is appreciated.
Thank you
Simple Answer
4TB is not the MySQL limit, so Azure is limiting you. Switch to another service.
Future problems
But... 4TB is rife with issues, especially for a non-dba.
Over-sized datatypes (wasting disk space)
Lack of normalization (wasting disk space)
Need for summary tables (if it is a Data Warehouse)
Sub-optimal indexes (performance)
Ingestion speed (bog down on loading of fresh data)
Query speed
Partitioning to aid in purging (if you will eventually purge 'old' data)
Sharding (This is as big a discussion as all the others combined)
All of these can be tackled, but we need to see the schema, the queries, the dataflow, the overall architecture, etc.

how to tune MySQL 5.6 database

We're developing an enterprise product that uses MySQL 5.6 on Windows to store reports generated by multiple clients. Our database contains approximately 20 tables, each table containing ranging from a few hundred thousand to some million records. All tables have more than 10 columns with a combination of numeric and textual data. All tables use innodb engine with numeric field as primary key. The tables are indexed on another numeric field, different than the primary key.
There are about 10 connections used to merge new data into the database. The data is viewed via a web console. There is as such no limit on the number of instances on viewing the data. We also don't have any reference/foreign key in tables so we don't use joins.
We haven't created stored procedure for fetching data. Does the store procedure really improve performance?
While searching solution on the internet, I found that if we changes the values of innodb_buffer_pool_size, read_rnd_buffer_size, sort_buffer_size etc fields in my.ini/my.cnf files, then we can improve performance as well as minimize memory requirements of mysql. But I am not confident about it because I don't know what should be the proper values of it and what are the side effect of it. Currently I kept the default configuration. Please let me know which values can be changed in the configuration file to improve performance and minimize memory requirements without any side effect.
I would also like to know some other ways to optimize & fine tune MySQL engine that would boost the performance & use optimum resources.
Minimum software/hardware requirement of our product is :
OS : Windows 2000 SP4 Professional / Server and later.
RAM : 1 GB
CPU : 1 GHz.
There are no predefined values for defined for performance in any database and if they are present then they are already used in default configurations.
So first time think about schema design. If your schema is designed well then you have already resolved half the issues.
From your description schema looks good as you already have applied indexes,no joins etc.
First try your application with default setting provided by Mysql engine.
Then carry out performance test of your application's critical/widely used/resource intensive workflow. (definition of these terms varies per application.)
If you find performance bottlenecks on database side then try below things,
Check heavy queries used by application. optimize them
if queries cannot be optimized then make them faster using indexes
carry out regular tuning of database (fragmentations,runstats etc.)
if you see bufferpools shortage,sort overflow,read buffers then tune them accordingly which is not one time job.
Performance tuning is very difficult job, requires lot of patience and expertise.
And Yes, generally stored procedures helps to improve performance. apply them whenever you can move heavy code logic into database.
Hope this was useful.

What data quantity is considered as too big for MySQL?

I am looking for a free SQL database able to handle my data model. The project is a production database working in a local network not connected to the internet without any replication. The number of application connected at the same times would be less than 10.
The data volume forecast for the next 5 years are:
3 tables of 100 millions rows
2 tables of 500 millions rows
20 tables with less than 10k rows
My first idea was to use MySQL, but I have found around the web several articles saying that MySQL is not designed for big database. But, what is the meaning of big in this case?
Is there someone to tell me if MySQL is able to handle my data model?
I read that Postgres would be a good alternative, but require a lot of hours for tuning to be efficient with big tables.
I don't think so that my project would use NOSQL database.
I would know if someone has some experience to share with regarding MySQL.
UPDATE
The database will be accessed by C# software (max 10 at the same times) and web application (2-3 at the same times),
It is important to mention that only few update will be done on the big tables, only insert query. Delete statements will be only done few times on the 20 small tables.
The big tables are very often used for select statement, but the most often in the way to know if an entry exists, not to return grouped and ordered batch of data.
I work for Percona, a company that provides consulting and other services for MySQL solutions.
For what it's worth, we have worked with many customers who are successful using MySQL with very large databases. Terrabytes of data, tens of thousands of tables, tables with billions of rows, transaction load of tens of thousands of requests per second. You may get some more insight by reading some of our customer case studies.
You describe the number of tables and the number of rows, but nothing about how you will query these tables. Certainly one could query a table of only a few hundred rows in a way that would not scale well. But this can be said of any database, not just MySQL.
Likewise, one could query a table that is terrabytes in size in an efficient way. It all depends on how you need to query it.
You also have to set specific goals for performance. If you want queries to run in milliseconds, that's challenging but doable with high-end hardware. If it's adequate for your queries to run in a couple of seconds, you can be a lot more relaxed about the scalability.
The point is that MySQL is not a constraining factor in these cases, any more than any other choice of database is a constraining factor.
Re your comments.
MySQL has referential integrity checks in its default storage engine, InnoDB. The claim that "MySQL has no integrity checks" is a myth often repeated over the years.
I think you need to stop reading superficial or outdated articles about MySQL, and read some more complete and current documentation.
MySQLPerformanceBlog.com
High Performance MySQL, 3rd edition
MySQL 5.6 manual
MySQL has a two important (and significantly different) database engines - MyISAM and InnoDB. A limits depends on usage - MyISAM is nontransactional - there is relative fast import, but it is too simple (without own memory cache) and JOINs on tables higher than 100MB can be slow (due too simple MySQL planner - hash joins is supported from 5.6). InnoDB is transactional and is very fast on operations based on primary key - but import is slower.
Current versions of MySQL has not good planner as Postgres has (there is progress) - so complex queries are usually much better on PostgreSQL - and really simple queries are better on MySQL.
Complexity of PostgreSQL configuration is myth. It is much more simple than MySQL InnoDB configuration - you have to set only five parameters: max_connection, shared_buffers, work_mem, maintenance_work_mem and effective_cache_size. Almost all is related to available memory for Postgres on server. Usually work for 5 minutes. On my experience a databases to 100GB is usually without any problems on Postgres (probably on MySQL too). There are two important factors - how speed you expect and how much memory and how fast IO you have.
With large databases you have to have a experience and knowledges for any database technology. All is fast when you are in memory, and when ratio database size/memory is higher, then much more work you have to do to get good results.
First of all, MySQLs table size is only limited by the allowed file size limit of your OS which is I. The terra bytes on any modern OS. That would pose no problems. Most important are questions like this:
What kind of queries will you run?
Are the large table records updated frequently or basically archives for history data?
What is your hardware budget?
What is the kind of query speed you need?
Are you familiar with table partitioning, archive tables, config tuning?
How fast do you need to write (expected inserts per second)
What language will you use to connect to the db (Java, .net, Ruby etc)
What platform are you most familiar with?
Will you run queries which might cause table scans such like '%something%' which would have to go through every single row and take forever
MySQL is used by Facebook, google, twitter and others with large tables and 100,000,000 is not much in the age of social media. MySQL has very little drawbacks (even though I prefer postgresql in most cases) like altering large tables by adding a new index for example. That might send your company in a couple days forced vacation if you don't have a replica in the meantime. Is there a reason why NoSQL is not an option? Sometimes hybrid approaches are a good choice like having your relational business logic in MySQL and huge statistical tables in a NoSQL database like MongoDb which can scale by adding new servers in minutes (MySQL can too but it's more complicated). Now MongoDB can have a indexed column which can be searched by in blistering speed.
Bejond the bottom line: you need to answer the above questions first to make a very informed decision. If you have huge tables and only search on indexed keys almost any database will do - if you expect many changes to the structure down the road you want to use a different approach.
Edit:
Based on your update you just posted I doubt you would run into problems.

Is Postgres better than MySql when one needs to add a column to a table with millions of rows?

We're having problems with Mysql. When I search around, I see many people having the same problem.
I have joined up with a product where the database has some tables with as many as 150 million rows. One example of our problem is that one of these tables has over 30 columns and about half of them are no longer used. When trying to remove columns or renaming columns, mysql wants to copy the entire table and rename. With this amount of data, it would take many hours to do this and the site would be offline pretty much the whole time. This is just the first of several large migrations to improve the schema. These aren't intended as a regular thing. Just a lot of cleanup I inherited.
I tried searching to see if people have the same problem with Postgres and I find almost nothing in comparison talking about this issue. Is this because Postgres is a lot better at it, or just that less people are using postgres?
In PostgreSQL, adding a new column without default value to a table is instantaneous, because the new column is only registered in the system catalog, not actually added on disk.
When the only tool you know is a hammer, all your problems look like a nail. For this problem, PostgreSQL is much much better at handling these types of changes. And the fact is, it doesn't matter how well you designed your app, you WILL have to change the schema on a live database someday. While MySQL's various engines really are amazing for certain corner cases, here none of them help. PostgreSQL's very close integration between the various layers means that you can have things like transactional ddl that allow you to roll back anything that isn't an alter / create database / tablespace. Or very very fast alter tables. Or non-impeding create indexes. And so on. It limits PostgreSQL to the things it does well (traditional transactional db load handling is a strong point) and not so great at the things that MySQL often fills in the gaps on, like live networked clustered storage with the ndb engine.
In this case none of the different engines in MySQL allow you to easily solve this problem. The very versatility of multiple storage engines means that the lexer / parser / top layer of the DB cannot be as tightly integrated to the storage engines, and therefore a lot of the cool things pgsql can do here mysql can't.
I've got a 118Gigabyte table in my stats db. It has 1.1 billion rows in it. It really should be partitioned but it's not read a whole lot, and when it is we can wait on it. At 300MB/sec (the speed the array it's on can read) it takes approximately 118*~3seconds to read, or right around 5 minutes. This machine has 32Gigs of RAM, so it cannot hold the table in memory.
When I ran the simple statement on this table:
alter table mytable add test text;
it hung waiting for a vacuum. I killed the vacuum (select pg_cancel_backend(12345) (<-- pid in there) and it finished immediately. A vacuum on this table takes a long time to run btw. Normally it's not a big deal, but when making changes to table structure, you have to wait on vacuums, or kill them.
Dropping a column is just as simple and fast.
Now we come to the problem with postgresql, and that is the in-heap MVCC storage. If you add that column, then do an update table set test='abc' it updates each row, and exactly doubles the size of the table. Unless HOT can update the rows in place, but then you need a 50% fill factor table which is double sized to begin with. The only way to get the space back is to either wait and let vacuum reclaim it over time and reuse it one update at a time, or to run cluster or vacuum full to shrink it back down.
you can get around this by running updates on parts of the table at a time (update where pkid between 1 and 10000000; ...) and running vacuum between each run to reclaim the space.
So, both systems have warts and bumps to deal with.
maybe because this should not be a regualr occurrence.
perhaps, reading between the lines, you need to be adding a row to another table, instead of columns to a large existing table..?