Google Cloud SQL sharding - mysql

I am now trying to use Google Cloud SQL as a solution for my project. However, I found no way to scale writing there. It uses innoDB, so I can't use features of NDB cluster. So I wanted to make some sharding, however I am unable to find any information about this. Are there any ways to scale writes to Google Cloud SQL?

A single INSERT statement with 100 row will run 10 times as fast as 100 single-row INSERTs.
LOAD DATA is probably the fastest way to do inserts.
In some situations it can be faster to use a staging table - put the rows to insert in it, then copy to the 'real' table(s).
Please give more details about the data flow, the table schema, and scale (eg, rows/second).

Related

Giant unpartitioned MySQL table issues

I have a MySQL table which is about 8TB in size. As you can imagine, querying is horrendous.
I am thinking about:
Create a new table with partitions
Loop through a series of queries to dump data into those partitions
But the loop will require lots of queries to be submitted & each will be REALLY slow.
Is there a better way to do this? Repartitioning a production database in-situ isn't going to work - this seemed like an OK option, but slow
And is there a tool that will make life easier? Rather than a Python job looping & submitting jobs?
Thanks a lot in advance
You could use pt-online-schema-change. This free tool allows you to partition the table with an ALTER TABLE statement, but it does not block clients from using the table while it's restructuring it.
Another useful tool could be pt-archiver. You would create a new table with your partitioning idea, then pt-archiver to gradually copy or move data from the old table to the new table.
Of course try out using these tools in a test environment on a much smaller table first, so you get some practice using them. Do not try to use them for the first time on your 8TB table.
Regardless of what solution you use, you are going to need enough storage space to store the entire dataset twice, plus binary logs. The old table will not shrink, even as you remove data from it. So I hope your filesystem is at least 24TB. Or else the new table should be stored on a different server (or ideally several other servers).
It will also take a long time no matter which solution you use. I expect at least 4 weeks, and perhaps longer if you don't have a very powerful server with direct-attached NVMe storage.
If you use remote storage (like Amazon EBS) it may not finish before you retire from your career!
In my opinion, 8TB for a single table is a problem even if you try partitioning. Partitioning doesn't magically fix performance, and could make some queries worse. Do you have experience with querying partitioned tables? And you understand how partition pruning works, and when it doesn't work?
Before you choose partitioning as your solution, I suggest you read the whole chapter on partitioning in the MySQL manual: https://dev.mysql.com/doc/refman/8.0/en/partitioning.html, especially the page on limitations: https://dev.mysql.com/doc/refman/8.0/en/partitioning-limitations.html Then try it out with a smaller table.
A better strategy than partitioning for data at this scale is to split the data into shards, and store each shard on one of multiple database servers. You need a strategy for adding more shards as I assume the data will continue to grow.

Redshift design or configuration issue? - My Redshift datawarehouse seems much slower than my mysql database

I have a Redshift datawarehouse that is pulling data in from multiple sources.
One is my from MySQL and the others are some cloud based databases that get pulled in.
When querying in redshift, the query response is significantly slower than the same mysql table(s).
Here is an example:
SELECT *
FROM leads
WHERE id = 10162064
In mysql this takes .4 seconds. In Redshift it takes 4.4 seconds.
The table has 11 million rows. "id" is indexed in mysql and in redshift it is not since it is a columnar system.
I know that Redshift is a columnar data warehouse (which is relatively new to me) and Mysql is a relational database that is able to utilize indexes. I'm not sure if Redshift is the right tool for us for reporting, or if we need something else. We have about 200 tables in it from 5 different systems and it is currently at 90 GB.
We have a reporting tool sitting on top that does native queries to pull data. They are pretty slow but are also pulling a ton of data from multiple tables. I would expect some slowness with these, but with a simple statement like above, I would expect it to be quicker.
I've tried some different DIST and SORT key configurations but see no real improvement.
I've run vacuum and analyze with no improvement.
We have 4 nodes, dc2.large. Currently only using 14% storage. CPU utilization is frequently near 100%. Database connections averages about 10 at any given time.
The datawarehouse just has exact copies of the tables from our integration with the other sources. We are trying to do near real-time reporting with this.
Just looking for advice on how to improve performance of our redshift via configuration changes, some sort of view or dim table architecture, or any other tips to help me get the most out of redshift.
I've worked with clients on this type of issue many times and I'm happy to help but this may take some back and forth to narrow in on what is happening.
First I'm assuming that "leads" is a normal table, not a view and not an external table. Please correct if this assumption isn't right.
Next I'm assuming that this table isn't very wide and that "select *" isn't contributing greatly to the speed concern. Yes?
Next question is wide this size of cluster for a table of only 11M rows? I'd guess it is that there are other much larger data sets on the database and that this table isn't setting the size.
The first step of narrowing this down is to go onto the AWS console for Redshift and find the query in question. Look at the actual execution statistics and see where the query is spending its time. I'd guess it will be in loading (scanning) the table but you never know.
You also should look at STL_WLM_QUERY for the query in question and see how much wait time there was with the running of this query. Queueing can take time and if you have interactive queries that need faster response times then some WLM configuration may be needed.
It could also be compile time but given the simplicity of the query this seems unlikely.
My suspicion is that the table is spread too thin around the cluster and there are lots of mostly empty blocks being read but this is just based on assumptions. Is "id" the distkey or sortkey for this table? Other factors likely in play are cluster load - is the cluster busy when this query runs? WLM is one place that things can interfere but disk IO bandwidth is a share resource and if some other queries are abusing the disks this will make every query's access to disk slow. (Same is true of network bandwidth and leader node workload but these don't seem to be central to your issue at the moment.)
As I mentioned resolving this will likely take some back and forth so leave comments if you have additional information.
(I am speaking from a knowledge of MySQL, not Redshift.)
SELECT * FROM leads WHERE id = 10162064
If id is indexed, especially if it is a Unique (or Primary) key, 0.4 sec sounds like a long network delay. I would expect 0.004 as a worst-case (with SSDs and `PRIMARY KEY(id)).
(If leads is a VIEW, then let's see the tables. 0.4s may be be reasonable!)
That query works well for a RDBMS, but not for a columnar database. Face it.
I can understand using a columnar database to handle random queries on various columns. See also MariaDB's implementation of "Columnstore" -- that would give you both RDBMS and Columnar in a single package. Still, they are separate enough that you can't really intermix the two technologies.
If you are getting 100% CPU in MySQL, show us the query, its EXPLAIN, and SHOW CREATE TABLE. Often, a better index and/or query formulation can solve that.
For "real time reporting" in a Data Warehouse, building and maintaining Summary Tables is often the answer.
Tell us more about the "exact copy" of the DW data. In some situations, the Summary tables can supplant one copy of the Fact table data.

Fastest way to copy a large MySQL table?

What's the best way to copy a large MySQL table in terms of speed and memory use?
Option 1. Using PHP, select X rows from old table and insert them into the new table. Proceed to next iteration of select/insert until all entries are copied over.
Option 2. Use MySQL INSERT INTO ... SELECT without row limits.
Option 3. Use MySQL INSERT INTO ... SELECT with a limited number of rows copied over per run.
EDIT: I am not going to use mysqldump. The purpose of my question is to find the best way to write a database conversion program. Some tables have changed, some have not. I need to automate the entire copy over / conversion procedure without worrying about manually dumping any tables. So it would be helpful if you could answer which of the above options is best.
There is a program that was written specifically for this task called mysqldump.
mysqldump is a great tool in terms of simplicity and careful handling of all types of data, but it is not as fast as load data infile
If you're copying on the same database, I like this version of Option 2:
a) CREATE TABLE foo_new LIKE foo;
b) INSERT INTO foo_new SELECT * FROM foo;
I've got lots of tables with hundreds of millions of rows (like 1/2B) AND InnoDB AND several keys AND constraints. They take many many hours to read from a MySQL dump, but only an hour or so by load data infile. It is correct that copying the raw files with the DB offline is even faster. It is also correct that non-ASCII characters, binary data, and NULLs need to be handled carefully in CSV (or tab-delimited files), but fortunately, I've pretty much got numbers and text :-). I might take the time to see how long the above steps a) and b) take, but I think they are slower than the load data infile... which is probably because of transactions.
Off the three options listed above.
I would select the second option if you have a Unique constraint on at least one column, therefore not creating duplicate rows if the script has to be run multiple times to achieve its task in the event of server timeouts.
Otherwise your third option would be the way to go, while manually taking into account any server timeouts to determine your insert select limits.
Use a stored procedure
Option two must be fastest, but it's gonna be a mighty long transaction. You should look into making a stored procedure doing the copy. That way you could offload some of the data parsing/handling from the MySQL engine.
MySQL's load data query is faster than almost anything else, however it requires exporting each table to a CSV file.
Pay particular attention to escape characters and representing NULL values/binary data/etc in the CSV to avoid data loss.
If possible, the fastest way will be to take the database offline and simply copy data files on disk.
Of course, this have some requirements:
you can stop the database while copying.
you are using a storage engine that stores each table in individual files, MyISAM does this.
you have privileged access to the database server (root login or similar)
Ah, I see you have edited your post, then I think this DBA-from-hell approach is not an option... but still, it's fast!
The best way i find so far is creating the files as dump files(.txt), by using the outfile to a text then using infile in mysql to get the same data to the database

Can I use multiple servers to increase mysql's data upload performance?

I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.

Best way to archive live MySQL database

We have a live MySQL database that is 99% INSERTs, around 100 per second. We want to archive the data each day so that we can run queries on it without affecting the main, live database. In addition, once the archive is completed, we want to clear the live database.
What is the best way to do this without (if possible) locking INSERTs? We use INSERT DELAYED for the queries.
http://www.maatkit.org/ has mk-archiver
archives or purges rows from a table to another table and/or a file. It is designed to efficiently “nibble” data in very small chunks without interfering with critical online transaction processing (OLTP) queries. It accomplishes this with a non-backtracking query plan that keeps its place in the table from query to query, so each subsequent query does very little work to find more archivable rows.
Another alternative is to simply create a new database table each day. MyIsam does have some advantages for this, since INSERTs to the end of the table don't generally block anyway, and there is a merge table type to being them all back together. A number of websites log the httpd traffic to tables like that.
With Mysql 5.1, there are also partition tables that can do much the same.
I use mysql partition tables and I've achieve wonderful results in all aspects.
Sounds like replication is the best solution for this. After the initial sync the slave gets updates via the Binary Log, thus not affecting the master DB at all.
More on replication.
MK-ARCHIVER is a elegant tool to archive MYSQL data.
http://www.maatkit.org/doc/mk-archiver.html
MySQL replication would work perfectly for this.
Master -> the live server.
Slave -> a different server on the same network.
Could you keep two mirrored databases around? Write to one, keep the second as an archive. Switch every, say, 24 hours (or however long you deem appropriate). Into the database that was the archive, insert all of todays activity. Then the two databases should match. Use this as the new live db. Take the archived database and do whatever you want to it. You can backup/extract/read all you want now that its not being actively written to.
Its kind of like having mirrored raid where you can take one drive offline for backup, resync it, then take the other drive out for backup.