What data is duplicated when MySQL/MariaDB BLOB columns are copied? - mysql

Let table_1 be created as follows:
CREATE TABLE table_1 (
id INT AUTO_INCREMENT PRIMARY KEY,
some_blob BLOB
);
Let table_2 be created as follows:
CREATE TABLE table_2 (
id INT AUTO_INCREMENT PRIMARY KEY,
some_blob BLOB
);
What I want to know is, after I run this table-copying query
INSERT INTO table_2 (id, some_blob) SELECT id, some_blob FROM table_1;
will the actual text within each some_blob field of the table_1 table be duplicated and stored on disk, or will the DB have only duplicated pointers to the disk locations containing the BLOB data?
One argument for why BLOB copying must involve the duplication of actual content reasons as follows:
Duplication of BLOB content is necessary because changes to BLOB data in table_1 should not also take place in table_2. If only the disk pointers were duplicated then content changes in one table would be reflected in the other table, which violates the properties of a correct copy operation.
Now I present an alternative method that the DB could implement to satisfy this copy operation. This alternative shows the above argument is not necessarily true. The DB could only duplicate disk pointers during the execution of the given INSERT statement, then whenever an UPDATE occurs which seeks to modify the BLOB data in one of the tables, the DB would only then allocate more space on disk to store the new data which is part of the UPDATE query. A BLOB data segment then is only deleted when there no longer exists any disk pointers to it, and a particular BLOB data segment could potentially have many disk pointers pointing to it.
So which of these strategies does MySQL/MariaDB use when executing the given INSERT statement, or does it use a different strategy?
EDIT: Why I am asking this question
Currently I am running a couple of UPDATE queries which are copying large amounts of BLOB data from one table to another in the same database (over 10 million rows of BLOB data). The queries have been running for a while. I am curious about whether the performance is so slow because some of the columns I am comparing are poorly indexed, because these queries are literally copying over the content instead of disk pointers, or perhaps because of both of these reasons.
I use an INSERT in the question's example because this simplifies the concept of database internals that I am trying to understand.

Each table has its own copy of the blob data, and of all other data. MySQL doesn't do shallow copying of data. It's true that blobs are separately allocated objects, but they are not shared between tables. The description of the internals of the storage engine is provided so you can understand what's going on, not so that you can change it (unless you fork the storage engine source and create a new version ... but get your app working first).
So, your UPDATE queries are erasing old blob data and writing new. That's I/O intensive, and so it's probably slow.
Using INSERT as a way of simplifying your question is incorrect. Writing new blobs to a table is a faster process then overwriting existing ones.
Your best bet in production is to avoid UPDATEs to blob columns.

Related

Galera: Cannot write to database when indexing large data

I am using Galera Cluster with 3 nodes. I am currently meet following problems. I want to write more than 500 millions of records into database, for example table Data. Here is following steps:
Create table NewData with schema as Data but without index.
Write 500 millions records into this table. (using multiple threads to write, each thread will write bunch of records).
After finishing, assign index for this table.
Rename Data to OldData and rename NewData to Data.
The problem I am currently have is at indexing phrase, other services cannot write / read data. After I increase innodb_buffer_pool_size other nodes can read data but still cannot write.
I have configured so that written job writes at different node than other api's but problem still the same. I think that if one node is very high workload, other node should still behave normally. Please tell me why and how to fix this.
Thanks
I think you missed a step.
(one-time) Create table NewData with schema as Data but without index.
Insert into NewData.
Create table `Empty (again like Data but without any index)
RENAME TABLE NewData TO ToIndex, Empty TO NewData; -- Now the ingestion can proceed.
ALTER TABLE ToIndex ADD INDEX ...
RENAME TABLE Data TO Old, ToIndex TO Data;
The point is to have two things going on:
Continually writing to the unindexed NewData.
Swap tables around so that periodically that table (under a new name) gets indexed, and then used to replace the live table (which is always seen as Data).
This is not quite the same situation, but has some similarities: http://mysql.rjweb.org/doc.php/staging_table

Maintain data integrity and consistency when performing sql batch insert/update with unique columns

I have an excel file that contains contents from the database when downloaded. Each row is identified using an identifier called id_number. Users can add new rows on the file with a new unique id_number. When it is uploaded, for each excel row,
When the id_number exist on the database, an update is performed on the database row.
When the id_number does not exist on the database, an insert is performed on the database row.
Other than the excel file, data can be added or updated individually using a file called report.php. Users use this page if they only want to add one data for an employee, for example.
Ideally, I would like to do an insert ... on duplicate key update for maximum performance. I might also put all of them in a transaction. However, I believe this overall process have some flaws:
Before any add/updates, validation checks have to be done on all excel rows against their corresponding database rows. The reason is because there are many unique columns in the table. That's why I'll have to do some select statements to insure that the data is valid before performing any add/update. Is this efficient on tables with 500 rows and 69 columns? I could probably just get all the data and store all of them in a php array and do the validation check on the array, but what happens if someone adds a new row (with an id_number of 5) through report.php? Then suppose the excel file I uploaded also contains a row with an id_number 5? That could probably destroy my validations because I can not be sure my data is up to date without performing a lot of select statements.
Suppose the system is in the middle of a transaction adding/updating the data retrieved from the excel file, then someone from report.php adds a row because all the validations have been satisfied (E.G. no duplicate id_numbers). Suppose at this point in time the next row to be added from the excel file and the row that will be added by the user on report.php have the same id_number. What happens then? I don't have much knowledge on transactions, I think they at least prevents two queries changing a row at the same time? Is that correct?
I don't really mind these kinds of situations that much. But some files have many rows and it might take a long time to process all of them.
One way I've thought of fixing this is: while the excel file upload is processing, I'll have to prevent users using report.php to modify the rows currently held by the excel file. Is this fine?
What could be the best way to fix these problems? I am using mysql.
If you really need to allow the user to generate their own unique ID then the you could lock the table in question while you're doing you validation and inserting.
If you acquire a write lock, then you can be certain the table isn't changed while you do your work of validation and inserting.
`mysql> LOCK TABLES tbl_name WRITE`
don't forget to
`mysql> UNLOCK TABLES;`
The downside with locking is obvious, the table is locked. If it is high traffic, then all your traffic is waiting, and that could lead all kinds of pain, (mysql running out of connections, would be one common one)
That said, I would suggest a different path altogether, let mysql be the only one who generates a unique id. That is make sure the database table have an auto_increment unique id (primary key) and then have new records in the spreadsheet entered without without the unique id given. Then mysql will ensure that the new records get a unique id, and you don't have to worry about locking and can validate and insert without fear of a collision.
In regards to the question as to performance with a 500 records 69 column table, I can only say that if the php server and the mysql server are reasonably sized and the columns aren't large data types then this amount of data should be readily handled in a fractions of a second. That said performance can be sabotaged by one bad line of code so if your code is slow to perform, I would take that as a separate optimisation problem.

Optimization of merging huge tables in MySQL

I have a huge indexed MySQL table (~300-400 GB) that I need to append with new entries from time to time (where the new data takes ~10-20 GB). The raw file with new data may contain mistakes, that could be fixed only manually and are visible only when the processing script reach them. Also the new data should be available in the main db only after the full processing of the raw data is finished. So to not screw up the main table I decided to have the following workflow:
The script creates temporary table with the structure identical to the main table and fills it.
Once it is done and verified, the temporary table is inserted into main one:
INSERT INTO main_table (all_fields_except_primary_key) SELECT all_fields_except_primary_key FROM new_table;
And this procedure is extremely slow, as I understand due to indexing new results.
I have read that inserting into indexed tables is very slow in general and some professionals suggest to DROP INDEX'es before insert big amount of data and then index again. But with such huge data indexing of whole table is very long (much longer than my naive INSERT INTO .. SELECT ..) and what is more important, the main table almost couldn't be used during it (without indexes SELECTS takes ages).
So I had an idea of indexing the temprory table before inserting (since it is very fast) and then do merge combining both indexes.
Is it possible somehow in MySQL?
And another question: possibly there're another workaround for my task?

Merging auto-increment table data

I have multiple end-user mySQL dbs with a fairly large amount of data that must be synchronized with a database (also mySQL) populated by an external data feed. End users can add data to their "local" DB, but not to the feed.
The question is how to merge/synchronize the two databases including the foreign keys between the tables of the DBs, without either overwriting the "local" additions or changing the key of the local additions.
Things I've considered include using a csv dump of the feed DB and doing a LOAD DATA INFILE with IGNORE, and then just comparing the files to see which rows from the feed didn't get written, and write them manually and writing some script to go line by line through the feed DB and create new rows in the local DBs, creating new keys at the same time. However, this seems like it could be very slow, particularly with multiple dbs.
Any thoughts on this? If there was a way to merge these DBs, preserving the keys with a sort of load infile simplicity and speed, that would be ideal.
Use a compound primary key.
primary key(id, source_id)
Make each db use a different value for source_id. That way you can copy database contents around without having PK clashes.
One option would be to use GUIDs rather than integer keys, but it may not be practical to make such a significant change.
Assuming that you're just updating the user databases from the central "feed" database, I'd use CSV and LOAD INFILE, but load into a staging table within the target database. You could then replace the keys with new values, and finally insert the rows into the permanent tables.
If you're not dealing with huge data volumes, it could be as simple as finding the difference between the highest ID of the existing data and the lowest ID of the incoming data. Add this amount to all of the keys in your incoming data, and there should be no collisions. This would waste some PK values, but that's probably not worth worrying about unless your record count is in the millions. This assumes that your PKs are integers and sequential.

Generating a massive 150M-row MySQL table

I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?