I am using Galera Cluster with 3 nodes. I am currently meet following problems. I want to write more than 500 millions of records into database, for example table Data. Here is following steps:
Create table NewData with schema as Data but without index.
Write 500 millions records into this table. (using multiple threads to write, each thread will write bunch of records).
After finishing, assign index for this table.
Rename Data to OldData and rename NewData to Data.
The problem I am currently have is at indexing phrase, other services cannot write / read data. After I increase innodb_buffer_pool_size other nodes can read data but still cannot write.
I have configured so that written job writes at different node than other api's but problem still the same. I think that if one node is very high workload, other node should still behave normally. Please tell me why and how to fix this.
Thanks
I think you missed a step.
(one-time) Create table NewData with schema as Data but without index.
Insert into NewData.
Create table `Empty (again like Data but without any index)
RENAME TABLE NewData TO ToIndex, Empty TO NewData; -- Now the ingestion can proceed.
ALTER TABLE ToIndex ADD INDEX ...
RENAME TABLE Data TO Old, ToIndex TO Data;
The point is to have two things going on:
Continually writing to the unindexed NewData.
Swap tables around so that periodically that table (under a new name) gets indexed, and then used to replace the live table (which is always seen as Data).
This is not quite the same situation, but has some similarities: http://mysql.rjweb.org/doc.php/staging_table
Related
Let table_1 be created as follows:
CREATE TABLE table_1 (
id INT AUTO_INCREMENT PRIMARY KEY,
some_blob BLOB
);
Let table_2 be created as follows:
CREATE TABLE table_2 (
id INT AUTO_INCREMENT PRIMARY KEY,
some_blob BLOB
);
What I want to know is, after I run this table-copying query
INSERT INTO table_2 (id, some_blob) SELECT id, some_blob FROM table_1;
will the actual text within each some_blob field of the table_1 table be duplicated and stored on disk, or will the DB have only duplicated pointers to the disk locations containing the BLOB data?
One argument for why BLOB copying must involve the duplication of actual content reasons as follows:
Duplication of BLOB content is necessary because changes to BLOB data in table_1 should not also take place in table_2. If only the disk pointers were duplicated then content changes in one table would be reflected in the other table, which violates the properties of a correct copy operation.
Now I present an alternative method that the DB could implement to satisfy this copy operation. This alternative shows the above argument is not necessarily true. The DB could only duplicate disk pointers during the execution of the given INSERT statement, then whenever an UPDATE occurs which seeks to modify the BLOB data in one of the tables, the DB would only then allocate more space on disk to store the new data which is part of the UPDATE query. A BLOB data segment then is only deleted when there no longer exists any disk pointers to it, and a particular BLOB data segment could potentially have many disk pointers pointing to it.
So which of these strategies does MySQL/MariaDB use when executing the given INSERT statement, or does it use a different strategy?
EDIT: Why I am asking this question
Currently I am running a couple of UPDATE queries which are copying large amounts of BLOB data from one table to another in the same database (over 10 million rows of BLOB data). The queries have been running for a while. I am curious about whether the performance is so slow because some of the columns I am comparing are poorly indexed, because these queries are literally copying over the content instead of disk pointers, or perhaps because of both of these reasons.
I use an INSERT in the question's example because this simplifies the concept of database internals that I am trying to understand.
Each table has its own copy of the blob data, and of all other data. MySQL doesn't do shallow copying of data. It's true that blobs are separately allocated objects, but they are not shared between tables. The description of the internals of the storage engine is provided so you can understand what's going on, not so that you can change it (unless you fork the storage engine source and create a new version ... but get your app working first).
So, your UPDATE queries are erasing old blob data and writing new. That's I/O intensive, and so it's probably slow.
Using INSERT as a way of simplifying your question is incorrect. Writing new blobs to a table is a faster process then overwriting existing ones.
Your best bet in production is to avoid UPDATEs to blob columns.
I have an excel file that contains contents from the database when downloaded. Each row is identified using an identifier called id_number. Users can add new rows on the file with a new unique id_number. When it is uploaded, for each excel row,
When the id_number exist on the database, an update is performed on the database row.
When the id_number does not exist on the database, an insert is performed on the database row.
Other than the excel file, data can be added or updated individually using a file called report.php. Users use this page if they only want to add one data for an employee, for example.
Ideally, I would like to do an insert ... on duplicate key update for maximum performance. I might also put all of them in a transaction. However, I believe this overall process have some flaws:
Before any add/updates, validation checks have to be done on all excel rows against their corresponding database rows. The reason is because there are many unique columns in the table. That's why I'll have to do some select statements to insure that the data is valid before performing any add/update. Is this efficient on tables with 500 rows and 69 columns? I could probably just get all the data and store all of them in a php array and do the validation check on the array, but what happens if someone adds a new row (with an id_number of 5) through report.php? Then suppose the excel file I uploaded also contains a row with an id_number 5? That could probably destroy my validations because I can not be sure my data is up to date without performing a lot of select statements.
Suppose the system is in the middle of a transaction adding/updating the data retrieved from the excel file, then someone from report.php adds a row because all the validations have been satisfied (E.G. no duplicate id_numbers). Suppose at this point in time the next row to be added from the excel file and the row that will be added by the user on report.php have the same id_number. What happens then? I don't have much knowledge on transactions, I think they at least prevents two queries changing a row at the same time? Is that correct?
I don't really mind these kinds of situations that much. But some files have many rows and it might take a long time to process all of them.
One way I've thought of fixing this is: while the excel file upload is processing, I'll have to prevent users using report.php to modify the rows currently held by the excel file. Is this fine?
What could be the best way to fix these problems? I am using mysql.
If you really need to allow the user to generate their own unique ID then the you could lock the table in question while you're doing you validation and inserting.
If you acquire a write lock, then you can be certain the table isn't changed while you do your work of validation and inserting.
`mysql> LOCK TABLES tbl_name WRITE`
don't forget to
`mysql> UNLOCK TABLES;`
The downside with locking is obvious, the table is locked. If it is high traffic, then all your traffic is waiting, and that could lead all kinds of pain, (mysql running out of connections, would be one common one)
That said, I would suggest a different path altogether, let mysql be the only one who generates a unique id. That is make sure the database table have an auto_increment unique id (primary key) and then have new records in the spreadsheet entered without without the unique id given. Then mysql will ensure that the new records get a unique id, and you don't have to worry about locking and can validate and insert without fear of a collision.
In regards to the question as to performance with a 500 records 69 column table, I can only say that if the php server and the mysql server are reasonably sized and the columns aren't large data types then this amount of data should be readily handled in a fractions of a second. That said performance can be sabotaged by one bad line of code so if your code is slow to perform, I would take that as a separate optimisation problem.
I have a huge indexed MySQL table (~300-400 GB) that I need to append with new entries from time to time (where the new data takes ~10-20 GB). The raw file with new data may contain mistakes, that could be fixed only manually and are visible only when the processing script reach them. Also the new data should be available in the main db only after the full processing of the raw data is finished. So to not screw up the main table I decided to have the following workflow:
The script creates temporary table with the structure identical to the main table and fills it.
Once it is done and verified, the temporary table is inserted into main one:
INSERT INTO main_table (all_fields_except_primary_key) SELECT all_fields_except_primary_key FROM new_table;
And this procedure is extremely slow, as I understand due to indexing new results.
I have read that inserting into indexed tables is very slow in general and some professionals suggest to DROP INDEX'es before insert big amount of data and then index again. But with such huge data indexing of whole table is very long (much longer than my naive INSERT INTO .. SELECT ..) and what is more important, the main table almost couldn't be used during it (without indexes SELECTS takes ages).
So I had an idea of indexing the temprory table before inserting (since it is very fast) and then do merge combining both indexes.
Is it possible somehow in MySQL?
And another question: possibly there're another workaround for my task?
I have a MySql DataBase. I have a lot of records (about 4,000,000,000 rows) and I want to process them in order to reduce them(reduce to about 1,000,000,000 Rows).
Assume I have following tables:
table RawData: I have more than 5000 rows per sec that I want to insert them to RawData
table ProcessedData : this table is a processed(aggregated) storage for rows that were inserted at RawData.
minimum rows count > 20,000,000
table ProcessedDataDetail: I write details of table ProcessedData (data that was aggregated )
users want to view and search in ProcessedData table that need to join more than 8 other tables.
Inserting in RawData and searching in ProcessedData (ProcessedData INNER JOIN ProcessedDataDetail INNER JOIN ...) are very slow. I used a lot of Indexes. assume my data length is 1G, but my Index length is 4G :). ( I want to get ride of these indexes, they make slow my process)
How can I Increase speed of this process ?
I think I need a shadow table from ProcessedData, name it ProcessedDataShadow. then proccess RawData and aggregate them with ProcessedDataShadow, then insert the result in ProcessedDataShadow and ProcessedData. What is your idea??
(I am developing the project by C++)
thank you in advance.
Without knowing more about what your actual application is, I have these suggestions:
Use InnoDB if you aren't already. InnoDB makes use of row-locks and are much better at handling concurrent updates/inserts. It will be slower if you don't work concurrently, but the row-locking is probably a must have for you, depending on how many sources you will have for RawData.
Indexes usually speeds up things, but badly chosen indexes can make things slower. I don't think you want to get rid of them, but a lot of indexes can make inserts very slow. It is possible to disable indexes when inserting batches of data, in order to prevent updating indexes on each insert.
If you will be selecting huge amount of data that might disturb the data collection, consider using a replicated slave database server that you use only for reading. Even if that will lock rows /tables, the primary (master) database wont be affected, and the slave will get back up to speed as soon as it is free to do so.
Do you need to process data in the database? If possible, maybe collect all data in the application and only insert ProcessedData.
You've not said what the structure of the data is, how its consolidated, how promptly data needs to be available to users nor how lumpy the consolidation process can be.
However the most immediate problem will be sinking 5000 rows per second. You're going to need a very big, very fast machine (probably a sharded cluster).
If possible I'd recommend writing a consolidating buffer (using an in-memory hash table - not in the DBMS) to put the consolidated data into - even if it's only partially consolidated - then update from this into the processedData table rather than trying to populate it directly from the rawData.
Indeed, I'd probably consider seperating the raw and consolidated data onto seperate servers/clusters (the MySQL federated engine is handy for providing a unified view of the data).
Have you analysed your queries to see which indexes you really need? (hint - this script is very useful for this).
I have a C program that mines a huge data source (20GB of raw text) and generates loads of INSERTs to execute on simple blank table (4 integer columns with 1 primary key). Setup as a MEMORY table, the entire task completes in 8 hours. After finishing, about 150 million rows exist in the table. Eight hours is a completely-decent number for me. This is a one-time deal.
The problem comes when trying to convert the MEMORY table back into MyISAM so that (A) I'll have the memory freed up for other processes and (B) the data won't be killed when I restart the computer.
ALTER TABLE memtable ENGINE = MyISAM
I've let this ALTER TABLE query run for over two days now, and it's not done. I've now killed it.
If I create the table initially as MyISAM, the write speed seems terribly poor (especially due to the fact that the query requires the use of the ON DUPLICATE KEY UPDATE technique). I can't temporarily turn off the keys. The table would become over 1000 times larger if I were to and then I'd have to reprocess the keys and essentially run a GROUP BY on 150,000,000,000 rows. Umm, no.
One of the key constraints to realize: The INSERT query UPDATEs records if the primary key (a hash) exists in the table already.
At the very beginning of an attempt at strictly using MyISAM, I'm getting a rough speed of 1,250 rows per second. Once the index grows, I imagine this rate will tank even more.
I have 16GB of memory installed in the machine. What's the best way to generate a massive table that ultimately ends up as an on-disk, indexed MyISAM table?
Clarification: There are many, many UPDATEs going on from the query (INSERT ... ON DUPLICATE KEY UPDATE val=val+whatever). This isn't, by any means, a raw dump problem. My reasoning for trying a MEMORY table in the first place was for speeding-up all the index lookups and table-changes that occur for every INSERT.
If you intend to make it a MyISAM table, why are you creating it in memory in the first place? If it's only for speed, I think the conversion to a MyISAM table is going to negate any speed improvement you get by creating it in memory to start with.
You say inserting directly into an "on disk" table is too slow (though I'm not sure how you're deciding it is when your current method is taking days), you may be able to turn off or remove the uniqueness constraints and then use a DELETE query later to re-establish uniqueness, then re-enable/add the constraints. I have used this technique when importing into an INNODB table in the past, and found even with the later delete it was overall much faster.
Another option might be to create a CSV file instead of the INSERT statements, and either load it into the table using LOAD DATA INFILE (I believe that is faster then the inserts, but I can't find a reference at present) or by using it directly via the CSV storage engine, depending on your needs.
Sorry to keep throwing comments at you (last one, probably).
I just found this article which provides an example of a converting a large table from MyISAM to InnoDB, while this isn't what you are doing, he uses an intermediate Memory table and describes going from memory to InnoDB in an efficient way - Ordering the table in memory the way that InnoDB expects it to be ordered in the end. If you aren't tied to MyISAM it might be worth a look since you already have a "correct" memory table built.
I don't use mysql but use SQL server and this is the process I use to handle a file of similar size. First I dump the file into a staging table that has no constraints. Then I identify and delete the dups from the staging table. Then I search for existing records that might match and put the idfield into a column in the staging table. Then I update where the id field column is not null and insert where it is null. One of the reasons I do all the work of getting rid of the dups in the staging table is that it means less impact on the prod table when I run it and thus it is faster in the end. My whole process runs in less than an hour (and actually does much more than I describe as I also have to denormalize and clean the data) and affects production tables for less than 15 minutes of that time. I don't have to wrorry about adjusting any constraints or dropping indexes or any of that since I do most of my processing before I hit the prod table.
Consider if a simliar process might work better for you. Also could you use some sort of bulk import to get the raw data into the staging table (I pull the 22 gig file I have into staging in around 16 minutes) instead of working row-by-row?