I have multiple end-user mySQL dbs with a fairly large amount of data that must be synchronized with a database (also mySQL) populated by an external data feed. End users can add data to their "local" DB, but not to the feed.
The question is how to merge/synchronize the two databases including the foreign keys between the tables of the DBs, without either overwriting the "local" additions or changing the key of the local additions.
Things I've considered include using a csv dump of the feed DB and doing a LOAD DATA INFILE with IGNORE, and then just comparing the files to see which rows from the feed didn't get written, and write them manually and writing some script to go line by line through the feed DB and create new rows in the local DBs, creating new keys at the same time. However, this seems like it could be very slow, particularly with multiple dbs.
Any thoughts on this? If there was a way to merge these DBs, preserving the keys with a sort of load infile simplicity and speed, that would be ideal.
Use a compound primary key.
primary key(id, source_id)
Make each db use a different value for source_id. That way you can copy database contents around without having PK clashes.
One option would be to use GUIDs rather than integer keys, but it may not be practical to make such a significant change.
Assuming that you're just updating the user databases from the central "feed" database, I'd use CSV and LOAD INFILE, but load into a staging table within the target database. You could then replace the keys with new values, and finally insert the rows into the permanent tables.
If you're not dealing with huge data volumes, it could be as simple as finding the difference between the highest ID of the existing data and the lowest ID of the incoming data. Add this amount to all of the keys in your incoming data, and there should be no collisions. This would waste some PK values, but that's probably not worth worrying about unless your record count is in the millions. This assumes that your PKs are integers and sequential.
Related
I am using Galera Cluster with 3 nodes. I am currently meet following problems. I want to write more than 500 millions of records into database, for example table Data. Here is following steps:
Create table NewData with schema as Data but without index.
Write 500 millions records into this table. (using multiple threads to write, each thread will write bunch of records).
After finishing, assign index for this table.
Rename Data to OldData and rename NewData to Data.
The problem I am currently have is at indexing phrase, other services cannot write / read data. After I increase innodb_buffer_pool_size other nodes can read data but still cannot write.
I have configured so that written job writes at different node than other api's but problem still the same. I think that if one node is very high workload, other node should still behave normally. Please tell me why and how to fix this.
Thanks
I think you missed a step.
(one-time) Create table NewData with schema as Data but without index.
Insert into NewData.
Create table `Empty (again like Data but without any index)
RENAME TABLE NewData TO ToIndex, Empty TO NewData; -- Now the ingestion can proceed.
ALTER TABLE ToIndex ADD INDEX ...
RENAME TABLE Data TO Old, ToIndex TO Data;
The point is to have two things going on:
Continually writing to the unindexed NewData.
Swap tables around so that periodically that table (under a new name) gets indexed, and then used to replace the live table (which is always seen as Data).
This is not quite the same situation, but has some similarities: http://mysql.rjweb.org/doc.php/staging_table
Let table_1 be created as follows:
CREATE TABLE table_1 (
id INT AUTO_INCREMENT PRIMARY KEY,
some_blob BLOB
);
Let table_2 be created as follows:
CREATE TABLE table_2 (
id INT AUTO_INCREMENT PRIMARY KEY,
some_blob BLOB
);
What I want to know is, after I run this table-copying query
INSERT INTO table_2 (id, some_blob) SELECT id, some_blob FROM table_1;
will the actual text within each some_blob field of the table_1 table be duplicated and stored on disk, or will the DB have only duplicated pointers to the disk locations containing the BLOB data?
One argument for why BLOB copying must involve the duplication of actual content reasons as follows:
Duplication of BLOB content is necessary because changes to BLOB data in table_1 should not also take place in table_2. If only the disk pointers were duplicated then content changes in one table would be reflected in the other table, which violates the properties of a correct copy operation.
Now I present an alternative method that the DB could implement to satisfy this copy operation. This alternative shows the above argument is not necessarily true. The DB could only duplicate disk pointers during the execution of the given INSERT statement, then whenever an UPDATE occurs which seeks to modify the BLOB data in one of the tables, the DB would only then allocate more space on disk to store the new data which is part of the UPDATE query. A BLOB data segment then is only deleted when there no longer exists any disk pointers to it, and a particular BLOB data segment could potentially have many disk pointers pointing to it.
So which of these strategies does MySQL/MariaDB use when executing the given INSERT statement, or does it use a different strategy?
EDIT: Why I am asking this question
Currently I am running a couple of UPDATE queries which are copying large amounts of BLOB data from one table to another in the same database (over 10 million rows of BLOB data). The queries have been running for a while. I am curious about whether the performance is so slow because some of the columns I am comparing are poorly indexed, because these queries are literally copying over the content instead of disk pointers, or perhaps because of both of these reasons.
I use an INSERT in the question's example because this simplifies the concept of database internals that I am trying to understand.
Each table has its own copy of the blob data, and of all other data. MySQL doesn't do shallow copying of data. It's true that blobs are separately allocated objects, but they are not shared between tables. The description of the internals of the storage engine is provided so you can understand what's going on, not so that you can change it (unless you fork the storage engine source and create a new version ... but get your app working first).
So, your UPDATE queries are erasing old blob data and writing new. That's I/O intensive, and so it's probably slow.
Using INSERT as a way of simplifying your question is incorrect. Writing new blobs to a table is a faster process then overwriting existing ones.
Your best bet in production is to avoid UPDATEs to blob columns.
I have an excel file that contains contents from the database when downloaded. Each row is identified using an identifier called id_number. Users can add new rows on the file with a new unique id_number. When it is uploaded, for each excel row,
When the id_number exist on the database, an update is performed on the database row.
When the id_number does not exist on the database, an insert is performed on the database row.
Other than the excel file, data can be added or updated individually using a file called report.php. Users use this page if they only want to add one data for an employee, for example.
Ideally, I would like to do an insert ... on duplicate key update for maximum performance. I might also put all of them in a transaction. However, I believe this overall process have some flaws:
Before any add/updates, validation checks have to be done on all excel rows against their corresponding database rows. The reason is because there are many unique columns in the table. That's why I'll have to do some select statements to insure that the data is valid before performing any add/update. Is this efficient on tables with 500 rows and 69 columns? I could probably just get all the data and store all of them in a php array and do the validation check on the array, but what happens if someone adds a new row (with an id_number of 5) through report.php? Then suppose the excel file I uploaded also contains a row with an id_number 5? That could probably destroy my validations because I can not be sure my data is up to date without performing a lot of select statements.
Suppose the system is in the middle of a transaction adding/updating the data retrieved from the excel file, then someone from report.php adds a row because all the validations have been satisfied (E.G. no duplicate id_numbers). Suppose at this point in time the next row to be added from the excel file and the row that will be added by the user on report.php have the same id_number. What happens then? I don't have much knowledge on transactions, I think they at least prevents two queries changing a row at the same time? Is that correct?
I don't really mind these kinds of situations that much. But some files have many rows and it might take a long time to process all of them.
One way I've thought of fixing this is: while the excel file upload is processing, I'll have to prevent users using report.php to modify the rows currently held by the excel file. Is this fine?
What could be the best way to fix these problems? I am using mysql.
If you really need to allow the user to generate their own unique ID then the you could lock the table in question while you're doing you validation and inserting.
If you acquire a write lock, then you can be certain the table isn't changed while you do your work of validation and inserting.
`mysql> LOCK TABLES tbl_name WRITE`
don't forget to
`mysql> UNLOCK TABLES;`
The downside with locking is obvious, the table is locked. If it is high traffic, then all your traffic is waiting, and that could lead all kinds of pain, (mysql running out of connections, would be one common one)
That said, I would suggest a different path altogether, let mysql be the only one who generates a unique id. That is make sure the database table have an auto_increment unique id (primary key) and then have new records in the spreadsheet entered without without the unique id given. Then mysql will ensure that the new records get a unique id, and you don't have to worry about locking and can validate and insert without fear of a collision.
In regards to the question as to performance with a 500 records 69 column table, I can only say that if the php server and the mysql server are reasonably sized and the columns aren't large data types then this amount of data should be readily handled in a fractions of a second. That said performance can be sabotaged by one bad line of code so if your code is slow to perform, I would take that as a separate optimisation problem.
I have to upload about 16 million records to a MySQL 5.1 server on a shared webspace which does not permit LOAD DATA functionality. The table is an Innodb table. I have not assigned any keys yet.
Therefore, I use a Python script to convert my CSV file (of 2.5 GB of size) to an SQL file with individual INSERT statements. I've launched the SQL file, and the process is incredibly slow, it feels like 1000-1500 lines are processed every minute!
In the meantime, I read about bulk inserts, but did not find any reliable source telling how many records one insert statement can have. Do you know?
Is it an advantage to have no keys and add them later?
Would a transaction around all the insert help speed up the process? In fact, there's just a single connection (mine) working with the database at this time.
If you use insert ... values ... syntax to insert multiple rows running a single request your query size is limited by max_allowed_packet value rather than by number of rows.
Concerning keys: it's a good practice to define keys before any data manipulations. Actually, when you build a model you must think of keys, relations, indexes etc.
It's better do define indexes before you insert data as well. CREATE INDEX works quite slowly on huge datasets. But postponing indexes creation is not a huge disadvantage.
To make your inserts faster try to turn autocommit mode on and do not run concurrent requests on your tables.
I have an "item" table with the following structure:
item
id
name
etc.
Users can put items from this item table into their inventory. I store it in the inventory table like this:
inventory
id
item_id
user_id
Is it OK to insert 1000 rows into inventory table? What is the best way to insert 1000 rows?
MySQL can handle millions of records in a single table without any tweaks. With little tweaks it can handle hundreds of millions (I did that). So I wouldn't worry about that.
To improve insert performance you should use batch inserts.
insert into table my_table(col1, col2) VALUES (val1_1, val2_1), (val1_2, val2_2);
Storing records to a file and using load data infile yields even better results (best in my case), but it requires more effort.
It's okay to insert 1000 rows. You should do this as a transaction so the indices are updated all at once at the commit.
You can also construct a single INSERT statement to insert many rows at a time. (See the syntax for INSERT.) However, I wonder how advisable it would be to do that for 1,000 rows.
Most efficient would probably be to use LOAD DATA INFILE or LOAD XML
When it gets to 1000's, I usually use write to a pipe-delimited CSV file and use LOAD DATA INFILE to suck it in quick. By writing to disk, you avoid issues with overflowing your string buffer, if the language you are using has limits on string size. LOAD DATA INFILE is optimized for bulk uploads.
I've done this with up to 1 billion rows (on a cheap $400 4GB 3 year old 32-bit Ubuntu box), so one thousand is not an issue.
Added note: if you don't care about the id assigned and you just want a new unique ID for every record you insert, you could consider setting up AUTO_INCREMENT on id in the table and let Mysql assign an ID for you.
It kind of also depends on how many users you have, if you have 1,000,000 users all doing 1,000 inserts every few minutes then the server is going to struggle to keep up. From a mysql point of view it is certainly capable of handling that much data.