How do I design a schema to handle periodic bulk inserts/updates? - mysql

(tldr; I think that periodic updates forces the table to use a natural key. And so I'll have to migrate my database schema.)
I have a production database with a table like planets, which although it has good potential natural keys (e.g., the planet names which never really change), uses a typical incremented integer as the primary key. The planets table has a self-referencing column or two such as *parent_planet_id*.
Now I'm building offline cloud-based workers that re-create subsets of the planets records each week, and they need to be integrated with the main server. My plan is:
A worker instance has a mini version of the database (same schema, but no planets records)
Once per week, the worker fires up, does all its processing, creates its 100,000 or so planets records, and exports the data. (I don't think the export format matters for this particular problem: could be mysqldump, yaml, etc.)
Then, the production server imports the records: some are new records, most are updates.
This last step is what I don't know how to solve. I'm not entirely replacing the planets table each time, so the problem is that the two databases each have their own incrementing integer PK's. And so I can't just do a simple import.
I thought about exporting without the id column, but then I realized that the self-referencing columns prevent this.
I see two possible solutions:
Redesign the schema to use a natural key for the planets table. This would be a pain.
Use UUID instead of an incrementing integer for the key. Would be easier, I think, to move to. The id's would be unique, and the new rows could be safely imported. This also avoids the issues with depending on natural data in keys.

Modify the Planets to use alternate-hierarchy technique, like nested sets, closure table, or path enumeration and than export. This will break the ID-dependency.
Or, if you still do not like the idea, consider your export and import as an ETL problem.
Modify the record during the export to include PlanetName, ParentPlanetName
Import all Planets (PlanetNames) first
Then import the hierarchy (ParentPlanetName)
In any case, the surrogate key from the first DB should never leave that DB -- it has no meaning outside of it.

The best solution (in terms of desing) would be to refine your keys architecture and implement some composite key having info about when and from where the planets were imported, but you do not want to do this.
Easier (I think), and yet a bit "happy engineering" solution would be to modify importing keys. You can do this for example like that:
1. lock planets table in main system (so no new key will appear during import),
2. create lookup table having two columns, ID and PLANET NAME basing on planet table in main system,
3. get the maximum key value from that table,
4. increment every imported key (identyfying and referencing the parent-child planet relationship) value by adding the MAX value retrived within step #3,
5. alter main planet table and change current auto increment value for actual MAX + 1 value
6. now go over the table (cursor loop within procedure) checking if for the current planet name you have different key in your lookup, if yes first remove the record from the table with the key from lookup (the old one) and update the key value within the currently inspected row for an old id (that was an update),
7. unlock the table.

Most operations are updates
So you need a "real" merge. In other words, you'll have to identify a proper order in which you can INSERT/UPDATE the data, so no FKs are violated in the process.
I'm not sure what parent_planet_id means, but assuming it means "orbits" and the word "planet" is stretched to also include moons, imagine you have only Phobos in your master database and Mars and Deimos need to be imported. This can only be done in a certain order:
INSERT Mars.
INSERT Deimos, set its parent_planet_id so it points to Mars.
UPDATE Phobos' parent_planet_id so it points to Mars.
While you could exchange steps (2) and (3), you couldn't do either before the step (1).
You'll need a recursive descent to determine the proper order and then compare natural keys1 to see what needs to be UPDATEd and what INSERTed. Unfortunately, MySQL doesn't support recursive queries, so you'll need to do it manually.
I don't quite see how surrogate keys help in this process - if anything, they add one more level of indirection you'll have to reconcile eventually.
1 Which, unlike surrogates, are meaningful across different databases. You can't just compare auto-incremented integers because the same integer value might identify different planets in different databases - you'll have false UPDATEs. GUIDs, on the other hand, will never match, even when rows describe the same planet - you'll have false INSERTs.

Related

Which is better, using a central ID store or assigning IDs based on tables

In many ERP Systems (Locally) I have seen that Databases (Generally MYSQL) uses central key store (Resource Identity). Why is that?
i.e. In a database one special table is maintained for generation of IDs which will have one cell (first one) which will have a number (ID) which is assigned to the subsequent tuple (i.e. common ID generation for all the tables in the same database).
Also in this table the entry for last inserted batch details are inserted. i.e. when 5 tuples in table ABC is inserted and, lets say that last ID of item in the batch is X, then an entry in the table (the central key store) is also inserted like ('ABC', X).
Is there any significance of this architecture?
And also where can I find the case study of common large scale custom built ERP system?
If I understand this correctly, you are asking why would someone replace IDs that are unique only for a table
TABLE clients (id_client AUTO_INCREMENT, name, address)
TABLE products (id_product AUTO_INCREMENT, name, price)
TABLE orders (id_order AUTO_INCREMENT, id_client, date)
TABLE order_details (id_order_detail AUTO_INCREMENT, id_order, id_product, amount)
with global IDs that are unique within the whole database
TABLE objects (id AUTO_INCREMENT)
TABLE clients (id_object, name, address)
TABLE products (id_object, name, price)
TABLE orders (id_object, id_object_client, date)
TABLE order_details (id_object, id_object_order, id_object_product, amount)
(Of course you could still call these IDs id_product etc. rather than id_object. I only used the name id_object for clarification.)
The first approach is the common one. When inserting a new row into a table you get the next available ID for the table. If two sessions want to insert at the same time, one must wait briefly.
The second approach hence leads to sessions waiting for their turn everytime they want to insert data, no matter what table, as they all get their IDs from the objects table. The big advantage is that when exporting data, you have global references. Say you export orders and the recipient tells you: "We have problems with your order details 12345. There must be something wrong with your data". Wouldn't it be great, if you could tell them "12345 is not an order detail ID, but a product ID. Do you have problems importing the product or can you tell me an order detail ID this is about?" rather than looking at an order detail record for hours that happens to have the number 12345, while it had nothing to do with the issue, really?
That said, it might be a better choice to use the first approach and add a UUID to all tables you'd use for external communication. No fight for the next ID and still no mistaken IDs in communication :-)
This is the common strategy used in datawarehouse to track the the batch number after successful or failure of dataloading, in case the loading of the data got failed you will say something like 'ABC' , 'Batch_num' and 'Error_Code' in the batch control table, so your further logic of loading can decide on what do with failure and can easily track the loading, in case if you want to recheck we can archive the data. This ID's are usually generated by a sequence in data base, in one word it is mostly used for monitoring purposes.
You can refer this link for more details
There are several more techniques, each with pros and cons. But let me start by pointing out two techniques that hit a brick wall at some point in scaling up. Let's assume you have billions of items, probably scattered across multiple server either by sharding or other techniques.
Brick wall #1: UUIDs are handy because clients can create them without having to ask some central server for values. But UUIDs are very random. This means that, in most situations, each reference incurs a disk hit because the id is unlikely to be cached.
Brick wall #2: Ask a central server, which has an AUTO_INCREMENT under the covers to dole out ids. I watched a social media site that was doing nothing but collecting images for sharing crash because of this. That's in spite of there being a server whose sole purpose is to hand out numbers.
Solution #1:
Here's one (of several) solutions that avoids most problems: Have a central server that hands out 100 ids at a time. After a client uses up the 100 it has been given, it asks for a new batch. If the client crashes, some of the last 100 are "lost". Oh, well; no big deal.
That solution is upwards of 100 times as good as brick wall #2. And the ids are much less random than those for brick wall #1.
Solution #2: Each client can generate its own 64-bit, semi-sequential, ids. The number includes a version number, some of the clock, a dedup-part, and the client-id. So it is roughly chronological (worldwide), and guaranteed to be unique. But still have good locality of reference for items created at about the same time.
Note: My techniques can be adapted for use by individual tables or as an uber-number for all tables. That distinction may have been your 'real' question. (The other Answers address that.)
The downside to such a design is that it puts a tremendous load on the central table when inserting new data. It is a built-in bottleneck.
Some "advantages" are:
Any resource id that is found anywhere in the system can be readily identified, regardless of type.
If there is any text in the table (such as a name or description), then it is all centralized facilitating multi-lingual support.
Foreign key references can work across multiple types.
The third is not really an advantage, because it comes with a downside: the inability to specify a specific type for foreign key references.

mySQL - Searching a row by its row number?

Question from a total mySQL newbie. I'm trying to build a table containing information about machine parts (screws, brackets, cylinders, etc), and each part corresponds to a machine that the part belongs to. The database will be designed so that whenever the client reads from the table, all of the parts from one specified machine will be selected. I'm trying to figure out the fastest way in which all rows falling under a certain category can be read from the disk.
Sorting the table is not an option as many people might be adding rows to the table at once. Using a table for each machine is not practical either, since new machines might be created. I expect it to have to handle lots of INSERT and SELECT operations, but almost no DELETE operations. I've come up with a plan to quickly identify each part belonging to any machine, and I've come to ask if it's practical:
Each row containing the data for a machine part will contain the row number of the previous part and the next part for the same machine. A separate table will contain the row number of the last part of each machine that appears on the table. A script could follow the list of these 'pointers,' skipping to different parts of the table until all of the parts were found.
TL;DR
Would this approach of searching a row by its row number be any faster than searching instead by an integer primary key (since a primary key does not necessarily indicate a position on the table)? How much faster would it be? Would it yield noticeable performance improvements over using an index?
This would be a terrible approach. Selecting rows which match some criteria is a fundamental feature of MySQL (or any other DB engine really...).
Just create a column called machine_id in your parts table and give an id to each machine.
You could put your machines in a machines table and use their primary key in the machine_id field of the parts table.
Then all you have to do to retrieve ALL parts of machine 42 is:
SELECT * FROM parts WHERE machine_id = 42;
If your database is massive you may also consider indexing the machine_id column for better performances.

Not defined database schema

I have a mysql database with 220 tables. The database is will structured but without any clear relations. I want to find a way to connect the primary key of each table to its correspondent foreign key.
I was thinking to write a script to discover the possible relation between two columns:
The content range should be similar in both of them
The foreign key name could be similar to the primary key table name
Those features are not sufficient to solve the problem. Do you have any idea how I could be more accurate and close to the solution. Also, If there's any available tool which do that.
Please Advice!
Sounds like you have a licensed app+RFS, and you want to save the data (which is an asset that belongs to the organisation), and ditch the app (due to the problems having exceeded the threshold of acceptability).
Happens all the time. Until something like this happens, people do not appreciate that their data is precious, that it out-lives any app, good or bad, in-house or third-party.
SQL Platform
If it was an honest SQL platform, it would have the SQL-compliant catalogue, and the catalogue would contain an entry for each reference. The catalogue is an entry-level SQL Compliance requirement. The code required to access the catalogue and extract the FOREIGN KEY declarations is simple, and it is written in SQL.
Unless you are saying "there are no Referential Integrity constraints, it is all controlled from the app layers", which means it is not a database, it is a data storage location, a Record Filing System, a slave of the app.
In that case, your data has no Referential Integrity
Pretend SQL Platform
Evidently non-compliant databases such as MySQL, PostgreSQL and Oracle fraudulently position themselves as "SQL", but they do not have basic SQL functionality, such as a catalogue. I suppose you get what you pay for.
Solution
For (a) such databases, such as your MySQL, and (b) data placed in an honest SQL container that has no FOREIGN KEY declarations, I would use one of two methods.
Solution 1
First preference.
use awk
load each table into an array
write scripts to:
determine the Keys (if your "keys" are ID fields, you are stuffed, details below)
determine any references between the Keys of the arrays
Solution 2
Now you could do all that in SQL, but then, the code would be horrendous, and SQL is not designed for that (table comparisons). Which is why I would use awk, in which case the code (for an experienced coder) is complex (given 220 files) but straight-forward. That is squarely within awks design and purpose. It would take far less development time.
I wouldn't attempt to provide code here, there are too many dependencies to identify, it would be premature and primitive.
Relational Keys
Relational Keys, as required by Codd's Relational Model, relate ("link", "map", "connect") each row in each table to the rows in any other table that it is related to, by Key. These Keys are natural Keys, and usually compound Keys. Keys are logical identifiers of the data. Thus, writing either awk programs or SQL code to determine:
the Keys
the occurrences of the Keys elsewhere
and thus the dependencies
is a pretty straight-forward matter, because the Keys are visible, recognisable as such.
This is also very important for data that is exported from the database to some other system (which is precisely what we are trying to do here). The Keys have meaning, to the organisation, and that meaning is beyond the database. Thus importation is easy. Codd wrote about this value specifically in the RM.
This is just one of the many scenarios where the value of Relational Keys, the absolute need for them, is appreciated.
Non-keys
Conversely, if your Record Filing System has no Relational Keys, then you are stuffed, and stuffed big time. The IDs are in fact record numbers in the files. They all have the same range, say 1 to 1 million. It is not reasonably possible to relate any given record number in one file to its occurrences in any other file, because record numbers have no meaning.
Record numbers are physical, they do not identify the data.
I see a record number 123456 being repeated in the Invoice file, now what other file does this relate to ? Every other possible file, Supplier, Customer, Part, Address, CreditCard, where it occurs once only, has a record number 123456 !
Whereas with Relational Keys:
I see IBM plus a sequence 1, 2, 3, ... in the Invoice table, now what other table does this relate to ? The only table that has IBM occurring once is the Customer table.
The moral of the story, to etch into one's mind, is this. Actually there are a few, even when limiting them to the context of this Question:
If you want a Relational Database, use Relational Keys, do not use Record IDs
If you want Referential Integrity, use Relational Keys, do not use Record IDs
If your data is precious, use Relational Keys, do not use Record IDs
If you want to export/import your data, use Relational Keys, do not use Record IDs

varchar and composite primary keys in mysql?

I am developing a logging database, the ids of the components being logged in this case are not determined by the database itself, but by the system that sends the report. The system id is a unique varchar, and the component's id is determined by the system (in some faraway location), so uniqueness is guaranteed when the component's primary key is system_id + component_id.
What I'm wondering is if this approach is going to be efficient. I could use auto incremented integers as the id, but that would mean I would have to do select operations before inserting so that I can get this generated id instead of using the already known string id that the system provides.
The database is going to be small scale, no more than a few dozen systems, each with a few dozen components, and and maybe some thousands of component updates (another table). Old updates will be periodically dumped into a file and removed from the database, so it won't ever get "big."
Any recomendations?
I would lean towards auto incremented integers as a primary key and put indexes on system_id and component_id. Your selects before that insert will be very cheap and fast.
I'm sure you'll find that tables of several million rows will perform fine with varchar() keys.
It's easy enough to test. Just import your data.

What is the best way to merge 2 MySQL data dumps?

We have built an application with MySQL as the database. Every week we export the data dump from the database, and delete all the data. Now we want to merge all these dumps together for some data-analysis tasks.
The problem we are facing is that the "id" field for all the tables is Auto-Increment, so it starts with 1 in all the data dumps, which causes duplicate IDs in the table. I am sure there must be better ways to do it since it should be a pretty common task in MySQL administration.
What would be the best way to go about it?
If you can easily identify your foreign key fields (like they take the form *_id) then you can use the scripting language of your choice to modify the primary and foreign keys in the dump files by adding an "id space offset".
For example let's say you have two dump files and you know their primary key range does not exceed 1,000,000, you increment the primary and foreign keys in the second dump file by 1,000,000.
This is not entirely trivial to implement, as you will have to detect the position of the foreign key fields in the statements and then modify values at the same column position elsewhere in the statement.
If your foreign keys are not easily identifiable by a common naming convention then you must keep separate information per table about how to find their positions based on column position.
Good luck.
The best way would be that you have another database that acts as data warehouse into which you copy the contents of your app's database. After that, you don't truncate all the tables, you simply use DELETE FROM tablename - that way, your auto_increments won't get reset.
It's an ugly solution to have something exported, then truncate the database, then expect an import will proceed properly. Even if you go around the problem of clashing auto increments (there's ON DUPLICATE KEY statement that allows you to do something if a unique key constraint fails), nothing guarantees that relations between tables (foreign keys) will be preserved.
This is a broad topic and solution given is quick and not nice, some other people will probably suggest other methods, but if you are doing this to offload the db your app uses - it's a bad design. Try to google MySQL's partitioning support if you're aiming for better performance with larger data set.
For the data you've already dumped, load it into a table that doesn't use the ID column as a primary key. You don't have to define any primary key. You will have multiple rows with the same ID, but that won't impede your data analysis.
Going forward, you can set up a discipline where you dump and then DELETE the rows that are more than, say, one day old. That way the your ID will keep incrementing.
Or, you can copy this data to a table that uses the ARCHIVE storage engine. This is good for retaining data for analysis, because it compresses its contents.