What is the best way to merge 2 MySQL data dumps? - mysql

We have built an application with MySQL as the database. Every week we export the data dump from the database, and delete all the data. Now we want to merge all these dumps together for some data-analysis tasks.
The problem we are facing is that the "id" field for all the tables is Auto-Increment, so it starts with 1 in all the data dumps, which causes duplicate IDs in the table. I am sure there must be better ways to do it since it should be a pretty common task in MySQL administration.
What would be the best way to go about it?

If you can easily identify your foreign key fields (like they take the form *_id) then you can use the scripting language of your choice to modify the primary and foreign keys in the dump files by adding an "id space offset".
For example let's say you have two dump files and you know their primary key range does not exceed 1,000,000, you increment the primary and foreign keys in the second dump file by 1,000,000.
This is not entirely trivial to implement, as you will have to detect the position of the foreign key fields in the statements and then modify values at the same column position elsewhere in the statement.
If your foreign keys are not easily identifiable by a common naming convention then you must keep separate information per table about how to find their positions based on column position.
Good luck.

The best way would be that you have another database that acts as data warehouse into which you copy the contents of your app's database. After that, you don't truncate all the tables, you simply use DELETE FROM tablename - that way, your auto_increments won't get reset.
It's an ugly solution to have something exported, then truncate the database, then expect an import will proceed properly. Even if you go around the problem of clashing auto increments (there's ON DUPLICATE KEY statement that allows you to do something if a unique key constraint fails), nothing guarantees that relations between tables (foreign keys) will be preserved.
This is a broad topic and solution given is quick and not nice, some other people will probably suggest other methods, but if you are doing this to offload the db your app uses - it's a bad design. Try to google MySQL's partitioning support if you're aiming for better performance with larger data set.

For the data you've already dumped, load it into a table that doesn't use the ID column as a primary key. You don't have to define any primary key. You will have multiple rows with the same ID, but that won't impede your data analysis.
Going forward, you can set up a discipline where you dump and then DELETE the rows that are more than, say, one day old. That way the your ID will keep incrementing.
Or, you can copy this data to a table that uses the ARCHIVE storage engine. This is good for retaining data for analysis, because it compresses its contents.

Related

Index every column to add foreign keys

I am currently learning about foreign keys and trying to add them as much as I can in my application to ensure data-integrity. I am using INNODB on Mysql.
My clicks table has a structure something like...
id, timestamp, link_id, user_id, ip_id, user_agent_id, ... etc for about 12 _id columns.
Obviously these all point to other tables, so should I add a foreign key on them? MySQL is creating an index automatically for every foreign key, so essentially I'll have an index on every column? Is this what I want?
FYI - this table will essentially be my most bulky table. My research basically tells me I'm sacrificing performance for integrity but doesn't suggest how harsh the performance drop will be.
Right before inserting such a row, you did 12 inserts or lookups to get the ids, correct? Then, as you do the INSERT, it will do 12 checks to verify that all of those ids have a match. Why bother; you just verified them with the code.
Sure, have FKs in development. But in production, you should have weeded out all the coding mistakes, so FKs are a waste.
A related tip -- Don't do all the work at once. Put the raw (not-yet-normalized) data into a staging table. Periodically do bulk operations to add new normalization keys and get the _id's back. Then move them into the 'real' table. This has the added advantage of decreasing the interference with reads on the table. If you are expecting more than 100 inserts/second, let's discuss further.
The generic answer is that if you considered a data item so important that you created a lookup table for the possible values, then you should create a foreign key relationship to ensure you are not getting any orphan records.
However, you should reconsider, whether all data items (fields) in your clicks table need a lookup table. For example ip_id field probably represents an IP address. You can simply store the IP address directly in the clicks table, you do not really need a lookup table, since IP addresses have a wide range and the IP addresses are unique.
Based on the re-evaluation of the fields, you may be able to reduce the number of related tables, thus the number of foreign keys and indexes.
Here are three things to consider:
What is the ratio of reads to writes on this table? If you are reading much more often than writing, then more indexes could be good, but if it is the other way around then the cost of maintaining those indexes becomes harder to bear.
Are some of the foreign keys not very selective? If you have an index on the gender_id column then it is probably a waste of space. My general rule is that indexes without included columns should have about 1000 distinct values (unless values are unique) and then tweak from there.
Are some foreign keys rarely or never going to be used as a filter for a query? If you have a last_modified_user_id field but you never have any queries that will return a list of items which were last modified by a particular user then an index on that field is less useful.
A little bit of knowledge about indexes can go a long way. I recommend http://use-the-index-luke.com

DB Design - any way to avoid duplicating columns here?

I've got a database that stores hash values and a few pieces of data about the hash, all in one table. One of the fields is 'job_id', which is the ID for the job that the hash came from.
The problem I'm trying to solve is that with this design, a hash can only belong to one job - in reality a hash can occur in many jobs, and I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called 'Jobs', with fields 'job_id', 'job_name' and 'hash_value'. When a new batch of data is inserted into the DB, the job ID and name would be created here and each hash would go into here as well as the original hash table, but in the Jobs table it'd also be stored against the job.
I don't like this, because I'd be duplicating the hash column across tables. Is there a better way? I can add to the hash table but can't take away any columns because closed-source software depends on it. The hash value is the primary key. It's MySQL and the database stores many millions of records. Thanks in advance!
Adding the new job table is the way to go. It's the normative practice, for representing a one-to-many relationship.
It's good to avoid unnecessary duplication of values. But in this case, you aren't really "duplicating" the hash_value column; rather, you are really defining a relationship between job and the table that has hash_value as the primary key.
The relationship is implemented by adding a column to the child table; that column holds the primary key value from the parent table. Typically, we add a FOREIGN KEY constraint on the column as well.
The problem I'm trying to solve is that with this design, a hash can
only belong to one job - in reality a hash can occur in many jobs, and
I'd like to know each job in which a hash occurs.
The way I'm thinking of doing this is to create a new table called
'Jobs', with fields 'job_id', 'job_name' and 'hash_value'.
As long as you can also get a) the foreign keys right and b) the cascades right for both "job_id" and "hash_value", that should be fine.
Duplicate data and redundant data are technical terms in relational modeling. Technical term means they have meanings that you're not likely to find in a dictionary. They don't mean "the same values appear in multiple tables." That should be obvious, because if you replace the values with surrogate ID numbers, those ID numbers will then appear in multiple tables.
Those technical terms actually mean "identical values with identical meaning." (Relevant: Hugh Darwen's article for definition and use of predicates.)
There might be good, practical reasons for replacing text with an ID number, but there are no theoretical reasons to do that, and normalization certainly doesn't require it. (There's no "every row has an ID number" normal form.)
If i read your question correctly, your design is fundamentally flawed, because of these two facts:
the hash is the primary key (quoted from your question)
the same hash can be generated from multiple different inputs (fact)
you have millions of hashes (from question)
With the many millions of rows/hashes, eventually you'll get a hash collision.
The only sane approach is to have job_id as the primary key and hash in a column with a non-unique index on it. Finding job(s) given a hash would be straightforward.

How do I design a schema to handle periodic bulk inserts/updates?

(tldr; I think that periodic updates forces the table to use a natural key. And so I'll have to migrate my database schema.)
I have a production database with a table like planets, which although it has good potential natural keys (e.g., the planet names which never really change), uses a typical incremented integer as the primary key. The planets table has a self-referencing column or two such as *parent_planet_id*.
Now I'm building offline cloud-based workers that re-create subsets of the planets records each week, and they need to be integrated with the main server. My plan is:
A worker instance has a mini version of the database (same schema, but no planets records)
Once per week, the worker fires up, does all its processing, creates its 100,000 or so planets records, and exports the data. (I don't think the export format matters for this particular problem: could be mysqldump, yaml, etc.)
Then, the production server imports the records: some are new records, most are updates.
This last step is what I don't know how to solve. I'm not entirely replacing the planets table each time, so the problem is that the two databases each have their own incrementing integer PK's. And so I can't just do a simple import.
I thought about exporting without the id column, but then I realized that the self-referencing columns prevent this.
I see two possible solutions:
Redesign the schema to use a natural key for the planets table. This would be a pain.
Use UUID instead of an incrementing integer for the key. Would be easier, I think, to move to. The id's would be unique, and the new rows could be safely imported. This also avoids the issues with depending on natural data in keys.
Modify the Planets to use alternate-hierarchy technique, like nested sets, closure table, or path enumeration and than export. This will break the ID-dependency.
Or, if you still do not like the idea, consider your export and import as an ETL problem.
Modify the record during the export to include PlanetName, ParentPlanetName
Import all Planets (PlanetNames) first
Then import the hierarchy (ParentPlanetName)
In any case, the surrogate key from the first DB should never leave that DB -- it has no meaning outside of it.
The best solution (in terms of desing) would be to refine your keys architecture and implement some composite key having info about when and from where the planets were imported, but you do not want to do this.
Easier (I think), and yet a bit "happy engineering" solution would be to modify importing keys. You can do this for example like that:
1. lock planets table in main system (so no new key will appear during import),
2. create lookup table having two columns, ID and PLANET NAME basing on planet table in main system,
3. get the maximum key value from that table,
4. increment every imported key (identyfying and referencing the parent-child planet relationship) value by adding the MAX value retrived within step #3,
5. alter main planet table and change current auto increment value for actual MAX + 1 value
6. now go over the table (cursor loop within procedure) checking if for the current planet name you have different key in your lookup, if yes first remove the record from the table with the key from lookup (the old one) and update the key value within the currently inspected row for an old id (that was an update),
7. unlock the table.
Most operations are updates
So you need a "real" merge. In other words, you'll have to identify a proper order in which you can INSERT/UPDATE the data, so no FKs are violated in the process.
I'm not sure what parent_planet_id means, but assuming it means "orbits" and the word "planet" is stretched to also include moons, imagine you have only Phobos in your master database and Mars and Deimos need to be imported. This can only be done in a certain order:
INSERT Mars.
INSERT Deimos, set its parent_planet_id so it points to Mars.
UPDATE Phobos' parent_planet_id so it points to Mars.
While you could exchange steps (2) and (3), you couldn't do either before the step (1).
You'll need a recursive descent to determine the proper order and then compare natural keys1 to see what needs to be UPDATEd and what INSERTed. Unfortunately, MySQL doesn't support recursive queries, so you'll need to do it manually.
I don't quite see how surrogate keys help in this process - if anything, they add one more level of indirection you'll have to reconcile eventually.
1 Which, unlike surrogates, are meaningful across different databases. You can't just compare auto-incremented integers because the same integer value might identify different planets in different databases - you'll have false UPDATEs. GUIDs, on the other hand, will never match, even when rows describe the same planet - you'll have false INSERTs.

varchar and composite primary keys in mysql?

I am developing a logging database, the ids of the components being logged in this case are not determined by the database itself, but by the system that sends the report. The system id is a unique varchar, and the component's id is determined by the system (in some faraway location), so uniqueness is guaranteed when the component's primary key is system_id + component_id.
What I'm wondering is if this approach is going to be efficient. I could use auto incremented integers as the id, but that would mean I would have to do select operations before inserting so that I can get this generated id instead of using the already known string id that the system provides.
The database is going to be small scale, no more than a few dozen systems, each with a few dozen components, and and maybe some thousands of component updates (another table). Old updates will be periodically dumped into a file and removed from the database, so it won't ever get "big."
Any recomendations?
I would lean towards auto incremented integers as a primary key and put indexes on system_id and component_id. Your selects before that insert will be very cheap and fast.
I'm sure you'll find that tables of several million rows will perform fine with varchar() keys.
It's easy enough to test. Just import your data.

Add Foreign Key relationships as bulk operation

I've inherited a database with hundreds of tables. Tables may have implicit FK relations that are not explicitly defined as such. I would like to be able to write a script or query that would be able to do this for all tables. For instance, if a table has a field called user_id, then we know there's a FK relationship with the users table on the id column. Is this even doable?
Thanks in advanced,
Yes, possible but I would want to explore more. Many folks design relational databases without foreign keys especially in the MySQL world. Also people reuse column names in different tables in the same schema (often with less than optimal results). Double check that what you think is a foreign key can be used that way (same data type, width, collation/character set, etc.).
Then i would recommend you copy the tables to a test machine and start doing your ALTER TABLES to add foreign keys. Test like heck.