How to turn a huge live database into a small testing database?

How to turn a huge live database into a small testing database? - mysql

I'm currently developing an API for a company that didn't do a very good job on maintaining a good test database with test data. The MySQL database structure is quite big and complicated and the live database should be around 160-200GB.
And because I'm quite lazy and don't want to create test data for all the table from scratch, I was wondering what would be the best way to turn such a big database into a smaller test database that keeps all data with their relationships in a correct form. Is there an easy way to this with some kind of script that checks the database model and knows what kind of data it needs to keep or delete when reducing the database to smaller size?
Or am I doomed and have to go through the tedious task of creating my own test data?

Take a look at Jailer which describes itself as a "Database Subsetting and Browsing Tool". It is specifically designed to select a subset of data, following the database relationships/constraints to include all related rows from linked tables. To limit the amount of data you export, you can set a WHERE clause on the table you are exporting.
The issue of scrubbing your test data to remove customer data is still there, but this will be easier once you have a smaller subset to work with.

In adition to Liath recomend:
maybe its a hard way but u can just export your schema (no data) and then make a stored procedure to iterate on your (original) tables and make a simple:
insert into dest_table (fields) (select * from origin_table where (`external_keys already inserted`) limit 100)
or somethink like.
thanks to #Liath : external_keys already inserted you hav to make a filter to ensure that any external key of this table already exist on your test database. So you also need to iterate your tables in order by external keys
another way its to export your data and edit the sql.dump file to remove the unwanted data (realy hard way)

I would suggest that it doesn't matter how thorough you are the risk of getting live customer details into a test database is too high. What happens if you accidentally email or charge a real customer for something your testing!?
There are a number of products out there such as RedGate's Data Generator which will create test data for you based on your schema (there is a free trial I believe so you can check it meets your needs before committing).
Your other alternative? Hire a temp to enter data all day!
ETA: Sorry - I just saw you're looking more at MySQL rather than MSSQL which probably rules out the tool I recommended. A quick google produces similar results.

Related

Translate data in MySQL database into another with slightly different structure?

I did some research on this and couldn't find many introductory resources for a beginner so I'm looking for a basic understanding here of how the process works. The problem I'm trying to solve is as such: I want to move data from an old database to a new one with a slightly different structure, possibly mutating the data a little bit in the process. Without going into the nitty gritty detail.. what are the general steps involved in doing this?
From what I gathered I would either be...
writing a ton of SQL queries manually (eesh)
using some complex tool that may be overkill for what I'm doing
There is a lot of data in the database so writing INSERT queries from a SQL dump seems like a nightmare. What I was looking for is some way to write a simple program that inserts logic like for each row in the table "posts", take the value of the "body" attribute and put it in the "post-body attribute of the new database or something like that. I'm also looking for functionality like append a 0 to the data in the "user id" column then insert it in the new database (just an example, the point is to mutate the data slightly).
In my head I can easily construct the logic of how the migration would go very easily (definitely not rocket science here).. but I'm not sure how to make this happen on a computer to iterate over the ridiculous amount of data without doing it manually. What is the general process for doing this, and what tools might a beginner want to use? Is this even a good idea for someone who has never done it before?
Edit: by request, here is an example of a mutation I'd like to perform:
Old database: table "posts" with an attribute post_body that is a varchar 255.
New database: table "posts" with an attribute body" that is a text datatype.
Want to take post-body from the old one and put it in body in the new one. Realize they are different datatypes but they are both technically strings and should be fine to convert, right? Etc. a bunch of manipulations like this.

Usually, the most time-consuming step of a database conversion is understanding both the old and the new structure, and establishing the correspondance of fields in each structure.
Compared to that, the time it takes to write the corresponding SQL query is ridiculously short.
for each row in the table "posts", take the value of the "body" attribute and put it in the "post-body attribute of the new database
INSERT INTO newdb.postattribute (id, attribute, value)
SELECT postid, 'post-body', body FROM olddb.post;
In fact, the tool that allows such data manipulation is... SQL! Really, this is already a very high-level language.

Test cases DB design

I`m working on web site which will test some applications or web sites with some test cases. And I dont know how to store this test cases which will be created by user. Is it okay to create separate table for each user? Or store all data in one table? So i have idea to create 3 new tables for each user (test_cases_x (will store all test cases which user has created), test_cases_history_x (will store references to all test cases which have been executed), test_cases_exe_x(will store all references to all test cases which are executing in this moment))

Is it okay to create separate table for each user?
No, this is defeating the whole idea of a relational database. You want the three tables but to link them by user id.
its hard without knowing all the information - however it usually better 99% of the time not to create specific tables on a per user basis but use the database to perform linkage (relationships).
If you're concerned your table will grow really large you can look at partition / sharding / archiving data to reduce it (please don't look there until you need to as premature optimization can actually make it perform slower)

Database design to create tables on the fly

I need to create dynamic tables in the database on the fly. For example, in the database I will have tables named:
Table
Column
DataType
TextData
NumberData
DateTimedata
BitData
Here I can add a table in the table named table, then I can add all the columns to that table in the columns table and associate a datatype to each column.
Basically I want to create tables without actually creating a table in the database. Is this even possible? If so, can you direct me to the right place so I can research? Also, I would prefer sql server or any free database software.
Thanks

What you are describing is an entity-attribute-value model (EAV). It is a very poor way to design a data model.
Although the data model is quite flexible, querying such a data model is quite complicated. You frequently end up having to self-join a table n times if you want to select or filter on n different attributes. That gets slow rather slow and becomes rather hard to optimize relatively quickly.
Plus, you generally end up building a lot of functionality that the database or your ORM would provide.

I'm not sure what the real problem you're having is, but the solution you proposed is the "database within a database" antipattern which makes so many people cringe.
Depending on how you're querying your data, if you were to structure things like you're planning, you'd either need a bunch of piece-wise queries which are joined in the middleware (slow) or one monster monolithic query (either slow or creates massive index bloat), if one is even possible.
If you must create tables on the fly, learn the CREATE TABLE ALTER TABLE and DROP TABLE DDL statements for the particular database engine you're using. Better yet, find an ORM that will do this for you. If your real problem is that you need to store unstructured data, check out MongoDB, Redis, or some of the other NoSQL variants.
My final advice is to write up the actual problem you're trying to solve as a separate question, and you'll probably learn a lot more.

Doing this with documents might be easier. Perhaps you should look at a noSQL solution such as mongoDB.

Or you can still create the Temporary tables but use a cronjob and create the Temporary tables every %% hours and rename it to the correct name after the query's are done. so your site is stil in the air
What you are trying to archive is not not bad but you must use it in the correct logic way.
*sorry for my bad english

I did something like this in LedgerSMB. While we use EAV modelling for a few things (where the flexibility is needed and the sort of querying we are doing is straight-forward, for example menu nodes use this in part), in general, you want to stay away from this as much as possible.
A better approach is to do all of what you are doing except for the data columns. Then you can (shock of shocks) just create the tables. This gives you a catalog of what you have added so your app knows this (and you can diff from the system catalogs if you ever have to check!) but at the same time you get actual relational modelling.
What we did in LedgerSMB was to have stored procedures that would accept a table name exists ('extends_' || name supplied). If so would add a column with the datatype required and write this to the application catalogs. This gives us relational modelling of extended attributes. At load time, the application loads the application catalogs and writes queries as appropriate at appropriate points to load/save the data. It works pretty well, actually.

Read vs Write tables database design

I have a user activity tracking log table where it logs all user activity as they occur. This is extremely high write table due to the in depth tracking of click by click tracking. Up to here the database design is perfect. Problem is the next step.
I need to output the data for the business folks + these people can query to fetch past activity data. Hence there is semi-medium to high read also. I do not like the idea of reading and writing from the same high traffic table.
So ideally I want to split the tables: The first one for quick writes (less to no fks), then copy that data over fully formatted & pulling in all the labels for the ids into a read table for reading use.
So questions:
1) Is this the best approach for me?
2) If i do keep 2 tables, how to keep them in sync? I cant copy the data to the read table instant as it writes to the write table - it will defeat the whole purpose of having seperate tables then, nor can i keep the read table to be old because the activity data tracked links with other user data like session_id, etc so if these IDs are not ready when their usecase calles for it the writes will fail.
I am using MySQL for user data and HBase for some large tables, with php codeignitor for my app.
Thanks.

Yes, having 2 separate tables is the best approach. I've had the same problem to solve a few months ago, though for a daemon-type application and not a website.
Eventually I ended up with 1 MEMORY table keeping "live" data which is inserted/updated/deleted on almost every event and another table that had duplicates of the live data rows, but without the unnecesary system columns - my history table, which was used for reading only per request.
The live table is only relevant to the running process, so I don't care if the contained data is lost due to a server failure - whatever data needs to be read later is already stored in the history table. So ... there's no problem in duplicating the data in the two tables - your goal is performance, not normalization.

What to do with the older database?

I am assigned in a project where I need to work with a medium database. When I open that database I saw that the database is not correctly made and it MUST have more tables than it should be.No normalization is applied even !
But problem is the database has a medium scale of data of almost 500 users. when I will break the older database, the older users will loss their data.
But I must copy this data to the newly formatted table of the new database.(But all field may not match.) I think there is no tool to automate it, is there any?
Is there any best practice to follow to do such type of work?

Is the schema really a problem or do you just want to fix it because it's not 3rd normal form?
Anyways, I'd create an entirely new database with the desired, normalized schema and write some import routines.
If the database was / is heavily used, I'd create some views to maintain read compatibility (the views would have the same names as the former tables and the same columns), that way all you have to change are the insert / update parts and ofc. the connection strings.

The question beeing asked is:
Is there any tool that can transform a non-normalized database into a normalized database while preserving all its contents.
The answer is: no.
You have to fine-tune the database optimization to your needs.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008