I have 2 large SQL files of around 8GB each. But, in the latest backup I find that one file has 300MB data missing.
I just want to compare which data is missing, so that I can check that was it just temporary data OR important data that has vanished.
On comparing both files via diff on Ubuntu 14.04 I always get memory allocation error. I have also tried other allowing more than memory solutions and all that, but still no help.
I want to gather all data which exists in sql1 but missing in sql2 to a new file sql3.
Please help!
EDIT: I moved from Simple MySQl-Server to Percona XtraDB Cluster recently, A lot of tables were converted from MyISAM to INNODB in the process. So, can that be a reason for 300MB decreases in mysqldump SQL files? I seriously doubt this because a SQL will be an SQL, but does INNODB SQL code is decreased in any case? An expert advice on this will help.
SQL dumps comparison is quite hard to do when dealing with large amounts of data. I would try the following:
Import each SQL file data into its own database
Use one of the methods indicated here to compare database content (I assume the schema is the same). E.g. Toad for MySql
This way of comparison should be faster, as data manipulation is much faster when stored into database and also has the advantage the missing data can easily used. E.g.
SELECT *
FROM db1.sometable
WHERE NOT EXISTS (SELECT 1
FROM db2.sometable
WHERE db1.sometable.pkcol = db2.sometable.pk2)
will return the exact missing information into a convenient way.
If you export the dump you can use tools like Beyond Compare, Semantic Merge, Winmerge, Code Compare or other diff tools.
Not that some tools (i.e. Beyond Compare) have 4096 characters limit for a row, which becomes a problem in the comparison (I got mad). It's possible to change that in Tools->FileFormat->[choose your format, maybe it is EverythingElse]->Conversion->64000 characters Per Line (this is the maximum).
Also you can try changing the fileformat to SQL(might not help much though; and it will slow your comparison).
Related
What's the best way to copy a large MySQL table in terms of speed and memory use?
Option 1. Using PHP, select X rows from old table and insert them into the new table. Proceed to next iteration of select/insert until all entries are copied over.
Option 2. Use MySQL INSERT INTO ... SELECT without row limits.
Option 3. Use MySQL INSERT INTO ... SELECT with a limited number of rows copied over per run.
EDIT: I am not going to use mysqldump. The purpose of my question is to find the best way to write a database conversion program. Some tables have changed, some have not. I need to automate the entire copy over / conversion procedure without worrying about manually dumping any tables. So it would be helpful if you could answer which of the above options is best.
There is a program that was written specifically for this task called mysqldump.
mysqldump is a great tool in terms of simplicity and careful handling of all types of data, but it is not as fast as load data infile
If you're copying on the same database, I like this version of Option 2:
a) CREATE TABLE foo_new LIKE foo;
b) INSERT INTO foo_new SELECT * FROM foo;
I've got lots of tables with hundreds of millions of rows (like 1/2B) AND InnoDB AND several keys AND constraints. They take many many hours to read from a MySQL dump, but only an hour or so by load data infile. It is correct that copying the raw files with the DB offline is even faster. It is also correct that non-ASCII characters, binary data, and NULLs need to be handled carefully in CSV (or tab-delimited files), but fortunately, I've pretty much got numbers and text :-). I might take the time to see how long the above steps a) and b) take, but I think they are slower than the load data infile... which is probably because of transactions.
Off the three options listed above.
I would select the second option if you have a Unique constraint on at least one column, therefore not creating duplicate rows if the script has to be run multiple times to achieve its task in the event of server timeouts.
Otherwise your third option would be the way to go, while manually taking into account any server timeouts to determine your insert select limits.
Use a stored procedure
Option two must be fastest, but it's gonna be a mighty long transaction. You should look into making a stored procedure doing the copy. That way you could offload some of the data parsing/handling from the MySQL engine.
MySQL's load data query is faster than almost anything else, however it requires exporting each table to a CSV file.
Pay particular attention to escape characters and representing NULL values/binary data/etc in the CSV to avoid data loss.
If possible, the fastest way will be to take the database offline and simply copy data files on disk.
Of course, this have some requirements:
you can stop the database while copying.
you are using a storage engine that stores each table in individual files, MyISAM does this.
you have privileged access to the database server (root login or similar)
Ah, I see you have edited your post, then I think this DBA-from-hell approach is not an option... but still, it's fast!
The best way i find so far is creating the files as dump files(.txt), by using the outfile to a text then using infile in mysql to get the same data to the database
I am in the process of setting up a mysql server to store some data but realized(after reading a bit this weekend) I might have a problem uploading the data in time.
I basically have multiple servers generating daily data and then sending it to a shared queue to process/analyze. The data is about 5 billion rows(although its very small data, an ID number in a column and a dictionary of ints in another). Most of the performance reports I have seen have shown insert speeds of 60 to 100k/second which would take over 10 hours. We need the data in very quickly so we can work on it that day and then we may discard it(or achieve the table to S3 or something).
What can I do? I have 8 servers at my disposal(in addition to the database server), can I somehow use them to make the uploads faster? At first I was thinking of using them to push data to the server at the same time but I'm also thinking maybe I can load the data onto each of them and then somehow try to merge all the separated data into one server?
I was going to use mysql with innodb(I can use any other settings it helps) but its not finalized so if mysql doesn't work is there something else that will(I have used hbase before but was looking for a mysql solution first in case I have problems seems more widely used and easier to get help)?
Wow. That is a lot of data you're loading. It's probably worth quite a bit of design thought to get this right.
Multiple mySQL server instances won't help with loading speed. What will make a difference is fast processor chips and very fast disk IO subsystems on your mySQL server. If you can use a 64-bit processor and provision it with a LOT of RAM, you may be able to use a MEMORY access method for your big table, which will be very fast indeed. (But if that will work for you, a gigantic Java HashMap may work even better.)
Ask yourself: Why do you need to stash this info in a SQL-queryable table? How will you use your data once you've loaded it? Will you run lots of queries that retrieve single rows or just a few rows of your billions? Or will you run aggregate queries (e.g. SUM(something) ... GROUP BY something_else) that grind through large fractions of the table?
Will you have to access the data while it is incompletely loaded? Or can you load up a whole batch of data before the first access?
If all your queries need to grind the whole table, then don't use any indexes. Otherwise do. But don't throw in any indexes you don't need. They are going to cost you load performance, big time.
Consider using myISAM rather than InnoDB for this table; myISAM's lack of transaction semantics makes it faster to load. myISAM will do fine at handling either aggregate queries or few-row queries.
You probably want to have a separate table for each day's data, so you can "get rid" of yesterday's data by either renaming the table or simply accessing a new table.
You should consider using the LOAD DATA INFILE command.
http://dev.mysql.com/doc/refman/5.1/en/load-data.html
This command causes the mySQL server to read a file from the mySQL server's file system and bulk-load it directly into a table. It's way faster than doing INSERT commands from a client program on another machine. But it's also tricker to set up in production: your shared queue needs access to the mySQL server's file system to write the data files for loading.
You should consider disabling indexing, then loading the whole table, then re-enabling indexing, but only if you don't need to query partially loaded tables.
I have got dump of all sql databases.
in this dump i have got "database1", "database2", "database3"
how to take all data in another files from dump? may be some program or script?
or delete only "database2" from dump for example?
Depends how big it is.
If it's small (i.e. < 1G) then you can easily load it into a mysql instance on a test-box (VM or somewhere) and then do another dump just containing the DBs you're interested in. This is definitely the most reliable way.
If the dump is very large, say 500G, then it could be more difficult.
Applying text-processing on mysql dump-files is not advisable because they aren't actually text files! They can contain arbitrary binary data. These binary data might happen to contain things that you're searching for (for example, if using an "awk" program to process it).
Depends on your use-case really.
I'm looking at having someone do some optimization on a database. If I gave them a similar version of the db with different data, could they create a script file to run all the optimizations on my database (ie create indexes, etc) without them ever seeing or touching the actual database? I'm looking at MySQL but would be open to other db's if necessary. Thanks for any suggestions.
EDIT:
What if it were an identical copy with transformed data? Along with a couple sample queries that approximated what the db was used for (ie OLAP vs OLTP)? Would a script be able to contain everything or would they need hands on access to the actual db?
EDIT 2:
Could I create a copy of the db, transform the data to make it unrecognizable, create a backup file of the db, give it to vendor and them give me a script file to run on my db?
Why are you concerned that they should not access the database? You will get better optimization if they have the actual data as they can consider table sizes, which queries run the slowest, whether to denormalise if necessary, putting small tables completely in memory, ...?
If it is a issue of confidentiality you can always make the data anomous by replacement of names.
If it's just adding indices, then yes. However, there are a number of things to consider when "optimizing". Which are the slowest queries in your database? How large are certain tables? How can certain things be changed/migrated to make those certain queries run faster? It could be harder to see this with sparse sample data. You might also include a query log so that this person could see how you're using the tables/what you're trying to get out of them, and how long those operations take.
Is there anything better (faster or smaller) than pages of plain text CREATE TABLE and INSERT statements for dumping MySql databases? It seems awfully inefficient for large amounts of data.
I realise that the underlying database files can be copied, but I assume they will only work in the same version of MySql that they come from.
Is there a tool I don't know about, or a reason for this lack?
Not sure if this is what you're after, but I usually pipe the output of mysqldump directly to gzip or bzip2 (etc). It tends to be a considerably faster than dumping to stdout or something like that, and the output files are much smaller thanks to the compression.
mysqldump --all-databases (other options) | gzip > mysql_dump-2010-09-23.sql.gz
It's also possible to dump to XML with the --xml option if you're looking for "portability" at the expense of consuming (much) more disk space than the gzipped SQL...
Sorry, no binary dump for MySQL. However the binary logs of MySQL are specifically for backup and database replication purposes http://dev.mysql.com/doc/refman/5.5/en/binary-log.html . They are not hard to configure. Only changes such as update and delete are logged, so each log file (created authomatically by MySQL) is also an incremental backup of the changes in the DB. This way you can save from time to time a whole snapshot of the db (once in a month?), and then store just the log files, and in case of crash, restore the latest snapshot and run through the logs.
It's worth noting that MySQL has a special syntax for doing bulk inserts. From the manual:
INSERT INTO tbl_name (a,b,c) VALUES(1,2,3),(4,5,6),(7,8,9);
Would insert 3 rows in a single operation. So loading this way isn't as inefficient as it might otherwise be with one statement per row, and instead of 129 bytes in 3 INSERT statements, this is 59 bytes, and that advantage only gets bigger the more rows you have.
I've never tried this, but aren't mysql tables just binary files on the hard drive? Couldn't you just copy the table files themselves? Presumably that's essentially what you are asking for.
I don't know how to stitch that together, but it seems to me a copy of /var/lib/mysql would do the trick