Restructure huge unnormalized mysql database - mysql

Hi I have a huge unnormalized mysql database with (~100 million) urls (~20% dupes) divided into identical split tables of 13 million rows each.
I want to move the urls into a normalized database on the same mySql server.
The old database table is unnormalized, and the url's have no index
It look like this:
entry{id,data,data2, data3, data4, possition,rang,url}
And i'm goin to slit it up into multiple tables.
url{id,url}
data{id,data}
data1{id,data}
etc
The first thing I did was
INSERT IGNORE INTO newDatabase.url (url)
SELECT DISTINCT unNormalised.url FROM oldDatabase.unNormalised
But the " SELECT DISTINCT unNormalised.url" (13 million rows) took ages, and I figured that that since "INSERT IGNORE INTO" also do a comparison, it would be fast to just do a
INSERT IGNORE INTO newDatabase.url (url)
SELECT unNormalised.url FROM oldDatabase.unNormalised
Without the DISTINCT, is this assumption Wrong?
Any way it still takes forever and i need some help, is there a better way of dealing withe this huge quantity of unnormalized data?
Whould it be best if i did a SELECT DISTINCT unNormalised.url" on the entire 100 milion row database, and exported all the id's, and then moved only those id's to the new database with lets say a php script?
All ideas are welcomed, i have no clue how to port all this date without it taking a year!
ps it is hosted on a rds amazon server.
Thank you!

As the MySQL Manual states that LOAD DATA INFILE is quicker than INSERT, the fastest way to load your data would be:
LOCK TABLES url WRITE;
ALTER TABLE url DISABLE KEYS;
LOAD DATA INFILE 'urls.txt'
IGNORE
INTO TABLE url
...;
ALTER TABLE url ENABLE KEYS;
UNLOCK TABLES;
But since you already have the data loaded into MySQL, but just need to normalize it, you might try:
LOCK TABLES url WRITE;
ALTER TABLE url DISABLE KEYS;
INSERT IGNORE INTO url (url)
SELECT url FROM oldDatabase.unNormalised;
ALTER TABLE url ENABLE KEYS;
UNLOCK TABLES;
My guess is that INSERT IGNORE ... SELECT will be faster than INSERT IGNORE ... SELECT DISTINCT but that's just a guess.

Related

Adding recors in MySql

I want to add some 1000 records into my table for creating a database. Inserting each record manually is not at all practical. Is there a proper way to do this?
In MySQL you can insert multiple rows with a single insert statement.
insert into table values (data-row-1), (data-row-2), (data-row-3)
If you run a mysqldump on your database, you will see that this is what the output does.
The insert is then run as a single "transaction", so it's much, much faster than running 1000 individual inserts

Create a view or new table for caching records

I'm experiencing huge performance problem in one legacy application.
There is a search form where user can search records with given value.
A result row contains 10 columns. Then a SP returns any row which contains in any column that value.
This SP uses 8 Tables and some of them have about million records. Every minute I get a new record. This SP conducts paging as well.
Execution of this SP takes sometimes around 40 seconds.
What I did was, I created a new table and put there all records by using a query from this SP, but without conditions.
When there is a new update or update in one of source table I use a trigger and update this new "cache" table.
Now waiting for results from this new table takes only 1-3 seconds.
Has someone experience with something like this?
One of my colleagues said I better use view, but then every time I will be making JOINS.
What do you think? Is there another way?
Often times temporary tables can help you resolve performance issues. One approach might be to collect only the records that you need to consider into temporary tables and then create your final select statement from the temporary tables joined to any other tables that you're not filtering.
As an example, let's say one of the fields you are searching for is field1 in table1. Start by inserting into table #table1 only records that have the value of field1 you are looking for:
select PrimaryKeyTable1, Field1, Field2, Field3, etc...
into #table1
from table1
where Field1 = 'Whatever you are looking for'
This should be pretty fast even for a big tables, especially if you have an index on Field1. You do this for every table with search fields to collect all the records that have relevant records you are searching.
Then you also need to be sure to insert any records into your temporary tables that might have foreign key references to any of your other temporary tables. So let's say you also built a table #table2 with the above method that has a foreign key to table1 called PrimaryKeyTable1. You would insert those records like:
Insert into #table1
(PrimaryKeyTable1, Field1, Field2, Field3, etc...)
select table1.PrimaryKeyTable1, table1.Field1, table1.Field2, table1.Field3, etc...
from table1
join #table2
on table1.PrimaryKeyTable1 = table2.PrimaryKeyTable1
where table1.PrimaryKeyTable1 not in
(Select PrimaryKeyTable1 from #table1)
Now you will also have any records in #table1 that match to a record in #table2 that contain records that match the search criteria. You do this for all your temporary tables that have relevant foreign keys. The order that you do the inserts matters; be sure that you don't reference any temporary tables until after the last insert statement while collecting the foreign key referenced records.
Then you can simply do your final select statement, replacing the actual tables with the temporary tables you have built and eliminating all the filters that search your field data. Depending on the structure of your query there might be other optimizations, but that is the general idea.
If you've already explored all of your indexing options and this still doesn't help, MS SQL Server has "Change Tracking" features that maybe be of use to you in building your cache table. You enable the database for change tracking and configure which tables you wish to track. SQL Server then creates change records on every update, insert, delete on a table and then lets you query for changes to records that have been made since the last time you checked. This is very useful for syncing changes and is more efficient than using triggers. It's also easier to manage than making your own tracking tables. This has been a feature since SQL Server 2005.
How to: Use SQL Server Change Tracking
Change tracking only captures the primary keys of the tables and let's you query which fields might have been modified. Then you can query the tables join on those keys to get the current data. If you want it to capture the data also you can use Change Capture, but it requires more overhead and at least SQL Server 2008 enterprise edition.
Change Data Capture
Your solution is a robust way of doing what is called in Microsoft SQL Server "an indexed view" or "materialized view" in Oracle.
Basically you are correct - it's faster to navigate single indexed table then a dozen ones which are updated constantly.
You should really try creating an indexed view (some start here https://technet.microsoft.com/en-us/library/dd171921(v=sql.100).aspx) and it will probably solve all your performance issues.
You can use schema binding View and create cluster index on view.it will store your view data physically.but after creating schema binding view you can not alter your table.

How to handle milions of separate insert queries

I have a situation in which I have to insert over 10 million separate records into one table. Normally a batch insert split into chunks does the work for me. The problem however is that this over 3gig file contains over 10 million separate insert statements. Since every query takes 0.01 till 0.1 seconds, it will take over 2 days to insert everything.
I'm sure there must be a way to optimize this by either lowering the insert time drasticly or somehow import in a different way.
I'm now just using the cli
source /home/blabla/file.sql
Note: It's a 3th party that is providing me this file. I'm
Small update
I removed any indexes
Drop the indexes, then re-index when you are done!
Maybe you can parse the file data and combine several INSERT queries to one query like this:
INSERT INTO tablename (field1, field2...) VALUES (val1, val2, ..), (val3, val4, ..), ...
There are some ways to improve the speed of your INSERT statements:
Try to insert many rows at once if this is an option.
An alternative can be to insert the data into a copy of your desired table without indexes, insert the data there, then add the indexes and rename your table.
Maybe use LOAD DATA INFILE, if this is an option.
The MySQL manual has something to say about that, too.

Delete all fields in huge mysql table

I have a lot amount of db, now i need to delete some table, where is many data, about millions, but if i delete using sql syntaxis, phpmyadmin interface, or delete table, i still have some data after refreshing. How to delete clear all data in table?
The easiest way to ensure you get a table wiped is to use the TRUNCATE TABLE table_name statement. If you still have data in the table after that, it means something is constantly adding data to the table.

Large MySQL Table daily updates?

I have a MySQL table that has a bunch of product pricing information on around 2 million products. Every day I have to update this information for any products whose pricing information has changed [huge pain].
I am wondering what the best way to handle these changes are other than running something like compare and update any products that have changed ?
Love any advice that you can provide
For bulk updates you should definitely be using LOAD DATA INFILE rather than a lot of smaller update statements.
First, load the new data into a temporary table:
LOAD DATA INFILE 'foo.txt' INTO TABLE bar (productid, info);
Then run the update:
UPDATE products, bar SET products.info = bar.info WHERE products.productid = bar.productid;
If you also want to INSERT new records from the same file that you're updating from, you can SELECT INTO OUTFILE all of the records that don't have a matching ID in the existing table then load that outfile into your products table using LOAD DATA INFILE.
I maintain a price comparison engine with millions of prices and I select each row that I find in the source and update each row individually. If there is no row then I insert. Its best to use InnoDB transactions to speed this up.
This is all done by a PHP script that knows how to parse the source files and update the tables.