Optimize Mysql Query (rawdata to processed table) - mysql

Hi everyone so my question is this, So I have a file that reads in roughly 3000 rows of data by the local infile command. After which there is a trigger on the table that's inserted into that copies three columns from from the updated table and two columns from a table that exist in the database already(if this is unclear to what I mean the structures are coming). From there only combinations that have unique glNumbers will be entered into the processed table. This takes over a minute and half normally. I find this pretty long, I was wondering if this is normal for what I'm doing(can't believe that's true) or is there a way to optimize the queries so it goes faster?
Tables that are inserted to are labeled the first three letters of each month. Here is the default structure.
RawData Structure
| idjan | glNumber | journel | invoiceNumber | date | JT | debit | credit | descriptionDetail | totalDebit | totalCredit |
(sorry) for the poor format there isn't a really good way to do this it seems)
After Insert Trigger Query
delete from processedjan;
insert into processedjan(glNumber,debit,credit,bucket1,bucket2)
select a.glNumber, a.totalDebit, a.totalCredit, b.bucket1, b.bucket2
from jan a inner join bucketinformation b on a.glNumber = b.glNumber
group by glNumber;
Processed Datatable Structure
| glNumber | bucket1| bucket2| credit | debit |
Also I guess it helps to know the bucket 1 and bucket 2 come from another table where its matched against the glNumber. That table is roughly 800 rows with three columns for the glNumber and the two buckets.

While postgresql has statement level triggers, mysql only has row level triggers. From the mysql reference:
A trigger is defined to activate when a statement inserts, updates, or
deletes rows in the associated table. These row operations are trigger
events. For example, rows can be inserted by INSERT or LOAD DATA
statements, and an insert trigger activates for each inserted row.
So while you are managing to load 3000 rows in one operation, unfortunately 3000 more queries are executed by the triggers. But the complex nature of your transaction sounds like you might actually be performing 2-3 queries per row. That's the real reason for the slow down.
You can speed things up by disabling the trigger and carrying out a INSERT .. SELECT after the load data in file. You can automate it with a small script.

Related

MySQL update of large table based on another large table too slow

I have one table that looks like this:
+-------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| name | varchar(255) | NO | PRI | NULL | |
| timestamp1 | int | NO | | NULL | |
| timestamp2 | int | NO | | NULL | |
+-------------+--------------+------+-----+---------+-------+
This table has around 250 million rows in it. I get a csv once a day that contains around 225 million rows of just one name column. 99% of the names that are in the csv I get everyday are already in the database. So what I want to do is for all the ones that are already there I update their timestamp1 column to UNIX_TIMESTAMP(NOW()). Then all the names that are not in the original table, but are in the csv I add to the original table. Right now this is how I am doing this:
DROP TEMPORARY TABLE IF EXISTS tmp_import;
CREATE TEMPORARY TABLE tmp_import (name VARCHAR(255), primary_key(name));
LOAD DATA LOCAL INFILE 'path.csv' INTO TABLE tmp_import LINES TERMINATED BY '\n';
UPDATE og_table tb SET timestamp1 = UNIX_TIMESTAMP(NOW()) WHERE og.name IN (SELECT tmp.name FROM tmp_import tmp);
DELETE FROM tmp_import WHERE name in (SELECT og.name FROM og_table og);
INSERT INTO og_table SELECT name, UNIX_TIMESTAMP(NOW()) AS timestamp1, UNIX_TIMESTAMP(NOW()) AS timestamp2 FROM tmp_import;
As someone might guess the update line is taking a long time, over 6 hours or throwing an error. Reading the data in is taking upwards of 40 minutes. I know this is mostly because it is creating an index for name when I don't set it as a primary key it only takes 9 minutes to read the data in, but I thought having an index would speed up the operation. I have tried the update several different way. What I have and the following:
UPDATE og_table og SET timestamp1 = UNIX_TIMESTAMP(NOW()) WHERE EXISTS (SELECT tmp.name FROM tmp_import tmp where tmp.name = og.name);
UPDATE og_table og inner join tmp_import tmp on og.name=tmp.name SET og.timestamp1 = UNIX_TIMESTAMP(NOW());
Both of those attempts did not work. It normally takes several hours and then ends up with:
ERROR 1206 (HY000): The total number of locks exceeds the lock table size
I am using InnoDB for these tables, but there are no necessary foreign keys and having the benefit of that engine is not necessarily needed so I would be open to trying different storage engines.
I have been looking through a lot of posts and have yet to find something to help in my situation. If I missed a post I apologize.
If the name values are rather long, you might greatly improve performance by using a hash function, such as MD5 or SHA-1 and store&index the hash only. You probably don't even need all 128 or 160 bits. 80-bit portions should be good enough with a very low chance of a collision. See this.
Another thing you might want to check is if you have enough RAM. How big is your table and how much RAM do you have? Also, it's not just about how much RAM you have on the machine, but how much of it is available to MySQL/InnoDB's buffer cache.
What disk are you using? If you are using a spinning disk (HDD), that might be a huge bottleneck if InnoDB needs to constantly make scattered reads.
There are many other things that might help, but I would need more details. For example, if the names in the CSV are not sorted, and your buffer caches are about 10-20% of the table size, you might have a huge performance boost by splitting the work in batches, so that names in each batch are close enough (for example, first process all names that start with 'A', then those starting with 'B', etc.). Why would that help? In a big index (in InnoDB tables are also implemented as indexes) that doesn't fit into the buffer cache, if you make millions of reads all around the index, the DB will need to constantly read from the disk. But if you work on a smaller area, the data blocks (pages) will only be read once and then they will stay in the RAM for subsequent reads until you've finished with that area. That alone might easily improve performance by 1 or 2 orders of magnitude, depending on your case.
A big update (as Barmar points out) takes a long time. Let's avoid it by building a new table, then swapping it into place.
First, let me get clarification and provide a minimal example.
You won't be deleting any rows, correct? Just adding or updating rows?
You have (in og_table):
A 88 123
B 99 234
The daily load (tmp_import) says
B
C
You want
A 88 123
B NOW() 234
C NOW() NULL
Is that correct? Now for the code:
load nightly data and build the merge table:
LOAD DATA ... (name) -- into TEMPORARY tmp_import
CREATE TABLE merge LIKE og_table; -- not TEMPORARY
Populate a new table with the data merged together
INSERT INTO merge
-- B and C (from the example):
( SELECT ti.name, FROM_UNIXTIME(NOW()), og.timestamp2
FROM tmp_import AS ti
LEFT JOIN og_table AS USING(name)
) UNION ALL
-- A:
( SELECT og.name, og.timestamp1, og.timestamp2
FROM og_table AS og
LEFT JOIN tmp_import AS ti USING(name)
WHERE ti.name IS NULL -- (that is, missing from csv)
);
Swap it into place
RENAME TABLE og_table TO x,
merge TO og_table;
DROP TABLE x;
Bonus: og_table is "down" only very briefly (during the RENAME).
A possible speed-up: Sort the CSV file by name before loading it. (If that takes an extra step, the cost of that step may be worse than the cost of not having the data sorted. There is not enough information to predict.)

How to sync two tables that are not identical?

I have two projects using the same data. However, this data is saved in 2 different databases. Each of these two databases has a table that is almost the same as his counterpart in the other database.
What I am looking for
I am looking for a method to synchronise two tables. Easier said, if database_one.table gets an insert, that same record needs to be inserted into database2.table.
Database and Table One
Table Products
| product_id | name | description | price | vat | flags |
Database and Table Two
Table Articles
| articleId | name_short | name | price | price_vat | extra_info | flags |
The issue
I have never used and wouldn't know how to use any method of database synchronisation. What also worries me is that the tables are not identical and so I will somehow need to map columns to one another.
For example:
database_one.Products.name -> database_two.articles.name_short
Can someone help me with this?
You can use MERGE function:
https://www.mssqltips.com/sqlservertip/1704/using-merge-in-sql-server-to-insert-update-and-delete-at-the-same-time/
Then create a procedure that runs at desired frequency or if it needs to be instant insert the merge into a trigger.
One of possible method is to use triggers. You need to create trigger for insert, update and delete on database_one.table, that does coresponding operation on database2.table. I guess, that there won't be any problems with insert/update/delete between both databases. When using triggers, you can very easily map columns.
However you need to consider prons and cons of using triggers - read something here or here. From my experience performance is very important, so if you have heavy loaded DB it is not a good idea to use triggers for data replication.
Maybe you should check this too?

Triggers with complex configurable conditions

Some background
We have a system which optionally integrates to several other systems. Our system shuffles data from a MySQL database to the other systems. Different customers want different data transferred. In order to not trigger unnecessary transfers (when no relevant data has changed) to these external systems, we have an "export" table which contains all the information any customer is interested in and a service which runs SQL queries defined in a file to compare the data in the export table to the data in the other tables and update the export table as appropriate, a solution we're not really happy with for several reasons:
No customer uses more than a fraction of these columns, although each column is used by at least one customer.
As the database grows, the service is causing increasing amounts of strain on the system. Some servers completely freeze while this service compares data, which may take up to 2 minutes (!) even though no customer has particularly large amounts of data (~15000 rows across all relevant tables, max). We fear what might happen if we ever get a customer with very large amounts of data. Performance could be improved by creating some indexes and improving the SQL queries, but we feel like that's attacking the problem from the wrong direction.
It's not very flexible, nor scalable. Having to add new columns every time a customer is interested in transferring data that no other customer has been interested in before (which happens a lot), just feels... icky. I don't know how much it really matters, but we're up to 37 columns in this table at the moment, and it keeps growing.
What we want to do
Instead, we would like to have a very slimmed down "export" table which only contains the bare minimum information, i.e. the table and primary key of the row that was updated, the system this row should be exported to, and some timestamps. A trigger in every relevant table would then update this export table whenever a column that has been configured to warrant an update is updated. This configuration should be read from another table (which, sometime in the future, could be configured from our web GUI), looking something like this:
+--------+--------+-----------+
| system | table | column |
+--------+--------+-----------+
| 'sys1' | 'tbl1' | 'column1' |
+--------+--------+-----------+
| 'sys2' | 'tbl1' | 'column2' |
+--------+--------+-----------+
Now, the trigger in tbl1 will read from this table when a row is updated. The configuration above should mean that if column1 in tbl1 has changed, then an export row for sys1 should be updated, if column2 has changed too, then an export row for sys2 should also be updated, etc.
So far, it all seems doable, although a bit tricky when you're not an SQL genius. However, we would preferably like to be able to define a little bit more complex conditions, at least something like "column3 = 'Apple' OR column3 = 'Banana'", and this is kind of the heart of the question...
So, to sum it up:
What would be the best way to allow for triggers to be configured in this way?
Are we crazy? Are triggers the right way to go here, or should we just stick to our service, smack on some indexes and suck it up? Or is there a third alternative?
How much of a performance increase could we expect to see? (Is this all worth it?)
This is actually impossible because dynamic SQL is not supported in SQL. Therefore we came up with reading the config table from PHP and generating "static" triggers. We'll try having 2 tables, one for columns and one for conditions, like so:
Columns
+--------+--------+-----------+
| system | table | column |
+--------+--------+-----------+
| 'sys1' | 'tbl1' | 'column1' |
+--------+--------+-----------+
| 'sys2' | 'tbl1' | 'column2' |
+--------+--------+-----------+
Conditions
+--------+--------+-------------------------------------------+
| system | table | condition |
+--------+--------+-------------------------------------------+
| 'sys1' | 'tbl1' | 'column3 = "Apple" OR column3 = "Banana"' |
+--------+--------+-------------------------------------------+
Then just build a statement like this in PHP (pseudo-code):
DROP TRIGGER IF EXISTS `tbl1_AUPD`;
CREATE TRIGGER `tbl1_AUPD` AFTER UPDATE ON tbl1 FOR EACH ROW
BEGIN
IF (*sys1 columns changed*) AND (*sys1 condition1*) THEN
updateExportTable('sys1', 'tbl1', NEW.primary_key, NEW.timestamp);
END IF;
IF (*sys2 columns changed*) THEN
updateExportTable('sys2', 'tbl1', NEW.primary_key, NEW.timestamp);
END IF;
END;
This seems to be the best solution for us, maybe even better than what I was asking for, but if anyone has a better suggestion I'm all ears!

Use of Index to Improve Speed of Aggregate functions in select query

I need to create a new table with sum aggregates of the measures columns in source table.
The Source table is very huge.
eg. Source Table
Category | Product | Sales
A | P1 | 100
B | P2 | 200
C | P3 | 300
Query is like :
SELECT Category,
Product,
SUM(Sales)
FROM source_table
GROUP BY Category.
There is no where condition.
Will indexing help in speeding up the process?
Any alternate mechanism for speeding the query?
It might help to add an index on Category since it is in the GROUP BY clause. But you're doing a full table dump, so it might just be slow.
Probably a better strategy is to create a new table for the sales report and populated it based on your business needs. If it can be updated only daily, then schedule a stored procedure to run nightly to repopulate it. If it needs to reflect the current state of the table, then you can use triggers to update the report table as the base table is updated. Or you could run a separate query at the application level to update the report table when your base table is updated.
Indexes are a tricky tool. If you're planning to add an index to a column of your table, you should consider at the very least:
1- How many different values do I have in this column.
2- How's the proportion between the total number of record and the number of different values.
3- How often do I apply where, group by or order by clauses on this column.
As the #Kasey answer states, for what's to see, you could add an index on the category column, but, it will depend on the number of different values you have for that column.

Matching algorithm in SQL Server 2008

I have more than 3 million rows in my table. When the user try to insert or update this table I have to check the following conditions sequentially.(Business Need)
Does any of the row has same address?
Does any of the row has same postcode?
Does any of the row has same DOB?
Obviously the newly inserted or updated row will match lot of the records from this table.
But the business need is, the matching process should end when the first match (row) found and that row has to returned.
I can easily achieve this using simple "SELECT" query . But it's taking very long time to find the match.
Please suggest some efficient way to do this.
If you're just looking for a way to return after the first match, use LIMIT 1.
You may want to maintain a table of either birth dates or postcodes and have each row link to a user, so that you can easily filter customers down to a smaller set. It would allow you to perform a much faster search on the database.
Example:
dob | userID
1/1/1980 | 235
1/1/1980 | 482
1/1/1980 | 123
2/1/1980 | 521
In that scenario, you only have to read 3 rows from the large users table if your target date is 1/1/1980. It's via a primary key index, too, so it'll be really fast.