Some background
We have a system which optionally integrates to several other systems. Our system shuffles data from a MySQL database to the other systems. Different customers want different data transferred. In order to not trigger unnecessary transfers (when no relevant data has changed) to these external systems, we have an "export" table which contains all the information any customer is interested in and a service which runs SQL queries defined in a file to compare the data in the export table to the data in the other tables and update the export table as appropriate, a solution we're not really happy with for several reasons:
No customer uses more than a fraction of these columns, although each column is used by at least one customer.
As the database grows, the service is causing increasing amounts of strain on the system. Some servers completely freeze while this service compares data, which may take up to 2 minutes (!) even though no customer has particularly large amounts of data (~15000 rows across all relevant tables, max). We fear what might happen if we ever get a customer with very large amounts of data. Performance could be improved by creating some indexes and improving the SQL queries, but we feel like that's attacking the problem from the wrong direction.
It's not very flexible, nor scalable. Having to add new columns every time a customer is interested in transferring data that no other customer has been interested in before (which happens a lot), just feels... icky. I don't know how much it really matters, but we're up to 37 columns in this table at the moment, and it keeps growing.
What we want to do
Instead, we would like to have a very slimmed down "export" table which only contains the bare minimum information, i.e. the table and primary key of the row that was updated, the system this row should be exported to, and some timestamps. A trigger in every relevant table would then update this export table whenever a column that has been configured to warrant an update is updated. This configuration should be read from another table (which, sometime in the future, could be configured from our web GUI), looking something like this:
+--------+--------+-----------+
| system | table | column |
+--------+--------+-----------+
| 'sys1' | 'tbl1' | 'column1' |
+--------+--------+-----------+
| 'sys2' | 'tbl1' | 'column2' |
+--------+--------+-----------+
Now, the trigger in tbl1 will read from this table when a row is updated. The configuration above should mean that if column1 in tbl1 has changed, then an export row for sys1 should be updated, if column2 has changed too, then an export row for sys2 should also be updated, etc.
So far, it all seems doable, although a bit tricky when you're not an SQL genius. However, we would preferably like to be able to define a little bit more complex conditions, at least something like "column3 = 'Apple' OR column3 = 'Banana'", and this is kind of the heart of the question...
So, to sum it up:
What would be the best way to allow for triggers to be configured in this way?
Are we crazy? Are triggers the right way to go here, or should we just stick to our service, smack on some indexes and suck it up? Or is there a third alternative?
How much of a performance increase could we expect to see? (Is this all worth it?)
This is actually impossible because dynamic SQL is not supported in SQL. Therefore we came up with reading the config table from PHP and generating "static" triggers. We'll try having 2 tables, one for columns and one for conditions, like so:
Columns
+--------+--------+-----------+
| system | table | column |
+--------+--------+-----------+
| 'sys1' | 'tbl1' | 'column1' |
+--------+--------+-----------+
| 'sys2' | 'tbl1' | 'column2' |
+--------+--------+-----------+
Conditions
+--------+--------+-------------------------------------------+
| system | table | condition |
+--------+--------+-------------------------------------------+
| 'sys1' | 'tbl1' | 'column3 = "Apple" OR column3 = "Banana"' |
+--------+--------+-------------------------------------------+
Then just build a statement like this in PHP (pseudo-code):
DROP TRIGGER IF EXISTS `tbl1_AUPD`;
CREATE TRIGGER `tbl1_AUPD` AFTER UPDATE ON tbl1 FOR EACH ROW
BEGIN
IF (*sys1 columns changed*) AND (*sys1 condition1*) THEN
updateExportTable('sys1', 'tbl1', NEW.primary_key, NEW.timestamp);
END IF;
IF (*sys2 columns changed*) THEN
updateExportTable('sys2', 'tbl1', NEW.primary_key, NEW.timestamp);
END IF;
END;
This seems to be the best solution for us, maybe even better than what I was asking for, but if anyone has a better suggestion I'm all ears!
Related
I have one table that looks like this:
+-------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| name | varchar(255) | NO | PRI | NULL | |
| timestamp1 | int | NO | | NULL | |
| timestamp2 | int | NO | | NULL | |
+-------------+--------------+------+-----+---------+-------+
This table has around 250 million rows in it. I get a csv once a day that contains around 225 million rows of just one name column. 99% of the names that are in the csv I get everyday are already in the database. So what I want to do is for all the ones that are already there I update their timestamp1 column to UNIX_TIMESTAMP(NOW()). Then all the names that are not in the original table, but are in the csv I add to the original table. Right now this is how I am doing this:
DROP TEMPORARY TABLE IF EXISTS tmp_import;
CREATE TEMPORARY TABLE tmp_import (name VARCHAR(255), primary_key(name));
LOAD DATA LOCAL INFILE 'path.csv' INTO TABLE tmp_import LINES TERMINATED BY '\n';
UPDATE og_table tb SET timestamp1 = UNIX_TIMESTAMP(NOW()) WHERE og.name IN (SELECT tmp.name FROM tmp_import tmp);
DELETE FROM tmp_import WHERE name in (SELECT og.name FROM og_table og);
INSERT INTO og_table SELECT name, UNIX_TIMESTAMP(NOW()) AS timestamp1, UNIX_TIMESTAMP(NOW()) AS timestamp2 FROM tmp_import;
As someone might guess the update line is taking a long time, over 6 hours or throwing an error. Reading the data in is taking upwards of 40 minutes. I know this is mostly because it is creating an index for name when I don't set it as a primary key it only takes 9 minutes to read the data in, but I thought having an index would speed up the operation. I have tried the update several different way. What I have and the following:
UPDATE og_table og SET timestamp1 = UNIX_TIMESTAMP(NOW()) WHERE EXISTS (SELECT tmp.name FROM tmp_import tmp where tmp.name = og.name);
UPDATE og_table og inner join tmp_import tmp on og.name=tmp.name SET og.timestamp1 = UNIX_TIMESTAMP(NOW());
Both of those attempts did not work. It normally takes several hours and then ends up with:
ERROR 1206 (HY000): The total number of locks exceeds the lock table size
I am using InnoDB for these tables, but there are no necessary foreign keys and having the benefit of that engine is not necessarily needed so I would be open to trying different storage engines.
I have been looking through a lot of posts and have yet to find something to help in my situation. If I missed a post I apologize.
If the name values are rather long, you might greatly improve performance by using a hash function, such as MD5 or SHA-1 and store&index the hash only. You probably don't even need all 128 or 160 bits. 80-bit portions should be good enough with a very low chance of a collision. See this.
Another thing you might want to check is if you have enough RAM. How big is your table and how much RAM do you have? Also, it's not just about how much RAM you have on the machine, but how much of it is available to MySQL/InnoDB's buffer cache.
What disk are you using? If you are using a spinning disk (HDD), that might be a huge bottleneck if InnoDB needs to constantly make scattered reads.
There are many other things that might help, but I would need more details. For example, if the names in the CSV are not sorted, and your buffer caches are about 10-20% of the table size, you might have a huge performance boost by splitting the work in batches, so that names in each batch are close enough (for example, first process all names that start with 'A', then those starting with 'B', etc.). Why would that help? In a big index (in InnoDB tables are also implemented as indexes) that doesn't fit into the buffer cache, if you make millions of reads all around the index, the DB will need to constantly read from the disk. But if you work on a smaller area, the data blocks (pages) will only be read once and then they will stay in the RAM for subsequent reads until you've finished with that area. That alone might easily improve performance by 1 or 2 orders of magnitude, depending on your case.
A big update (as Barmar points out) takes a long time. Let's avoid it by building a new table, then swapping it into place.
First, let me get clarification and provide a minimal example.
You won't be deleting any rows, correct? Just adding or updating rows?
You have (in og_table):
A 88 123
B 99 234
The daily load (tmp_import) says
B
C
You want
A 88 123
B NOW() 234
C NOW() NULL
Is that correct? Now for the code:
load nightly data and build the merge table:
LOAD DATA ... (name) -- into TEMPORARY tmp_import
CREATE TABLE merge LIKE og_table; -- not TEMPORARY
Populate a new table with the data merged together
INSERT INTO merge
-- B and C (from the example):
( SELECT ti.name, FROM_UNIXTIME(NOW()), og.timestamp2
FROM tmp_import AS ti
LEFT JOIN og_table AS USING(name)
) UNION ALL
-- A:
( SELECT og.name, og.timestamp1, og.timestamp2
FROM og_table AS og
LEFT JOIN tmp_import AS ti USING(name)
WHERE ti.name IS NULL -- (that is, missing from csv)
);
Swap it into place
RENAME TABLE og_table TO x,
merge TO og_table;
DROP TABLE x;
Bonus: og_table is "down" only very briefly (during the RENAME).
A possible speed-up: Sort the CSV file by name before loading it. (If that takes an extra step, the cost of that step may be worse than the cost of not having the data sorted. There is not enough information to predict.)
So i understand and found posts that indicates that it is not recommended to omit the order by clause in a SQL query when you are retrieving data from the DBMS.
Resources & Post consulted (will be updated):
SQL Server UNION - What is the default ORDER BY Behaviour
When no 'Order by' is specified, what order does a query choose for your record set?
https://dba.stackexchange.com/questions/6051/what-is-the-default-order-of-records-for-a-select-statement-in-mysql
Questions :
See logic of the question below if you want to know more.
My question is : under mysql with innoDB engine, does anyone know how the DBMS effectively gives us the results ?
I read that it is implementation dependent, ok, but is there a way to know it for my current implementation ?
Where is this defined exactly ?
Is it from MySQL, InnoDB , OS-Dependent ?
Isn't there some kind of list out there ?
Most importantly, if i omit the order by clause and get my result, i can't be sure that this code will still work with newer database versions and that the DBMS will never give me the same result, can i ?
Use case & Logic :
I'm currently writing a CRUD API, and i have table in my DB that doesn't contain an "id" field (there is a PK though), and so when i'm showing the results of that table without any research criteria, i don't really have a clue on what i should use to order the results. I mean, i could use the PK or any field that is never null, but it wouldn't make it relevant. So i was wondering, as my CRUD is supposed to work for any table and i don't want to solve this problem by adding an exception for this specific table, i could also simply omit the order by clause.
Final Note :
As i'm reading other posts, examples and code samples, i'm feeling like i want to go too far. I understand that it is common knowledge that it's just a bad practice to omit the Order By clause in a request and that there is no reliable default order clause, not to say that there is no order at all unless you specify it.
I'd just love to know where this is defined, and would love to learn how this works internally or at least where it's defined (DBMS / Storage Engine / OS-Dependant / Other / Multiple criteria). I think it would also benefit other people to know it, and to understand the inners mechanisms in place here.
Thanks for taking the time to read anyway ! Have a nice day.
Without a clear ORDER BY, current versions of InnoDB return rows in the order of the index it reads from. Which index varies, but it always reads from some index. Even reading from the "table" is really an index—it's the primary key index.
As in the comments above, there's no guarantee this will remain the same in the next version of InnoDB. You should treat it as a coincidental behavior, it is not documented and the makers of MySQL don't promise not to change it.
Even if their implementation doesn't change, reading in index order can cause some strange effects that you might not expect, and which won't give you query result sets that makes sense to you.
For example, the default index is the clustered index, PRIMARY. It means index order is the same as the order of values in the primary key (not the order in which you insert them).
mysql> create table mytable ( id int primary key, name varchar(20));
mysql> insert into mytable values (3, 'Hermione'), (2, 'Ron'), (1, 'Harry');
mysql> select * from mytable;
+----+----------+
| id | name |
+----+----------+
| 1 | Harry |
| 2 | Ron |
| 3 | Hermione |
+----+----------+
But if your query uses another index to read the table, like if you only access column(s) of a secondary index, you'll get rows in that order:
mysql> alter table mytable add key (name);
mysql> select name from mytable;
+----------+
| name |
+----------+
| Harry |
| Hermione |
| Ron |
+----------+
This shows it's reading the table by using an index-scan of that secondary index on name:
mysql> explain select name from mytable;
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | mytable | index | NULL | name | 83 | NULL | 3 | Using index |
+----+-------------+---------+-------+---------------+------+---------+------+------+-------------+
In a more complex query, it can become very tricky to predict which index InnoDB will use for a given query. The choice can even change from day to day, as your data changes.
All this goes to show: You should just use ORDER BY if you care about the order of your query result set!
Bill's answer is good. But not complete.
If the query is a UNION, it will (I think) deliver first the results of the first SELECT (according to the rules), then the results of the second. Also, if the table is PARTITIONed, it is likely to do a similar thing.
GROUP BY may sort by the grouping expressions, thereby leading to a predictable order, or it may use a hashing technique, which scrambles the rows. I don't know how to predict which.
A derived table used to be an ordered list that propagates into the parent query's ordering. But recently, the ORDER BY is being thrown away in that subquery! (Unless there is a LIMIT.)
Bottom Line: If you care about the order, add an ORDER BY, even if it seems unnecessary based on this Q & A.
MyISAM, in contrast, starts with this premise: The default order is the order in the .MYD file. But DELETEs leave gaps, UPDATEs mess with the gaps, and INSERTs prefer to fill in gaps over appending to the file. So, the row order is rather unpredictable. ALTER TABLE x ORDER BY y temporarily sets the .MYD order; this 'feature' does not work for InnoDB.
I have two projects using the same data. However, this data is saved in 2 different databases. Each of these two databases has a table that is almost the same as his counterpart in the other database.
What I am looking for
I am looking for a method to synchronise two tables. Easier said, if database_one.table gets an insert, that same record needs to be inserted into database2.table.
Database and Table One
Table Products
| product_id | name | description | price | vat | flags |
Database and Table Two
Table Articles
| articleId | name_short | name | price | price_vat | extra_info | flags |
The issue
I have never used and wouldn't know how to use any method of database synchronisation. What also worries me is that the tables are not identical and so I will somehow need to map columns to one another.
For example:
database_one.Products.name -> database_two.articles.name_short
Can someone help me with this?
You can use MERGE function:
https://www.mssqltips.com/sqlservertip/1704/using-merge-in-sql-server-to-insert-update-and-delete-at-the-same-time/
Then create a procedure that runs at desired frequency or if it needs to be instant insert the merge into a trigger.
One of possible method is to use triggers. You need to create trigger for insert, update and delete on database_one.table, that does coresponding operation on database2.table. I guess, that there won't be any problems with insert/update/delete between both databases. When using triggers, you can very easily map columns.
However you need to consider prons and cons of using triggers - read something here or here. From my experience performance is very important, so if you have heavy loaded DB it is not a good idea to use triggers for data replication.
Maybe you should check this too?
Hi everyone so my question is this, So I have a file that reads in roughly 3000 rows of data by the local infile command. After which there is a trigger on the table that's inserted into that copies three columns from from the updated table and two columns from a table that exist in the database already(if this is unclear to what I mean the structures are coming). From there only combinations that have unique glNumbers will be entered into the processed table. This takes over a minute and half normally. I find this pretty long, I was wondering if this is normal for what I'm doing(can't believe that's true) or is there a way to optimize the queries so it goes faster?
Tables that are inserted to are labeled the first three letters of each month. Here is the default structure.
RawData Structure
| idjan | glNumber | journel | invoiceNumber | date | JT | debit | credit | descriptionDetail | totalDebit | totalCredit |
(sorry) for the poor format there isn't a really good way to do this it seems)
After Insert Trigger Query
delete from processedjan;
insert into processedjan(glNumber,debit,credit,bucket1,bucket2)
select a.glNumber, a.totalDebit, a.totalCredit, b.bucket1, b.bucket2
from jan a inner join bucketinformation b on a.glNumber = b.glNumber
group by glNumber;
Processed Datatable Structure
| glNumber | bucket1| bucket2| credit | debit |
Also I guess it helps to know the bucket 1 and bucket 2 come from another table where its matched against the glNumber. That table is roughly 800 rows with three columns for the glNumber and the two buckets.
While postgresql has statement level triggers, mysql only has row level triggers. From the mysql reference:
A trigger is defined to activate when a statement inserts, updates, or
deletes rows in the associated table. These row operations are trigger
events. For example, rows can be inserted by INSERT or LOAD DATA
statements, and an insert trigger activates for each inserted row.
So while you are managing to load 3000 rows in one operation, unfortunately 3000 more queries are executed by the triggers. But the complex nature of your transaction sounds like you might actually be performing 2-3 queries per row. That's the real reason for the slow down.
You can speed things up by disabling the trigger and carrying out a INSERT .. SELECT after the load data in file. You can automate it with a small script.
I have more than 3 million rows in my table. When the user try to insert or update this table I have to check the following conditions sequentially.(Business Need)
Does any of the row has same address?
Does any of the row has same postcode?
Does any of the row has same DOB?
Obviously the newly inserted or updated row will match lot of the records from this table.
But the business need is, the matching process should end when the first match (row) found and that row has to returned.
I can easily achieve this using simple "SELECT" query . But it's taking very long time to find the match.
Please suggest some efficient way to do this.
If you're just looking for a way to return after the first match, use LIMIT 1.
You may want to maintain a table of either birth dates or postcodes and have each row link to a user, so that you can easily filter customers down to a smaller set. It would allow you to perform a much faster search on the database.
Example:
dob | userID
1/1/1980 | 235
1/1/1980 | 482
1/1/1980 | 123
2/1/1980 | 521
In that scenario, you only have to read 3 rows from the large users table if your target date is 1/1/1980. It's via a primary key index, too, so it'll be really fast.