Matching algorithm in SQL Server 2008 - sql-server-2008

I have more than 3 million rows in my table. When the user try to insert or update this table I have to check the following conditions sequentially.(Business Need)
Does any of the row has same address?
Does any of the row has same postcode?
Does any of the row has same DOB?
Obviously the newly inserted or updated row will match lot of the records from this table.
But the business need is, the matching process should end when the first match (row) found and that row has to returned.
I can easily achieve this using simple "SELECT" query . But it's taking very long time to find the match.
Please suggest some efficient way to do this.

If you're just looking for a way to return after the first match, use LIMIT 1.
You may want to maintain a table of either birth dates or postcodes and have each row link to a user, so that you can easily filter customers down to a smaller set. It would allow you to perform a much faster search on the database.
Example:
dob | userID
1/1/1980 | 235
1/1/1980 | 482
1/1/1980 | 123
2/1/1980 | 521
In that scenario, you only have to read 3 rows from the large users table if your target date is 1/1/1980. It's via a primary key index, too, so it'll be really fast.

Related

Mysql Update one column of multiple rows in one query

I've looked over all of the related questions i've find, but couldn't get one which will answer mine.
i got a table like this:
id | name | age | active | ...... | ... |
where "id" is the primary key, and the ... meaning there are something like 30 columns.
the "active" column is of tinyint type.
My task:
Update ids 1,4,12,55,111 (those are just an example, it can be 1000 different id in total) with active = 1 in a single query.
I did:
UPDATE table SET active = 1 WHERE id IN (1,4,12,55,111)
its inside a transaction, cause i'm updating something else in this process.
the engine is InnoDB
My problem:
Someone told me that doing such a query is equivalent to 5 queries at execution, cause the IN will translate to the a given number of OR, and run them one after another.
eventually, instead of 1 i get N which is the number in the IN.
he suggests to create a temp table, insert all the new values in it, and then update by join.
Does he right? both of the equivalency and performance.
What do you suggest? i've thought INSERT INTO .. ON DUPLICATE UPDATE will help but i don't have all the data for the row, only it id, and that i want to set active = 1 on it.
Maybe this query is better?
UPDATE table SET
active = CASE
WHEN id='1' THEN '1'
WHEN id='4' THEN '1'
WHEN id='12' THEN '1'
WHEN id='55' THEN '1'
WHEN id='111' THEN '1'
ELSE active END
WHERE campaign_id > 0; //otherwise it throws an error about updating without where clause in safe mode, and i don't know if i could toggle safe mode off.
Thanks.
It's the other way around. OR can sometimes be turned into IN. IN is then efficiently executed, especially if there is an index on the column. If you have 1000 entries in the IN, it will do 1000 probes into the table based on id.
If you are running a new enough version of MySQL, I think you can do EXPLAIN EXTENDED UPDATE ...OR...; SHOW WARNINGS; to see this conversion;
The UPDATE CASE... will probably tediously check each and every row.
It would probably be better on other users of the system if you broke the UPDATE up into multiple UPDATEs, each having 100-1000 rows. More on chunking .
Where did you get the ids in the first place? If it was via a SELECT, then perhaps it would be practical to combine it with the UPDATE to make it one step instead of two.
I think below is better because it uses primary key.
UPDATE table SET active = 1 WHERE id<=5

Providing unique results for multiple clients simultaneously

I have a single table with a few million rows, hundres of clients are accessing this table simultaneously, each one needs to get 20 unique rows, which then needs to be placed last in line.
My setup is:
Table structure:
id | last_access | reserved_id | [Data columns]
id + last_access is indexed
For selecting 20 unique rows I use the following:
UPDATE "table" SET "reserved" = 'client-id_timestamp' WHERE "reserved" = ''ORDER BY "last_access" ASC LIMIT 20
This update query is quite bad performance wise, which is why I ask:
Is there a better solution for my specific requirements? Another table structure perhaps?
Is last_access a date column? Try expressing it with an integer value (ie. seconds since 1970-01-01), it might be faster to sort.
Second performance issue might come from the need to reindex the table after you change the "reserved" field. It is possible that the performance might improve if you remove the index from that column. Though the search will take longer, the more expensive reindex is thrown out of the equation.
If you are using MySQL 5.6.3 or newer, you can execute EXPLAIN with your query to find out what part of it takes the longest.

Optimize Mysql Query (rawdata to processed table)

Hi everyone so my question is this, So I have a file that reads in roughly 3000 rows of data by the local infile command. After which there is a trigger on the table that's inserted into that copies three columns from from the updated table and two columns from a table that exist in the database already(if this is unclear to what I mean the structures are coming). From there only combinations that have unique glNumbers will be entered into the processed table. This takes over a minute and half normally. I find this pretty long, I was wondering if this is normal for what I'm doing(can't believe that's true) or is there a way to optimize the queries so it goes faster?
Tables that are inserted to are labeled the first three letters of each month. Here is the default structure.
RawData Structure
| idjan | glNumber | journel | invoiceNumber | date | JT | debit | credit | descriptionDetail | totalDebit | totalCredit |
(sorry) for the poor format there isn't a really good way to do this it seems)
After Insert Trigger Query
delete from processedjan;
insert into processedjan(glNumber,debit,credit,bucket1,bucket2)
select a.glNumber, a.totalDebit, a.totalCredit, b.bucket1, b.bucket2
from jan a inner join bucketinformation b on a.glNumber = b.glNumber
group by glNumber;
Processed Datatable Structure
| glNumber | bucket1| bucket2| credit | debit |
Also I guess it helps to know the bucket 1 and bucket 2 come from another table where its matched against the glNumber. That table is roughly 800 rows with three columns for the glNumber and the two buckets.
While postgresql has statement level triggers, mysql only has row level triggers. From the mysql reference:
A trigger is defined to activate when a statement inserts, updates, or
deletes rows in the associated table. These row operations are trigger
events. For example, rows can be inserted by INSERT or LOAD DATA
statements, and an insert trigger activates for each inserted row.
So while you are managing to load 3000 rows in one operation, unfortunately 3000 more queries are executed by the triggers. But the complex nature of your transaction sounds like you might actually be performing 2-3 queries per row. That's the real reason for the slow down.
You can speed things up by disabling the trigger and carrying out a INSERT .. SELECT after the load data in file. You can automate it with a small script.

Avoid duplicate rows, without reference to keys or indexes?

I have a MySQL table in which each row is a TV episode. It looks like this:
showTitle | season | episode | episodeTitle | airdate | absoluteEpisode
----------------------------------------------------------------------------------------
The X-Files 5 12 Bad Blood 1998-02-22 109
The X-Files 5 13 Patient X 1998-03-01 110
(Where absoluteEpisode is the episode's overall number counting from episode 1.)
It is populated using a Ruby program I wrote which fetches the data from a web service. Periodically, I'd like to run the program again to fetch new episodes. The question then becomes, how do I avoid adding duplicates of the already-existing rows? None of the columns in this table are suitable for use as a primary key or unique field.
I had two ideas. The first was to create a new column, md5, with an MD5 hash of all of those values, and make that a unique column, to prevent two rows with identical data from being added. That seems like it would work, but be messy.
My second was to use this solution from StackOverflow. But I can't quite get that to work. My SQL query is
INSERT INTO `tv`.`episodes` (showTitle,episodeTitle,season,episode,date,absoluteEpisode)
SELECT '#{show}','#{title}','#{y['airdate']}' FROM `tv`.`episodes`
WHERE NOT EXISTS (SELECT * from `tv`.`episodes`
WHERE showTitle='#{show}' AND episodeTitle='#{title}' AND season='#{season_string}' AND episode='#{y['seasonnum']}' AND date='#{y['airdate']}' AND absoluteEpisode='#{y['epnum']}'")
The #{...} bits are Ruby variables. This gets me the obvious error You have an error in your SQL syntax.
Flipping through the books and documentation I can find on the subject, I'm still not sure how to properly execute this query, or if it's not a smart way of solving my problem. I'd appreciate any advice!
why not create a primary key from the showTitle, season, and episode, this will solve the problem because I guess because the episode number can not be duplicate under the same season, and that's apply for the same TV show,
example
x-files==>season 1==>episode 1 this will be primary key as one unit

Use of Index to Improve Speed of Aggregate functions in select query

I need to create a new table with sum aggregates of the measures columns in source table.
The Source table is very huge.
eg. Source Table
Category | Product | Sales
A | P1 | 100
B | P2 | 200
C | P3 | 300
Query is like :
SELECT Category,
Product,
SUM(Sales)
FROM source_table
GROUP BY Category.
There is no where condition.
Will indexing help in speeding up the process?
Any alternate mechanism for speeding the query?
It might help to add an index on Category since it is in the GROUP BY clause. But you're doing a full table dump, so it might just be slow.
Probably a better strategy is to create a new table for the sales report and populated it based on your business needs. If it can be updated only daily, then schedule a stored procedure to run nightly to repopulate it. If it needs to reflect the current state of the table, then you can use triggers to update the report table as the base table is updated. Or you could run a separate query at the application level to update the report table when your base table is updated.
Indexes are a tricky tool. If you're planning to add an index to a column of your table, you should consider at the very least:
1- How many different values do I have in this column.
2- How's the proportion between the total number of record and the number of different values.
3- How often do I apply where, group by or order by clauses on this column.
As the #Kasey answer states, for what's to see, you could add an index on the category column, but, it will depend on the number of different values you have for that column.