SSIS 2008 Fuzzy grouping to identify duplicate contacts but ignoring punctuation

SSIS 2008 Fuzzy grouping to identify duplicate contacts but ignoring punctuation - ssis

im using SSIS in visual studio 2008 to perform some fuzzy grouping on a customer table.
columns
ID
Name
Email
etc
I have some duplicate customers in the table with the same email address im currently able to use the Fuzzy grouping to identify the duplicates for manual checking.
I also have some records which are almost duplicates but have some extra punctuation.
eg
ID Name Email
1 bob bob.bob#bob.com
2 bob bob.bob#bob.com
3 bob bob..bob#bob.com
7 tom tom#tom.com
9 frog tom#tom..com
currently i can get id 1 and 2 to match but i would want 1, 2 and 3 to match and be grouped on the same key
and 7 and 9 to also match because i want to ignore the double full stops and see it as only one full stop. Also name does not matter, only the email address column is currently important.
any suggestions and help please.

Use a derived column transformation before your fuzzy grouping transformation to remove unwanted characters:
REPLACE([Email], "..", ".")

Related

How to sum only one of repeated values from joined data in RDLC

I'm not sure if SSRS is dumb, or I am (I'm leaning towards both).
I have a dataset that (as a result of joins etc) has some columns with the same values duplicated across every row (fairly standard database stuff):
rid cnt bid flg1 flg2
-------------------------------
4 2882 1 17 3
5 2784 1 17 3
6 1293 1 17 3
18 9288 2 4 9
20 762 2 4 9
Reporting based on cnt is straightforward enough. I can also make a tablix that shows the following:
bid flg1 flg2
------------------
1 17 3
2 4 9
(Where the tablix is grouped by Fields!bid.Value and the columns are just Fields!flg1.Value and Fields!flg2.Value respectively.)
What I can't figure out is how to display the sum of these values -- specifically I want to show that the sum of flg1 is 21 and the sum of flg2 is 12 -- not the sum of every row in the dataset (counting each value more than once).
(Note that I'm not looking for a sum of distinct values, as they may not be unique. I want a sum of one value from each bid group, because it's from a table join so they will always have the same value.)
If possible, I'd also like to be able to do a similar calculation at the top level of the report (not in any tablix); although I'd settle for hiding the detail row if that's the only way.
Obviously, Sum(Fields!flg1.Value) isn't the answer, as this either returns 51 (if on the first row inside the group) or 59 (if outside it).
I also tried Sum(Fields!flg1.Value, "bid") but this wasn't considered a valid scope.
I also tried Sum(First(Fields!flg1.Value, "bid")) but apparently you're not allowed to sum first values for some weird reason (and may have had the same scope problem anyway).
Using Sum(Max(Fields!flg1.Value, "bid")) does work, but feels wrong. Is there a better way to do this?
(Related: is there a good way to save the result of that calculation so that I can later also show a Sum of those totals without an even hairier expression?)

There are two basic ways to do this.
Do what you have already done (Sum(Max(Fields!flg1.Value, "bid")))
Sum the rendered values. To do this check the name of the cell containing the data you want (check it's properties) and then use something like =SUM(ReportItems!flg1.Value) where flg1 is the name of the textbox, which is not necessarily always the same name as the field.

How to display two sets of data in a tablix

at the moment I have two tables. I have one table that displays loan summaries.
Loan Client Balance
11 Bob 100000
20 Steven 100000
33 Michael 100000
I need to enhance this table by adding Loan.Notes
Loan Client Balance
11 Bob 50000
2015-05-06 - Bob came into the office and said we should expect late payments
20 Steven 100000
2015-05-06 - Steven came into the office and he will pay this friday
2015-05-06 - Steven came into the office and said we should expect late payments
33 Michael 700000
The Notes section has two columns, Date of note and the Note.Subject. May I ask how do I add in the notes section onto the tablix?

First you need to create the query joining Loan with the 'LoanNotes'.
In your main table group by Loan. In the header of that group show Loan #, client and balance.
In the Detail section based on your requirements merge the column and show the Notes Data. In the following case I merged three columns and created a expression as =Fields!NoteDate.Value & " - " & Fields!Subject.Value
Now when you run the report you will get the data as you want.
Optional: In the case where there is no note you will need to write an expression in visibility tab to hide your detail row
=IIF(CountRows("LoanGroup") = 1, True, False)
OR
=CountRows("LoanGroup") = 1

The best approach for this is to combine the tables into one datatable joining on a similar key.
You can join the tables in your dbms by using queries to create a new view. or you could use this DataSet JOIN Helper.
You can then group in your Tablix as you see fit.
Another solution would be to add multiple DataSets within the same report.
Using multiple datasets in RDLC

processing MySQL data when there are field values inserted with commas

I have some columns in mysql table with field vaues are seperated with commas. fields like IP address and running_port_ids, dns_range or subnet etc. running a cron to check every hour whether the ports are used or not on the appliance. if ports are used against each appliance running_port_ids(like 2,3,7) are inserted with comma seperated values.
How to process the data so that i can get a reports which ports are less used (i have a static list of port ids) in ascending order like below by grouping of address, running_port_ids and insert date for a date range of one month.
address port usage%
10.2.1.3 3 1
10.3.21.22 2 20
there are thousands of record now in the table with comma seperated running_port_ids. is there any methods available in MySql to do this?
Any help much appreciated.

If you can convert your data model to a n:m relation (or "link table"), i.e. normalize your data model, this is pretty easy using grouping (or "aggregate") functions. So I'd advise to revise your data model and introduce a table containing one row for each of the ports, in stead of storing this de-normalized in a text column.
A typical example would be: "student has many classes", and a property of this relation is "attendance":
Student
id name
1 John
2 Jane
Course
id name
1 Engineering
2 Databases
Class
id courseid date room
1 1 2015-08-05 10:00:00 301
2 1 2015-08-13 10:00:00 301
3 1 2015-09-03 10:00:00 301
StudentClass
studentid classid attendance
1 1 TRUE
1 2 FALSE
1 3 NULL
2 1 TRUE
2 2 TRUE
2 3 NULL
In this case, you can see the relation between student and class is normalized, i.e. every other value is stored vertically in stead of horizontally. This way, you can easily query things like "How many classes did John miss?" or "How many students did not miss any class". NULL in the example shows that we can not yet tell anything about the attendance (as the date is in the future), but we do know that they should attend.
This is the way you should keep track of properties of a relation between two things. I can't really make out what you're trying to build, but I'm pretty sure you need a similar model.
Hope this helps.

Displaying Multiple Column Values for One Row Value

I currently have a database which houses county codes within a state and clients doing business within those counties. Sometimes, several clients will operate within the same counties. I am looking to display each county code and then that county codes associated clients as separate columns. Example would look something like the below:
County Code Client1 Client2 Client3
32 1 2
42 3
43 6 8
44 2 8 5
45 2
As of now, all I have managed to do is display it as two columns with duplicate county codes displaying different lender IDs. However, this is very manual to put it into the above format once I get it into Excel.
Any suggestions on this?

After review, this can be done with a GROUP_CONCAT(). It puts it as a string so modification with Excel will be needed, but it's a simple solution.
SELECT
COUNTY_CODE,
GROUP_CONCAT(DISTINCT CLIENT_ID)
FROM CLIENT_TABLE
WHERE STATE = 'IA'
GROUP BY COUNTY_CODE
ORDER BY COUNTY_CODE;

When is it better to flatten out data using comma separated values to improve search query performance?

My question about SEARCH query performance.
I've flattened out data into a read-only Person table (MySQL) that exists purely for search. The table has about 20 columns of data (mostly limited text values, dates and booleans and a few columns containing unlimited text).
Person
=============================================================
id First Last DOB etc (20+ columns)...
1 John Doe 05/02/1969
2 Sara Jones 04/02/1982
3 Dave Moore 10/11/1984
Another two tables support the relationship between Person and Activity.
Activity
===================================
id activity
1 hiking
2 skiing
3 snowboarding
4 bird watching
5 etc...
PersonActivity
===================================
id PersonId ActivityId
1 2 1
2 2 3
3 2 10
4 2 16
5 2 34
6 2 37
7 2 38
8 etc…
Search considerations:
Person table has potentially 200-300k+ rows
Each person potentially has 50+ activities
Search may include Activity filter (e.g., select persons with one and/or more activities)
Returned results are displayed with person details and activities as bulleted list
If the Person table is used only for search, I'm wondering if I should add the activities as comma separated values to the Person table instead of joining to the Activity and PersonActivity tables:
Person
===========================================================================
id First Last DOB Activity
2 Sara Jones 04/02/1982 hiking, snowboarding, golf, etc.
Given the search considerations above, would this help or hurt search performance?
Thanks for the input.

Horrible idea. You will lose the ability to use indexes in querying. Do not under any circumstances store data in a comma delimited list if you ever want to search on that column. Realtional database are designed to have good performance with tables joined together. Your database is relatively small and should have no performance issues at all if you index properly.
You may still want to display the results in a comma delimted fashion. I think MYSQL has a function called GROUP_CONCAT for that.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

SSIS 2008 Fuzzy grouping to identify duplicate contacts but ignoring punctuation - ssis

Use a derived column transformation before your fuzzy grouping transformation to remove unwanted characters: REPLACE([Email], "..", ".")

Related

How to sum only one of repeated values from joined data in RDLC

How to display two sets of data in a tablix

processing MySQL data when there are field values inserted with commas

Displaying Multiple Column Values for One Row Value

When is it better to flatten out data using comma separated values to improve search query performance?

Categories

Resources