SSRS and Comparison Operators on Numeric Portion of varchar - reporting-services

Each returned transaction I am to report on is stored with a return reason code and a description of the return reason code. I built a tablix with two columns - one for return codes and another for descriptions. This works just peachy. The report owner is upset that a long list of codes will split pages - sigh. I was told to display them side-by-side.
I am new to t-sql and SSRS and its idiosyncrasies. I have minimal support from our DBAs. Two tables, filtered to display codes that meet a criteria sound simple enough.
My research:
MSDN's support network, Operators in Expressions page, and various help topics. I also found SO posts regarding split functions in t-sql and similar as well as one specifically asking about comparison and varchar. I found sites with helpful information like ResultData and Network Steve. I haven't found what I think I'm looking for.
My problem:
The return reason code is a varchar that always consists of the letter 'r' and two numeric digits (R00 to R99). It appears I can't run a comparison operator on an entire varchar that is alphanumeric; it doesn't recognize IIF((Fields!... <= R17),True,False). Additionally, the company will not allow the warehouse or its functions to be edited so I cannot create my own.
My solution ideas:
Add each Rnn code to the tablix filter, individually. This means ~50 filters per tablix and seems a sloppy or inefficient way of handling this
Separate the varchar string in to its alpha and numeric components and compare the latter using standard operators. This sounds the cleanest method but I'm unsure how to accomplish this in an expression or within SSRS
Forgo the two-table idea and create one table with four columns (code, description, code, description). This still leaves me with how to set a limit on the number of rows that can be created before 'spilling over' to the other side
I appreciate being pointed to any resources or any offered input to the issue and my (not so?)logical approach to it.

You can achieve your second option as follows:
CInt(Fields!ReturnCode.Value.Substring(1,2))

Related

Obtaining the average of one field with comma delimited values (InfoPath)

I have a field where the user enters multiple values each separated by a comma eg "1.8, 2, 3".
I want to find the average of those values. Is there a way to utilise avg() to accommodate for stripping the comma and producing the mean?
Unfortunately you can't do that with the built in InfoPath functions (there is no traditional split method for strings).
If you are willing to tackle it - using managed code behind the form will very easily solve your problem (only about 4 lines of code). Basic math and string manipulation should not impose any security restrictions on the form. However you will have to setup for code behind which is easy but can seem like somewhat of a hassle the first time you try it. There are good MSDN articles on how to go about that.
Alternatively, if you can change your data entry from comma separated to a repeating table you can use the built in avg() function.

SSAS calculated measure: Access relational database

I recently asked a question about many-to-many relationships and how they can be used to calculate intersections that got answered pretty fine. Now, there is another nice-to-have requirement for our cube to extend that to more data. The general question remains: How many orders contain both product x and y?
However, the measure groups are now much larger, currently about 1.4 billion rows. I tried to implement that using the method described in the other post, with several hidden cross-referenced measure groups. However, this is simply too much for our hardware, the cube is reaching sizes next to 0.5 TB, and querys take several minutes to complete.
Now I would try to use another option: Can I access our relational database in a calculated measure? It seems I can, using UDFs like described in this article. I could write a Function in c# that queries our relational database and returns all the orders that contain the products chosen by the user. But in order to do that, I need to supply all the dimensional data the user has selected to the UDF. I also need the UDF to return the calculated value so it can be output as the result of the calculated member. Is that possible? If yes, how? The example microsoft provides only includes a small deterministic string-function as the UDF.
Here my own results:
It seems to be possible, though with limitations. The class Microsoft.AnalysisServices.AdomdServer.Context can provide you with the currentMember of each Hierarchy, however this does not work with Excel-Style-Subselects. It either contains a single member or the AllMember.
Another option is to get the MDX query using the dmv SELECT * FROM $System.DISCOVER_SESSIONS. There will be a column on that view which contains the last mdx query for a given session. However in order to not overwrite your own last query, you will need to not use the current connection, but to open a new one. The session id can be obtained through Microsoft.AnalysisServices.AdomdServer.Context.CurrentConnection.SessionID.
The second approach is ok for our use-case. It does not allow you to handle axes, since the udf-function has a cell-scope, but you don't know which cell you are in. If anyone of you knows anything about that last bit, please tell me. Thanks!

SSRS - How can I create a custom aggregate function?

I'm creating a report that has an unusual BoxPlot chart. I need to calculate the values for "Low Box" and "High Box" using all of the data for the certain column. The methodology for calculating these values is not that complicated, but I can not disclose it.
Basically I want to create a custom aggregate function. I understand how to create a VB function, but how do I make it take in a series of data instead of a single value. I know there is a Max function already, but for the sake of example how would one implement a Max function?
Thanks for your help.
"can not disclose it." implies high value, which implies that you are using a recent version of SSRS, so this link should be of value for you. (The blog article also includes how you might implement this in 2005, but doesn't focus on it.)
Essentially create a custom function that gets called for every row of the data, taking in values from that row. That method or another related method can return your aggregate. 2008 includes Group Variables should help with a convenient place to store that.
Another approach, but much harder I think, would be to implement a custom data provider wrapping your query.

Conditional split based on array variable

I need something like a T-SQL IN statement to filter records in a conditional split based on an array variable (or something similar)
I need to have a list of items that a column can be filtered on.
As Filip has indicated, there is no IN operator in the expression language. I did come up with some options though as I thought this sounded like an interesting problem.
My long analysis is on my blog: Filter list in SSIS
Conditional split
If you can transform your list of values into a delimited string, then you can use FINDSTRING and the current value to determine whether it's in the list. This provided the best throughput for my testing scenario. (FINDSTRING(#[User::MyListStr], [MyColumn],1)) > 0
Script task
I had assumed using a List in a script task to determine membership would provide the best performance but I was wrong. Row.IsInList = MyListObj.Contains(Row.MyColumn);
Lookup/Cache Connection Manager
The third approach I had come up with was dumping the list into a Cached Connection Manager and then using that in a lookup task. I thought this was the easiest to conceptualize and maintain but the performance was lacking.
Conclusion
For this problem domain, the FINDSTRING approach was the most efficient, by a considerable margin. The other three approaches consistently averaged a throughput of within 7 rows per millisecond of each other. I did find it interesting that the standard deviation of the FINDSTRING approach fluctuated so much. While this box is older and slower, there was not a considerable amount of activity going on during the package executions.
There is no IN operator in SSIS expression operators. And there is no similar operator. Since there is no such operator, You can't do that with built-in expressions and built-in Conditional Split. But You can do one of the following:
use Script Transformation to check if particular column has that is in variable array, and add additional column (flag) with value 1 if it contains, 0 if not; then use Conditional Split on this flag added in Script Transformation, or
it's better to put variables in database table and then use Lookup or Merge Join to check if row exists

How do I do a fuzzy match of company names in MYSQL with PHP for auto-complete?

My users will import through cut and paste a large string that will contain company names.
I have an existing and growing MYSQL database of companies names, each with a unique company_id.
I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.
Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **
For example, someone writes:
Microsoft -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc. -> Polycom
I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:
How to find best fuzzy match for a string in a large string database
Matching inexact company names in Java
You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).
The drawbacks of SOUNDEX() are:
its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
for MySQL, at least according to the docs, SOUNDEX is broken for unicode input
Example:
SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')
/* all of these return 'M262' */
For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.
Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.
In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).
SOUNDEX is an OK algorithm for this, but there have been recent advances on this topic. Another algorithm was created called the Metaphone, and it was later revised to a Double Metaphone algorithm. I have personally used the java apache commons implementation of double metaphone and it is customizable and accurate.
They have implementations in lots of other languages on the wikipedia page for it, too. This question has been answered, but should you find any of the identified problems with the SOUNDEX appearing in your application, it's nice to know there are options. Sometimes it can generate the same code for two really different words. Double metaphone was created to help take care of that problem.
Stolen from wikipedia: http://en.wikipedia.org/wiki/Soundex
As a response to deficiencies in the
Soundex algorithm, Lawrence Philips
developed the Metaphone algorithm for
the same purpose. Philips later
developed an improvement to Metaphone,
which he called Double-Metaphone.
Double-Metaphone includes a much
larger encoding rule set than its
predecessor, handles a subset of
non-Latin characters, and returns a
primary and a secondary encoding to
account for different pronunciations
of a single word in English.
At the bottom of the double metaphone page, they have the implementations of it for all kinds of programming languages: http://en.wikipedia.org/wiki/Double-Metaphone
Python & MySQL implementation: https://github.com/AtomBoy/double-metaphone
Firstly, I would like to add that you should be very careful when using any form of Phonetic/Fuzzy Matching Algorithm, as this kind of logic is exactly that, Fuzzy or to put it more simply; potentially inaccurate. Especially true when used for matching company names.
A good approach is to seek corroboration from other data, such as address information, postal codes, tel numbers, Geo Coordinates etc. This will help confirm the probability of your data being accurately matched.
There are a whole range of issues related to B2B Data Matching too many to be addressed here, I have written more about Company Name Matching in my blog (also an updated article), but in summary the key issues are:
Looking at the whole string is unhelpful as the most important part
of a Company Name is not necessarily at the beginning of the Company
Name. i.e. ‘The Proctor and Gamble Company’ or ‘United States Federal
Reserve ‘
Abbreviations are common place in Company Names i.e. HP, GM, GE, P&G,
D&B etc..
Some companies deliberately spell their names incorrectly as part of
their branding and to differentiate themselves from other companies.
Matching exact data is easy, but matching non-exact data can be much more time consuming and I would suggest that you should consider how you will be validating the non-exact matches to ensure these are of acceptable quality.
Before we built Match2Lists.com, we used to spend an unhealthy amount of time validating fuzzy matches. In Match2Lists we incorporated a powerful Visualisation tool enabling us to review non-exact matches, this proved to be a real game changer in terms of match validation, reducing our costs and enabling us to deliver results much more quickly.
Best of Luck!!
Here's a link to the php discussion of the soundex functions in mysql and php. I'd start from there, then expand into your other not-so-well-defined requirements.
Your reference references the Levenshtein methodology for matching. Two problems. 1. It's more appropriate for measuring the difference between two known words, not for searching. 2. It discusses a solution designed more to detect things like proofing errors (using "Levenshtien" for "Levenshtein") rather than spelling errors (where the user doesn't know how to spell, say "Levenshtein" and types in "Levinstein". I usually associate it with looking for a phrase in a book rather than a key value in a database.
EDIT: In response to comment--
Can you at least get the users to put the company names into multiple text boxes; 2. or use an unambigous name delimiter (say backslash); 3. leave out articles ("The") and generic abbreviations (or you can filter for these); 4. Squoosh the spaces out and match for that also (so Micro Soft => microsoft, Bare Essentials => bareessentials); 5. Filter out punctuation; 6. Do "OR" searches on words ("bare" OR "essentials") - people will inevitably leave one or the other out sometimes.
Test like mad and use the feedback loop from users.
the best function for fuzzy matching is levenshtein. it's traditionally used by spell checkers, so that might be the way to go. there's a UDF for it available here: http://joshdrew.com/
the downside to using levenshtein is that it won't scale very well. a better idea might be to dump the whole table in to a spell checker custom dictionary file and do the suggestion from your application tier instead of the database tier.
This answer results in indexed lookup of almost any entity using input of 2 or 3 characters or more.
Basically, create a new table with 2 columns, word and key. Run a process on the original table containing the column to be fuzzy searched. This process will extract every individual word from the original column and write these words to the word table along with the original key. During this process, commonly occurring words like 'the','and', etc should be discarded.
We then create several indices on the word table, as follows...
A normal, lowercase index on word + key
An index on the 2nd through 5th character + key
An index on the 3rd through 6th character + key
Alternately, create a SOUNDEX() index on the word column.
Once this is in place, we take any user input and search using normal word = input or LIKE input%. We never do a LIKE %input as we are always looking for a match on any of the first 3 characters, which are all indexed.
If your original table is massive, you could partition the word table by chunks of the alphabet to ensure the user's input is being narrowed down to candidate rows immediately.
Though the question asks about how to do fuzzy searches in MySQL, I'd recommend considering using a separate fuzzy search (aka typo tolerant) engine to accomplish this. Here are some search engines to consider:
ElasticSearch (Open source, has a ton of features, and so is also complex to operate)
Algolia (Proprietary, but has great docs and super easy to get up and running)
Typesense (Open source, provides the same fuzzy search-as-you-type feature as Algolia)
Check if it's spelled wrong before querying using a trusted and well tested spell checking library on the server side, then do a simple query for the original text AND the first suggested correct spelling (if spell check determined it was misspelled).
You can create custom dictionaries for any spell check library worth using, which you may need to do for matching more obscure company names.
It's way faster to match against two simple strings than it is to do a Levenshtein distance calculation against an entire table. MySQL is not well suited for this.
I tackled a similar problem recently and wasted a lot of time fiddling around with algorithms, so I really wish there had been more people out there cautioning against doing this in MySQL.
Probably been suggested before but why not dump the data out to Excel and use the Fuzzy Match Excel plugin. This will give a score from 0 to 1 (1 being 100%).
I did this for business partner (company) data that was held in a database.
Download the latest UK Companies House data and score against that.
For ROW data its more complex as we had to do a more manual process.