I have a table like as follows:
This is a cat
This is a pet
This is a dog
is
a
is a
is
is a dog
That is a dog
I would like to end up with a table as follows:
This is a cat
This is a pet
This is a dog
That is a dog
Essentially remove the rows that are already contained (as sub-strings) in other rows.
You can create a combination of all rows with the Cross Joiner node (both inputs are coming from your example). Followed by a String Manipulation node (probably followed by a String to Number node) or a Java Snippet node you can assign 1 or 0 if the original is contained in the latter or not. After you can GroupBy based on the original column and sum the 0/1 values. With a Row Filter you can keep only those rows which contain 1 in the sum column.
(Please note that because of the Cross Joiner it can create quite large tables. Maybe the Distance measure nodes can solve this problem more efficiently.)
It depends on the exact nature of your dataset, but if you had columns each with some text value (like in the picture), you could treat each row as an itemset and use the Item Set Finder (after a suitable conversion to a bit vector) to find the maximal itemsets.
The maximal itemsets would be the rows that are supersets of other rows.
Related
I am trying to write a query that will return similar rows regarding the "Name" column.
My issue is that within my SQL database , there are the following examples:
NAME DOB
Doe, John 1990-01-01
Doe, John A 1990-01-01
I would like a query that returns similar, but not exact, duplicates of the "Name" column. Since I do not know exactly which patients this occurs for, I cannot just query for "Doe, John%".
I have written this query using MySQL Workbench:
SELECT
Name, DOB, id, COUNT(*)
FROM
Table
GROUP BY
DOB
HAVING
COUNT(*) > 1 ;
However, this results in an undesirable amount of results which Name is not similar at all. Is there any way I can narrow down my results to include only similar (but not exact duplicate!) Name? It seems impossible, since I do not know exactly which rows have similar Name, but I figured I'd ask some experts.
To be clear, this is not a duplicate of the other question posted, since I do not know the content of the two(or more) strings whereas that poster seemed to have known some content. Ideally, I would like to have the query limit results to rows with the first 3 or 4 characters being the same in the "Name" column.
But again, I do not know the content of the strings in question. Hope this helps clarify my issue.
What I intend on doing with these results is manually auditing the rest of the information in each of the duplicate rows (over 90 other columns per row may or may not have abstract information in them that must be accurate) and then deleting the unneeded row.
I would just like to get the most concise and accurate list I can to go through, so I don't have to scroll through over 10,000 rows looking for similar names.
For the record, I do know for a fact that the two rows will have exactly similar names up until the middle initial. In the past, someone used a tool that exported names from one database to my SQL database, which included middle initials. Since then, I have imported another list that does not include middle initials. I am looking for the ones that have middle initials from that subset.
This is a very large topic and effort depends on what you consider as "similar" and what the structure of the data is. For example are you going to want to match Doe, Johnathan as well?
Several algorithms exist but they can be extremely resource intensive when matching name alone if you have a large data set. That is why often using other attributes such as DOB, or Email, or Address to first narrow your possible matches then compare names typically works better.
When comparing you can use several algorithms such as Jaro-Winkler, Levenshtein Distance, ngrams. But you should also consider "confidence" of match by looking at the other information as suggested above.
Issue with matching addresses is you have the same fuzy logic problems. 1st vs first. So if going this route I would actually turn into GPS coordinates using another service then accepting records within X amount of distance.
And the age old issue with this is Matching a husband and wife. I personally know a married couple both named Michael Hatfield. So you could try to bring in gender of name but then Terry, Tracy, etc can be either....
Bottom line is only go the route of similarity of names if you have to and if you do look into other solutions like services by Melissa data, sql server data quality services as a tool.....
Update per comment about middle initial. If you always know the name will be the same except middle initial then this task can be fairly simple and not need any complicated algorithm. You could match based on one string + '%' being LIKE the other then testing to make sure length is only 2 different and that there is 1 more spaces in it than the smaller string. Or you could make an attempt at cleansing/removing the middle initial, this can be a little complicated if name has a space in it Doe, Ann Marie. But you could do it by testing if 2nd to last character is a space.
I have two tables that their shared columns do not exactly match (differences in capital character or existence of some characters like comma,space and ...). How can I merge these two tables based on their shared column (in R, Knime, Excel-power query or sql)?
In your example Result table it's not clear where the row
gene1 | go3 | 14
comes from, because there's no entry for go3 in Table2. I'm assuming that's a mistake and you meant Table2 to include the row
go3 | 14
If that's correct, here's how to do this in KNIME:
The two Table Creator nodes just create the two tables with column names as shown in your example - replace these with your actual data sources. Cell Splitter splits column Goes using a comma as the delimiter. The Unpivoting node is configured like this:
and the Joiner like this:
All other settings were left as default. Add nodes to reorder and filter the columns in the Joiner output if you need to. Note that you'll see different Goes_Arr[n] columns depending on how many different values of Goes there are - the Enforce exclusion and Enforce inclusion settings make sure that Unpivoting handles this correctly.
This workflow should cope with whitespace between the commas, but I think you also mention differences in capital letters - if you need to handle these, pass each table through a Case Converter node to make them consistent.
Pivoting and unpivoting are hard to understand (IMHO - especially given the cryptic descriptions of their KNIME nodes) but very powerful. I recommend taking time to play around with these nodes to figure out how they work.
We have a large table with product information. Almost all the time we need to find product names that contain specific words, but unfortunately these queries take forever to run.
Example: Find all the products where the name contains the words "steel" and "102" (not necessarily next to each other, so a product like "Ninja steel iron 102 x" is a match, just like "Dragon steel 102 b" is it).
Currently we are doing it like this:
SELECT columns FROM products WHERE name LIKE '%WORD1%' AND name LIKE '%WORD2%' (the number of like words are normally 2-4, but it can in theory be 7-8 or more).
Is there a faster way of doing this?
We are only matching words, so I wonder if that can help somehow (i.e. the products in the example above are matches, but "Samurai swordsteel 102 v" is not a match since "steel" doesn't stand alone).
My own thought is to make a helper table with the words from productnames in and then use that table to get the ids of the matching products.
i.e. a table like: [id, word, productid] so we get for example:
1, samurai, 3
2, swordsteel, 3
3, 102, 3
4, v, 3
Just wonder if there is a built in way to do this in MySQL, so I don't have to implement my own stuff + maintain two tables.
Thanks!
Unfortunately, you have wild cards at the beginning of the pattern name. Hence, MySQL cannot use a standard index for this.
You have two options. First, if the words are really keywords/attributes, then you should have another table, with one row per word.
If that is not the case, you can try a full text index. Note that MySQL has attributes for the minimum words length and uses a stop words list. You should take these into account before building the index.
I was poking around a TFS database today to try and run some statistics and I came across a table called tbl_Number. This table contains one column Number, and all the values are just the values 1 to 500,000. None of the values differ from their respective index in the list, as you can see in the screenshot from queries I ran in LinqPad:
Tbl_Numbers.Max(x => x.Number).Dump(); //max value
Tbl_Numbers.Count().Dump(); //number of entries
var asList = Tbl_Numbers.ToList();
asList.Where(x => asList[x.Number - 1].Number != x.Number).Any().Dump();
//False shows that every entry matches the value at its ordinal location in the list
My question is: What would the use of such a table be? Is this in case one of the referenced numbers needs to change for some reason? The only way to identify a number from this table is by using that same number, so I don't see what use this table could be.
I realize this question could lead to answers that are conjecture, but I'd be interested to see if there's some programming principal that I'm unaware of that's being used here.
It can be used in OUTER JOINS to make sure that you always get all the numbers in a given range, even if there is no data related to that number.
For example, suppose I want to return the count of customers who bought 3,4 or 5 products on their last order. But in fact, there are no customers who bought 4 products. If I just ran a count query on my data, I wouldn't get a row for the customers who bought 4 products at all.
However, If I query my numbers table and LEFT JOIN to my data, I will get the number 4, and a count of 0 or NULL, depending on how I wrote my query.
People also often do this with Date tables, by the way.
I have a case where we are maintaining a table containing resources. This table has a varchar column that contains role ids as comma separated values (I know normalizing SHOULD have been the way to go, but can't change a long running working system). E.g. role_ids column contains '1,4,6,9,10' and another row contains '5,10,15'.
Then, for a user in system, I have the associated role ids as a list, e.g. 4,15. Now I need to find 'any in many', i.e. any resource that may have any of the role ids present in resource.role_ids column.
This question is something similar to this one, but the solution expected is not expected in Grails.
I'm looking for a MySQL solution - either a query or a stored procedure. Though finding a set of resources could have been achieved using 'FIND_IN_SET()', but don't want to perform multiple calls to DB with each of user's role_id list.
Use a function like this one, to turn your lists into individual records, then join everything up normally.