Transforming less frequent values - knime

Suppose I have the following columns for a csv that I read through a 'File Reader' node:
id, name, city, income
After reading it, I notice that the column 'city' contains a huge number of unique values. I want to:
Know which values are the 'k' most frequent for 'city'
Modify those which are not the 'k' most frequent to hold something like 'other'
Example:
id, name, city, income
1, Person 1, New York, 100.000
2, Person 2, Toronto, 90.000
3, Person 3, New York, 50.000
4, Person 4, Seattle, 60.000
Choosing k to be 1, I want to produce the following table:
id, name, city, income
1, Person 1, New York, 100.000
2, Person 2, Other, 90.000
3, Person 3, New York, 50.000
4, Person 4, Other, 60.000
It happens because 'New York' is the '1' most frequent value for 'city' in the original table.
Do you know how I can do that using Knime?
Thanks a lot!

You can use the CSV Reader to read the data. With the Statistics and Row Filter nodes you can find the k most frequent values. From those, you can create a collection cell using GroupBy. With that collection value, you can use Rule Engine with a similar ruleset:
$city$ IN $most frequent cities$ => $city$
TRUE => "Other"

Related

Searching for groups of rows where a column contains ALL given values

I have a table symptom_ratings containing the columns id, user_id, review_id, symptom_id, rate, and strain_id.
Each review can have multiple entries in symptom_ratings, one per symptom.
I would like to do a search for every strain_id that has all of the symptom_id's the user searches for.
That is, given the columns:
review: 2, strain_id: 3, symptom_id: 43
review: 2, strain_id: 3, symptom_id: 23
review: 2, strain_id: 3, symptom_id: 12
review: 6, strain_id: 1, symptom_id: 3
review: 6, strain_id: 2, symptom_id: 12
Searching for the symptom_id's 43 and 12 should only return results for strain_id 3.
I currently use the following WHERE condition:
Strain.id IN (SELECT strain_id
FROM symptom_ratings
WHERE symptom_id
IN ($symptoms))
where $symptoms is a comma-separated list of symptom_id values.
My problems is that this query currently performs an OR search (i.e. it finds strains that have any of the symptoms), where instead I'd prefer an AND search (i.e. finding strains that have all of the symptoms). How can I achieve that?
One way to do this would be to group the rows by the strain ID, count the number of distinct matching symptoms in each group, and return only those rows where the count equals the total number of symptoms searched for:
SELECT
strain_id,
COUNT(DISTINCT symptom_id) AS matched_symptoms
FROM symptom_ratings
WHERE symptom_id IN (43, 12)
GROUP BY strain_id
HAVING matched_symptoms = 2
Here's a quick online demo.
One potentially useful feature of this method is that it's trivial to extend it to support both "all of these", "any of these" and "at least n of these" searches just by changing the condition in the HAVING clause. For the latter cases, you can also sort the results by the number of matching symptoms (e.g. with ORDER BY matched_symptoms DESC).

Database Normalization For Table With Tree Like Data

How do I Normalize this table. It has a tree like structure which is expected to grow like a tree.
By tree like structure I mean that new students, subjects, levels and chapters will be constantly added or updated or removed
I want to store the result of a quiz in this table. the quiz has multiple subjects under which there are multiple levels under which there are mutliple chapter. and Every students can take different subjects.
So is this table good for storing the results or I need to do something with this table?
In this particular case you need to create several independent tables:
Table "Student"
ID, Name
1, John
2, Jack
Table "Subject"
ID, Name
1, Math
2, Science
3, Geography
4, History
5, English
Table "Levels"
ID, Name
1, Intermediate
2, Moderate
3, Difficult
Table "Chapters"
ID, Name
1, Chapter 1
2, Chapter 2
3, Chapter 3
And so on and so on.
Then you define the relations between the tables, like this:
Table "student_subject_level"
ID, student_id, subject_id, level_id
1, 1, 1, 1 (John, Math, Intermediate)
2, 1, 2, 2 (John, Science, Moderate)
So far you have the student, the corresponding subejct and the subject's level. Since we may have multiple chapters for each level, we need another relation:
Table "student_subject_level_chapter" (or use simpler name)
student_subject_level_id, chapter_id
1, 1 (John, Math, Intermediate, Chapter 1)
1, 2 (John, Math, Intermediate, Chapter 2)
2, 1 (John, Science, Moderate, Chapter 1)
And so on and so on. Start by isolating the individual tables and then figure out how you'd like to achieve the actual relation. Fore each new relation where you have redundant data, you'd like to have new table which keeps the relation you need. It's much easier once you have ID's to refer to, so start with the individual tables and figure your way through.

Updating a table based on contents of the same table

I have a table:
id, number, name, display_name
0001, 1, Category 1, null
0001-0002, 2, Category 2, null
0001-0002-0003, 3, Category 3, null
The id is the full path to the category, the number is just the final category number.
I'd like display_name updated to include the full names from all categories in the path, so they'd end up as
0001, 1, Category 1, Category 1
0001-0002, 2, Category 2, Category 1 > Category 2
0001-0002-0003, 3, Category 3, Category 1 > Category 2 > Category 3
I know I can generate these on the fly by lookups to the number column, but this table doesn't need to change that often but it has a lot of lookups -- it seems wasteful not to just calculate and store the data once. I can do this in php but it's slow and I assume there's a better way to do it? Or perhaps I'm just going about this in completely the wrong way. I realise there's plenty of redundancy in the table... I'm happy for any input.
I got as far as
update categories set display=(select name from (select name from categories where number=1) t) where number=1
but that obviously just copies the name to the display name.

Where would tables that are not normalised be applicable?

Is it ever best practise or recommended to use a table of the following?
id, uid, fieldname, fieldvalue
4, 12, gender, male
5, 12, age, 21-30
6, 12, location, 5
7, 13, gender, female
8, 13, age, 31-40
9, 13, location, 5
10, 14, gender, female
11, 14, age, 31-40
12, 14, location, 6
13, 15, gender, male
14, 15, age, 21-30
15, 15, location, 7
It is not normalised and you cannot specify the data type of the field.
Would the following not be better
id, uid, gender, age, location
4, 12, male, 21-30, 5
5, 13, female, 31-40, 5
6, 14, female, 31-40, 6
7, 15, male, 21-30, 7
I want to know if you can ever justify having such a table in a database, I know that the first method may be easier to add more fields (instead of altering the database) and will probably remove all null values.
However one cannot specify the datatype and you will have to convert from string every time you want to use the data.
So is there ever a scenario where the first table is considered the best practice or solution?
Working on a system that uses that method will make you lose your sanity. The complexity of the queries required in order to perform basic tasks is dreadful, and performance is a nightmare.
Here's one man's experience: https://www.simple-talk.com/opinion/opinion-pieces/bad-carma/
You can normalize the initial setup by adding another table:
fieldnames (fnid, name)
fieldvalues (id, uid, fnid, value, unique(uid,fnid))
However, I would recommend against it because of its complexity -- it's much easier to use a single table unless you are going to be adding and/or removing fields very frequently or there could be a large disparity in which rows get which fields (in which case you should probably rethink redesigning your DB and application).
The first type of structure you describe is quite common in applications where attributes are not known in advance. For example, you might be developing a system where a user can save details about contacts. You might know some of the details that will be stored, but not others. So, you might allow the user to define custom-attributes for a contact.
Of course, this type of design means that one loses database-applied checks, and has to rely on application-applied checks. That's a con, but should not be a deal-breaker if the flexibility is really required.

Union Query -> Pick One When Duplicate Value in Jet

I'm using Access to fill in details in a database across 3 offline computers. This means they all have a copy of the database, do a day of info filling, then get manually uploaded to a central database. Horrid, but it's the only option.
I have a pre-filled database, key identifiers etc are all determined previously; we are adding information to the blank fields for these entries. (Started with 3 key fields, added a few info fields). The user selects an entry and edits it rather than creating one. I then use a script which takes each table and unions the three databases into a table for each. The users do not duplicate work (meaning you don't have Jack working on entry A as well as Jill working on entry A).
My question: How can I get my union query to select all entries, even the unfilled ones, but let the filled ones take precedence? (aka bypass the "duplicate entry" error by choosing the filled in entry instead of the two unfilled entries?)
ex:
JOHN's DB JACK's DB JILL's DB ---> MASTER DB
A: 1, 1, __ 1, 1, __ 1, 1, "Yes" 1, 1, "Yes"
B: 1, 2, "No" 1, 2, __ 1, 2, __ 1, 2, "No"
C: 1, 3, __ 1, 3, __ 1, 3, "No" 1, 3, "No"
Completely terrible way to do this (Unioning offline tables, that is) but we have little other choice due to many other uncontrollable factors.
How about
SELECT Id, Max(Field)
FROM ( Select Id, Field FROM John
Union All ...)
GROUP BY Id