Is there some kind of way to import data that consists of multiple rows? - rapidminer

In RapidMiner the data table I usually see is like this:
Row Age Class
1 19 Adult
2 10 Minor
3 15 Teenager
In the data table above this sentence, one row refers to one complete information.
But how do I input a data table to RapidMiner where more than one row refers to one complete information?
For example:
Row Word Rho Theta Phi
1 Hello 0.9384 0.4943 1.2750
2 Hello 1.2819 0.8238 1.3465
3 Hello 1.3963 0.1758 1.4320
4 Eat 1.3918 0.3883 1.1756
5 Eat 1.4742 0.0526 1.2312
6 Eat 0.6698 0.2548 1.4769
7 Eat 0.3074 1.2214 0.2059
In the data table above this sentence, rows 1-3 refers to one complete information where the combinations of rho, theta, and phi from rows 1-3 means the word hello. Same goes for rows 4-7 which is one complete information also that means the word eat. For further explanation of the information I'm talking about, take a look at the table below this sentence.
Row Rho Theta Phi Word
----------------------------
1 |0.9384 0.4943 1.2750|
2 |1.2819 0.8238 1.3465| HELLO
3 |1.3963 0.1758 1.4320|
----------------------------
4 |1.3918 0.3883 1.1756|
5 |1.4742 0.0526 1.2312|
6 |0.6698 0.2548 1.4769| EAT
7 |0.3074 1.2214 0.2059|
----------------------------
Again my problem is, how do I insert this kind of data table to RapidMiner where it understands that multiple rows refer to one complete information? Is there some kind of table like what I displayed below this sentence?
Row Word Rho Theta Phi
1 Hello 0.9384 0.4943 1.2750
. Hello 1.2819 0.8238 1.3465
1 Hello 1.3963 0.1758 1.4320
2 Eat 1.4742 0.0526 1.2312
. Eat 0.6698 0.2548 1.4769
. Eat 0.3074 1.2214 0.2059
2 Eat 0.3074 1.2214 0.2059

you can try to use the Pivot operator to group your result by word.
To do so, I would set the group attribute parameter to "Word" and the index parameter to "Row". It's not exactly the same representation, but close enough, depending on your use case, as multiple format tables are not part of RapidMiner's design.

Related

Multinomial Logistic Regression Predictors Set Up

I would like to use a multinomial logistic regression to get win probabilities for each of the 5 horses that participate in any given race using each horses previous average speed.
RACE_ID H1_SPEED H2_SPEED H3_SPEED H4_SPEED H5_SPEED WINNING_HORSE
1 40.482081 44.199627 42.034929 39.004813 43.830139 5
2 39.482081 42.199627 41.034929 41.004813 40.830139 4
I am stuck on how to handle the independent variables for each horse given that any of the 5 horses average speed can be placed in any of H1_SPEED through H5_SPEED.
Given the fact that for each race I can put any of the 5 horses under H1_SPEED meaning there is no real relationship between H1_SPEED from RACE_ID 1 and H1_SPEED from RACE_ID 2 other than the arbitrary position I selected.
Would there be any difference if the dataset looked like this -
For RACE_ID 1 I swapped H3_SPEED and H5_SPEED and changed WINNING_HORSE from 5 to 3
For RACE_ID 2 I swapped H4_SPEED and H1_SPEED and changed WINNING_HORSE from 4 to 1
RACE_ID H1_SPEED H2_SPEED H3_SPEED H4_SPEED H5_SPEED WINNING_HORSE
1 40.482081 44.199627 43.830139 39.004813 42.034929 3
2 41.004813 42.199627 41.034929 39.482081 40.830139 1
Is this an issue, if so how should this be handled? What if I wanted to add more independent features per horse?
You cannot change in that way your dataset, because each feature (column) has a meaning and probably it depends on the values of the other features. You can imagine it as a six dimensional hyperplane, if you change the value of a feature the position of the point in the hyperplane changes, it does not remain stationary.
If you deem that a feature is useless to solve your problem (i.e. it is independent from the target), you can drop it or avoid to use it during the training phase of your model.
Edit
To solve your specific problem you may add a parameter for each speed column that takes care of the specific horse which is running with that speed. It is a sort of data augmentation, in order to add more problem related features to your model.
RACE_ID H1_SPEED H1_HORSE H2_SPEED H2_HORSE ... WINNING_HORSE
1 40.482081 1 44.199627 2 ... 5
2 39.482081 3 42.199627 5 ... 4
I've invented the number associated to each horse, but it seems that this information is present in your dataset.

Extract string from a column in mysql and replace in place

I have a table in mysql, there are two columns in this table 'ID' and 'Branch'. I want to make the fix the column 'Branch'
Structure:
ID Branch
1 ?¤All-bank
2 (1) bank
3 ÂJohn & William
4 % Bank
5 ----Road one
6 ?bank
Output expected:
ID Branch
1 All-bank
2 bank
3 John & William
4 Bank
5 Road one
6 bank
So basically the branch can contain english characters and some special characters like ".,/&-"
I was think on these lines:
SELECT * FROM table_name WHERE branch like '%[^a-zA-Z0-9]%'
Can anyone help me on this, mysql newbie here.
thanks a lot in advance

Finding Reccurring Number Combinations in Column of Numbers

I have searched and found discussions and solutions to similar problems, but not quite or as complex as I'm trying to figure out.
I have an access table which consists of two columns Draw Number and Number Drawn as shown below. Draw Number is repeated 20 times, to correspond to the 20 numbers that are drawn in each particular draw.
I'm trying to figure a way to determine the most frequent occurring combination of numbers (5 numbers) for all of the draws in each of the 20 number sets. So for instance, 12341 occurs n x, 12342 occurs nx, 12343 occurs n x, etc.
I've created parameter queries which allow me to search for different number combinations from 2 to 10 numbers, and they work OK returning the number of occurrences of a combination of numbers that I input through a simple UI. But the goal is to figure out pragmatically what the optimum combination of numbers.
Hope this makes sense. And by the way, there are 36 million or so rows in the table. The para queries work quite well however; it takes just over a second to return results for each number added. So, query two numbers = 2 second wait, three numbers = 3 second wait, etc.
I've been thinking about a loop of some type but don't know how to get started? Processing time isn't an issue; can take a day if required!
This is written in VBA and has an assortment of queries, temp tables, etc to get the job done.
The text says Access, but the tags say MySql, which is it? – RBarryYoung 21 hours ago
This part confuses me: I'm trying to figure a way to determine the most frequent occurring combination of numbers (5 numbers) for all of the draws in each of the 20 number sets. So for instance, 12341 occurs n x, 12342 occurs nx, 12343 occurs n x, etc. – Newd 21 hours ago
^What do you mean five numbers? No where in your sample data do I see 12341. Please explain using the data you have, and give expected results using that data. – McAdam331 21 hours ago
drosberg - clarification:
thanks for the response. It is an Access application, but as a first-time poster Stackoverflow recommends tags?
By five numbers I mean the most frequently occurring group of five numbers (I used five as an example, could be groups of 2 to 10 numbers) which occur in each draw, where a draw consists of 20 drawn numbers from a total of 80 numbers. So the data that I posted was intended as an example. The sample provided only has 50, 51 in common. I can plug 50 and 51 into the parameter query and it will tell me that this combination occurs 60,000 times (or whatever), but perhaps 50 and 57 occurs 65,000 times.
If i was to do this manually, and assuming I'm looking for the most frequent 5 number combination I would enter the following in the parameter query: 1,2,3,4,1 group = 30,000 occurrences 1,2,3,4,2 group = 31,000 occurrences 1,2,3,4,3 group = 31,050 occurrences 1,2,3,4,4 group = 29,050 occurrences etc........... etc...........
but I would have to do this for every combination of 5 numbers that can be derived from the numbers 1 thru 80. I'm hoping to have program do the work!!
thanks
don
DRAW NUMBER NUMBER DRAWN
1 1
1 28
1 19
1 3
1 38
1 46
1 43
1 29
1 13
1 22
1 20
1 11
1 50
1 51
1 53
1 54
1 57
1 64
1 76
1 78
2 29
2 14
2 2
2 1
2 35
2 40
2 39
2 30
2 10
2 27
2 21
2 6
2 42
2 50
2 51
2 53
2 54
2 61
2 65
2 69
I wrote a post a while ago about generating permutations with and without repetition using Excel. Perhaps you can use it.
https://michiel.wordpress.com/2015/03/29/permutations-with-repetition-using-excel/
Here's how it works. I am using strings, but you can easily modify that for numbers (since you say you need 5).
You can use the MID function to grab a single char from a string, and generate permutations from it.
=MID(Pattern,MOD([N]/[P],Length)+1,1)
N revers to the column N
P refers to the horizontal row (1,4,16). You can generate these with a formula like =4^.
After putting in the code, you can make a list of all permutations in Excel and in the cell next to it generate a sql query that you can perform as well from VBA.
Example: Looking up Access database in Excel
Or find a commercial tool like http://thingiequery.com/
I don't know if there's any open source tools for it.
I'm thinking that you should consider:
Say there are 100 balls.
Setting up a table to have one row for each "Draw number" with 100 columns one for every possible number each column has type boolean.
When you look to see which draws had number 23 you just add a
WHERE Column23 = true.
For numbers 23 and 56
WHERE Column23 = true AND Column56 = true
This should massivel simplify and speed up your SQL.
You set up a table with every possible combination of numbers.
You run SQL to find the counts.
Harvey

COUNTIF for rows which contain a given value in another column

My table lists every character from all 5 of George R. R. Martin's currently published A Song of Ice and Fire novels. Each row contains a record indicating which book in the series the character is from (numbered 1-5) and a single letter indicating the character's gender (M/F). For example:
A B C
1 Character Book Gender
------------------------------
2 Arya Stark - 1 - F
3 Eddard Stark - 1 - M
4 Davos Seaworth - 2 - M
5 Lynesse Hightower - 2 - F
6 Xaro Xhoan Daxos - 2 - M
7 Elinor Tyrell - 3 - F
I can use COUNTIF to find out that there are three females and three males in this table, but I want to know, for example, how many males there are in book 2. How could I write a formula that would make this count? Here is a pseudocode of what I'm trying to achieve:
=COUNTIF(C2:C7, Column B = '2' AND Column C = 'M')
This would output 2.
I'm aware that this task is far better suited to databases and a SELECT query, but I'd like to know how to solve this problem within the constraints of a LibreOffice Calc spreadsheet, without using a macro. Excel-based solutions are fine, so long as they also work in Calc. If there's no solution that uses COUNTIF, it doesn't matter, so long as it works.
I worked it out, thanks to a prompt by assylias. The COUNTIFS formula produces the result I want by counting multiple search criteria. For example, this formula works out how many male characters are in Book 1 (A Game of Thrones).
=COUNTIFS($A$2:$A$2102, "=1", $L$2:$L$2102, "=M")

When is it better to flatten out data using comma separated values to improve search query performance?

My question about SEARCH query performance.
I've flattened out data into a read-only Person table (MySQL) that exists purely for search. The table has about 20 columns of data (mostly limited text values, dates and booleans and a few columns containing unlimited text).
Person
=============================================================
id First Last DOB etc (20+ columns)...
1 John Doe 05/02/1969
2 Sara Jones 04/02/1982
3 Dave Moore 10/11/1984
Another two tables support the relationship between Person and Activity.
Activity
===================================
id activity
1 hiking
2 skiing
3 snowboarding
4 bird watching
5 etc...
PersonActivity
===================================
id PersonId ActivityId
1 2 1
2 2 3
3 2 10
4 2 16
5 2 34
6 2 37
7 2 38
8 etc…
Search considerations:
Person table has potentially 200-300k+ rows
Each person potentially has 50+ activities
Search may include Activity filter (e.g., select persons with one and/or more activities)
Returned results are displayed with person details and activities as bulleted list
If the Person table is used only for search, I'm wondering if I should add the activities as comma separated values to the Person table instead of joining to the Activity and PersonActivity tables:
Person
===========================================================================
id First Last DOB Activity
2 Sara Jones 04/02/1982 hiking, snowboarding, golf, etc.
Given the search considerations above, would this help or hurt search performance?
Thanks for the input.
Horrible idea. You will lose the ability to use indexes in querying. Do not under any circumstances store data in a comma delimited list if you ever want to search on that column. Realtional database are designed to have good performance with tables joined together. Your database is relatively small and should have no performance issues at all if you index properly.
You may still want to display the results in a comma delimted fashion. I think MYSQL has a function called GROUP_CONCAT for that.