Association Rules for Text File - rapidminer

I am a student using Rapidminer, and I am doing a project using Yummly's What's Cooking dataset (https://www.kaggle.com/c/whats-cooking/data). The dataset has 20 different cuisine types (e.g. Italian, Chinese, Indian, etc.).
Our goal is to develop a data mining model that identifies the cuisine type of future dishes by analyzing the ingredient list of the dish. We are using association rules to do so. However, I keep getting "no rules found" and have no idea why. I am thinking this has something to do with my attributes being formatted as text and not using the nominal to binominal operator, but am not sure how to fix it.
Currently my process looks like....
data -> select attributes -> FP growth -> create association rules
Can you help?

According to the documentation for the FP-Growth operator, all the attributes in the example set need to be binomial.
I'll admit--I haven't looked at the data directly because I didn't want to register an account on kaggle, so I'm not sure exactly how it's formatted, but you would probably want to set the type of cuisine as a label and then have each of the remaining attributes represent each ingredient that is included in one or more of the recipes. Each dish would have a 1 in the column if the ingredient is used and a 0 if it's not used. (Depending on the original format of the data, since you mentioned it's text, you may want to check out the text processing extension, which can create an example set like what I just described.) Then, if you convert the 0s and 1s to binomial, you should be able to use FP-Growth.

Related

How do I fill a list with all the world's phone prefixes in Dart on Flutter?

I'd like to implement an app with Dart on Flutter. I'm on my first approach with this new language and for the first time I meet this problem.
My app must necessarily work with a mobile phone number. I would like to see a ban on the insertion of unse prefixed telephone numbers or, alternatively, the typing of a number with more digits than expected. For example, in Italy the figures after +39 (0039) are at most 10. I probably thought I'd separate the two parts to make it easier to distinguish between lengths (one field where you select the country and another that allows you to enter the number).
Is there, as you know, a JSON that contains exactly: - the prefix of each state, - the length of the telephone number (excluding prefix), - name, *flag and *sigla (Italy, green-white-red, IT)?
Sifting through the web a little bit, I saw that flutter should actually provide already in itself with .demoTextFieldEnterITPhoneNumber, through GalleryLocalizations to do such a job, but I didn't quite understand if it bothers to control a particular regular expression for each nation or not. Could I copy and paste a number for example? Will nationality be automatically recognized?
In the end I think that such a control, so deep, is not possible so I would just need this, so make two fields, one with a list, which at the choice automatically fills in depending on the selected prefix, and a field on which the user types his number: in case of copied and pasted number check if that string also contains a +prefix.
Thank you very much, I need a lot, since my app will mainly revolve around a correct value for this field. :)
Try using the international_phone_input or country_code_picker flutter package. They are quite easy to implement

When creating a file format. Is it better to create multiple formats or have several optional sections?

Let's say I am the technical lead for a software company specializing in writing applications for chefs that help organize recipes. Starting out we are developing one app for bakers who make cakes and another for sushi chefs. One of the requirements is to create a standard file format for importing and exporting recipes. (This file format will go on to be an industry standard with other companies using it to interface with our products) we are faced with two options: Make a standard recipes format (lets say .recipe) which uses common properties where applicable and optional properties where they differ, or making independent formats for each application (let us say .sushi and .cake).
Imagine the file format would look something like this for sushi:
{
"name":"Big California",
"type":"sushi",
"spiciness": 0,
"ingredients": [
{
"name":"rice"
"amount": 20.0,
"units": "ounces"
},
{
...
}
],
}
and imagine the file format would look something like this for cakes:
{
"name":"Wedding Cake",
"type":"cake",
"layers": 3,
"ingredients": [
{
"name":"flour"
"amount": 40.0,
"units": "ounces"
},
{
...
}
],
}
Notice the file formats are very similar with only the spiciness and layers properties differing between them. Undoubtedly as the applications grow in complexity and sophistication, and will cause many more specialized properties to be added. There will also be more applications added to the suite for other types of chefs. With this context,
Is it wiser to have each application read/write .recipe files that adhere to a somewhat standardized interface, or is it wiser to remove all interdependence and have each application read/write their respective .sushi and .cake file types?
This kind of thing get be a very, very deep thing to get right. I think a lot of it depends on what you want to be able to do with recipes beyond simply displaying them for use by a chef.
For example, does one want to normalise data across the whole system? That is, when a recipe specifies "flour", what do you want to say about that flour, and how standardised do you want that to be? Imagine a chef is preparing an entire menu, and wants to know how much "plain white high gluten zero additives flour" is used by all the recipes in that menu. They might want to know this so they know how much to buy. There's actually quite a lot you can say about just flour that means simply having "flour" as a data item in a recipe may not go far enough.
The "modern" way of going about these things is to simply have plain text fields, and rely on some kind of flexible search to make equivalency associations between fields like "flour, white, plain, strong" and "high gluten white flour". That's what Google does...
The "proper" way to do it is to come up with a rigid schema that allows "flour" to be fully specified. It's going to be hard to come up with a file / database format schema that can exhaustively and unambiguously describe every single possible aspect of "flour". And if you take it too far, then you have the problem of making associations between two different records of "flour" that, for all reasonable purposes are identical, but differ in some minor aspect. Suppose you had a field for particle size; the search would have to be clever enough to realise that there's no real difference in flours that differ by, for example, 0.5 micrometer in average particle size.
We've discussed the possible extent to the definition of a "flour". One also has to consider the method by which the ingredient is prepared. That adds a whole new dimension of difficulty. And then one would have to desribed all the concievable kitchen utensils too. One can see the attractions of free text...
With that in mind, I would aim to have a single file format (.recipe), but not to break down the data too much. I would forget about trying to categorise each and every ingredient down to the last possible level of detail. Instead, for each ingredient I'd have a free text description, then perhaps a well structured quantity field (e.g. a number and a unit, 1 and cup), and finally a piece of free text describing the ingredient preparation (e.g. sieved). Then I'd have something that describes a preparation step, referencing the ingredients; that would have some free text fields and structured fields ("sieve the", , "into a bowl"). The file will contain a list of these. You might also have a list of required utensils, and a general description field too. You'll be wanting to add structured fields for recipe metadata (e.g. 'cake', or 'sushi', or serves 2).
Or something like that. Having some structure allows some additional functionality to be implemented (e.g. tidy layout of the recipe on a screen). Just having a single free-text field for the whole thing means that it'd be difficult to add, say, an ingredient ordering feature - who is to say what lines in the text represent ingredients?
Having separate file formats would involve coming up with a large number of schema. It would be even more unmanagable.

Can an OCR run in a split-second if it is highly targeted? (Small dictionary)

I am looking for an open source ocr (maybe tesseract) that uses a dictionary to match words against. For example, I know that this ocr will only be used to search for certain names. Imagine I have a master guest list (written) and I want to scan this list in under a second with the ocr and check this against a database of names.
I understand that a traditional ocr can attempt to read every letter and then I could just cross reference the results with the 100 names, but this takes too long. If the ocr was just focusing on those 100 words and nothing else then it should be able to do all this in a split second. i.e. There is no point in guessing that a word might be "Jach" since "Jach" isn't a name in my database. The ocr should be able to infer that it is "Jack" since that is an actual name in the database.
Is this possible?
It should be possible. Think of it this way: instead of having your OCR look for 'J' it could be looking for 'Jack' directly, sort of: as an individual symbol.
So when you train / calibrate your OCR, train it with images of whole words, similar to how you would - for an individual symbol.
(if this feature is not directly available in your OCR then first map images of whole words to a unique symbol and later transform that symbol into the final word string)

authorized.net ambiguity in country names

Hi I am working on a site and integrating authorize.net payment gateway. I am thinking of adding a dropdown for country names, will passing of "United States Of America" as country variable work? Or should I use "US"? Should I use ISO codes for every country? I tried on test developer account but it seems to accept everything I passes to it as correct!
~Ajit
I know authorize.net doesn't require country names. A simple way to see if they even validate them would be to run a transaction through the production gateway, pass a nonsense value and see if the transaction still goes through.
If you do standardize to support authorize.net (or for another reason), I'd suggest country codes versus full names. Codes seem to change less often, and also can be useful as identifiers. For example, I have an application which presents data for roughly 200 countries; I have flag icons (multiple sizes for each country) that use a 2 digit country code in their name. Using codes made this fairly easy to implement and maintain.
According to their AIM Guide:
x_country: Optional
Value: The country of the customer’s billing
Format: Up to 60 characters (no symbols)

When could a CSV records *not* have the same number of fields?

I am storing a series of events to a CSV file, each event type comes with a different set of data.
To illustrate, say I have two events (there will be many more):
Running, which has a data set containing speed and incline.
Sleeping, which has a data set containing snores.
There are two options to store this data in CSV records:
Option A
Storing each possible item of data in it's own field...
speed, incline, snores
therefore...
15mph, 20%, ,
, , 12
16mph, 20%, ,
14mph, 20%, ,
Option B
Storing each event in its own record...
event, value1...
therefore...
running, 15mph, 20%
sleeping, 12
running, 16mph, 20%
running, 14mph, 20%
Without a specific CSV specification, the consensus seems to be:
Each record "should" contain the same number of comma-separated fields.
Context
There are a number of events which each have a large & different set of data values.
CSV data is to be of use to other developers (I will/could/should/won't use either structure).
The 'other developers' to be toward the novice end of the spectrum and/or using resource limited systems. CSV is accessible.
The CSV format is being provided non-exclusively as feature not requirement. Although, if said application is providing a CSV file it should be provided in the correct manner from now on.
Question
Would it be valid – in this case - to go with Option B?
Thoughts
Option B maintains a level of human readability, which is an advantage say CSV is read by human not processor. Neither method is more complex to parse using a custom parser, but will Option B void the usefulness of a CSV format with other libraries, frameworks, applications et al. With Option A future changes/versions to the data set of an individual event may break the CSV structure (zombie , , to maintain forwards compatibility); whereas Option B will fail gracefully.
edit
This may be aimed at students and frameworks like OpenFrameworks, Plask, Proccessing et al. where CSV is easier to implement.
Any "other frameworks, libraries and applications" I've ever used all handle CSV parsing differently, so trying to conform to one or many of these standards might over-complicate your end result. My recommendation would be to keep it simple and use what works for your specific task. If human readbility is a requirement, then CSV in the form of Option B would work fine. Otherwise, you may want to consider JSON or XML.
As you say there is no "CSV Standard" with regard to contents. The real answer depend on what you are doing and why. You mention "other frameworks, libraries and applications". The one thing I've learnt is "Dont over engineer". i.e. Don't write reams of code today on the assumption that you will plug it into some other framework tomorrow.
I'd say option B is fine, unless you have specific requirements to use other apps etc.
< edit >
Having re-read your context, I'd probably pick one output format and use it, and forget about having multiple formats:
Having multiple output formats is a source of inconsistency (e.g. bug in one format but not another).
Having multiple formats means more code that needs to be
tested
documented
supported
< /edit >
Is there any reason you can't use XML? Yes, it's slightly more difficult to parse, at least for novices, but if so they probably need the practice. File size would be much greater, of course, but it's compressible.