Blocking in cross validation in mlr with subject id - mlr

I have a dataset with multiple observations by participant. Participants are denoted by id. To account for this in the cross validation process, I add blocking = factor(id) to makeClassifTask() and blocking.cv = TRUE to makeResampleDesc(). However, if I leave id in the dataset, it will be used as a predictor. My question is: How do I correctly use blocking? My take would be to create a new variable, e.g. participant.id (outside of the dataset), next to remove id from the original dataset and then to use blocking = factor(participant.id), but I am not sure if this is the correct way to handle blocking.

Rather than supplying a variable for blocking you can provide a custom factor vector that specifies the observations which belong together. This is also shown in the tutorial.
This way you do not need to have the variable "participant.id" in the dataset.
Also make sure that you really want to use "blocking". Did you have a look at "grouping" already? The differences between both are also described in the linked tutorial section.

Related

SSIS Custom Component: Having any IsSorted Property and output metadata

I have a Custom Synchronous Component that works fine and I use it.
Recently, I sent some Sorted data from a sort component to it (or an IsSorted=true Source Component)
but then i couldn't use the output as the input of a merge join due to not having a IsSorted=true property.
So I have to sort data again and it reduces the package performance too much.
Also I can't have any metadata same as Input, for my output(s) during design time.
I guess when my component is synchronous so it might be sorted as its input
if not, how to make the component output data sorted!
I really wanna know if there is any clever solution to solve this detailed issues about Custom Pipeline Components.
As your component is synchronous, somewhere in your code you are synchronizing the output, IDTSOutput1xx, with the input, IDTSInput1xx, with code like this:
output.SynchronousInputID = input.ID;
In a synchronous component the PipelineBuffer (the one exposed to the output, exposed in ProcessInput) is based upon the input buffer (usually with modified or added columns) respecting the original row ordering.
So, if you check that the input is ordered, you can assure that the output is also ordered. And there is a property that you can use to read this information from the input and set it in the output:
output.IsOrdered = input.IsOrdered
Take into account that you could set this property to true even if the output was not ordered, but in this case, you're relying on the information provided from the input, which should be correct.
You should only change this property explicitly to true in an asynchronous component in which you really sort the rows before returning them. But, as I told, you could lie, and set this property to true without returning ordered rows. In other words, it's informative metadata.
If this doesn't work for you, you'll also have to set the SortKeyPosition of the required columns in output.OutputColumnCollection. This information is also used by Merge Join to ensure that the input apart from being ordered, is ordered by the required columns.
If you want to see how you can do this using the SSIS task editor, instead of doing it "automagically" in your custom component, please, read IsSorted properties in SSIS or SSIS #98 – IsSorted is true, but sorted on what?
The Merge Join Transform in SSIS has a few requirements. To join two data sources,The data sources must be sorted and there must be a key that you can join them with. In some cases, I perform the join in the OLEDB SOURCE from Query.

Rest API design with multiple unique ids

Currently, we are developing an API for our system and there are some resources that may have different kinds of identifiers.
For example, there is a resource called orders, which may have an unique order number and also have an unique id. At the moment, we only have URLs for the id, which are these URLs:
GET /api/orders/{id}
PUT /api/orders/{id}
DELETE /api/orders/{id}
But now we need also the possibility to use order numbers, which normally would result into:
GET /api/orders/{orderNumber}
PUT /api/orders/{orderNumber}
DELETE /api/orders/{orderNumber}
Obviously that won't work, since id and orderNumber are both numbers.
I know that there are some similar questions, but they don't help me out, because the answers don't really fit or their approaches are not really restful or comprehensible (for us and for possible developers using the API). Additionally, the questions and answers are partially older than 7 years.
To name a few:
1. Using a query param
One suggests to use a query param, e.g.
GET /api/orders/?orderNumber={orderNumber}
I think, there are a lot of problems. First, this is a filter on the orders collections, so that the result should be a list as well. However, there is only one order for the unique order number which is a little bit confusing. Secondly, we use such a filter to search/filter for a subset of orders. Additionally, a query params is some kind of a second-class parameter, but should be first-class in this case. This is even a problem, if I the object does not exist. Normally a get would return a 404 (not found), but a GET /api/orders/?orderNumber=1234 would be an empty array, if the order 1234 does not exist.
2. Using a prefix
Some public APIs use some kind of a discriminator to distinguish between different types, e.g. like:
GET /api/orders/id_1234
GET /api/orders/ordernumber_367652
This works for their approach, because id_1234 and ordernumber_367652 are their real unique identifiers that are also returned by other resources. However, that would result in a response object like this:
{
"id": "id_1234",
"ordernumber": "ordernumber_367652"
//...
}
This is not very clean, because the type (id or order number) is modelled twice. And apart from the problem of changing all identifiers and response objects, this would be confusing, if you e.g. want to search for all order numbers greater than 67363 (thus, there is also a string/number clash). If the response does not add the type as a prefix, a user have to add this for some request, which would also be very confusing (sometime you have to add this and sometimes not...)
3. Using a verb
This is what e.g. Twitter does: their URL ends with show.json, so you can use it like:
GET /api/orders/show.json?id=1234
GET /api/orders/show.json?number=367652
I think, this is the most awful solution, since it is not restful. Furthermore, it has some of the problems that I mentioned in the query param approach.
4. Using a subresource
Some people suggest to model this like a subresource, e.g.:
GET /api/orders/1234
GET /api/orders/id/1234 //optional
GET /api/orders/ordernumber/367652
I like the readability of this approach, but I think the meaning of /api/orders/ordernumber/367652 would be "get (just) the order number 367652" and not the order. Finally, this breaks some best practices like using plurals and only real resources.
So finally, my questions are: Did we missed something? And are there are other approaches, because I think that this is not an unusual problem?
to me, the most RESTful way of solving your problem is using the approach number 2 with a slight modification.
From a theoretical point of view, you just have valid identification code to identify your order. At this point of the design process, it isn't important whether your identification code is an id or an order number. It's something that uniquely identify your order and that's enough.
The fact that you have an ambiguity between ids and numbers format is an issue belonging to the implementation phase, not the design phase.
So for now, what we have is:
GET /api/orders/{some_identification_code}
and this is very RESTful.
Of course you still have the problem of solving your ambiguity, so we can proceed with the implementation phase. Unfortunately your order identification_code set is made of two distinct entities that share the format. It's trivial it can't work. But now the problem is in the definition of these entity formats.
My suggestion is very simple: ids will be integers, while numbers will be codes such as N1234567. This approach will make your resource representation acceptable:
{
"id": "1234",
"ordernumber": "N367652"
//...
}
Additionally, it is common in many scenarios such as courier shipments.
Here is an alternate option that I came up with that I found slightly more palatable.
GET /api/orders/1234
GET /api/orders/1234?idType=id //optional
GET /api/orders/367652?idType=ordernumber
The reason being it keeps the pathing consistent with REST standards, and then in the service if they did pass idType=orderNumber (idType of id is the default) you can pick up on that.
I'm struggling with the same issue and haven't found a perfect solution. I ended up using this format:
GET /api/orders/{orderid}
GET /api/orders/bynumber/{orderNumber}
Not perfect, but it is readable.
I'm also struggling with this! In my case, i only really need to be able to GET using the secondary ID, which makes this a little easier.
I am leaning towards using an optional prefix to the ID:
GET /api/orders/{id}
GET /api/orders/id:{id}
GET /api/orders/number:{orderNumber}
or this could be a chance to use an obscure feature of the URI specification, path parameters, which let you attach parameters to particular path elements:
GET /api/orders/{id}
GET /api/orders/{id};id_type=id
GET /api/orders/{orderNumber};id_type=number
The URL using an unqualified ID is the canonical one. There are two options for the behaviour of non-canonical URLs: either return the entity, or redirect to the canonical URL. The latter is more theoretically pure, but it may be inconvenient for users. Or it may be more useful for users, who knows!
Another way to approach this is to model an order number as its own thing:
GET /api/ordernumbers/{orderNumber}
This could return a small object with just the ID, which users could then use to retrieve the entity. Or even just redirect to the order.
If you also want a general search resource, then that can also be used here:
GET /api/orders?number={orderNumber}
In my case, i don't want such a resource (yet), and i could be uncomfortable adding what appears to be a general search resource that only supports one field.
So basically, you want to treat all ids and order numbers as unique identifiers for the order records. The thing about unique identifiers is, of course, they have to be unique! But your ids and order numbers are all numeric; do their ranges overlap? If, say, "1234" could be either an id or an order number, then obviously /api/orders/1234 is not going to reference a unique order.
If the ranges are unique, then you just need discriminator logic in the handler code for /api/orders/{id}, that can tell an id from an order number. This could actually work, say if your order numbers have more digits than your ids ever will. But I expect you would have done this already if you could.
If the ranges might overlap, then you must at least force the references to them to have unique ranges. The simplest way would be to add a prefix when referring to an order number, e.g. the prefix "N". So that if the order with id 1234 has order number 367652, it could be retrieved with either of these calls:
/api/orders/1234
/api/orders/N367652
But then, either the database must change to include the "N" prefix in all order numbers (you say this is not possible) or else the handler code would have to strip off the "N" prefix before converting to int. In that case, the "N" prefix should only be used in the API calls - user facing data-entry forms should not expose it! You can't have a "lookup by any identifier" field where users can enter either id or order number (this would have a non-uniqueness problem anyway.) Instead, you must have separate "lookup by id" and "lookup by order number" options. Then, you should be able to have the order number input handler automatically add the "N" prefix before submitting to the API.
Fundamentally, this is a problem with the database design - if this (using values from both fields as "unique identifiers") was a requirement, then the database fields should have been designed with this in mind (i.e. with non-overlapping ranges) - if you can't change the order number format, then the id format should have been different.

Three rows of almost the same code behave differently

I have three dropdown boxes on a Main_Form. I will add the chosen content into three fields on the form, Form_Applications.
These three lines are added :
Form_Applications.Classification = Form_Main_Form.Combo43.Value
Form_Applications.Countryname_Cluster = Form_Main_Form.Combo56.Value
Form_Applications.Application = Form_Main_Form.Combo64.Value
The first two work perfectly but the last one gives error code 438!
I can enter in the immediate window :
Form_Applications.Classification = "what ever"
Form_Applications.Countryname_Cluster = "what ever"
but not for the third line. Then, after enter, the Object doesn't support this property or method error appears.
I didn't expect this error as I do exactly the same as in the first two lines.
Can you please help or do you need more info ?
In VBA Application is a special word and should not be used to address fields.
FormName.Application will return an object that points to the application instance that is running that form as opposed to an object within that form.
From the Application object you can do all sorts of other things such as executing external programs and other application level stuff like saving files/
Rename your Application field to something else, perhaps ApplicationCombo and change your line of code to match the new name. After doing this the code should execute as you expect.
Form_Applications.Application is referring to the application itself. It is not a field, so therefore it is not assignable (at least with a string).
You really haven't provided enough code to draw any real conclusions though. But looking at what you have posted, you definitely need to rethink your approach.
It's to say definitely but you are not doing the same. It looks like you are reading a ComboBox value the same (I will assume Combo64 is the same as 43 and 56) but my guess is that what you are assigning that value to is the problem:
Form_Applications.Application =
Application is not assignable. Is there another field you meant to use there?

Deal with undefined values in code or in the template?

I'm writing a web application (in Python, not that it matters). One of the features is that people can leave comments on things. I have a class for comments, basically like so:
class Comment:
user = ...
# other stuff
where user is an instance of another class,
class User:
name = ...
# other stuff
And of course in my template, I have
<div>${comment.user.name}</div>
Problem: Let's say I allow people to post comments anonymously. In that case comment.user is None (undefined), and of course accessing comment.user.name is going to raise an error. What's the best way to deal with that? I see three possibilities:
Use a conditional in the template to test for that case and display something different. This is the most versatile solution, since I can change the way anonymous comments are displayed to, say, "Posted anonymously" (instead of "Posted by ..."), but I've often been told that templates should be mindless display machines and not include logic like that. Also, other people might wind up writing alternate templates for the same application, and I feel like I should be making things as easy as possible for the template writer.
Implement an accessor method for the user property of a Comment that returns a dummy user object when the real user is undefined. This dummy object would have user.name = 'Anonymous' or something like that and so the template could access it and print its name with no error.
Put an actual record in my database corresponding to a user with user.name = Anonymous (or something like that), and just assign that user to any comment posted when nobody's logged in. I know I've seen some real-world systems that operate this way. (phpBB?)
Is there a prevailing wisdom among people who write these sorts of systems about which of these (or some other solution) is the best? Any pitfalls I should watch out for if I go one way vs. another? Whoever gives the best explanation gets the checkmark.
I'd go with the first option, using an if switch in the template.
Consider the case of localization: You'll possibly have different templates for each language. You can easily localize the "anonymous" case in the template itself.
Also, the data model should have nothing to do with the output side. What would you do in the rest of the code if you wanted to test whether a user has a name or not? Check for == 'Anonymous' each time?
The template should indeed only be concerned with outputting data, but that doesn't mean it has to consist solely of output statements. You usually have some sort of if user is logged in, display "Logout", otherwise display "Register" and "Login" case in the templates. It's almost impossible to avoid these.
Personally, I like for clean code, and agree that templates should not have major logic. So in my implementations I make sure that all values have "safe" default values, typically a blank string, pointer to a base class or equivalent. That allows for two major improvements to the code, first that you don't have to constantly test for null or missing values, and you can output default values without too much logic in your display templates.
So in your situation, making a default pointer to a base value sounds like the best solution.
Your 3rd option: Create a regular User entity that represents an anonymous user.
I'm not a fan of None for database integrity reasons.

Reference table values in a war against magic numbers

This question bugged me for years now and can't seem to find good solution still. I working in PHP and Java but it sounds like this maybe language-agnostic :)
Say we have a standard status reference table that holds status ids for some kind of entity. Further let's assume the table will have just 5 values, and will remain like this for a long time, maybe edited occasionally with addition of a new status.
When you fetch a row and need to see what status it is you have 2 options(as I see it at least) - put it straight ID values(magic numbers that is) or use a named constant. Latter seem much cleaner, the question though is where those named constants should leave? In a model class? In a class that uses this particular constant? Somewhere else?
It sounds like what you're wanting to do is an enumerated value.
This is a value that has a literal name mapped to a constant value, this would be something like
Statusone = 1
Statustwo = 2
Then anywhere in your program you could refrenece statusone which the compiler would see as 1.
I'm not sure if this exists in php but I'm pretty sure it does in java
EDIT In response to some comments
I would typically put enumerated values in some kind of global namespace, or if you only need them when you are using that class spefically you can put them at the class level.