glmer model error: variable lengths differ - regression

I aim trying to run the following model:
glmer(c(EMERGED.COUNT,NOT.EMERGED.COUNT) ~ SPECIES*SOIL*TREATMENT + (1|ID), data = data_emerged_success, family=binomial)
I receive an error message of:
Error in model.frame.default(data = data_emerged_success, drop.unused.levels = TRUE, :
variable lengths differ (found for 'SPECIES')
I have checked my data and all variables are of equal length and there are no NA values. If I change the order of variables in the model (e.g, treatment first), I will get the same error message but for the first variable of the three.

Related

XGboost considers numeric variables as categorical variables because there are too few unique values

I have built an XGboost model and trying to write a custom function such as:
cols_when_model_builds = model.get_booster().feature_names
def get_prediction(parameter_1, parameter_2, parameter_3):
all_columns = ["parameter_1", "parameter_2", "parameter_3"]
lst = [parameter_1, parameter_2, parameter_3]
df = pd.DataFrame([lst], columns=all_columns)
X = pd.get_dummies(df, columns=["parameter_1"], drop_first=False)
X2 = pd.DataFrame(columns = cols_when_model_builds)
X = X.reindex(labels=X2.columns,axis=1)
result = model.predict(X)
Thus, "parameter_3" is a categorical feature and the other two parameters are integer features (numeric). However, when I pass a value to "parameter_3" it return an error:
ValueError: DataFrame.dtypes for data must be int, float, bool or category. When
categorical type is supplied, DMatrix parameter `enable_categorical` must
be set to `True`. Invalid columns:"parameter_2"
It should be noted that the code doesn't have problems with the "parameter_1" even though it is numeric as well
Setting "enable_categorical" won't work because the trouble making parameter is not categorical.

How to loop over comparegoup function in R

I have a large dataset and I want to apply comparegroups function in R. The dataset has number of grouping variables that I want to compare other variables on it; I wanted to loop over these grouping variables. the function is not accepting the loop
`vars <- c("fibrosis_stage", "Steatosis_stage", "patient_classification")
for (var in vars){
model <- compareGroups(var~.,data = data)
result.model <- createTable(model)
export2xls(result.model,paste0(var,"comparisons.xlsx"))
}`
but I got the following error:
Error in model.frame.default(formula = var ~ ., data = list(Gender = c(2L, :
variable lengths differ
I tried even to make it in a function; with the grouping varaible as input but I had an error also.
Any one can help?

error: element number 2 undefined in return list. I'm new to this, pls help me

x = fopen('pm10_data.txt');
fseek(x, 8,0);
dat = fscanf (x,'%f',[2,1000]);
dat = transpose(dat);
a = dat(:,1);
b = dat(:,2);
[r,p] = cor_test (a,b)
fclose(x);
r
p
this is what i got,
r =
scalar structure containing the fields:
method = Pearson's product moment correlation
params = 76
stat = 6.2156
dist = t
pval = 2.5292e-08
alternative = !=
Run error
error: element number 2 undefined in return list
error: called from
tester.octave at line 7 column 6
Presumably you're referring to the cor_test function from the statistics package, even though you don't show loading this in your workspace.
According to the documentation of cor_test:
The output is a structure with the following elements:
PVAL The p-value of the test.
STAT The value of the test statistic.
DIST The distribution of the test statistic.
PARAMS The parameters of the null distribution of the test statistic.
ALTERNATIVE The alternative hypothesis.
METHOD The method used for testing.
If no output argument is given, the p-value is displayed.
This seems to be what you're getting too.
If you want the p value explicitly from that structure, you can access that as r.pval
The syntax [a, b, ...] = functionname( args, ... ) expects the function to return more than one argument, and capture all the returned arguments into the named variables (i.e. a, b, etc).
In this case, cor_test only returns a single argument, even though that argument is a struct (which means it has fields you can access).
The error you're getting effectively means you requested a second output argument p, but the function you're using does not return a second output argument. It only returns that struct you already captured in r.

Kaggle competition submission error : The value '' in the key column '' has already been defined

This is my first time participating in a kaggle competition and I'm having trouble submitting my result table. I made my model using gbm and made a prediction table like below. the submission file has 2 column named 'fullVisitorId' and 'PredictedLogRevenue') as any other kaggle competition cases.
pred_oob = predict(object = model_gbm, newdata = te_df, type = 'response')
mysub = data.frame(fullVisitorId = test$fullVisitorId, Pred = pred_oob)
mysub = mysub %>%
group_by(fullVisitorId) %>%
summarise(Predicted = sum(Pred))
submission = read.csv('sample_submission.csv')
mysub = submission %>%
left_join(mysub, by = 'fullVisitorId')
mysub$PredictedLogRevenue = NULL
names(mysub) = names(submission)
But when I try to submit the file, I got the 'fail' message saying ...
ERROR: The value '8.893887e+17' in the key column 'fullVisitorId' has already been defined (Line 549026, Column 1)
ERROR: The value '8.895317e+18' in the key column 'fullVisitorId' has already been defined (Line 549126, Column 1)
ERROR: The value '8.895317e+18' in the key column 'fullVisitorId' has already been defined (Line 549127, Column 1)
Not just 3 lines, but 8 more lines like this.
I have no idea what I did wrong. I also checked other kernels but couldn't find the answer. Please...help!!
This issue was because fullVisitorId was numeric instead of character, so It dropped all the leading zeros. Therefore, using read.csv() with colClases argument or fread() can make it work.
I left this just because there could be someone else who are having the similar trouble like me
For creating submission dataframe, the easiest way is this
subm_df = pd.read_csv('../input/sample_submission.csv')
subm_df['PredictedLogRevenue'] = <your prediction array>
subm_df.to_csv('Subm_1.csv', index=False)
Noe this is assuming your sample_submission.csv has all fullVisitorId, which it usually does in Kaggle. Following this, I have never faced any issues.

Using R and R-SQL API to execute SQL query

For an assignment, I am supposed to use SQL to get a list of unique values from a table as a vector in R. I wrote the following code in R:
selection = dbSendQuery(con, statement = "SELECT user_id FROM twitter_message")
user_id = c(dbFetch(selection))
I am supposed to then randomly generate 3 values, preferably using the sample() function. However, when I do that, it generates vectors the size of the original vector (approximately 500 values) rather than selecting 3 values from the vector. I do not know if the error is from how I put the data in a vector or not. I tried writing the following code:
sample(user_id, size = 3, replace = FALSE, prob = NULL)
However, I get an the error:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
Need to sample from rows not from your dataframe.
user_id[sample(nrow(user_id), 3, replace = FALSE, prob = NULL),]