Kaggle competition submission error : The value '' in the key column '' has already been defined - multiple-columns

This is my first time participating in a kaggle competition and I'm having trouble submitting my result table. I made my model using gbm and made a prediction table like below. the submission file has 2 column named 'fullVisitorId' and 'PredictedLogRevenue') as any other kaggle competition cases.
pred_oob = predict(object = model_gbm, newdata = te_df, type = 'response')
mysub = data.frame(fullVisitorId = test$fullVisitorId, Pred = pred_oob)
mysub = mysub %>%
group_by(fullVisitorId) %>%
summarise(Predicted = sum(Pred))
submission = read.csv('sample_submission.csv')
mysub = submission %>%
left_join(mysub, by = 'fullVisitorId')
mysub$PredictedLogRevenue = NULL
names(mysub) = names(submission)
But when I try to submit the file, I got the 'fail' message saying ...
ERROR: The value '8.893887e+17' in the key column 'fullVisitorId' has already been defined (Line 549026, Column 1)
ERROR: The value '8.895317e+18' in the key column 'fullVisitorId' has already been defined (Line 549126, Column 1)
ERROR: The value '8.895317e+18' in the key column 'fullVisitorId' has already been defined (Line 549127, Column 1)
Not just 3 lines, but 8 more lines like this.
I have no idea what I did wrong. I also checked other kernels but couldn't find the answer. Please...help!!

This issue was because fullVisitorId was numeric instead of character, so It dropped all the leading zeros. Therefore, using read.csv() with colClases argument or fread() can make it work.
I left this just because there could be someone else who are having the similar trouble like me

For creating submission dataframe, the easiest way is this
subm_df = pd.read_csv('../input/sample_submission.csv')
subm_df['PredictedLogRevenue'] = <your prediction array>
subm_df.to_csv('Subm_1.csv', index=False)
Noe this is assuming your sample_submission.csv has all fullVisitorId, which it usually does in Kaggle. Following this, I have never faced any issues.

Related

error: element number 2 undefined in return list. I'm new to this, pls help me

x = fopen('pm10_data.txt');
fseek(x, 8,0);
dat = fscanf (x,'%f',[2,1000]);
dat = transpose(dat);
a = dat(:,1);
b = dat(:,2);
[r,p] = cor_test (a,b)
fclose(x);
r
p
this is what i got,
r =
scalar structure containing the fields:
method = Pearson's product moment correlation
params = 76
stat = 6.2156
dist = t
pval = 2.5292e-08
alternative = !=
Run error
error: element number 2 undefined in return list
error: called from
tester.octave at line 7 column 6
Presumably you're referring to the cor_test function from the statistics package, even though you don't show loading this in your workspace.
According to the documentation of cor_test:
The output is a structure with the following elements:
PVAL The p-value of the test.
STAT The value of the test statistic.
DIST The distribution of the test statistic.
PARAMS The parameters of the null distribution of the test statistic.
ALTERNATIVE The alternative hypothesis.
METHOD The method used for testing.
If no output argument is given, the p-value is displayed.
This seems to be what you're getting too.
If you want the p value explicitly from that structure, you can access that as r.pval
The syntax [a, b, ...] = functionname( args, ... ) expects the function to return more than one argument, and capture all the returned arguments into the named variables (i.e. a, b, etc).
In this case, cor_test only returns a single argument, even though that argument is a struct (which means it has fields you can access).
The error you're getting effectively means you requested a second output argument p, but the function you're using does not return a second output argument. It only returns that struct you already captured in r.

Argument not specified for parameter 'FalsePart' of 'Public Function

I'm trying to get a field condition to give me N/A when blank. Also to give me Fail on certain conditions. In addition, on a variable,pass/fail give me PASS or fail per the variable.
What is missing from my syntax?
=IIf(IsNothing(Variables!PassFail.Value),"N/A", iif(Variables!NameVer.Value = " x") Or (Variables!levPassFail.Value = "FAIL") Or (Variables!countPassFail.Value = "FAIL"),"FAIL", Variables!PassFail.Value)
The syntax was wrong but it's difficult to know exactly what you want to achieve.
The following will reproduce your original attempt
=IIF(
IsNothing(Variables!PassFail.Value)
,"N/A"
, IIF(Variables!NameVer.Value = " x" Or Variables!levPassFail.Value = "FAIL" Or Variables!countPassFail.Value = "FAIL"
,"FAIL"
, Variables!PassFail.Value
)
)
Note: I'm not sure if = " x" is a typo and should be = "x" but I left your original code in there.
If this does not help, please list the possible scenarios and the required output in each, then I can offer a more accurate solution.

Dropping duplicates in a pyarrow table?

Is there a way to sort data and drop duplicates using pure pyarrow tables? My goal is to retrieve the latest version of each ID based on the maximum update timestamp.
Some extra details: my datasets are normally structured into at least two versions:
historical
final
The historical dataset would include all updated items from a source so it is possible to have duplicates for a single ID for each change that happened to it (picture a Zendesk or ServiceNow ticket, for example, where a ticket can be updated many times)
I then read the historical dataset using filters, convert it into a pandas DF, sort the data, and then drop duplicates on some unique constraint columns.
dataset = ds.dataset(history, filesystem, partitioning)
table = dataset.to_table(filter=filter_expression, columns=columns)
df = table.to_pandas().sort_values(sort_columns, ascending=True).drop_duplicates(unique_constraint, keep="last")
table = pa.Table.from_pandas(df=df, schema=table.schema, preserve_index=False)
# ds.write_dataset(final, filesystem, partitioning)
# I tend to write the final dataset using the legacy dataset so I can make use of the partition_filename_cb - that way I can have one file per date_id. Our visualization tool connects to these files directly
# container/dataset/date_id=20210127/20210127.parquet
pq.write_to_dataset(final, filesystem, partition_cols=["date_id"], use_legacy_dataset=True, partition_filename_cb=lambda x: str(x[-1]).split(".")[0] + ".parquet")
It would be nice to cut out that conversion to pandas and then back to a table, if possible.
Edit March 2022: PyArrow is adding more functionalities, though this one isn't here yet. My approach now would be:
def drop_duplicates(table: pa.Table, column_name: str) -> pa.Table:
unique_values = pc.unique(table[column_name])
unique_indices = [pc.index(table[column_name], value).as_py() for value in unique_values]
mask = np.full((len(table)), False)
mask[unique_indices] = True
return table.filter(mask=mask)
//end edit
I saw your question because I had a similar one, and I solved it for my work (due to IP issues I can't post the whole code but I'll try to answer as well as I can. I've never done this before)
import pyarrow.compute as pc
import pyarrow as pa
import numpy as np
array = table.column(column_name)
dicts = {dct['values']: dct['counts'] for dct in pc.value_counts(array).to_pylist()}
for key, value in dicts.items():
# do stuff
I used the 'value_counts' to find the unique values and how many of them there are (https://arrow.apache.org/docs/python/generated/pyarrow.compute.value_counts.html). Then I iterated over those values. If the value was 1, I selected the row by using
mask = pa.array(np.array(array) == key)
row = table.filter(mask)
and if the count was more then 1 I selected either the first or last one by using numpy boolean arrays as a mask again.
After iterating it was just as simple as pa.concat_tables(tables)
warning: this is a slow process. If you need something quick&dirty, try the "Unique" option (also in the same link I provided).
edit/extra:: you can make it a bit faster/less memory intensive by keeping up a numpy array of boolean masks while iterating over the dictionary. then in the end you return a "table.filter(mask=boolean_mask)".
I don't know how to calculate the speed though...
edit2:
(sorry for the many edits. I've been doing a lot of refactoring and trying to get it to work faster.)
You can also try something like:
def drop_duplicates(table: pa.Table, col_name: str) ->pa.Table:
column_array = table.column(col_name)
mask_x = np.full((table.shape[0]), False)
_, mask_indices = np.unique(np.array(column_array), return_index=True)
mask_x[mask_indices] = True
return table.filter(mask=mask_x)
The following gives a good performance. About 2mins for a table with half billion rows. The reason I don't do combine_chunks(): there is a bug, arrow seems can not combine chunk arrays if there size are too large. See details: https://issues.apache.org/jira/browse/ARROW-10172?src=confmacro
a = [len(tb3['ID'].chunk(i)) for i in range(len(tb3['ID'].chunks))]
c = np.array([np.arange(x) for x in a])
a = ([0]+a)[:-1]
c = pa.chunked_array(c+np.cumsum(a))
tb3= tb3.set_column(tb3.shape[1], 'index', c)
selector = tb3.group_by(['ID']).aggregate([("index", "min")])
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=selector['index_min']))
I found duckdb can give better performance on group by. Change the last 2 lines above into the following will give 2X speedup:
import duckdb
duck = duckdb.connect()
sql = "select first(index) as idx from tb3 group by ID"
duck_res = duck.execute(sql).fetch_arrow_table()
tb3 = tb3.filter(pc.is_in(tb3['index'], value_set=duck_res['idx']))

Using R and R-SQL API to execute SQL query

For an assignment, I am supposed to use SQL to get a list of unique values from a table as a vector in R. I wrote the following code in R:
selection = dbSendQuery(con, statement = "SELECT user_id FROM twitter_message")
user_id = c(dbFetch(selection))
I am supposed to then randomly generate 3 values, preferably using the sample() function. However, when I do that, it generates vectors the size of the original vector (approximately 500 values) rather than selecting 3 values from the vector. I do not know if the error is from how I put the data in a vector or not. I tried writing the following code:
sample(user_id, size = 3, replace = FALSE, prob = NULL)
However, I get an the error:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
Need to sample from rows not from your dataframe.
user_id[sample(nrow(user_id), 3, replace = FALSE, prob = NULL),]

MySQL Dynamic Query Statement in Python with Dictionary

Very similar to this question MySQL Dynamic Query Statement in Python
However what I am looking to do instead of two lists is to use a dictionary
Let's say i have this dictionary
instance_insert = {
# sql column variable value
'instance_id' : 'instnace.id',
'customer_id' : 'customer.id',
'os' : 'instance.platform',
}
And I want to populate a mysql database with an insert statement using sql column as the sql column name and the variable name as the variable that will hold the value that is to be inserted into the mysql table.
Kind of lost because I don't understand exactly what this statement does, but was pulled from the question that I posted where he was using two lists to do what he wanted.
sql = "INSERT INTO instance_info_test VALUES (%s);" % ', '.join('?' for _ in instance_insert)
cur.execute (sql, instance_insert)
Also I would like it to be dynamic in the sense that I can add/remove columns to the dictionary
Before you post, you might want to try searching for something more specific to your question. For instance, when I Googled "python mysqldb insert dictionary", I found a good answer on the first page, at http://mail.python.org/pipermail/tutor/2010-December/080701.html. Relevant part:
Here's what I came up with when I tried to make a generalized version
of the above:
def add_row(cursor, tablename, rowdict):
# XXX tablename not sanitized
# XXX test for allowed keys is case-sensitive
# filter out keys that are not column names
cursor.execute("describe %s" % tablename)
allowed_keys = set(row[0] for row in cursor.fetchall())
keys = allowed_keys.intersection(rowdict)
if len(rowdict) > len(keys):
unknown_keys = set(rowdict) - allowed_keys
print >> sys.stderr, "skipping keys:", ", ".join(unknown_keys)
columns = ", ".join(keys)
values_template = ", ".join(["%s"] * len(keys))
sql = "insert into %s (%s) values (%s)" % (
tablename, columns, values_template)
values = tuple(rowdict[key] for key in keys)
cursor.execute(sql, values)
filename = ...
tablename = ...
db = MySQLdb.connect(...)
cursor = db.cursor()
with open(filename) as instream:
row = json.load(instream)
add_row(cursor, tablename, row)
Peter
If you know your inputs will always be valid (table name is valid, columns are present in the table), and you're not importing from a JSON file as the example is, you can simplify this function. But it'll accomplish what you want to accomplish. While it may initially seem like DictCursor would be helpful, it looks like DictCursor is useful for returning a dictionary of values, but it can't execute from a dict.