Using R and R-SQL API to execute SQL query - mysql

For an assignment, I am supposed to use SQL to get a list of unique values from a table as a vector in R. I wrote the following code in R:
selection = dbSendQuery(con, statement = "SELECT user_id FROM twitter_message")
user_id = c(dbFetch(selection))
I am supposed to then randomly generate 3 values, preferably using the sample() function. However, when I do that, it generates vectors the size of the original vector (approximately 500 values) rather than selecting 3 values from the vector. I do not know if the error is from how I put the data in a vector or not. I tried writing the following code:
sample(user_id, size = 3, replace = FALSE, prob = NULL)
However, I get an the error:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'

Need to sample from rows not from your dataframe.
user_id[sample(nrow(user_id), 3, replace = FALSE, prob = NULL),]

Related

Unable to retrieve data from my sql database using pymysql

I have been trying to retrieve data from my database. I was successful, however, this time inside an if statement. The code looks like:
cur_msql = conn_mysql.cursor(cursor=pymysql.cursors.DictCursor)
select_query = """select x,y,z from table where type='sample' and code=%s"""
cur_msql.execute(select_query, code)
result2 = cur_msql.fetchone()
if(result2==None):
insert_func(code)
select_query = f"""select x,y,z from table where type='sample' and code='{code}'"""
mycur = conn_mysql.cursor(cursor=pymysql.cursors.DictCursor)
print(select_query)
mycur.execute(select_query)
result3 = mycur.fetchone()
if(result2==None):
result2=result3
Now I see that insert_func does successfully insert into the 'table'. However, on trying to fetch that row, immediately after the insertion, it returns None as if the row is absent. On debugging I find that result3 is also None. Nothing looks wrong to me but it's not working.
you donĀ“t execute it in the right way, in the cur_msql.execute, you the to send the query and a tuple of values, and you are sending just a value:
cur_msql = conn_mysql.cursor(cursor=pymysql.cursors.DictCursor)
select_query = "select learnpath_code,learnpath_id,learnpath_name from contentgrail.knowledge_vectors_test where Type='chapters' and code=%s"
cur_msql.execute(select_query, (meta['chapter_code'],))
result2 = cur_msql.fetchone()

Kaggle competition submission error : The value '' in the key column '' has already been defined

This is my first time participating in a kaggle competition and I'm having trouble submitting my result table. I made my model using gbm and made a prediction table like below. the submission file has 2 column named 'fullVisitorId' and 'PredictedLogRevenue') as any other kaggle competition cases.
pred_oob = predict(object = model_gbm, newdata = te_df, type = 'response')
mysub = data.frame(fullVisitorId = test$fullVisitorId, Pred = pred_oob)
mysub = mysub %>%
group_by(fullVisitorId) %>%
summarise(Predicted = sum(Pred))
submission = read.csv('sample_submission.csv')
mysub = submission %>%
left_join(mysub, by = 'fullVisitorId')
mysub$PredictedLogRevenue = NULL
names(mysub) = names(submission)
But when I try to submit the file, I got the 'fail' message saying ...
ERROR: The value '8.893887e+17' in the key column 'fullVisitorId' has already been defined (Line 549026, Column 1)
ERROR: The value '8.895317e+18' in the key column 'fullVisitorId' has already been defined (Line 549126, Column 1)
ERROR: The value '8.895317e+18' in the key column 'fullVisitorId' has already been defined (Line 549127, Column 1)
Not just 3 lines, but 8 more lines like this.
I have no idea what I did wrong. I also checked other kernels but couldn't find the answer. Please...help!!
This issue was because fullVisitorId was numeric instead of character, so It dropped all the leading zeros. Therefore, using read.csv() with colClases argument or fread() can make it work.
I left this just because there could be someone else who are having the similar trouble like me
For creating submission dataframe, the easiest way is this
subm_df = pd.read_csv('../input/sample_submission.csv')
subm_df['PredictedLogRevenue'] = <your prediction array>
subm_df.to_csv('Subm_1.csv', index=False)
Noe this is assuming your sample_submission.csv has all fullVisitorId, which it usually does in Kaggle. Following this, I have never faced any issues.

MySQL Dynamic Query Statement in Python with Dictionary

Very similar to this question MySQL Dynamic Query Statement in Python
However what I am looking to do instead of two lists is to use a dictionary
Let's say i have this dictionary
instance_insert = {
# sql column variable value
'instance_id' : 'instnace.id',
'customer_id' : 'customer.id',
'os' : 'instance.platform',
}
And I want to populate a mysql database with an insert statement using sql column as the sql column name and the variable name as the variable that will hold the value that is to be inserted into the mysql table.
Kind of lost because I don't understand exactly what this statement does, but was pulled from the question that I posted where he was using two lists to do what he wanted.
sql = "INSERT INTO instance_info_test VALUES (%s);" % ', '.join('?' for _ in instance_insert)
cur.execute (sql, instance_insert)
Also I would like it to be dynamic in the sense that I can add/remove columns to the dictionary
Before you post, you might want to try searching for something more specific to your question. For instance, when I Googled "python mysqldb insert dictionary", I found a good answer on the first page, at http://mail.python.org/pipermail/tutor/2010-December/080701.html. Relevant part:
Here's what I came up with when I tried to make a generalized version
of the above:
def add_row(cursor, tablename, rowdict):
# XXX tablename not sanitized
# XXX test for allowed keys is case-sensitive
# filter out keys that are not column names
cursor.execute("describe %s" % tablename)
allowed_keys = set(row[0] for row in cursor.fetchall())
keys = allowed_keys.intersection(rowdict)
if len(rowdict) > len(keys):
unknown_keys = set(rowdict) - allowed_keys
print >> sys.stderr, "skipping keys:", ", ".join(unknown_keys)
columns = ", ".join(keys)
values_template = ", ".join(["%s"] * len(keys))
sql = "insert into %s (%s) values (%s)" % (
tablename, columns, values_template)
values = tuple(rowdict[key] for key in keys)
cursor.execute(sql, values)
filename = ...
tablename = ...
db = MySQLdb.connect(...)
cursor = db.cursor()
with open(filename) as instream:
row = json.load(instream)
add_row(cursor, tablename, row)
Peter
If you know your inputs will always be valid (table name is valid, columns are present in the table), and you're not importing from a JSON file as the example is, you can simplify this function. But it'll accomplish what you want to accomplish. While it may initially seem like DictCursor would be helpful, it looks like DictCursor is useful for returning a dictionary of values, but it can't execute from a dict.

LINQ sum returns wrong result

I have a table with rows containing the following numbers:
4191808.51
3035280.22
3437796.06
4013772.33
1740652.56
0
The sum of that table is 16419309.68.
When I do a Linq SUM query on that table, it returns a whole different number: 19876858.14
The code I use to do the sum is as follows:
IEnumerable<MyData> data = _myRepository.GetMatching(myValue);
decimal sumSales = data.Sum(x => x.Sales);
What could be causing this? I suspect some max decimal value but couldn't find info on that
can you please examine your
IEnumerable<MyData> data = _myRepository.GetMatching(myValue);
in the debugger after the execution. I'm suspecting that you are selecting some different set of data - not what you have shown in the sample. I attempted to recreated you situation LinqPad, but constantly getting the correct answer.
decimal[] data = {
4191808.51m
,3035280.22m
,3437796.06m
,4013772.33m
,1740652.56m
,0m
};
decimal sumSales = data.Sum();
sumSales.Dump();
and getting: 16419309.68

postgres crosstab query with $libdir/tablefunc crosstab_hash function

My crosstab query (see below) runs just fine. However, I have to generate a large number of such queries, and - crucially - the number of column definitions will vary from day to day. If the number of output columndefs does not match that of the second argument of the crosstab, the crosstab will throw and error and abort. Therefore, I cannot "hard-wire" the column definitions as in my current query, and I need instead a function which will ensure that column definitions will be synchronized on-the-fly. Is it possible to write a generic postgres function that will be reusable in all such instances? Here is my query:
SELECT *
FROM crosstab
('SELECT
to_char(ipstimestamp, ''mon DD HH24h'') As row_name,
ips.objectid::text As category,
COUNT(*)::integer As value
FROM loggingdb_ips_boolean As log
INNER JOIN IpsObjects As ips
ON log.Varid=ips.ObjectId
WHERE (( log.varid = 37551)
OR (log.varid = 27087)
OR (log.varid = 29469)
OR (log.varid = 50876)
OR (log.varid = 45096)
OR (log.varid = 54708)
OR (log.varid = 47475)
OR (log.varid = 54606)
OR (log.varid = 25528)
OR (log.varid = 54729))
GROUP BY to_char(ipstimestamp, ''yyyy MM DD HH24h''), row_name, objectid, category
ORDER BY to_char(ipstimestamp, ''yyyy MM DD HH24h''), row_name, objectid, category',
'SELECT DISTINCT varid
FROM loggingdb_ips_boolean ORDER BY 1;'
)
As CountsPerHour(row_name text,
"25528" integer,
"27087" integer,
"29469" integer,
"37551" integer,
"45096" integer,
"54606" integer,
"54708" integer,
"54729" integer)
PS: Note that this query can be run against test data at the following server:
host: bellariastrasse.com
database: IpsLogging
user: guest
password: guest
I am afraid what you want is not completely possible. If the return type varies, you can either
create a function returning a generic SETOF record.
But then you'd have to provide a column definition list with every call - bringing you right back to where you started.
create a new function with a matching return type for every different case.
But that's what you are trying to avoid ...
If you have to write "a large number of such queries" you could utilize a query-generator function instead, which would not return the results but the DDL script which you would execute in a second step. Basically a function that takes in the variable parts as parameters and generates the query string in your example .. RETURNS text.
This can get pretty complex. Several meta-levels on top of each other have to be considered, but it is absolutely possible. Be sure to make heavy use of dollar-quoting to keep the quoting madness at bay.