Duplicate row but update a field in json - json

I'm trying to duplicate all rows in my table that contain a signalVersion: prod but in the duplicated row I'd like to set signalVersion to 0. There are two keys in my table, signal_key and signal_value and both are json objects. signalVersion is a property of the object in column signal_key. Below is an example of signal_key:
{
"signalType": "OCR_ITEM",
"signalVersion": "prod"
}
This is the code I've written so far but it's failing. Does anyone know why my syntax is incorrect?
insert into signals (signal_key, signal_value)
select signal_key -> '{"signalVersion": "0"}', signal_value
from signals
where signal_key #> '{"signalVersion": "prod"}';

You'll want to use the || operator to merge {"signalVersion": "0"} into the signal_key, not ->:
insert into signals (signal_key, signal_value)
select signal_key || '{"signalVersion": "0"}', signal_value
from signals
where signal_key #> '{"signalVersion": "prod"}';
Alternatively, you could use json_set:
insert into signals (signal_key, signal_value)
select json_set(signal_key, array['signalVersion'], 0), signal_value
from signals
where signal_key #> '{"signalVersion": "prod"}';

Related

Honeysql - Clojure - How to pass a vector as arguments into Values in insert-into statement

I'm working with Clojure using honeySql for formatting SQL statements.
The statement will run on a MySQL DB.
I'm looking for a way to pass an argument in the params and use it as the values in the inset-inro statement.
This is the statement I'm looking for :
INSERT INTO some_table
(a, b, c)
VALUES (1, 2, 3),
(4, 5, 6);
insert attempt:
(def insert-demo
(-> (insert-into :some_table)
(columns :a :b :c)
(values [:param :vals)))
This is the way I'm calling to the format function:
(sql/format dbcst/insert-demo {:inline true
:pretty true
:params {:vals [[1 2 3] [4 5 6]]}})
Unforentualy this won't work.
I tried a few options in the values section no success so far.
(values [:?vals])
(values :?vals)
Is there a way to pass a vector and use it as a param?
To be clear: I'm looking for a way to pass values dynamically.
Thanks
Try this :
(def insert-demo
(-> (insert-into :some_table)
(columns :a :b :c)
(values [[1 2 3]
[4 5 6]])))

how parallel fetch data from MySQL with Sequel Pro in R

I want to fetch data from mysql with seqlpro in R but when I run the query it takes ages.
here is my code :
old_value<- data.frame()
new_value<- data.frame()
counter<- 0
for (i in 1:length(short_list$id)) {
mydb = OpenConn(dbname = '**', user = '**', password = '**', host = '**')
query <- paste0("select * from table where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" )
temp_old <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1
query <- paste0("select * from table2 where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" )
temp_new <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1)
RMySQL::dbDisconnect(mydb)
new_value<- rbind(temp_new,new_value)
old_value<- rbind(temp_old,old_value)
counter=counter+1
base::print(paste("completed for ",counter),sep="")
}
is there any way that I can writ it more efficient and call the queries faster because i have around 5000 rows which should go into the loop. Actually this query works but it takes time.
I have tried this but still it gives me error :
#parralel computing
clust <- makeCluster(length(6))
clusterEvalQ(cl = clust, expr = lapply(c('data.table',"RMySQL","dplyr","plyr"), library, character.only = TRUE))
clusterExport(cl = clust, c('config','short_list'), envir = environment())
new_de <- parLapply(clust, short_list, function(id,country) {
for (i in 1:length(short_list$id)) {
mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**')
query <- paste0("select * from table1 where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" )
temp_data <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1) %>% data.table::data.table()
RMySQL::dbDisconnect(mydb)
return(temp_data)}
})
stopCluster(clust)
gc(reset = T)
new_de <- data.table::rbindlist(new_de, use.names = TRUE)
I have also defined the list of short_list as following:
short_list<- as.list(short_list)
and inside short_list is:
id. country
2 US
3 UK
... ...
However it gives me this error:
Error in checkForRemoteErrors(val) :
one node produced an error: object 'i' not found
However when I remove i from the id[i] and country[i] it only give me the first row result not get all ids and country result.
I think an alternative is to upload the ids you need into a temporary table, and query for everything at once.
tmptable <- "mytemptable"
dbWriteTable(conn, tmptable, short_list, create = TRUE)
alldat <- dbGetQuery(conn, paste("
select t1.*
from ", tmptable, " tmp
left join table1 t1 on tmp.id=t1.id and tmp.country=t1.country
where t1.`date` >= '2019-04-31' and t1.`date` <= '2020-09-1'"))
dbExecute(conn, paste("drop table", tmptable))
(Many DBMSes use a leading # to indicate a temporary table that is only visible to the local user, is much less likely to clash in the schema namespace, and is automatically cleaned when the connection is closed. I generally encourage use of temp-tables here, check with your DB docs, schema, and/or DBA for more info here.)
The order of tables is important: by pulling all from mytemptable and then left join table1 onto it, we are effectively filtering out any data from table1 that does not include a matching id and country.
This doesn't solve the speed of data download, but some thoughts on that:
Each time you iterate through the queries, you have not-insignificant overhead; if there's a lot of data then this overhead should not be huge, but it's still there. Using a single query will reduce this overhead significantly.
Query time can also be affected by any index(ices) on the tables. Outside the scope of this discussion, but might be relevant if you have a large-ish table. If the table is not indexed efficiently (or the query is not structured well to use those indices), then each query will take a finite amount of time to "compile" and return data. Again, overhead that will be reduced with a single more-efficient query.
Large queries might benefit from using the command-line tool mysql; it is about as fast as you're going to get, and might iron over any issues in RMySQL and/or DBI. (I'm not saying they are inefficient, but ... it is unlikely that a free open-source driver will be faster than MySQL's own command-line utility.
As for doing this in parallel ...
You're using parLapply incorrectly. It accepts a single vector/list and iterates over each object in that list. You might use it iterating over the indices of a frame, but you cannot use it to iterate over multiple columns within that frame. This is exactly like base R's lapply.
Let's show what is going on when you do your call. I'll replace it with lapply (because debugging in multiple processes is difficult).
# parLapply(clust, mtcars, function(id, country) { ... })
lapply(mtcars, function(id, country) { browser(); 1; })
# Called from: FUN(X[[i]], ...)
debug at #1: [1] 1
id
# [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2
# [24] 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
country
# Error: argument "country" is missing, with no default
Because the argument (mtcars here, short_list in yours) is a data.frame, since it is a list-like object, lapply (and parLapply) operate on each column at a time. You were hoping that it would "unzip" the data, applying the first column's value to id and the second column's value to country. In fact, the is a function that does this: Map (and the parallel's clusterMap, as I suggested in my comment). More on that later.
The intent of parallelizing things is to not use the for loop inside the parallel function. If short_list has 10 rows, and if your use of parLapply were correct, then you would be querying all rows 10 times, making your problem significantly worse. In pseudo-code, you'd be doing:
parallelize for each row in short_list:
# this portion is run simultaneously in 10 difference processes/threads
for each row in short_list:
query for data related to this row
Two alternatives:
Provide a single argument to parLapply representing the rows of the frame.
new_de <- new_de <- parLapply(clust, seqlen(NROW(short_list)), function(rownum) {
mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**')
on.exit({ DBI::dbDisconnect(mydb) })
tryCatch(
DBI::dbGetQuery(mydb, "
select * from table1
where id=? and country=?
and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1'",
params = list(short_list$id[rownum], short_list$country[rownum])),
error = function(e) e)
})
Use clusterMap for the same effect.
new_de <- clusterMap(clust, function(id, country) {
mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**')
on.exit({ DBI::dbDisconnect(mydb) })
tryCatch(
DBI::dbGetQuery(mydb, "
select * from table1
where id=? and country=?
and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1'",
params = list(id, country)),
error = function(e) e)
}, short_list$id, short_list$country)
If you are not familiar with Map, it is like "zipping" together multiple vectors/lists. For example:
myfun1 <- function(i) paste(i, "alone")
lapply(1:3, myfun1)
### "unrolls" to look like
list(
myfun1(1),
myfun1(2),
myfun1(3)
)
myfun3 <- function(i,j,k) paste(i, j, k, sep = '-')
Map(f = myfun3, 1:3, 11:13, 21:23)
### "unrolls" to look like
list(
myfun3(1, 11, 21),
myfun3(2, 12, 22),
myfun3(3, 13, 23)
)
Some liberties I took in that adapted code:
I shifted from the dbSendQuery/dbFetch double-tap to a single call to dbGetQuery.
I'm using DBI functions, since DBI functions provide a superset of what each driver's package provides. (You're likely using some of it anyway, perhaps without realizing it.) You can switch back with no issue.
I added tryCatch, since sometimes errors can be difficult to deal with in parallel processes. This means you'll need to check the return value from each of your processes to see if either inherits(ret, "error") (problem) or is.data.frame (normal).
I used on.exit so that even if there's a problem, the connection closure should still occur.

Grouping multiple rows in R

I've generated a heatmap in R for microbiome data, using the following link
My data as far as rows is concerned looks like this:
781
782
783
547
519
575
044
045
049
If I want to group 781-783, 547-575 and 044-049 as individual groups and give them separate colours using the below idea:
Assigning animals to different groups (2 random groups in this case)
var1 <- round(runif(n=12, min=1, max=2))
var1 <- replace (var1, which(var1 == 1), "deepskyblue")
var1 <- replace (var1, which(var1 == 2), "magenta")
cbind(row.names(data.prop), var1)
How do I go about it? I understand that the above code, randomly generates 2 groups, but how can I specify which rows go into which group?
Thank you,
Susheel
Because rownames are of necessity character and the only good range-operator in R is ":" for numeric values: you need to coerce ranges to the desired "0nn" format. This is untested in the absence of a proper test case (which questioners are asked to provide):
#look at...
sprintf("%03i", c(781:783, 547:575, 044:049))
# then....
data.prop[ sprintf("%03i", c( 781:783, 547:575, 044:049), 'var1'] <-
mapply(function(clr, rng) {rep(clr, length(rng) )},
c("deepskyblue", "magenta", "green"),
list( 781:783, 547:575, 44:49)
)

How do I sum up properties of a JSON object in coffescript?

I have an object that looks like this one:
object =
title : 'an object'
properties :
attribute1 :
random_number: 2
attribute_values:
a: 10
b: 'irrelevant'
attribute2 :
random_number: 4
attribute_values:
a: 15
b: 'irrelevant'
some_random_stuff: 'random stuff'
I want to extract the sum of the 'a' values on attribute1 and attribute2.
What would be the best way to do this in Coffeescript?
(I have already found one way to do it but that just looks like Java-translated-to-coffee and I was hoping for a more elegant solution.)
Here is what I came up with (edited to be more generic based on comment):
sum_attributes = (x) =>
sum = 0
for name, value of object.properties
sum += value.attribute_values[x]
sum
alert sum_attributes('a') # 25
alert sum_attributes('b') # 0irrelevantirrelevant
So, that does what you want... but it probably doesn't do exactly what you want with strings.
You might want to pass in the accumulator seed, like sum_attributes 0, 'a' and sum_attributes '', 'b'
Brian's answer is good. But if you wanted to bring in a functional programming library like Underscore.js, you could write a more succinct version:
sum = (arr) -> _.reduce arr, ((memo, num) -> memo + num), 0
sum _.pluck(object.properties, 'a')
total = (attr.attribute_values.a for key, attr of obj.properties).reduce (a,b) -> a+b
or
sum = (arr) -> arr.reduce((a, b) -> a+b)
total = sum (attr.attribute_values.a for k, attr of obj.properties)

Subsetting a data frame in a function using another data frame as parameter

I would like to submit a data frame to a function and use it to subset another data frame.
This is the basic data frame:
foo <- data.frame(var1= c(1, 1, 1, 2, 2, 3), var2=c('A', 'A', 'B', 'B', 'C', 'C'))
I use the following function to find out the frequencies of var2 for specified values of var1.
foobar <- function(x, y, z){
a <- subset(x, (x$var1 == y))
b <- subset(a, (a$var2 == z))
n=nrow(b)
return(n)
}
Examples:
foobar(foo, 1, "A") # returns 2
foobar(foo, 1, "B") # returns 1
foobar(foo, 3, "C") # returns 1
This works. But now I want to submit a data frame of values to foobar. Instead of the above examples, I would like to submit df to foobar and get the same results as above (2, 1, 1)
df <- data.frame(var1=c(1, 1, 3), var2=c("A", "B", "C"))
When I change foobar to accept two arguments like foobar(foo, df) and use y[, c(var1)] and y[, c(var2)] instead of the two parameters x and y it still doesn't work. Which way is there to do this?
edit1: last paragraph clarified
edit2: var1 type corrected
Try this:
library(plyr)
match_df <- function(x, match) {
vars <- names(match)
# Create unique id for each row
x_id <- id(match[vars])
match_id <- id(x[vars])
# Match identifiers and return subsetted data frame
x[match(x_id, match_id, nomatch = 0), ]
}
match_df(foo, df)
# var1 var2
# 1 1 A
# 3 1 B
# 5 2 C
Your function foobar is expecting three arguments, and you only supplied two arguments to it with foobar(foo, df). You can use apply to get what you want:
apply(df, 1, function(x) foobar(foo, x[1], x[2]))
And in use:
> apply(df, 1, function(x) foobar(foo, x[1], x[2]))
[1] 2 1 1
To respond to your edit:
I'm not entirely sure what y[, c(var1)] means, but here's an attempt at trying to figure out what you are trying to do.
What I think you were trying to do was: foobar(foo, y = df[, "var1"], z = df[, "var2"]).
First, note that the use of c() is not needed here and you can reference the columns you want by placing the name of the column in quotes OR reference the column by number (as I did above). Secondly, df[, "var1"] returns all of the rows for the column names var1 which has a length of three:
> length(df[, "var1"])
[1] 3
The function you defined is not set up to deal with vectors of length greater than 1. That is why we need to iterate through each row of your dataframe to grab a single value, process it, and then go to the next row in the data.frame. That is what the apply function does. It is equivalent to saying something along the lines of for (i in 1: length(nrow(df)) but is a more idiomatic way of handling such issues.
Finally, is there a reason you generated var1 as a factor? It probably makes more sense to treate these as numeric in my opinion. Compare:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ var1: Factor w/ 2 levels "1","3": 1 1 2
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
Versus
> df2 <- data.frame(var1=c(1,1,3), var2=c("A", "B", "C"))
> str(df2)
'data.frame': 3 obs. of 2 variables:
$ var1: num 1 1 3
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
In summary - apply is the function you are after here. You may want to spend some time thinking about whether your data should be numeric or a factor, but apply is still what you want.
foobar2 <- function(x, df) {
.dofun <- function(y, z){
a <- subset(x, x$var1==y)
b <- subset(a, a$var2==z)
n <- nrow(b)
return (n)
}
ans <- mapply(.dofun, as.character(df$var1), as.character(df$var2))
names(ans) <- NULL
return(ans)
}