Split column string into multiple columns strings - mysql

I have a entry in the table that is a string which is delimited by semicolons. Is possible to split the string into separate columns? I've been looking online and at stackoverflow and I couldn't find one that would do the splitting into columns.
The entry in the table looks something like this (anything in brackets [] is not actually in my table. Just there to make things clearer):
sysinfo [column]
miscInfo ; vendor: aaa ; bootr: bbb; revision: ccc; model: ddd [string a]
miscInfo ; vendor: aaa ; bootr: bbb; revision: ccc; model: ddd [string b]
...
There are a little over one million entries with the string that looks like this. Is it possible in mySQL so that the query returns the following
miscInfo, Vendor, Bootr, Revision , Model [columns]
miscInfo_a, vendor_a, bootr_a, revision_a, model_a
miscInfo_b, vendor_b, bootr_b, revision_b, model_b
...
for all of the rows in the table, where the comma indicates a new column?
Edit:
Here's some input and output as Bohemian requested.
sysinfo [column]
Modem <<HW_REV: 04; VENDOR: Arris ; BOOTR: 6.xx; SW_REV: 5.2.xxC; MODEL: TM602G>>
<<HW_REV: 1; VENDOR: Motorola ; BOOTR: 216; SW_REV: 2.4.1.5; MODEL: SB5101>>
Thomson DOCSIS Cable Modem <<HW_REV: 4.0; VENDOR: Thomson; BOOTR: 2.1.6d; SW_REV: ST52.01.02; MODEL: DCM425>>
Some can be longer entries but they all have similar format. Here is what I would like the output to be:
miscInfo, vendor, bootr, revision, model [columns]
04, Arris, 6.xx, 5.2.xxC, TM602G
1, Motorola, 216, 2.4.1.5, SB5101
4.0, Thomson, 2.1.6d, ST52.01.02, DCM425

You could make use of String functions (particularly substr) in mysql: http://dev.mysql.com/doc/refman/5.0/en/string-functions.html

Please take a look at how I've split my coordinates column into 2 lat/lng columns:
UPDATE shops_locations L
LEFT JOIN shops_locations L2 ON L2.id = L.id
SET L.coord_lat = SUBSTRING(L2.coordinates, 1, LOCATE('|', L2.coordinates) - 1),
L.coord_lng = SUBSTRING(L2.coordinates, LOCATE('|', L2.coordinates) + 1)
In overall I followed UPDATE JOIN advice from here MySQL - UPDATE query based on SELECT Query and STR_SPLIT question here Split value from one field to two
Yes I'm just splitting into 2, and SUBSTRING might not work well for you, but anyway, hope this helps :)

Related

how parallel fetch data from MySQL with Sequel Pro in R

I want to fetch data from mysql with seqlpro in R but when I run the query it takes ages.
here is my code :
old_value<- data.frame()
new_value<- data.frame()
counter<- 0
for (i in 1:length(short_list$id)) {
mydb = OpenConn(dbname = '**', user = '**', password = '**', host = '**')
query <- paste0("select * from table where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" )
temp_old <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1
query <- paste0("select * from table2 where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" )
temp_new <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1)
RMySQL::dbDisconnect(mydb)
new_value<- rbind(temp_new,new_value)
old_value<- rbind(temp_old,old_value)
counter=counter+1
base::print(paste("completed for ",counter),sep="")
}
is there any way that I can writ it more efficient and call the queries faster because i have around 5000 rows which should go into the loop. Actually this query works but it takes time.
I have tried this but still it gives me error :
#parralel computing
clust <- makeCluster(length(6))
clusterEvalQ(cl = clust, expr = lapply(c('data.table',"RMySQL","dplyr","plyr"), library, character.only = TRUE))
clusterExport(cl = clust, c('config','short_list'), envir = environment())
new_de <- parLapply(clust, short_list, function(id,country) {
for (i in 1:length(short_list$id)) {
mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**')
query <- paste0("select * from table1 where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" )
temp_data <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1) %>% data.table::data.table()
RMySQL::dbDisconnect(mydb)
return(temp_data)}
})
stopCluster(clust)
gc(reset = T)
new_de <- data.table::rbindlist(new_de, use.names = TRUE)
I have also defined the list of short_list as following:
short_list<- as.list(short_list)
and inside short_list is:
id. country
2 US
3 UK
... ...
However it gives me this error:
Error in checkForRemoteErrors(val) :
one node produced an error: object 'i' not found
However when I remove i from the id[i] and country[i] it only give me the first row result not get all ids and country result.
I think an alternative is to upload the ids you need into a temporary table, and query for everything at once.
tmptable <- "mytemptable"
dbWriteTable(conn, tmptable, short_list, create = TRUE)
alldat <- dbGetQuery(conn, paste("
select t1.*
from ", tmptable, " tmp
left join table1 t1 on tmp.id=t1.id and tmp.country=t1.country
where t1.`date` >= '2019-04-31' and t1.`date` <= '2020-09-1'"))
dbExecute(conn, paste("drop table", tmptable))
(Many DBMSes use a leading # to indicate a temporary table that is only visible to the local user, is much less likely to clash in the schema namespace, and is automatically cleaned when the connection is closed. I generally encourage use of temp-tables here, check with your DB docs, schema, and/or DBA for more info here.)
The order of tables is important: by pulling all from mytemptable and then left join table1 onto it, we are effectively filtering out any data from table1 that does not include a matching id and country.
This doesn't solve the speed of data download, but some thoughts on that:
Each time you iterate through the queries, you have not-insignificant overhead; if there's a lot of data then this overhead should not be huge, but it's still there. Using a single query will reduce this overhead significantly.
Query time can also be affected by any index(ices) on the tables. Outside the scope of this discussion, but might be relevant if you have a large-ish table. If the table is not indexed efficiently (or the query is not structured well to use those indices), then each query will take a finite amount of time to "compile" and return data. Again, overhead that will be reduced with a single more-efficient query.
Large queries might benefit from using the command-line tool mysql; it is about as fast as you're going to get, and might iron over any issues in RMySQL and/or DBI. (I'm not saying they are inefficient, but ... it is unlikely that a free open-source driver will be faster than MySQL's own command-line utility.
As for doing this in parallel ...
You're using parLapply incorrectly. It accepts a single vector/list and iterates over each object in that list. You might use it iterating over the indices of a frame, but you cannot use it to iterate over multiple columns within that frame. This is exactly like base R's lapply.
Let's show what is going on when you do your call. I'll replace it with lapply (because debugging in multiple processes is difficult).
# parLapply(clust, mtcars, function(id, country) { ... })
lapply(mtcars, function(id, country) { browser(); 1; })
# Called from: FUN(X[[i]], ...)
debug at #1: [1] 1
id
# [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2
# [24] 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
country
# Error: argument "country" is missing, with no default
Because the argument (mtcars here, short_list in yours) is a data.frame, since it is a list-like object, lapply (and parLapply) operate on each column at a time. You were hoping that it would "unzip" the data, applying the first column's value to id and the second column's value to country. In fact, the is a function that does this: Map (and the parallel's clusterMap, as I suggested in my comment). More on that later.
The intent of parallelizing things is to not use the for loop inside the parallel function. If short_list has 10 rows, and if your use of parLapply were correct, then you would be querying all rows 10 times, making your problem significantly worse. In pseudo-code, you'd be doing:
parallelize for each row in short_list:
# this portion is run simultaneously in 10 difference processes/threads
for each row in short_list:
query for data related to this row
Two alternatives:
Provide a single argument to parLapply representing the rows of the frame.
new_de <- new_de <- parLapply(clust, seqlen(NROW(short_list)), function(rownum) {
mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**')
on.exit({ DBI::dbDisconnect(mydb) })
tryCatch(
DBI::dbGetQuery(mydb, "
select * from table1
where id=? and country=?
and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1'",
params = list(short_list$id[rownum], short_list$country[rownum])),
error = function(e) e)
})
Use clusterMap for the same effect.
new_de <- clusterMap(clust, function(id, country) {
mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**')
on.exit({ DBI::dbDisconnect(mydb) })
tryCatch(
DBI::dbGetQuery(mydb, "
select * from table1
where id=? and country=?
and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1'",
params = list(id, country)),
error = function(e) e)
}, short_list$id, short_list$country)
If you are not familiar with Map, it is like "zipping" together multiple vectors/lists. For example:
myfun1 <- function(i) paste(i, "alone")
lapply(1:3, myfun1)
### "unrolls" to look like
list(
myfun1(1),
myfun1(2),
myfun1(3)
)
myfun3 <- function(i,j,k) paste(i, j, k, sep = '-')
Map(f = myfun3, 1:3, 11:13, 21:23)
### "unrolls" to look like
list(
myfun3(1, 11, 21),
myfun3(2, 12, 22),
myfun3(3, 13, 23)
)
Some liberties I took in that adapted code:
I shifted from the dbSendQuery/dbFetch double-tap to a single call to dbGetQuery.
I'm using DBI functions, since DBI functions provide a superset of what each driver's package provides. (You're likely using some of it anyway, perhaps without realizing it.) You can switch back with no issue.
I added tryCatch, since sometimes errors can be difficult to deal with in parallel processes. This means you'll need to check the return value from each of your processes to see if either inherits(ret, "error") (problem) or is.data.frame (normal).
I used on.exit so that even if there's a problem, the connection closure should still occur.

How to truncate double precision value in PostgreSQL by keeping exactly first two decimals?

I'm trying to truncate double precision value when I'm build json using json_build_object() function in PostgreSQL 11.8 but with no luck. To be more precise I'm trying to truncate 19.9899999999999984 number to ONLY two decimals but making sure it DOES NOT round it to 20.00 (which is what it does), but to keep it at 19.98.
BTW, what I've tried so far was to use:
1) TRUNC(found_book.price::numeric, 2) and I get value 20.00
2) ROUND(found_book.price::numeric, 2) and I get value 19.99 -> so far this is closesest value but not what I need
3) ROUND(found_book.price::double precision, 2) and I get
[42883] ERROR: function round(double precision, integer) does not exist
Also here is whole code I'm using:
create or replace function public.get_book_by_book_id8(b_id bigint) returns json as
$BODY$
declare
found_book book;
book_authors json;
book_categories json;
book_price double precision;
begin
-- Load book data:
select * into found_book
from book b2
where b2.book_id = b_id;
-- Get assigned authors
select case when count(x) = 0 then '[]' else json_agg(x) end into book_authors
from (select aut.*
from book b
inner join author_book as ab on b.book_id = ab.book_id
inner join author as aut on ab.author_id = aut.author_id
where b.book_id = b_id) x;
-- Get assigned categories
select case when count(y) = 0 then '[]' else json_agg(y) end into book_categories
from (select cat.*
from book b
inner join category_book as cb on b.book_id = cb.book_id
inner join category as cat on cb.category_id = cat.category_id
where b.book_id = b_id) y;
book_price = trunc(found_book.price, 2);
-- Build the JSON response:
return (select json_build_object(
'book_id', found_book.book_id,
'title', found_book.title,
'price', book_price,
'amount', found_book.amount,
'is_deleted', found_book.is_deleted,
'authors', book_authors,
'categories', book_categories
));
end
$BODY$
language 'plpgsql';
select get_book_by_book_id8(186);
How do I achieve to keep EXACTLY ONLY two FIRST decimal digits 19.98 (any suggestion/help is greatly appreciated)?
P.S. PostgreSQL version is 11.8
In PostgreSQL 11.8 or 12.3 I cannot reproduce:
# select trunc('19.9899999999999984'::numeric, 2);
trunc
-------
19.98
(1 row)
# select trunc(19.9899999999999984::numeric, 2);
trunc
-------
19.98
(1 row)
# select trunc(19.9899999999999984, 2);
trunc
-------
19.98
(1 row)
Actually I can reproduce with the right type and a special setting:
# set extra_float_digits=0;
SET
# select trunc(19.9899999999999984::double precision::text::numeric, 2);
trunc
-------
19.99
(1 row)
And a possible solution:
# show extra_float_digits;
extra_float_digits
--------------------
3
(1 row)
select trunc(19.9899999999999984::double precision::text::numeric, 2);
trunc
-------
19.98
(1 row)
But note that:
Note: The extra_float_digits setting controls the number of extra
significant digits included when a floating point value is converted
to text for output. With the default value of 0, the output is the
same on every platform supported by PostgreSQL. Increasing it will
produce output that more accurately represents the stored value, but
may be unportable.
As #pifor suggested I've managed to get it done by directly passing trunc(found_book.price::double precision::text::numeric, 2) as value in json_build_object like this:
json_build_object(
'book_id', found_book.book_id,
'title', found_book.title,
'price', trunc(found_book.price::double precision::text::numeric, 2),
'amount', found_book.amount,
'is_deleted', found_book.is_deleted,
'authors', book_authors,
'categories', book_categories
)
Using book_price = trunc(found_book.price::double precision::text::numeric, 2); and passing it as value for 'price' key didn't work.
Thank you for your help. :)

Hadoop PIG with nested Json

I have a list of movies with ratings by user.
{"_id":59607,"title":"King Corn (2007)",
"genres":["Documentary"],
"ratings":[ {"userId":1860,"rating":3},
{"userId":9970,"rating":3.5},
{"userId":16929,"rating":1.5},
{"userId":23473,"rating":4},
{"userId":23733,"rating":4},
{"userId":27584,"rating":3},
{"userId":28232,"rating":4},
{"userId":29482,"rating":3},
{"userId":40976,"rating":5},
{"userId":44631,"rating":4},
{"userId":47613,"rating":3},
{"userId":49763,"rating":3},
{"userId":58160,"rating":4.5},
{"userId":62249,"rating":3},
{"userId":65923,"rating":4},
{"userId":67507,"rating":4},
{"userId":68259,"rating":3.5},
{"userId":70331,"rating":5},
{"userId":71420,"rating":3.5}
]
}
I need to count how many ratings are done by every user. This is my attempt to get in the ratings.
a = load '/movies_1m.json' using JsonLoader('id:int, title : chararray, genres : { ( genre : chararray ) }, ratings: { ( userId : int, rating: float) } ');
then
b = FOREACH a GENERATE FLATTEN(ratings);
describe give me following:
b: {ratings::userId: int,ratings::rating: float}
just to count the users I need to access the inside of ratings. But this is the point where it is not succeeding. I tried this:
c = FOREACH b GENERATE COUNT(ratings);
it gets me an error.
I need to get something like this:
{userId: int, rating: float}
You need to GROUP in order to COUNT since that is an aggregate operation.
b = FOREACH a GENERATE FLATTEN(ratings);
gr = GROUP b by ratings::userId;
c = FOREACH gr GENERATE group,COUNT($1);
\d c
Output
Note, none of the users in your example repeat, so these are all one.
(1860,1)
(9970,1)
(16929,1)
(23473,1)
(23733,1)
(27584,1)
(28232,1)
(29482,1)
(40976,1)
(44631,1)
(47613,1)
(49763,1)
(58160,1)
(62249,1)
(65923,1)
(67507,1)
(68259,1)
(70331,1)
(71420,1)

Grouping multiple rows in R

I've generated a heatmap in R for microbiome data, using the following link
My data as far as rows is concerned looks like this:
781
782
783
547
519
575
044
045
049
If I want to group 781-783, 547-575 and 044-049 as individual groups and give them separate colours using the below idea:
Assigning animals to different groups (2 random groups in this case)
var1 <- round(runif(n=12, min=1, max=2))
var1 <- replace (var1, which(var1 == 1), "deepskyblue")
var1 <- replace (var1, which(var1 == 2), "magenta")
cbind(row.names(data.prop), var1)
How do I go about it? I understand that the above code, randomly generates 2 groups, but how can I specify which rows go into which group?
Thank you,
Susheel
Because rownames are of necessity character and the only good range-operator in R is ":" for numeric values: you need to coerce ranges to the desired "0nn" format. This is untested in the absence of a proper test case (which questioners are asked to provide):
#look at...
sprintf("%03i", c(781:783, 547:575, 044:049))
# then....
data.prop[ sprintf("%03i", c( 781:783, 547:575, 044:049), 'var1'] <-
mapply(function(clr, rng) {rep(clr, length(rng) )},
c("deepskyblue", "magenta", "green"),
list( 781:783, 547:575, 44:49)
)

How do I sum up properties of a JSON object in coffescript?

I have an object that looks like this one:
object =
title : 'an object'
properties :
attribute1 :
random_number: 2
attribute_values:
a: 10
b: 'irrelevant'
attribute2 :
random_number: 4
attribute_values:
a: 15
b: 'irrelevant'
some_random_stuff: 'random stuff'
I want to extract the sum of the 'a' values on attribute1 and attribute2.
What would be the best way to do this in Coffeescript?
(I have already found one way to do it but that just looks like Java-translated-to-coffee and I was hoping for a more elegant solution.)
Here is what I came up with (edited to be more generic based on comment):
sum_attributes = (x) =>
sum = 0
for name, value of object.properties
sum += value.attribute_values[x]
sum
alert sum_attributes('a') # 25
alert sum_attributes('b') # 0irrelevantirrelevant
So, that does what you want... but it probably doesn't do exactly what you want with strings.
You might want to pass in the accumulator seed, like sum_attributes 0, 'a' and sum_attributes '', 'b'
Brian's answer is good. But if you wanted to bring in a functional programming library like Underscore.js, you could write a more succinct version:
sum = (arr) -> _.reduce arr, ((memo, num) -> memo + num), 0
sum _.pluck(object.properties, 'a')
total = (attr.attribute_values.a for key, attr of obj.properties).reduce (a,b) -> a+b
or
sum = (arr) -> arr.reduce((a, b) -> a+b)
total = sum (attr.attribute_values.a for k, attr of obj.properties)