Get statistical measures of a varchar field in snowflake - mysql

I have a field called MER_DATA in a snowflake table having a value as shown below:
[43,44.25,44.5,42.75,44,44.25,42.75,43,42.5,42.5,36.75,42.25,42.75,43.25,43.25,43.25,42.75,43.5,42,43,43.75,43.75,43.25,41.75,43.25,42.5,43.25,42.75,43.25,43.5,43.25,43.25,43.75,...]
Each row has approximately 4k(This varies from row to row)numbers in them and the data type of the field is varchar(30000). The data is around 700k rows
Now I want to calculate the standard deviation of each row using the numbers present in the list shown above.
I have tried doing this in MySQL using the following query:
select mac, `timestamp`, std(res), min(res), max(res)
from
(select mac, `timestamp`, r.res from table cmr ,
json_table(mer_data, '$[*]' columns (res float path '$'))r)T
group by mac, `timestamp`;
which gives me the right result but takes a lot of time for 700k rows.
I want to do the same in snowflake. Is there an optimal way to do this?
Also the query needs to run within 10 mins in snowflake. The mysql query can take upto 1 hours.

Without the table definition and example source data it's difficult to produce a complete solution for your problem, but here is an example of how to do this using the STRTOK_SPLIT_TO_TABLE table function which first splits your varchar numbers to rows, so we can then re-aggregate the Value's to get the standard deviations per row.
First generate some test data at the right scale:
Create or Replace Table cmr (mer_data varchar) as
With gen as (
select
uniform(1,700000, random()) row_num,
normal(50, 1, random(0))::decimal(4,2) num
from table(generator(rowcount => 2800000000)) v
)
Select listagg(num, ',') listNums from gen group by row_num
;
Check we have 700k rows and varying count of numbers per row.
Select
count(*) row_count,
min(REGEXP_COUNT( mer_data , '[,]' ))+1 min_num_count,
max(REGEXP_COUNT( mer_data , '[,]' ))+1 max_num_count
from cmr limit 10;
Split the varchar number lists to rows with STRTOK_SPLIT_TO_TABLE and group by the generated SEQ column to calculate the stddev of the VALUE.
Select
seq row_num,
stddev(value) stdListNums,
min(value) minNum, max(value) maxNum,
count(value) countListNums
from cmr, table(STRTOK_SPLIT_TO_TABLE(mer_data ,','))
Group By 1
;
For my data the query takes just over 3 minutes on and XSMALL Virtual Warehouse, and
a little over 30 seconds on LARGE Virtual Warehouse.
You can read about the STRTOK_SPLIT_TO_TABLE function here.

Related

MySQL select value range within a value range, from a dash-separated column value?

How do I select a value range from a column where two values in it are separated by a dash, using MySQL?
Here's my example table named "example":
The user enters a low value (X) and a high value (Y).
For example X=2.5 and Y=7.2
I want to select all items where the left value is higher than X (in this case 2.5) and the right value is lower than Y (in this case 7.2). Using these X and Y values I should end up with the rows 2 and 5 as a result.
Sort of like this:
SELECT * FROM example WHERE MIN(value) > X AND MAX(value) < Y
How do I do this?
You can use LEFT and RIGHT functions to get X and Y out of your value field.
So I think you are looking for something like this:
SELECT * FROM example WHERE CAST(LEFT(value,3)AS DECIMAL(2,1)) > 2.5 and CAST(RIGHT(value,3)AS DECIMAL(2,1)) < 7.2
First you need to access your table in a fashion that only has one value per column. (Multiple values per column, like 3.5-7.5 happen to be a very common relational database design antipattern. They cripple both performance and clarity.)
This SQL subquery does the trick for pairs of values.
SELECT item_id, name,
0.0+SUBSTRING_INDEX(value, '-',1) first,
0.0+SUBSTRING_INDEX(value, '-', -1) last
FROM example;
The expression 0.0+something is a MySQL trick to coerce a value to be numeric.
Then use the subquery to apply your search criteria.
SELECT item_id, name, first, last
FROM ( SELECT item_id, name,
0.0+SUBSTRING_INDEX(value, '-',1) first,
0.0+SUBSTRING_INDEX(value, '-', -1) last
FROM example
) s
WHERE first > 2.5
AND last < 7.2;
Fiddle here.
In a comment you asked about the situation where you have more than two values in a single column separated by delimiters. See this. Split comma separated values in MySQL
Pro tip Don't put more than one number in a column in an RDBMS table. The next person to use the table will be muttering curses all day while trying to use that data.
Pro tip Use numeric data types, not VARCHAR(), for numbers.

cast to unsigned need very long

I've the following problem:
i want to query some data from 3 big SQL-Tables.
eintraege ~13000 rows // rubrik2eintrag ~ 9500 rows // rubriken ~ 425 rows
This query
SELECT eintraege.id AS id, eintraege.email, eintraege.eintrags_name, eintraege.telefon,
eintraege.typ, rubrik2eintrag.rubrik AS rubrik, eintraege.status,
IFNULL( GROUP_CONCAT( rubriken.bezeichnung ), \'- Keine Rubrik zugeordnet\' ) AS rubrikname
FROM eintraege
LEFT OUTER JOIN rubrik2eintrag ON rubrik2eintrag.eintrag = eintraege.id
LEFT OUTER JOIN rubriken ON rubrik = rubriken.rubrik_id
GROUP BY id
ORDER BY `id` DESC
LIMIT 0, 50
works fine for me (~ 2 seconds response time) but the entrys appear not in the correct order. (e.g. the row with the id 500 came right before the row with id 3000 )
so i cast the id to unsigned. like this:
ORDER BY CAST(`id` AS UNSIGNED) DESC
But now the query needs nearly 40 seconds.
Is there a better/faster way to reach a correct ordered output?
Apparently, id is not defined as integer (or numeric) datatype. That would explain the ordering, where it's ordering by string value.
Some possibilities:
Introduce a new column in the table with integer datatype, populate/maintain the contents of that column, add an appropriate index with that column as the leading index, and change the query to order by the new column. (That would be the best MySQL approximation of a function based index.)
Or, store the string value with leading zeros, so they are the same length.
000000000500
000000030000
Or, redefine the id column to be integer type.
Aside from those ideas... no, there's really no getting around a Using filesort operation to order the rows by integer value.

Table statistics (aka row count) over time

i'm preparing a presentation about one of our apps and was asking myself the following question: "based on the data stored in our database, how much growth have happend over the last couple of years?"
so i'd like to basically show in one output/graph, how much data we're storing since beginning of the project.
my current query looks like this:
SELECT DATE_FORMAT(created,'%y-%m') AS label, COUNT(id) FROM table GROUP BY label ORDER BY label;
the example output would be:
11-03: 5
11-04: 200
11-05: 300
unfortunately, this query is missing the accumulation. i would like to receive the following result:
11-03: 5
11-04: 205 (200 + 5)
11-05: 505 (200 + 5 + 300)
is there any way to solve this problem in mysql without the need of having to call the query in a php-loop?
Yes, there's a way to do that. One approach uses MySQL user-defined variables (and behavior that is not guaranteed)
SELECT s.label
, s.cnt
, #tot := #tot + s.cnt AS running_subtotal
FROM ( SELECT DATE_FORMAT(t.created,'%y-%m') AS `label`
, COUNT(t.id) AS cnt
FROM articles t
GROUP BY `label`
ORDER BY `label`
) s
CROSS
JOIN ( SELECT #tot := 0 ) i
Let's unpack that a bit.
The inline view aliased as s returns the same resultset as your original query.
The inline view aliased as i returns a single row. We don't really care what it returns (except that we need it to return exactly one row because of the JOIN operation); what we care about is the side effect, a value of zero gets assigned to the #tot user variable.
Since MySQL materializes the inline view as a derived table, before the outer query runs, that variable gets initialized before the outer query runs.
For each row processed by the outer query, the value of cnt is added to #tot.
The return of s.cnt in the SELECT list is entirely optional, it's just there as a demonstration.
N.B. The MySQL reference manual specifically states that this behavior of user-defined variables is not guaranteed.

SQL query(or function) to find the total no. of occurance of a particular string in a table

I have a big table, where one of the columns have text responses from many users (thousands of rows). I need to find the total number of occurance of a partcular word in that response column. If I use
SELECT count(*) from Table where Response like '%wordtobefound%'
it gives number of rows that contains the 'wordtobefound', but I need total number of occurances of the word 'wordtobefound' in all rows.
Note: I would prefer a user defined function that I can add to database and use again and again.
SELECT SUM(BINARY (LENGTH(field_name) - LENGTH(REPLACE(field_name, "your_word", "")))/LENGTH("your_word")) FROM table_name;
Try this :
SELECT
CAST((LENGTH(Response) - LENGTH(REPLACE(Response, 'wordtobefound', ""))) / LENGTH('wordtobefound') AS UNSIGNED) AS wordCount
FROM Table
You can use a SUM for the total :
SELECT
SUM(CAST((LENGTH(Response) - LENGTH(REPLACE(Response, 'wordtobefound', ""))) / LENGTH('wordtobefound') AS UNSIGNED)) AS wordCount
FROM Table

Mysql query find specified rows

i have one table trip_data.Every one second i getting packets and inserting data to database.trip_data table contains four fields.trip_paramid,fuel_content,creation_time&vehicle_id.I want to select all rows in which difference between creation time is 2 minutes(Not exactly 2.Approximately 2).trip_data table contains 40 lacks rows.So i need a optimized select query for this.Can anyone help on this.Here is table schema&sample data for the trip_table..
SQlFiddle demo
SELECT
tp.*
FROM
trip_parameters tp
GROUP BY
CONVERT(UNIX_TIMESTAMP (tp.creation_time)/(2*60), unsigned)
ORDER BY
tp.creation_time asc
Note that using UNIX_TIMESTAMP does not allow you to handle dates beyond year 2037. Using the following instead fixes the problem:
CONVERT(TIMESTAMPDIFF(SECOND,'1970-01-01 00:00:00',tp.creation_time)/(2*60), unsigned)
You can do it in one table scan using MYSQL User defined variables. Unfortunately UDV's have a limited set of data types (integer, decimal, floating-point, binary or nonbinary string). So in this query I use a char #ti varible to store previous datetime using CAST to compare it with the Creation_time field. Also initial value for this variable I set to (now()-10000000) you can use any date you wish less than MIN(Creation_time)
Here is the SQLFiddle demo
select * from
(
select trip_parameters.*,
if(ABS(TIMESTAMPDIFF(MINUTE,Creation_time,cast(#ti as datetime)))>=2,1,0) t,
#ti:=if(ABS(TIMESTAMPDIFF(MINUTE,Creation_time,cast(#ti as datetime)))>=2,
cast(Creation_time as char(100)),#ti)
from trip_parameters,
(select #ti:=cast(now()-10000000 as char(100))) a
order by creation_time
) t2
where T=1
order by creation_time
Try this
SELECT trip_paramid, fuel_content, creation_time, vehicle_id
FROM trip_parameters
GROUP BY FLOOR(UNIX_TIMESTAMP(creation_time)/120)
This takes one item of every 2 minute block