We are trying to determine what is recommended and what is more performant if we have multiple queries each relying on the same partition. Imagine we have data of just a name string and a value integer, we would like 2 separate queries for detecting patterns in the value for the same name. There are two ways to write this:
Option 1:
define InputStream(name string, value integer)
partition with (name of InputStream)
begin
from every s1=InputStream[value == 1],
s2=InputStream[value == 2]
select s1.value as initial, s2.value as final
insert into OutputStream1
end
partition with (name of InputStream)
begin
from every s1=InputStream[value == 10],
s2=InputStream[value == 11]
select s1.value as initial, s2.value as final
insert into OutputStream2
end
Option 2:
define InputStream(name string, value integer)
partition with (name of InputStream)
begin
from every s1=InputStream[value == 1],
s2=InputStream[value == 2]
select s1.value as initial, s2.value as final
insert into OutputStream1;
from every s1=InputStream[value == 10],
s2=InputStream[value == 11]
select s1.value as initial, s2.value as final
insert into OutputStream2
end
Option 1: It should generate a separate partition stream for each query, and be able to execute them in parallel, but it also has the overhead of generating 2 partition streams for the same name. Unless Siddhi is smart enough to realize the partition streams are identical and puts them in the same stream.
Option 2: The queries are in the same partition stream so I imagine it will execute each sequentially (unless Siddhi is smart enough to realize the queries don't depend on each other since there are no inner streams). But the bonus is that now only 1 partition stream needs to be generated.
Either option should work fine, but which one is more performant? Or will they both be functionally the same once Siddhi processes them.
Since you are using Siddhi 4, I would recommend using Option 2, since memory overhead is really high in Option 1.
However, this issue is fixed in Siddhi 5, after upgrading you can use Option 1 for better performance
Here is a problem that I'm facing, which I need to solve using Perl DBI module:
Table:
c1 c2 c3 c4 c5 c6
__________________
r1 | a b c d e f
r2 | h i j k x m
r3 | n x p q r x
r4 | k l m n p q
Task: determine the name of the left-most column that has value 'x' in any of the rows. In the example it is c2.
I see two ways to do this:
First
Select column c1 in all the rows;
Loop through the retrieved fields, starting from top-most;
If any of the fields have value 'x', return c1;
Otherwise, repeat 1-4 for next column;
How I approximately imagine it to look in perl:
my #bind_values=\(my $field);
my $var;
for my $i (1..6) {
$statement="select c$i from table"
$dbh->selectcol_arrayref($statement, undef, #bind_values);
if ($field eq 'x') {$var=$i;last;}
}
return $field;
Second
Set variable $var to 4;
Select all columns from r1 to r$var.
Loop through returned fields, starting from left-most;
If a field has value 'x' and current column number is lower than x, assign the current column number to x;
repeat 2-5 for next row
return x
How I approximately imagine it to look in Perl:
my #bind_values;
my $var=6;
my #cols;
for my $i (1..6) {
for (1..$var){push #cols, "c$_"; push #bind_values, my "c$_";}
$statement="select #cols from table"
$dbh->selectrow_array($statement, undef, #bind_values)
for (#bind values){
if ($$_<$var) $var=$$_;
}
}
return $var;
If I understood the manual correctly, selectcol_array() actually performs a separate SQL call for each row in the table, so both approaches involve a two-level loop.
To people know more about the inner workings of Perl DBI module my question is the following:
Which of the approaches is better performance-wise?
If it's of any significance, I'm working with a MySQL database.
EDIT: Actual table dimensions are potentially c200 x r1000.
EDIT2:
Another idea: using LIMIT statement, to determine if a column contains a field with the statement SQL statement itself, for example:
SELECT c1
FROM table
WHERE c1='x'
LIMIT 0,1
This statement should allow to determine if c1 contains value 'x'. This would move some more of the performance load to DB engine, correct? Would this improve or worsen performance?
Here is a version using SQLite. I expect the same code to work for MySQL with little or no change. It should work fine unless your detabase table is huge, but you don't mention its size so I presume it's not out of the ordinary.
It simply fetches the contents of the table into memory and checks each column, one by one, to see if any field is x, printing the name of the column once it is found.
use strict;
use warnings;
use DBI;
use List::Util qw/ any /;
my $dbh = DBI->connect('dbi:SQLite:test.sqlite');
my $sth = $dbh->prepare('SELECT * FROM "table"');
$sth->execute;
my $table = $sth->fetchall_arrayref;
my $first_column;
for my $i (0 .. $#{$table->[0]}) {
my #column = map { $_->[$i] } #$table;
if ( any { $_ eq 'x' } #column ) {
$first_column = $sth->{NAME}[$i];
last;
}
}
print $first_column, "\n";
output
c2
Update
This way is likely to be faster, as it uses the database engine to search for columns that contain an x and very little data is loaded into memory
use strict;
use warnings;
use DBI;
my $dbh = DBI->connect('dbi:SQLite:test.sqlite');
my #names = do {
my $sth = $dbh->prepare('SELECT * FROM "table"' LIMIT 0);
$sth->execute;
#{ $sth->{NAME_lc} };
};
my $first_column;
for my $col (#names) {
my $sql = qq{SELECT $col from "table" WHERE $col = 'x' LIMIT 1};
my $row = $dbh->selectrow_arrayref($sql);
if ($row) {
$first_column = $col;
last;
}
}
print $first_column, "\n";
Short of redesigning your table so that it can be queried more effectively, I think your optimal solution is likely to be a modified version of your Option 1. Instead of using fetchall_arrayref(), use fetchrow_arrayref() to collect 1 row at a time. Examine each row as you get it. Break the loop if the minimum column ever gets to column 1. This minimizes the memory used in the Perl code; it uses a single SQL statement (but multiple fetch operations — but then fetchall_arrayref() also uses multiple fetch operations).
The fact that you need to query your data this way tells me that it's stored in a bizarre and inappropriate way. Relational databases are meant to store relations, and the order of their columns should be irrelevant to how they logically function. Any need to refer to column order is a guaranteed sign that you're doing something wrong.
I understand that sometimes one needs to perform one-time queries to determine unusual things about data sets, but I stand by my assessment: this data is stored inappropriately.
My guess is that there are many columns that define related, sequential attributes, maybe something like "profits_1q2001", "profits_2q2001", etc. You'll want to create a separate table for those, maybe something like:
CREATE TABLE `department_profits` (
`id` int(10) unsigned NOT NULL,
`department_id` same_as_parent_table NOT NULL,
`y` year(4) NOT NULL,
`q` tinyint(3) unsigned NOT NULL,
`profits` decimal(9,2) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_dept_quarter` (`department_id`,`y`,`q`),
KEY `idx_profits_y_q_dept` (`profits`,`y`,`q`,`department_id`)
) ENGINE=InnoDB;
Converting the data from its current format to the proper format is left as an exercise for the reader, but it might involve 200 script-generated queries that look like:
SELECT CONCAT(
"INSERT INTO department_profits (department_id, y, q, profits) VALUES (",
"'", department_id, "',",
2001, ",",
1, ",",
profits_1q2001,
");"
)
FROM old_table;
If your question is then (say) when was the first time profits exceeded $10,000 and in which department, then finding the answer becomes something like:
SELECT department_id, y, q, profits
FROM department_profits
WHERE profits > 10000
ORDER BY y, q LIMIT 1;
For the actual question you asked -- if it really is a one-off -- since there are just 200,000 data points, I would do it manually. Export the whole table as tab-separated, drag it onto Excel, "Find/Replace" to change "x" to "-999" or some small value, then "Data -> Sort" by each column in turn until your answer pops to the top. Heck, plain old "Find" might tell you your answer. With just 200 columns, it won't take long, and you might learn something new about your data by seeing it all on the screen sorted various ways :)
Assuming your columns are c1 .. c6 you can use something like this to get it in sqlite:
select distinct (case when c1 = 'x' then 'c1' when c2 = 'x' then 'c2' when c3 = 'x' then 'c4' when c4 = 'x' then 'c4' when c5 = 'x' then 'c5' when c6 = 'x' then 'c6' else 'x' end) from mje order by 1 limit 1;
I can find sequenced record gaps where sequenced weeks with same numbers using following query.
SELECT * FROM pointed_numbers A WHERE EXISTS (
SELECT * FROM pointed_numbers B WHERE A.number = B.number AND (A.week = B.week + 1 XOR A.week = B.week - 1)
) ORDER BY A.number, A.week;
How can I identify each gaps without stored procedure. I have tried with user-defined variable but I had no success.
Take a look at http://www.artfulsoftware.com/infotree/queries.php and look at the stuff under the "sequences" section. This is a super super helpful site with recipes for how to do complicated things in mysql!
Is there way to realize this algorithm with mysql without 100500 queries and lots of resources?
if (exists %name% in table.name) {
num = 2;
while(exists %newname%+(num) in table.name) num++;
%name% = newname+(num);
}
Thanks
I don't know how much better you can do with a stored procedure in MySql, but you can definitely do better than 100500 queries:
SELECT name FROM table WHERE name LIKE 'somename%' ORDER BY name DESC LIMIT 1
At that point, you know that you can increment the number at the end of name and the result will be unused.
I 'm glossing over some fine print (this approach will never find and fill any "holes" in the naming scheme that may exist, and it's still not guaranteed that the name will be available due to race conditions), but in practice it can be made to work quite easily.
The simpliest way I can see of doing it is to create a table of sequential numbers
then cross join on to it....
SELECT a.name,b.id
FROM table a
WHERE a.name = 'somename'
CROSS JOIN atableofsequentialnumbers b
WHERE NOT EXISTS (SELECT 1 FROM table x WHERE x.name = CONCAT(a.name,b.id))
LIMIT 10
This will return the first 10 available numbers/names
The query:
$consulta = "UPDATE `list`
SET `pos` = $pos
WHERE `id_item` IN (SELECT id_item
FROM lists
WHERE pos = '$item'
ORDER BY pos DESC
LIMIT 1)
AND id_usuario = '$us'
AND id_list = '$id_pl'";
The thing is, this query is inside a foreach, and it wants to update the order of the items in a list. Before I had it like this:
$consulta = "UPDATE `list`
SET `pos` = $pos
WHERE `$pos` = '$item'
AND id_usuario = '$us'
AND id_list = '$id_pl'";
But when I update pos 2 -> 1, and then 1 -> 2, the result is two times 2 and no 1...
Is there a solution for this query?
Renumbering the items in a list is tricky. When you renumber the items in the list using multiple separate SQL statements, it is even trickier.
Your inner sub-select statement also is not properly constrained. You need an extra condition such as:
AND id_list = '$id_pl'
There are probably many ways to do this, but the one that may be simplest follows. I'm assuming that:
the unshown foreach loop generates $pos values in the desired sequence (1, 2, ...)
the value of $id_pl is constant for the loop
the foreach loop gives values for $us and $item for each iteration
the combination of $id_pl, $us, and $item uniquely identifies a row in the list table
there aren't more than 100 pos values to worry about
you are able to use an explicit transaction around the statement sequence
The suggested solution has two stages:
Allocate 100 + pos to each row to place it in its new position
Subtract 100 from each pos
This technique avoids any complicated issues about whether rows that have had there position adjusted are reread by the same query.
Inside the loop:
foreach ...
...$pos, $item, $us...
UPDATE list
SET pos = $pos + 100
WHERE id_item = '$item'
AND id_usuario = '$us'
AND id_list = '$id_pl'
AND pos < 100
end foreach
UPDATE list
SET pos = pos - 100
WHERE id__list = '$id_pl';
If you don't know the size of the lists, you could assign negative pos values in the loop and convert to positive after the loop, or any of a number of other equivalent mappings. The key is to update the table so that the new pos numbers in the loop are disjoint from the old numbers, and then adjust the new values after the loop.
Alternative techniques create a temporary table that maps the old numbers to the new and then executes a single UPDATE statement that changes the old pos value to the new for all rows in a single operation. This is probably more efficient, especially if the mapping table can be generated as a query, but that depends on whether the renumbering is algorithmic. The technique shown, albeit somewhat clumsy, can be made to work for arbitrary renumberings.