Duplicate Siddhi partitions or multiple queries in same partition - partitioning

We are trying to determine what is recommended and what is more performant if we have multiple queries each relying on the same partition. Imagine we have data of just a name string and a value integer, we would like 2 separate queries for detecting patterns in the value for the same name. There are two ways to write this:
Option 1:
define InputStream(name string, value integer)
partition with (name of InputStream)
begin
from every s1=InputStream[value == 1],
s2=InputStream[value == 2]
select s1.value as initial, s2.value as final
insert into OutputStream1
end
partition with (name of InputStream)
begin
from every s1=InputStream[value == 10],
s2=InputStream[value == 11]
select s1.value as initial, s2.value as final
insert into OutputStream2
end
Option 2:
define InputStream(name string, value integer)
partition with (name of InputStream)
begin
from every s1=InputStream[value == 1],
s2=InputStream[value == 2]
select s1.value as initial, s2.value as final
insert into OutputStream1;
from every s1=InputStream[value == 10],
s2=InputStream[value == 11]
select s1.value as initial, s2.value as final
insert into OutputStream2
end
Option 1: It should generate a separate partition stream for each query, and be able to execute them in parallel, but it also has the overhead of generating 2 partition streams for the same name. Unless Siddhi is smart enough to realize the partition streams are identical and puts them in the same stream.
Option 2: The queries are in the same partition stream so I imagine it will execute each sequentially (unless Siddhi is smart enough to realize the queries don't depend on each other since there are no inner streams). But the bonus is that now only 1 partition stream needs to be generated.
Either option should work fine, but which one is more performant? Or will they both be functionally the same once Siddhi processes them.

Since you are using Siddhi 4, I would recommend using Option 2, since memory overhead is really high in Option 1.
However, this issue is fixed in Siddhi 5, after upgrading you can use Option 1 for better performance

Related

Reducing two views into one?

Good afternoon all,
I am new to mySQL. I have everything working just as I want it, but I could only achieve what I wanted by creating a first view, then referencing it in the final view. I wanted to know whether this is bad practice and will result in slower performance or not? And also, just for learning sake whether it could be all done by using a single view. I have tried and tried but keep getting errors.
First I create this view, which I call intermediate view:
CREATE VIEW intermediate_view AS (
SELECT
freight,
shipping_date,
receiver,
tracking_no,
left(cast(receiver as int), 3) as vendor_id,
DATEDIFF(now(),shipping_date) AS days_in_transit,
CASE
WHEN DATEDIFF(now(),shipping_date) > 5 then "Problem"
WHEN DATEDIFF(now(),shipping_date) < 0 then "Not shipped yet"
ELSE "In transit"
END
AS Status
FROM tracking
);
I'm then creating a final view to join data to it from another another table. My second view is
CREATE VIEW final_view AS (
SELECT
intermediate_view.freight,
intermediate_view.shipping_date,
intermediate_view.receiver,
intermediate_view.tracking_no,
intermediate_view.days_in_transit,
intermediate_view.Status,
vendors.vendor_name
FROM intermediate_view
JOIN vendors
on intermediate_view.vendor_id = vendors.vendor_id
);
Basically, all the second table is doing is matching the first 3 numbers of left(cast(receiver as int), 3), to another table table where those 3 numbers have a corresponding company name. Is there a way to join this in one view?
Hopefully I've explained this well enough! Thanks in advance
your cast will not work in MySQL, so use this
CREATE VIEW final_view AS (
SELECT
t.freight,
t.shipping_date,
t.receiver,
t.tracking_no,
left(cast(t.`receiver` as UNSIGNED), 3) as vendor_id,
DATEDIFF(now(),t.shipping_date) AS days_in_transit,
CASE
WHEN DATEDIFF(now(),t.shipping_date) > 5 then "Problem"
WHEN DATEDIFF(now(),t.shipping_date) < 0 then "Not shipped yet"
ELSE "In transit"
END
AS 'Status',
v.vendor_name
FROM tracking t JOIN
vendors v
on left(cast(t.`receiver` as UNSIGNED), 3) = v.vendor_id
);

Does the Laravel `increment()` lock the row?

Does calling the Laravel increment() on an Eloquent model lock the row?
For example:
$userPoints = UsersPoints::where('user_id','=',\Auth::id())->first();
if(isset($userPoints)) {
$userPoints->increment('points', 5);
}
If this is called from two different locations in a race condition, will the second call override the first increment and we still end up with only 5 points? Or will they add up and we end up with 10 points?
To answer this (helpful for future readers): the problem you are asking about depends on database configuration.
Most MySQL engines: MyISAM and InnoDB etc.. use locking when inserting, updating, or altering the table until this feature is explicitly turned off. (anyway this is the only correct and understandable implementation, for most cases)
So you can feel comfortable with what you got, because it will work correct at any number of concurrent calls:
-- this is something like what laravel query builder translates to
UPDATE users SET points += 5 WHERE user_id = 1
and calling this twice with starting value of zero will end up to 10
The answer is actually a tiny bit different for the specific case with ->increment() in Laravel:
If one would call $user->increment('credits', 1), the following query will be executed:
UPDATE `users`
SET `credits` = `credits` + 1
WHERE `id` = 2
This means that the query can be regarded as atomic, since the actual credits amount is retrieved in the query, and not retrieved using a separate SELECT.
So you can execute this query without running any DB::transaction() wrappers or lockForUpdate() calls because it will always increment it correctly.
To show what can go wrong, a BAD query would look like this:
# Assume this retrieves "5" as the amount of credits:
SELECT `credits` FROM `users` WHERE `id` = 2;
# Now, execute the UPDATE statement separately:
UPDATE `users`
SET `credits` = 5 + 1, `users`.`updated_at` = '2022-04-15 23:54:52'
WHERE `id` = 2;
Or in a Laravel equivalent (DONT DO THIS):
$user = User::find(2);
// $user->credits will be 5.
$user->update([
// Shown as "5 + 1" in the query above but it would be just "6" ofcourse.
'credits' => $user->credits + 1
]);
Now, THIS can go wrong easily since you are 'assigning' the credit value, which is dependent on the time that the SELECT statement took place. So 2 queries could update the credits to the same value while the intention was to increment it twice. However, you CAN correct this Laravel code the following way:
DB::transaction(function() {
$user = User::query()->lockForUpdate()->find(2);
$user->update([
'credits' => $user->credits + 1,
]);
});
Now, since the 2 queries are wrapped in a transaction and the user record with id 2 is READ-locked using lockForUpdate(), any second (or third or n-th) instance of this transaction that takes place in parallel should not be able to read using a SELECT query until the locking transaction is complete.

Which of the two approaches for the specified task is better performance-wise in Perl DBI module?

Here is a problem that I'm facing, which I need to solve using Perl DBI module:
Table:
c1 c2 c3 c4 c5 c6
__________________
r1 | a b c d e f
r2 | h i j k x m
r3 | n x p q r x
r4 | k l m n p q
Task: determine the name of the left-most column that has value 'x' in any of the rows. In the example it is c2.
I see two ways to do this:
First
Select column c1 in all the rows;
Loop through the retrieved fields, starting from top-most;
If any of the fields have value 'x', return c1;
Otherwise, repeat 1-4 for next column;
How I approximately imagine it to look in perl:
my #bind_values=\(my $field);
my $var;
for my $i (1..6) {
$statement="select c$i from table"
$dbh->selectcol_arrayref($statement, undef, #bind_values);
if ($field eq 'x') {$var=$i;last;}
}
return $field;
Second
Set variable $var to 4;
Select all columns from r1 to r$var.
Loop through returned fields, starting from left-most;
If a field has value 'x' and current column number is lower than x, assign the current column number to x;
repeat 2-5 for next row
return x
How I approximately imagine it to look in Perl:
my #bind_values;
my $var=6;
my #cols;
for my $i (1..6) {
for (1..$var){push #cols, "c$_"; push #bind_values, my "c$_";}
$statement="select #cols from table"
$dbh->selectrow_array($statement, undef, #bind_values)
for (#bind values){
if ($$_<$var) $var=$$_;
}
}
return $var;
If I understood the manual correctly, selectcol_array() actually performs a separate SQL call for each row in the table, so both approaches involve a two-level loop.
To people know more about the inner workings of Perl DBI module my question is the following:
Which of the approaches is better performance-wise?
If it's of any significance, I'm working with a MySQL database.
EDIT: Actual table dimensions are potentially c200 x r1000.
EDIT2:
Another idea: using LIMIT statement, to determine if a column contains a field with the statement SQL statement itself, for example:
SELECT c1
FROM table
WHERE c1='x'
LIMIT 0,1
This statement should allow to determine if c1 contains value 'x'. This would move some more of the performance load to DB engine, correct? Would this improve or worsen performance?
Here is a version using SQLite. I expect the same code to work for MySQL with little or no change. It should work fine unless your detabase table is huge, but you don't mention its size so I presume it's not out of the ordinary.
It simply fetches the contents of the table into memory and checks each column, one by one, to see if any field is x, printing the name of the column once it is found.
use strict;
use warnings;
use DBI;
use List::Util qw/ any /;
my $dbh = DBI->connect('dbi:SQLite:test.sqlite');
my $sth = $dbh->prepare('SELECT * FROM "table"');
$sth->execute;
my $table = $sth->fetchall_arrayref;
my $first_column;
for my $i (0 .. $#{$table->[0]}) {
my #column = map { $_->[$i] } #$table;
if ( any { $_ eq 'x' } #column ) {
$first_column = $sth->{NAME}[$i];
last;
}
}
print $first_column, "\n";
output
c2
Update
This way is likely to be faster, as it uses the database engine to search for columns that contain an x and very little data is loaded into memory
use strict;
use warnings;
use DBI;
my $dbh = DBI->connect('dbi:SQLite:test.sqlite');
my #names = do {
my $sth = $dbh->prepare('SELECT * FROM "table"' LIMIT 0);
$sth->execute;
#{ $sth->{NAME_lc} };
};
my $first_column;
for my $col (#names) {
my $sql = qq{SELECT $col from "table" WHERE $col = 'x' LIMIT 1};
my $row = $dbh->selectrow_arrayref($sql);
if ($row) {
$first_column = $col;
last;
}
}
print $first_column, "\n";
Short of redesigning your table so that it can be queried more effectively, I think your optimal solution is likely to be a modified version of your Option 1. Instead of using fetchall_arrayref(), use fetchrow_arrayref() to collect 1 row at a time. Examine each row as you get it. Break the loop if the minimum column ever gets to column 1. This minimizes the memory used in the Perl code; it uses a single SQL statement (but multiple fetch operations — but then fetchall_arrayref() also uses multiple fetch operations).
The fact that you need to query your data this way tells me that it's stored in a bizarre and inappropriate way. Relational databases are meant to store relations, and the order of their columns should be irrelevant to how they logically function. Any need to refer to column order is a guaranteed sign that you're doing something wrong.
I understand that sometimes one needs to perform one-time queries to determine unusual things about data sets, but I stand by my assessment: this data is stored inappropriately.
My guess is that there are many columns that define related, sequential attributes, maybe something like "profits_1q2001", "profits_2q2001", etc. You'll want to create a separate table for those, maybe something like:
CREATE TABLE `department_profits` (
`id` int(10) unsigned NOT NULL,
`department_id` same_as_parent_table NOT NULL,
`y` year(4) NOT NULL,
`q` tinyint(3) unsigned NOT NULL,
`profits` decimal(9,2) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `idx_dept_quarter` (`department_id`,`y`,`q`),
KEY `idx_profits_y_q_dept` (`profits`,`y`,`q`,`department_id`)
) ENGINE=InnoDB;
Converting the data from its current format to the proper format is left as an exercise for the reader, but it might involve 200 script-generated queries that look like:
SELECT CONCAT(
"INSERT INTO department_profits (department_id, y, q, profits) VALUES (",
"'", department_id, "',",
2001, ",",
1, ",",
profits_1q2001,
");"
)
FROM old_table;
If your question is then (say) when was the first time profits exceeded $10,000 and in which department, then finding the answer becomes something like:
SELECT department_id, y, q, profits
FROM department_profits
WHERE profits > 10000
ORDER BY y, q LIMIT 1;
For the actual question you asked -- if it really is a one-off -- since there are just 200,000 data points, I would do it manually. Export the whole table as tab-separated, drag it onto Excel, "Find/Replace" to change "x" to "-999" or some small value, then "Data -> Sort" by each column in turn until your answer pops to the top. Heck, plain old "Find" might tell you your answer. With just 200 columns, it won't take long, and you might learn something new about your data by seeing it all on the screen sorted various ways :)
Assuming your columns are c1 .. c6 you can use something like this to get it in sqlite:
select distinct (case when c1 = 'x' then 'c1' when c2 = 'x' then 'c2' when c3 = 'x' then 'c4' when c4 = 'x' then 'c4' when c5 = 'x' then 'c5' when c6 = 'x' then 'c6' else 'x' end) from mje order by 1 limit 1;

How to get the last updated row using a single query

I searched enough before posting it.
My table structure:
aid | bid | cid | did |
Where aid, bid together are the primary keys.
When I update the value of cid using a where clause for aid, bid I also want to get the did value of the updated row.
Something like this:
$this->db->set('cid', 1, FALSE)
->where(array(
'aid' => $a_id,
'bid' => $b_id
))
->update('my_table')
->select('did');
The above query says:
Fatal error: Call to a member function select() on a non-object in...
I tried this:
How to get ID of the last updated row in MySQL?
Which is like 3 queries.
I'd suggest fetching the values you're about to update, store their IDs in an array, and run an UPDATE with a WHERE id IN (1, 2, ...).
What you're trying to do is not supported by MySQL. You'll need to run at least 2 queries, and since you're fetching the values the first time and already know what values you're updating, then you can also recreate the new row and it's values without using a query after UPDATE.
In your given example:
$this->db->set('cid', 1, FALSE)
->where(array(
'aid' => $a_id,
'bid' => $b_id
))
->update('my_table')
->select('did');
set(), where() and also select() returns an object that builds on the query. However update() return a value which is the results and doesn't have a function called select() and not set() and where() for that matter.

How to factorise integers using mysql

While doing my project i need to find the prime factorisation of an integer using mysql which i think is the efficient way for query beside doing all recursive thing.
and I want to achieve is to find the prime numbers composing an integer.
example: for 102, the factorial numbers would be: 17, 3, 2
Thank you.
back of envelope strategy (still requires programmed loop for step 2)
create table "primes" with single int column (primary key)
run this loop:
for $x = 2 to $n {
execute("
insert into primes (id)
select $x where not exists
(select * from primes as p where p.id <= sqrt($x) AND ($x mod p.id) > 0)")
}
use the subquery above to list your results for a specific $x
this solution will work for values to $n^2. Step 2 could be improved by only testing numbers over 9 with last digits 1,3,7,9.