Converting H2/MySQL query to Postgres/cockroach - mysql

I want to convert following —admittedly bad— query from H2/MySQL to Postgres/cockroach:
SET #UPDATE_TRANSFER=
(select count(*) from transfer where id=‘+transfer_id+' and consumed=false)>0;
update balance_address set balance =
case when #UPDATE_TRANSFER then balance +
(select value from transaction where transfer_id=‘+id+' and t_index=0)
else balance end where address =
(select address from transaction where transfer_id=‘+id+' and t_index=0)
There are three tables involved in this query: balance_address, bundle, and transaction. The goal of the query is to update the overall balance when a fund transfer happens.
A transfer can have many transaction bundled together. For instance, let’s assume Paul has $20 in his account and he wants to send $3 to Jane. This will result in 4 transactions:
One that adds $3 into Jane’s account
One transaction that removes the $20 from Paul account
One transactions that changes Paul account to 0
One transaction that puts to remainder of Paul funds in a new address; still belonging to him.
Each of these transaction in the whole transfer bundle has an index and a value. As you see above. So the goal of this update query is to update Jane’s account.
The challenge is that this transfer can be processed by many servers in parallel and there is no distributed lock. So, if we naively process in parallel, each server will increment Jane’s account, leading to erroneous results.
To prevent this, the balance_address table has a column called consumed. The first server that updates the balance, sets the transfer to consumed=true. Other servers or threads can only update if consumed is false.
So, my goal is to 1) improve this query and 2) rewrite it to work with posters. Right now, the variable construct is not accepted already.
PS. I cannot change the data model.

CockroachDB doesn't have variables, but the #UPDATE_TRANSFER variable is only used once, so you can just substitute the subquery inline:
update balance_address set balance =
case
when (select count(*) from transfer where id=$1 and consumed=false)>0
then balance + (select value from transaction where transfer_id=$1 and t_index=0)
else balance
end
where address =
(select address from transaction where transfer_id=$1 and t_index=0)
But this doesn't set the consumed flag. The simplest way to do this is to make this a multi step transaction in your client application:
num_rows = txn.execute("UPDATE transfer SET consumed=true
WHERE id=$1 AND consumed=false", transfer_id)
if num_rows == 0: return
value, address = txn.query("SELECT value, address FROM transaction
WHERE transfer_id=$1 and t_index=0", transfer_id)
txn.execute("UPDATE balance_address SET balance = balance+$1
WHERE address = $2", value, address)
In PostgreSQL, I think you could get this into one big statement using common table expressions. However, CockroachDB 2.0 only supports a subset of CTEs, and I don't think it's possible to do this with a CTE in cockroach yet.

Related

How can this query be optimized for speed?

This query creates an export for UPS from the deliveries history:
select 'key'
, ACC.Name
, CON.FullName
, CON.Phone
, ADR.AddressLine1
, ADR.AddressLine2
, ADR.AddressLine3
, ACC.Postcode
, ADR.City
, ADR.Country
, ACC.Code
, DEL.DeliveryNumber
, CON.Email
, case
when CON.Email is not null
then 'Y'
else 'N'
end
Ship_Not_Option
, 'Y' Ship_Not
, 'ABCDEFG' Description_Goods
, '1' numberofpkgs
, 'PP' billing
, 'CP' pkgstype
, 'ST' service
, '1' weight
, null Shippernr
from ExactOnlineREST..GoodsDeliveries del
join ExactOnlineREST..Accounts acc
on ACC.ID = del.DeliveryAccount
join ExactOnlineREST..Addresses ADR
on ADR.ID = DEL.DeliveryAddress
join ExactOnlineREST..Contacts CON
on CON.ID = DEL.DeliveryContact
where DeliveryDate between $P{P_SHIPDATE_FROM} and $P{P_SHIPDATE_TO}
order
by DEL.DeliveryNumber
It takes many minutes to run. The number of deliveries and accounts grows with several hundreds each day. Addresses and contacts are mostly 1:1 with accounts. How can this query be optimized for speed in Invantive Control for Excel?
Probably this query is run at most once every day, since the deliverydate does not contain time. Therefore, the number of rows selected from ExactOnlineREST..GoodsDeliveries is several hundreds. Based upon the statistics given, the number of accounts, deliveryaddresses and contacts is also approximately several hundreds.
Normally, such a query would be optimized by a solution such as Exact Online query with joins runs more than 15 minutes, but that solution will not work here: the third value of a join_set(soe, orderid, 100) is the maximum number of rows on the left-hand side to be used with index joins. At this moment, the maximum number on the left-hand side is something like 125, based upon constraints on the URL length for OData requests to Exact Online. Please remember the actual OData query is a GET using an URL, not a POST with unlimited size for the filter.
The alternatives are:
Split volume
Data Cache
Data Replicator
Have SQL engine or Exact Online adapted :-)
Split Volume
In a separate query select the eligible GoodsDeliveries and put them in an in-memory or database table using for instance:
create or replace table gdy#inmemorystorage as select ... from ...
Then create a temporary table per 100 or similar rows such as:
create or replace table gdysubpartition1#inmemorystorage as select ... from ... where rowidx$ between 0 and 99
... etc for 100, 200, 300, 400, 500
And then run the query several times, each time with a different gdysubpartition1..gdysubpartition5 instead of the original from ExactOnlineREST..GoodsDeliveries.
Of course, you can also avoid the use of intermediate tables by using an inline view like:
from (select * from goodsdeliveries where date... limit 100)
or alike.
Data Cache
When you run the query multiple times per day (unlikely, but I don't know), you might want to cache the Accounts in a relational database and update it every day.
You can also use a 'local memorize results clipboard andlocal save results clipboard to to save the last results to a file manually and later restore them usinglocal load results clipboard from ...andlocal insert results clipboard in table . And maybe theninsert into from exactonlinerest..accounts where datecreated > trunc(sysdate)`.
Data Replicator
With Data Replicator enabled, you can have replicas created and maintained automatically within an on-premise or cloud relational database for Exact Online API entities. For low latency, you will need to enable the Exact webhooks.
Have SQL Engine or Exact adapted
You can also register a request to have the SQL engine to allow higher number in the join_set hint, which would require addressing the EOL APIs in another way. Or register a request at Exact to also allow POST requests to the API with the filter in the body.

How to immediately get timestamp value after update?

I've got some client code that is committing some data across some tables, in simple terms like so:
Client [Id, Balance, Timestamp]
ClientAssets [Id, AssetId, Quantity]
ClientLog [Id, ClientId, BalanceBefore, BalanceAfter]
When the customer buys an asset, I do the following pseudo code:
BEGIN TRANSACTION
GetClientRow Where ID = 1
Has enough balance for new asset cost? Yes...
Insert Into ClientAssets...
UpdateClient -> UPDATE Client SET Balance = f_SumAssetsForClient(1) WHERE ID = 1 and Timestamp = TS From Step 1;
GetClientRow Where ID = 1
Insert Into ClientLog BalanceBefore = Balance at Step 1, BalanceAfter = Balance at Step 5.
COMMIT
On step 4, the client row is updated in 1 update statement using a function 'f_SumAssetsForClient' that just sums the assets for the client and returns the balance of those assets. Also on Step 4, the timestamp is automatically updated.
My problem is, when I call GetClientRow again on Step 5, someone could have updated the clients balance, so when I go to write the log in Step 6, its not truly the balance after this set of steps. It would be the balance after a different write outside of this transaction.
If I could get the newly updated timestamp from the client row when I call UPDATE in Step 4, I could pass this to step to only grab the client row where the TS = the new updated TS. Is this possible at all? Or is my design flawed. I can't see a way out of the problem of stale data between step 5 and 6. I sense there is a problem in the table design but can't quite see it.
Step 1 needs to be SELECT ... FOR UPDATE. Any other data that needs to change also need to be "locked" FOR UPDATE.
That way, another thread cannot sneak in and modify those rows. They will probably be delayed until after you have COMMITted, or there might be a Deadlock. Either way, the thing you are worried about cannot happen. No timestamp games.
Copied from comment
Sounds like you need a step 3.5 that is SELECT f_SumAssetsForClient(1) then
store that value, then do the update, then write the log with the values - you
shouldnt have to deal with the timestamp at all -- or do the whole procedure as
a stored proc

How to use Linq to Sql as a Serial Number Generator to avoid Gaps?

Have created the following Linq to SQL transaction to try and create invoices number without gaps.
Assuming 2 Tables:
Table 1: InvoiceNumbers. -
Columns ID, SerialNumber, Increment -
Example: 1, 10001, 1
Table 2: Invoices. -
Columns: ID, InvoiceNumber, Name -
Example: 1, 10001, "Bob Smith"
Dim db As New Invoices.InvoicesDataContext
Dim lastInvoiceNumber = (From n In db.InvoiceNumbers Order By n.LastSerialNumber Descending
Select n.LastSerialNumber, n.Increment).First
Dim nextInvoiceNumber As Integer = lastInvoiceNumber.LastSerialNumber + lastInvoiceNumber.Increment
Dim newInvoiceNumber = New Invoices.InvoiceNumber With {.LastSerialNumber = nextInvoiceNumber, .Increment = lastInvoiceNumber.Increment}
Dim newInvoice = New Invoices.Invoice With {.InvoiceNumber = nextInvoiceNumber, .Name = "Test" + nextInvoiceNumber.ToString}
db.InvoiceNumbers.InsertOnSubmit(newInvoiceNumber)
db.Invoices.InsertOnSubmit(newInvoice)
db.SubmitChanges()
All works fine but is is possible using this method that 2 users might pick up the same Invoice Number if they hit the transaction at the same time?
If so, is there are better way using Linq to Sql?
Gaps in sequences are inevitable when dealing with transactional databases.
First, you cannot use SELECT max(id)+1 because it may give the same id to 2 transactions which execute at the same time. This means you have to use database native auto-increment column (MySQL, SQLite) or database sequence (PostgreSQL, MSSQL, Oracle) to obtain next available id.
But even using auto-increment sequence does NOT solve this problem.
Imagine that you have 2 database connections that started 2 parallel transactions almost at the same time. First one acquired some id from auto-increment sequence and it became previously used value +1. One nanosecond later, second transaction acquired next id, which is now +2. Now imagine that first transaction rolled back for some reason (encountered error, your code decided to abort it, program crashed - you name it). After that, second transaction committed with id +2, creating a gap in id numbering.
But what if number of such parallel transactions was more than 2? You cannot predict, and you also cannot tell currently running transactions to reuse ids that were abandoned.
It is theoretically possible to reuse abandoned ids. However, in practice it is prohibitively expensive on database, and creates more problems when multiple sessions try to do the same thing.
TDLR: stop fighting it, gaps in used ids are perfectly normal.
You can always assure that the transaction is not running in more than one thread at the same time using lock():
private static object myLockObject = new object();
....
....
public class MyClass
{
...
public void TransactionAndStuff()
{
lock(myLockObject)
{
// your linq query
}
}

Is it better to use database polling or events for the following system?

I'm working on an ordering system that works exactly the way Netflix's service works (see end of this question if you're not familiar with Netflix). I have two approaches and I am unsure which approach is the right one; one relies on database polling and the other is event driven.
The following two approaches assume this simplified schema:
member(id, planId)
plan(id, moviesPerMonthLimit, moviesAtHomeLimit)
wishlist(memberId, movieId, rank, shippedOn, returnedOn)
Polling: I would run the following count queries in wishlist
Count movies shippedThisMonth (where shippedOn IS NOT NULL #memberId)
Count moviesAtHome (where shippedOn IS NOT NULL, and returnedOn IS NULL #memberId)
Count moviesInList (#memberId)
The following function will determine how many movies to ship:
moviesToShip = Min(moviesPerMonthLimit - shippedThisMonth, moviesAtHomeLimit - moviesAtHome, moviesInList)
I will loop through each member, run the counts, and loop through their list as many times as moviesToShip. Seems like a pain in the neck, but it works.
Event Driven: This approach involves adding an extra column "queuedForShipping" and marking it to 0,1 every time an event takes place. I will do the following counts:
Count movies shippedThisMonth (where shippedOn IS NOT NULL #memberId)
Count moviesAtHome (where shippedOn IS NOT NULL, and returnedOn IS NULL #memberId)
Count moviesQueuedForShipping (where queuedForShipping = 1, #memberId)
Instead of using min, I have to use the following if statements
If moviesPerMonthLimit > (shippedThisMonth + moviesQueuedForShipping)
AND IF moviesAtHomeLimit > (moviesAtHome + moviesQueuedForShipping))
If both conditions are true, I will select a row from wishlist where queuedForShippinh = 0, and set it's queuedForShipping to 1. I will run this function every time someone adds, deletes, reorders their list. When it's time to ship, I would select #memberId where queuedForShipping = 1. I would also run this when updating shippedAt and returnedAt.
Approach one is simple. It also allows members to mess around with their ranks until someone decides to run the polling. That way what to ship is always decided by rank. But ppl keep telling polling is bad.
The event driven approach is self-sustaining, but it seems like a waste of time to ping the database with all those counts every time a person changes their list. I would also have to write to the column queuedForShipment. It also means when a member re-ranks their list and they have pending shipments (shippedAt IS NULL, queuedForShipping = 1) I would have to update those rows and set queuedForShipping back to 1 based on the new ranks. (What if someone added 5 movies, and then suddenly went to change the order? Well, queuedForShipment would already be set to 1 on the first two movies he or she added)
Can someone please give me their opinion on the best approach here and the cons/advantages of polling versus event driven?
Netflix is a monthly subscription service where you create a movie list, and your movies are shipped to you based on your service plan limits.
Based on what you described, there's no reason to keep the data "ready to use" (event) when you can create it very easily when needed (poll).
Reasons to cache it:
If you needed to display the next item to the user.
If the detailed data was being removed due to some retention policy.
If the polling queries were too slow.

Is this a case for denormalisation?

I have a site with about 30,000 members to which I'm adding a functionality that involves sending a random message from a pool of 40 possible messages. Members can never receive the same message twice.
One table contains the 40 messages and another table maps the many-to-many relationship between messages and members.
A cron script runs daily, selects a member from the 30,000, selects a message from the 40 and then checks to see if this message has been sent to this user before. If not, it sends the message. If yes, it runs the query again until it finds a message that has not yet been received by this member.
What I'm worried about now is that this m-m table will become very big: at 30,000 members and 40 messages we already have 1.2 million rows through which we have to search to find a message that has not yet been sent.
Is this a case for denormalisation? In the members table I could add 40 columns (message_1, message_2 ... message_40) in which a 1 flag is added each time a message is sent. If I'm not mistaken, this would make the queries in the cron script run much faster
?
I know that doesn't answer your original question, but wouldn't it be way faster if you selected all the messages that weren't yet sent to a user and then select one of those randomly?
See this pseudo-mysql here:
SELECT
CONCAT_WS(',', messages.ids) unsent_messages,
user.id user
FROM
messages,
user
WHERE
messages.id NOT IN (
SELECT
id
FROM
sent_messages
WHERE
user.id = sent_messages.user
)
GROUP BY ids
You could also append the id of the sent messages to a varchar-field in the members-table.
Despite of good manners, this would make it easily possible to use one statement to get a message that has not been sent yet for a specific member.
Just like this (if you surround the ids with '-')
SELECT message.id
FROM member, message
WHERE member.id = 2321
AND member.sentmessages NOT LIKE '%-' && id && '-%'
1.2 M rows # 8 bytes (+ overhead) per row is not a lot. It's so small I wouldn't even bet it needs indexing (but of course you should do it).
Normalization reduces redundancy and it is what you'll do if you have large amount of data which seems to be your case. You need not denormalize. Let there be an M-to-M table between members and messages.
You can archive the old data as your M-to-M data increases. I don't even see any conflicts because your cron job runs daily for this task and accounts only for the data for the current day. So you can archive M-to-M table data every week.
I believe there will be maintenance issue if you denormalize by adding additional coloumns to members table. I don't recommend the same. Archiving of old data can save you from trouble.
You could store only available (unsent) messages. This implies extra maintenance when you add or remove members or message types (nothing that can't be automated with foreign keys and triggers) but simplifies delivery: pick a random line from each user, send the message and remove the line. Also, your database will get smaller as messages get sent ;-)
You can achieve the effect of sending random messages by preallocating the random string in your m-m table and a pointer to the offset of the last message sent.
In more detail, create a table MemberMessages with columns
memberId,
messageIdList char(80) or varchar ,
lastMessage int,
primary key is memberId.
Pseudo-code for the cron job then looks like this...
ONE. Select next message for a member. If no row exists in MemberMessages for this member, go to step TWO. The sql to select next message looks like
select substr(messageIdList, 2*lastMessage + 1, 2) as nextMessageId
from MemberMessages
where member_id = ?
send the message identified by nextMessageId
then update lastMessage incrementing by 1, unless you have reached 39 in which case reset it to zero.
update MemberMessages
set lastMessage = MOD(lastMessage + 1, 40)
where member_id = ?
TWO. Create a random list of messageIds as a String of couplets like 2117390740... This is your random list of message IDs as an 80 char String. Insert a row to MemberMessages for your member_id setting message_id_list to your 80 char String and set last_message to 1.
Send the message identified by the first couplet from the list to the member.
You can create a kind of queue / heap.
ReceivedMessages
UserId
MessageId
then:
Pick up a member and select message to send:
SELECT * FROM Messages WHERE MessageId NOT IN (SELECT MessageId FROM ReceivedMessages WHERE UserId = #UserId) LIMIT 1
then insert MessageId and UserId to ReceivedMessages
and do send logic here
I hope that helps.
There are potential easier ways to do this, depending on how random you want "random" to be.
Consider that at the beginning of the day you shuffle an array A, [0..39] which describes the order of the messages to be sent to users today.
Also, consider that you have at most 40 Cron jobs, which are used to send messages to the users. Given the Nth cron job, and ID the selected user ID, numeric, you can choose M, the index of the message to send:
M = (A[N] + ID) % 40.
This way, a given ID would not receive the same message twice in the same day (because A[N] would be different), and two randomly selected users have a 1/40 chance of receiving the same message. If you want more "randomness" you can potentially use multiple arrays.