Selecting row data and a scaler in SQL - mysql

I have a job table, where each job has some metrics like cost, time taken, etc. I'd like to select information for a set of jobs, like the requestor and job action, and in addition to that row data, select some high-level metrics (min cost, max cost, min time taken, etc.).
The data changes frequently, so I'd like to get this information in a single select. Is it possible to do this? I'm not sure if this is conceptually possible because the DB would have to return row-level data along with aggregate data.
Right now I can get all the details and calculate the min/max, something like this:
select requestor, action, cost, time_taken from job;
But then I have to write code to find the min/max and this query has to download all the cost/time data when I am really only interested in the min/max. I really want to do something like
select (min(cost), max(cost), min(time_taken), max(time_taken)), (requestor, action) from job;
And get the aggregate data first, and then the row level data. Is this possible? (On a real server this is on MySQL, but for dev I locally use sqlite so it'd be nice if it worked there too, but not required).
The table looks something like this:
+----+-----------+--------+------+------------+
| id | requestor | action | cost | time_taken |
+----+-----------+--------+------+------------+
| 1 | 31233 | sync | 8 | 423.3 |
+----+-----------+--------+------+------------+
| 2 | 11229 | read | 1 | 1.3 |
+----+-----------+--------+------+------------+
| 3 | 1434 | edit | 5 | 152.8 |
+----+-----------+--------+------+------------+
| 4 | 101781 | sync | 12 | 712.1 |
+----+-----------+--------+------+------------+
I'd like to get back the stats:
min/max cost: 1/12
min/max time_taken: 1.3/712.1
and all the requestors and actions:
+-----------+--------+
| requestor | action |
+-----------+--------+
| 31233 | sync |
+-----------+--------+
| 11229 | read |
+-----------+--------+
| 1434 | edit |
+-----------+--------+
| 101781 | sync |
+-----------+--------+

Do you just want aggregation?
select requestor, action, min(cost), max(cost), min(time_taken), max(time_taken),
from job
group by requestor, action;

Related

Implementing an enrichment using Spark with MySQL is bad idea?

I am trying to build one giant schema that makes data users to query easier, in order to achieve that, streaming events have to be joined with User Metadata by USER_ID and ID. In data engineering, This operation is called "Data Enrichment" right? the tables below are the example.
# `Event` (Stream)
+---------+--------------+---------------------+
| UERR_ID | EVENT | TIMESTAMP |
+---------+--------------+---------------------+
| 1 | page_view | 2020-04-10T12:00:11 |
| 2 | button_click | 2020-04-10T12:01:23 |
| 3 | page_view | 2020-04-10T12:01:44 |
+---------+--------------+---------------------+
# `User Metadata` (Static)
+----+-------+--------+
| ID | NAME | GENDER |
+----+-------+--------+
| 1 | Matt | MALE |
| 2 | John | MALE |
| 3 | Alice | FEMALE |
+----+-------+--------+
==> # Result
+---------+--------------+---------------------+-------+--------+
| UERR_ID | EVENT | TIMESTAMP | NAME | GENDER |
+---------+--------------+---------------------+-------+--------+
| 1 | page_view | 2020-04-10T12:00:11 | Matt | MALE |
| 2 | button_click | 2020-04-10T12:01:23 | John | MALE |
| 3 | page_view | 2020-04-10T12:01:44 | Alice | FEMALE |
+---------+--------------+---------------------+-------+--------+
I was developing this using Spark, and User Metadata is stored in MySQL, then I realized it would be waste of parallelism of Spark if the spark code includes joining with MySQL tables right?
The bottleneck will be happening on MySQL if traffic will be increased I guess..
Should I store those table to key-value store and update it periodically?
Can you give me some idea to tackle this problem? How you usually handle this type of operations?
Solution 1 :
As you suggested you can keep a local cache copy of in key-value pair on your local and updated the cache as regular interval.
Solution 2 :
You can use a MySql to Kafka Connector as below,
https://debezium.io/documentation/reference/1.1/connectors/mysql.html
For every DML or table alter operations on your User Metadata Table there will be a respective event fired to a Kafka topic (e.g. db_events). You can run a thread in parallel in your Spark streaming job which polls db_events and updates your local cache key-value.
This solution would make your application a near-real time application in true sense.
One over head I can see is that there will be need to run a Kafka Connect service with Mysql Connector (i.e. Debezium) as a plugin.

MySQL: How to make sure update is always executed before select?

I am creating a web app that lets N number of users to enter receipt data.
A set of scanned receipts is given to users, but no more than 2 users should work on the same receipt.
i.e. User A and User B can work on receipt-1, but User C can not work on it(Another receipt, say receipt-2, should be assigned to the User C).
The table structure I am using looks similar to the following.
[User-Receipt Table]
+------------+--------------+
| user_id | receipt_id |
+------------+--------------+
| 000000001 | R0000000000 |
| 000000001 | R0000000001 |
| 000000001 | R0000000002 |
| 000000002 | R0000000000 |
| 000000002 | R0000000001 |
+------------+--------------+
[Receipt Table]
+-------------+--------+
| receipt_id | status |
+-------------+--------+
| R0000000000 | 0 |
| R0000000001 | 1 |
| R0000000002 | 0 |
| R0000000003 | 2 |
+-------------+--------+
★status 0:not assigned 1:assigned to a user 2: assigned to 2 users
select receipts from the receipt table whose status is not equal to '2'
insert the receipts fetched from the step 1 along with a user to whom receipts are assigned.
update the receipt status(0->1 or 1->2)
This is how I plan to achieve the above requirement.
The problem with this approach is that there could be a chance that the select(step1) is executed right before the update(step3) is executed.
If this happens, the receipts with status 2 might be fetched and assigned to another user, which does not meet the requirement.
How can I make sure that this does not happen?
For all purposes, use transactions :
START TRANSACTION
your SQL commands
COMMIT
Transactions either let all your statements executed or not executed at all and performs implicitly a lock on the updated row which is more efficient than the second approach
You can also do it using LOCK TABLE

Whats the best way to get MySQL data into Date Related Groupings without crushing our db?

I have a few tables related to an app of ours in a database that needs to be lumped into buckets to help compare one source from another.
Example, we have an app install table with a source, and a timestamp
Then we have an uninstall table with a app id.
We need to be able to basically get data into a grouping of "0-7";"7-14";"15-30";"30-60" days of age.
Then select from there the amount of ininstalls that happen in similar fashion. First week, second week, second half of month, second month.
Its not so bad if we only have 50-100k installs, however when we throw in app activity in the mix, to see if that bucket did a certain action, our actions table is in themillions, and the world ends.
Is there a way we can do this with MySQL, or is it just not practical?
It almost seems easier to setup a server side script to process each row individually into a rollup table.
Install
| App ID | Timestamp | Source
--------------------------------------------------------
| foo-1 | 2015-11-23 03:49:12 | Google
| foo-2 | 2015-12-23 03:49:12 | Facebook
| foo-3 | 2015-12-31 01:10:01 | Google
Purchase:
| App ID | Timestamp | Amount
--------------------------------------------------------
| foo-1 | 2015-11-26 05:49:12 | $10.00
| foo-1 | 2015-12-27 03:49:12 | $5.00
Uninstall:
| App ID | Timestamp
--------------------------------------------------------
| foo-2 | 2015-12-15 05:49:12
Report: (FP = First Purchase, U = Uninstall)
| Source | Total Installs | FP 0-14d | FP in 15-30 | FP in 30-60 | U in 0-14d | U in 15-30
Google | 2 | 1 | - | - | - | -
Facebook | 1 | - | - | - | 1 | -

Display only one row for values that appear multiple times

I have multiple rows with the same name in this table, and I want to show only one of row of each. For example, with the following data:
| name | number |
+------+--------+
| exe | 1 |
| exe | 10 |
| exe | 2 |
| bat | 1 |
| exe | 3 |
| bat | 4 |
I would like to see the following results:
| name | number |
+------+--------+
| exe | 16 |
| bat | 5 |
How can I achieve this result?
Duplicate response: My question only have 1 table, the JOIN ..ON command creates confusion in understanding, i think this simple question can help many guys!
Try something like this:
SELECT t.`name`, SUM(t.`number`) AS `number`
FROM mytable t
GROUP BY t.`name`
ORDER BY `number` DESC
let the database return the result you want, rather than mucking with returning a bloatload of rows, and collapsing them on the client side. There's plenty of work for the client to do without doing what the database can do way more efficiently.
You can use an aggregation function for this:
SELECT name, SUM(number) AS total
FROM myTable
GROUP BY name;
Here is a reference on aggregate functions, and here is an SQL Fiddle example using your sample data.

How to create mysql db for account balace, add and subtract amounts

I have project like online service, i have made some part and stopped. If user use service it must take some amount (e.g. 5$ per service). I don't know how to build MySQL tables. I have made 2 tables 1st for rest amount 2nd for add and subtract amounts. May be this is wrong way, what is the best practice?
action_table
id | userId | reason | amount
1 | 4 | for service 3 | -5
2 | 2 | refill account | 100
3 | 13 | for service 3 | -5
balance_table
1 | 4 | 23
2 | 2 | 125
3 | 13 | 0
After using service query adds one row to action_table and updates balance_table
Personally, if I was making an account database, I would have one table for an account and one for transactions, like this:
Accounts:
| id | user | name | balance |
Transactions:
| id | account_id | description | amount | is_withdrawal |
The reason I came up with this is because it helps to think of database tables like real world objects sometimes, and in this case you have a Transaction and an Account.
Then, you can use a TRIGGER to update the account table anytime you add a transaction.