Whats the best way to get MySQL data into Date Related Groupings without crushing our db? - mysql

I have a few tables related to an app of ours in a database that needs to be lumped into buckets to help compare one source from another.
Example, we have an app install table with a source, and a timestamp
Then we have an uninstall table with a app id.
We need to be able to basically get data into a grouping of "0-7";"7-14";"15-30";"30-60" days of age.
Then select from there the amount of ininstalls that happen in similar fashion. First week, second week, second half of month, second month.
Its not so bad if we only have 50-100k installs, however when we throw in app activity in the mix, to see if that bucket did a certain action, our actions table is in themillions, and the world ends.
Is there a way we can do this with MySQL, or is it just not practical?
It almost seems easier to setup a server side script to process each row individually into a rollup table.
Install
| App ID | Timestamp | Source
--------------------------------------------------------
| foo-1 | 2015-11-23 03:49:12 | Google
| foo-2 | 2015-12-23 03:49:12 | Facebook
| foo-3 | 2015-12-31 01:10:01 | Google
Purchase:
| App ID | Timestamp | Amount
--------------------------------------------------------
| foo-1 | 2015-11-26 05:49:12 | $10.00
| foo-1 | 2015-12-27 03:49:12 | $5.00
Uninstall:
| App ID | Timestamp
--------------------------------------------------------
| foo-2 | 2015-12-15 05:49:12
Report: (FP = First Purchase, U = Uninstall)
| Source | Total Installs | FP 0-14d | FP in 15-30 | FP in 30-60 | U in 0-14d | U in 15-30
Google | 2 | 1 | - | - | - | -
Facebook | 1 | - | - | - | 1 | -

Related

Implementing an enrichment using Spark with MySQL is bad idea?

I am trying to build one giant schema that makes data users to query easier, in order to achieve that, streaming events have to be joined with User Metadata by USER_ID and ID. In data engineering, This operation is called "Data Enrichment" right? the tables below are the example.
# `Event` (Stream)
+---------+--------------+---------------------+
| UERR_ID | EVENT | TIMESTAMP |
+---------+--------------+---------------------+
| 1 | page_view | 2020-04-10T12:00:11 |
| 2 | button_click | 2020-04-10T12:01:23 |
| 3 | page_view | 2020-04-10T12:01:44 |
+---------+--------------+---------------------+
# `User Metadata` (Static)
+----+-------+--------+
| ID | NAME | GENDER |
+----+-------+--------+
| 1 | Matt | MALE |
| 2 | John | MALE |
| 3 | Alice | FEMALE |
+----+-------+--------+
==> # Result
+---------+--------------+---------------------+-------+--------+
| UERR_ID | EVENT | TIMESTAMP | NAME | GENDER |
+---------+--------------+---------------------+-------+--------+
| 1 | page_view | 2020-04-10T12:00:11 | Matt | MALE |
| 2 | button_click | 2020-04-10T12:01:23 | John | MALE |
| 3 | page_view | 2020-04-10T12:01:44 | Alice | FEMALE |
+---------+--------------+---------------------+-------+--------+
I was developing this using Spark, and User Metadata is stored in MySQL, then I realized it would be waste of parallelism of Spark if the spark code includes joining with MySQL tables right?
The bottleneck will be happening on MySQL if traffic will be increased I guess..
Should I store those table to key-value store and update it periodically?
Can you give me some idea to tackle this problem? How you usually handle this type of operations?
Solution 1 :
As you suggested you can keep a local cache copy of in key-value pair on your local and updated the cache as regular interval.
Solution 2 :
You can use a MySql to Kafka Connector as below,
https://debezium.io/documentation/reference/1.1/connectors/mysql.html
For every DML or table alter operations on your User Metadata Table there will be a respective event fired to a Kafka topic (e.g. db_events). You can run a thread in parallel in your Spark streaming job which polls db_events and updates your local cache key-value.
This solution would make your application a near-real time application in true sense.
One over head I can see is that there will be need to run a Kafka Connect service with Mysql Connector (i.e. Debezium) as a plugin.

Selecting row data and a scaler in SQL

I have a job table, where each job has some metrics like cost, time taken, etc. I'd like to select information for a set of jobs, like the requestor and job action, and in addition to that row data, select some high-level metrics (min cost, max cost, min time taken, etc.).
The data changes frequently, so I'd like to get this information in a single select. Is it possible to do this? I'm not sure if this is conceptually possible because the DB would have to return row-level data along with aggregate data.
Right now I can get all the details and calculate the min/max, something like this:
select requestor, action, cost, time_taken from job;
But then I have to write code to find the min/max and this query has to download all the cost/time data when I am really only interested in the min/max. I really want to do something like
select (min(cost), max(cost), min(time_taken), max(time_taken)), (requestor, action) from job;
And get the aggregate data first, and then the row level data. Is this possible? (On a real server this is on MySQL, but for dev I locally use sqlite so it'd be nice if it worked there too, but not required).
The table looks something like this:
+----+-----------+--------+------+------------+
| id | requestor | action | cost | time_taken |
+----+-----------+--------+------+------------+
| 1 | 31233 | sync | 8 | 423.3 |
+----+-----------+--------+------+------------+
| 2 | 11229 | read | 1 | 1.3 |
+----+-----------+--------+------+------------+
| 3 | 1434 | edit | 5 | 152.8 |
+----+-----------+--------+------+------------+
| 4 | 101781 | sync | 12 | 712.1 |
+----+-----------+--------+------+------------+
I'd like to get back the stats:
min/max cost: 1/12
min/max time_taken: 1.3/712.1
and all the requestors and actions:
+-----------+--------+
| requestor | action |
+-----------+--------+
| 31233 | sync |
+-----------+--------+
| 11229 | read |
+-----------+--------+
| 1434 | edit |
+-----------+--------+
| 101781 | sync |
+-----------+--------+
Do you just want aggregation?
select requestor, action, min(cost), max(cost), min(time_taken), max(time_taken),
from job
group by requestor, action;

pyqt4 - MySQL How print single/multiple row(s) of a table in the TableViewWidget

I've recently tried to create an executable with python 2.7 which can read a MySQL database.
The database (named 'montre') regroups two tables : patient and proto_1
Here is the content of those tables :
mysql> select * from proto_1;
+----+------------+---------------------+-------------+-------------------+-----
----------+----------+
| id | Nom_Montre | Date_Heure | Temperature | Pulsion_cardiaque | Taux
_oxy_sang | Humidite |
+----+------------+---------------------+-------------+-------------------+-----
----------+----------+
| 1 | montre_1 | 2017-11-27 19:33:25 | 22.30 | NULL |
NULL | NULL |
| 2 | montre_1 | 2017-11-27 19:45:12 | 22.52 | NULL |
NULL | NULL |
+----+------------+---------------------+-------------+-------------------+-----
----------+----------+
mysql> select * from patient;
+----+-----------+--------+------+------+---------------------+------------+----
----------+
| id | nom | prenom | sexe | age | date_naissance | Nom_Montre | com
mentaires |
+----+-----------+--------+------+------+---------------------+------------+----
----------+
| 2 | RICHEMONT | Robert | M | 37 | 1980-04-05 23:43:00 | montre_3 | ess
aye2 |
| 3 | PIERRET | Mandy | F | 22 | 1995-04-05 10:43:00 | montre_4 | ess
aye3 |
| 14 | PIEKARZ | Allan | M | 22 | 1995-06-01 10:32:56 | montre_1 | Hea
lthy man |
+----+-----------+--------+------+------+---------------------+------------+----
----------+
As I'm just used to code in C (no OOP), I didn't create class in the python project (shame on me...). But I managed, in two files, to create something (with mysql.connector) which can print (on the cmd) my database and excecute sub like looking-for() etc.
Now, I want to create a GUI for users with pyqt. Unfortunately, I saw that the structure is totally different, with class etc. But okay, I tried to go throught this and I've created a GUI which allows to display the table "patient". But I didn't manage (in the datasheet of QT) to find how I can use the programs I've already created to display. Neither how to display in a tableWidget only several rows of my table patient for exemple (Using QSQL).
For example, if I want to display all the table patient, I use this line (pyQt):
self.model.setTable("patient")
For this one, I got it, but that disturb me because there is no MySQL coding requisites to display my table and so I don't know how to sort only the rows we want to see and display them. If we only want to see, for example, the ID n°2, how to display in the table:widget only Robert ?
To recap, I want to know :
If I can take the coding I've created and combine it with pyQT
How to display (tableWidget) only rows which are selected by MySQL. Is that possible ?
Please find in the URL my code for a better understanding of my problem :
https://drive.google.com/file/d/1nxufjJfF17P5hN__CBEcvrbuHF-23aHN/view?usp=sharing
I hope I was clear, thank you all for your help !

Multiple Data Sources in Microsoft Excel SQL Query

I have a lot of spreadsheets that pull transactional information from our ERP software into Excel using the Microsoft Query that we then perform other calculations on automatically. Recently we upgraded our ERP system, but management made the decision to leave the transactional history in the old databases to have a clean one going forward in the new system. I still need to have some "rolling 12 months" graphs, but if I use only the old database, I'm missing new data and if I use only the new, I'm missing the last 11 months data.
Is there a way that I can write a query in Excel to pull data from the old database PartTran table and merge it with the new database PartTran table without user intervention each time? For instance, I don't want my users (if possible) to have to have two queries that they copy and paste into one Excel table. The schema of the tables (at least the columns I need) are identically named and defined.
If you want to take a bit of a fun, hacky Excel approach, you could do the "copy-paste" bit FOR your users behind the scenes. Given two similar tables OLD and NEW with structures
+-----+------+-------+------------+
| id | foo | bar | date |
+-----+------+-------+------------+
| 95 | blah | $25 | 2015-06-01 |
| 96 | bork | $12 | 2015-07-01 |
| 97 | bump | $200 | 2015-08-01 |
| 98 | fizz | | 2015-09-01 |
| 99 | buzz | $50 | 2015-10-01 |
| 100 | char | ($1) | 2015-11-01 |
| 101 | mope | | 2015-12-01 |
+-----+------+-------+------------+
and
+----+-----+-------+------------+------+---------+
| id | foo | bar | date | fizz | buzz |
+----+-----+-------+------------+------+---------+
| 1 | cat | ($10) | 2016-01-01 | 285B | 1110111 |
| 2 | dog | $25 | 2016-02-01 | 27F5 | 1110100 |
| 3 | ant | $100 | 2016-03-01 | 1F91 | 1001111 |
+----+-----+-------+------------+------+---------+
... you can union together the data for these two datasets with some prudent excel wizardry as below:
Your UNION table ( named using alt+j+t+a ) should have the following items:
New natural ID
DataSet pointer ( name of old or new table )
Derived ID from original dataset
Columns of data you want from Old & New DataSets
example:
+---------+------------+------------+----+------+-----+------------+------+------+
| UnionId | SourceName | SourceRank | id | foo | bar | date | fizz | buzz |
+---------+------------+------------+----+------+-----+------------+------+------+
| 1 | OLD | | | | | | | |
| 2 | NEW | | | | | | | |
+---------+------------+------------+----+------+-----+------------+------+------+
You will then make judicious use of Indirect() and VlookUp() to derive the lookup id and column targets. Sample code below
SourceRank - helper column
=COUNTIFS([SourceName],[#SourceName],[UnionId],"<="&[#UnionId])
id - the id from the original DataSet
=SMALL(INDIRECT([#SourceName]&"[id]"),[#SourceRank])
Everything else is just VlookUp madness!! Although I've taken the liberty of copying the sample code below for reference
foo =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[foo]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
bar =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[bar]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
date =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[date]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
fizz =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[fizz]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
buzz =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[fizz]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
Output
You'll likely want to make prudent use of If() and/or IfError() to help your users ignore the new column references to the old table and those rows that do not yet have data. Without that, however, you'll end up with something like the below.
This is both ready to accept & read new inputs to both OLD and NEW DataSets and is sortable to get rid of those pesky placeholder rows...
Hope this helps! Happy coding!

Running a quartz job in grails?

I'm very new to Grails. I have a table like this :
+----+---------+----------------+----------------+-------------+--------------------+--------------+--------+---------------------+
| id | version | card_exp_month | card_exp_year | card_number | card_security_code | name_on_card | txn_id | date_created |
+----+---------+----------------+----------------+-------------+--------------------+--------------+--------+---------------------+
| 9 | 0 | ASdsadsd | Asdsadsadasdas | Asdsa | | batman | asd | 2012-08-13 19:38:22 |
+----+---------+----------------+----------------+-------------+--------------------+--------------+--------+---------------------+
In mysql. I wish to run a Quartz job against this table, which will compare, date_created Time stamp with present time such that, if any field's there with timestamp less than 30 minutes should be deleted.
How can I do this?
you could define a Job implementing your logic ( in the execute() method, check (date_created - now) < 30 minutes or else delete the row in the database) and then trigger this job on a regular basis.
You can read the documentation http://quartz-scheduler.org/documentation/quartz-2.1.x/cookbook or have a look at the examples : http://svn.terracotta.org/svn/quartz/branches/quartz-2.2.x/examples/src/main/java/org/quartz/examples/example1/
Check this example for grails quartz:
http://www.juancarlosfernandez.net/2012/02/como-crear-un-proceso-quartz-en-grails.html