Implementing an enrichment using Spark with MySQL is bad idea? - mysql

I am trying to build one giant schema that makes data users to query easier, in order to achieve that, streaming events have to be joined with User Metadata by USER_ID and ID. In data engineering, This operation is called "Data Enrichment" right? the tables below are the example.
# `Event` (Stream)
+---------+--------------+---------------------+
| UERR_ID | EVENT | TIMESTAMP |
+---------+--------------+---------------------+
| 1 | page_view | 2020-04-10T12:00:11 |
| 2 | button_click | 2020-04-10T12:01:23 |
| 3 | page_view | 2020-04-10T12:01:44 |
+---------+--------------+---------------------+
# `User Metadata` (Static)
+----+-------+--------+
| ID | NAME | GENDER |
+----+-------+--------+
| 1 | Matt | MALE |
| 2 | John | MALE |
| 3 | Alice | FEMALE |
+----+-------+--------+
==> # Result
+---------+--------------+---------------------+-------+--------+
| UERR_ID | EVENT | TIMESTAMP | NAME | GENDER |
+---------+--------------+---------------------+-------+--------+
| 1 | page_view | 2020-04-10T12:00:11 | Matt | MALE |
| 2 | button_click | 2020-04-10T12:01:23 | John | MALE |
| 3 | page_view | 2020-04-10T12:01:44 | Alice | FEMALE |
+---------+--------------+---------------------+-------+--------+
I was developing this using Spark, and User Metadata is stored in MySQL, then I realized it would be waste of parallelism of Spark if the spark code includes joining with MySQL tables right?
The bottleneck will be happening on MySQL if traffic will be increased I guess..
Should I store those table to key-value store and update it periodically?
Can you give me some idea to tackle this problem? How you usually handle this type of operations?

Solution 1 :
As you suggested you can keep a local cache copy of in key-value pair on your local and updated the cache as regular interval.
Solution 2 :
You can use a MySql to Kafka Connector as below,
https://debezium.io/documentation/reference/1.1/connectors/mysql.html
For every DML or table alter operations on your User Metadata Table there will be a respective event fired to a Kafka topic (e.g. db_events). You can run a thread in parallel in your Spark streaming job which polls db_events and updates your local cache key-value.
This solution would make your application a near-real time application in true sense.
One over head I can see is that there will be need to run a Kafka Connect service with Mysql Connector (i.e. Debezium) as a plugin.

Related

How does Foundry Magritte append ingestion handle deleted rows in the data source?

If I have a Magritte ingestion that is set to append, will it detect if rows are deleted in the source data? Will it also delete the rows in the ingested dataset?
For your first question on if deletions are detected, this will depend on the database implementation you are extracting from (I'll assume this is JDBC for this answer). If this shows up as a modification and therefore a new row, then yes your deletes will show up.
This would look something like the following at first:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | CREATE | 0 |
| key_2 | 2 | CREATE | 0 |
| key_3 | 3 | CREATE | 0 |
Followed by some updates (inside a subsequent run, incremental on update_ts:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | UPDATE | 1 |
| key_2 | 2 | UPDATE | 1 |
Now your database would have to explicitly mark any DELETE rows and increment the update_ts for this to be brought in:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | DELETE | 2 |
After this, you would then be able to detect the deleted records and adjust accordingly. Your full materialized table view will now look like the following:
| primary_key | val | update_type | update_ts |
|-------------|-----|-------------|-----------|
| key_1 | 1 | CREATE | 0 |
| key_2 | 2 | CREATE | 0 |
| key_3 | 3 | CREATE | 0 |
| key_1 | 1 | UPDATE | 1 |
| key_2 | 2 | UPDATE | 1 |
| key_1 | 1 | DELETE | 2 |
If you are running incrementally in your raw ingestion, these rows will not be automatically deleted from your dataset; you'll have to explicitly write logic to detect these deleted records and remove them from your output clean step. If these deletes are found, you'll have to SNAPSHOT the output the remove them (unless you're doing lower-level file manipulations where you could remove the underlying file perhaps).
It's worth noting you'll want to materialize the DELETES as late as possible (assuming your intermediate logic allows for it) since this will require a snapshot and will kill your overall pipeline performance.
If you aren't dealing with JDBC, then #Kellen's answer will apply.
If this is a file-based ingetsion (as opposed to JDBC) magritte ingestion operates on files not on rows. If your transaction type for the ingestion is set to UPDATE, and you make changes to the file, including deleting rows, then when the ingestion runs the new file will completely replace the existing file in that dataset, so any changes made in the file will be reflected in the dataset.
Two additional notes:
If you have the exclude files already synced filter, you will probably want to have last modified date and/or file size options enabled or the modified file won't be ingested.
If your transaction type is set to APPEND and not UPDATE then the ingestion will fail because APPEND doesn't allow changes to existing files.

How to create a table for a user in a Database

I am creating an mobile application. In this app, I have created a Login and Register activity. I have also created a online Database using AWS(Amazon Web Service) to store all the login information of the user upon registering.
In my database, i have a table called 'users'. This table holds the following fields "fname","lname","username","password". This part works and successfully stores data from my phone to the database.
for example,
| fname | lname | username | password |
| ------ | ------ | -------- | -------- |
| john | doe | jhon123 | 1234 |
Inside the app, I have an option where the user may click on "Start Log", which will record a start and end values on a seekBar.
How can i create a table under a user who is logged in.
(Essentially, i want to be able to create multiple tables under a user.)
for example,
This table should be under the user "john123":
| servo | Start | End |
| ------ | ------ | --- |
| 1 | 21 | 30 |
| 2 | 30 | 11 |
| 3 | 50 | 41 |
| 4 | 0 | 15 |
I know its a confusing question, but
i am essentially just trying to have multiple tables linked to a user.
As to:
How to create a table for a user in a Database
Here are some GUI tools you might find useful:
MySQL - MySQL Workbench
PostgreSQL - PG Admin
As for creating a separate table for each user, refer to #Polymath's answer. There is no benefit in creating separate tables for each user (you might as well use a json file).
What you should do is create a logs table that has a user_id attribute referencing the id in the users table.
-------------------------------------------------------
| id | fname | lname | username | password |
| -- | ------ | ------ | -------- | ------------------- |
| 1 | john | doe | jhon123 | encrypted(password) |
-------------------------------------------------------
|______
|
V
---------------------------------------
| id | user_id | servo_id | start | end |
| -- | ------- | -------- | ----- | --- |
| 1 | 1 | 1 | 21 | 30 |
| 2 | 1 | 2 | 30 | 11 |
---------------------------------------
You should also look into database normalization as your "john123" table is not in 3NF. The servo should be decomposed out of logs table if it will be logged by multiple users or multiple times (which I'm guessing is the case for you).
In reading this I wonder if your design is right. It sounds like you are trying to create a table for each user. I also wonder how scalable it is to have a unique table per user. if you scale to millions of users you will have millions of tables to manage and will probably need a separate index table to find the right table for the user. Why a table for each? Why not a single table with the UserID as a use key value. You can extract the data just by filtering on the UserID.
Select * FROM UsersData ORDER BY DateTime WHERE User == UserID
However I will leave that for you to ponder.
You mentioned that this is a Mobile App. I think what you need to do is look at AWS Federated access and Cognito which will allow you to Identify a user using federate Identities. Pass the User unique Id , plus a temporary (one use) credentials linked to an access Role. Combined this way, you can scale to millions of users with full authentication without managing millions of accounts.
RL

Whats the best way to get MySQL data into Date Related Groupings without crushing our db?

I have a few tables related to an app of ours in a database that needs to be lumped into buckets to help compare one source from another.
Example, we have an app install table with a source, and a timestamp
Then we have an uninstall table with a app id.
We need to be able to basically get data into a grouping of "0-7";"7-14";"15-30";"30-60" days of age.
Then select from there the amount of ininstalls that happen in similar fashion. First week, second week, second half of month, second month.
Its not so bad if we only have 50-100k installs, however when we throw in app activity in the mix, to see if that bucket did a certain action, our actions table is in themillions, and the world ends.
Is there a way we can do this with MySQL, or is it just not practical?
It almost seems easier to setup a server side script to process each row individually into a rollup table.
Install
| App ID | Timestamp | Source
--------------------------------------------------------
| foo-1 | 2015-11-23 03:49:12 | Google
| foo-2 | 2015-12-23 03:49:12 | Facebook
| foo-3 | 2015-12-31 01:10:01 | Google
Purchase:
| App ID | Timestamp | Amount
--------------------------------------------------------
| foo-1 | 2015-11-26 05:49:12 | $10.00
| foo-1 | 2015-12-27 03:49:12 | $5.00
Uninstall:
| App ID | Timestamp
--------------------------------------------------------
| foo-2 | 2015-12-15 05:49:12
Report: (FP = First Purchase, U = Uninstall)
| Source | Total Installs | FP 0-14d | FP in 15-30 | FP in 30-60 | U in 0-14d | U in 15-30
Google | 2 | 1 | - | - | - | -
Facebook | 1 | - | - | - | 1 | -

MySQL Many-to-Many Query

I am currently in the process of converting the player saving features of a game's multi-player engine into an SQL database for the integration of a webpage to display/modify/sell characters. The original system stored all data into text files which was an awful way of dealing with this data as it was fixed to the game only. Within the Text files the user's Username, Password, ID, and player-data was stored, allowing for only one character. I have separated this into tables and can successfully save and load character data. The tables I have are quite large so for example purposes I will use the following:
account:
+----+----------+----------+
| ID | Username | Password |
+----+----------+----------+
| 1 | Player1 | 123456 | (Secure passwords much?)
| 2 | Player2 | password | (These are actually hashed in the real db)
+----+----------+----------+
account_character:
+------------+--------------+
| Account_ID | Character_ID |
+------------+--------------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
+------------+--------------+
character:
+----+-----------+-----------+-----------+--------+--------+
| ID | PositionX | PositionY | PositionZ | Gender | Energy | etc....
+----+-----------+-----------+-----------+--------+--------+
| 1 | 100 | 150 | 1 | 1 | 100 |
| 2 | 30 | 90 | 0 | 1 | 100 |
| 3 | 420 | 210 | 2 | 0 | 53.5 |
+----+-----------+-----------+-----------+--------+--------+
These tables are linked using relationships.
What I have so far is, the user logs in and the server queries their username and matches the password. If the passwords match, the server begins to load the character data based on the ID loaded from the account during logging in.
This is where I am stuck. I have successfully done this through phpmyadmin using the SQL command interface, but as it was around 4AM I was tired and accidentally closed the tab that contained the command before I saved it. I have tried to replicate this but I simply cannot obtain the data I require in the query.
I've recently completed a course in databases at college and got a distinction, but for the life of me I cannot get this to work again... I have followed tutorials but as the situations usually differ from mine I cannot apply them until I understand them. I know I'm going to kick myself once I have a working command.
Tl;dr - I wish to query all character data linked to an account using the account's 'ID'.
I think this should work:
SELECT
*
FROM
account_character ac
INNER JOIN account a ON ac.Account_ID = a.ID
INNER JOIN character c on ac.Character_ID = c.ID
WHERE
account.Username = ? AND
account.Password = ?
;
We start by joining together all the relevant tables, and then filter to get characters just for the current user.

Running a quartz job in grails?

I'm very new to Grails. I have a table like this :
+----+---------+----------------+----------------+-------------+--------------------+--------------+--------+---------------------+
| id | version | card_exp_month | card_exp_year | card_number | card_security_code | name_on_card | txn_id | date_created |
+----+---------+----------------+----------------+-------------+--------------------+--------------+--------+---------------------+
| 9 | 0 | ASdsadsd | Asdsadsadasdas | Asdsa | | batman | asd | 2012-08-13 19:38:22 |
+----+---------+----------------+----------------+-------------+--------------------+--------------+--------+---------------------+
In mysql. I wish to run a Quartz job against this table, which will compare, date_created Time stamp with present time such that, if any field's there with timestamp less than 30 minutes should be deleted.
How can I do this?
you could define a Job implementing your logic ( in the execute() method, check (date_created - now) < 30 minutes or else delete the row in the database) and then trigger this job on a regular basis.
You can read the documentation http://quartz-scheduler.org/documentation/quartz-2.1.x/cookbook or have a look at the examples : http://svn.terracotta.org/svn/quartz/branches/quartz-2.2.x/examples/src/main/java/org/quartz/examples/example1/
Check this example for grails quartz:
http://www.juancarlosfernandez.net/2012/02/como-crear-un-proceso-quartz-en-grails.html