MySQL to Redis - Import and Model - mysql

I'm thinking to use Redis to cache some user data snapshot(s) in order to speed up the access to that data (one of the reasons is because my MySQL table(s) suffer of lock contention) and I'm looking for the best way to import in one step a table like this(which may contain from a few record to millions of records):
mysql> select * from mytable where snapshot = 1133;
+------+--------------------------+----------------+-------------------+-----------+-----------+
| id | email | name | surname | operation | snapshot |
+------+--------------------------+----------------+-------------------+-----------+-----------+
| 2989 | example-2989#example.com | fake-name-2989 | fake-surname-2989 | 2 | 1133 |
| 2990 | example-2990#example.com | fake-name-2990 | fake-surname-2990 | 10 | 1133 |
| 2992 | example-2992#example.com | fake-name-2992 | fake-surname-2992 | 5 | 1133 |
| 2993 | example-2993#example.com | fake-name-2993 | fake-surname-2993 | 5 | 1133 |
| 2994 | example-2994#example.com | fake-name-2994 | fake-surname-2994 | 9 | 1133 |
| 2995 | example-2995#example.com | fake-name-2995 | fake-surname-2995 | 7 | 1133 |
| 2996 | example-2996#example.com | fake-name-2996 | fake-surname-2996 | 1 | 1133 |
+------+--------------------------+----------------+-------------------+-----------+-----------+
into the Redis key-value store.
I can have many "snapshots" to load into Redis, and the basic access pattern is (SQL like syntax)
select * from mytable where snapshot = ? and id = ?
these snapshots can also coming from others table, so the "global unique ID per snapshot" is the column snapshot, ex:
mysql> select * from my_other_table where snapshot = 1134;
+------+--------------------------+----------------+-------------------+-----------+-----------+
| id | email | name | surname | operation | snapshot |
+------+--------------------------+----------------+-------------------+-----------+-----------+
| 2989 | example-2989#example.com | fake-name-2989 | fake-surname-2989 | 1 | 1134 |
| 2990 | example-2990#example.com | fake-name-2990 | fake-surname-2990 | 8 | 1134 |
| 2552 | example-2552#example.com | fake-name-2552 | fake-surname-2552 | 5 | 1134 |
+------+--------------------------+----------------+-------------------+-----------+-----------+
The loaded snapshot into redis never change, they are available only for a week via TTL
There is a way to load in one step this kind of data(rows and columns) into redis combining redis-cli --pipe and HMSET?
What is the best model to use in redis in order to store/get this data (thinking at the access pattern)?
I have found the redis-cli --pipe Redis Mass Insertion (and also MySQL to Redis in One Step) but I can't figure out the best way to achieve my requirements (load from mysql in one step all rows/colums, best redis model for this) using HMSET
Thanks in advance
Cristian.

Model
To be able to query your data from Redis the same way as:
select * from mytable where snapshot = ?
select * from mytable where id = ?
You'll need the model below.
Note: select * from mytable where snapshot = ? and id = ? does not make a lot of sense here, since it's the same as select * from mytable where id = ?.
Key type and naming
[Key Type] [Key name pattern]
HASH d:{id}
ZSET d:ByInsertionDate
SET d:BySnapshot:{id}
Note: I used d: as a namespace but you may want to rename it with the name of your domain model.
Data insertion
Insert a new line from Mysql into Redis:
hmset d:2989 id 2989 email example-2989#example.com name fake-name-2989 ... snapshot 1134
zadd d:ByInsertionDate {current_timestamp} d:2989
sadd d:BySnapshot:1134 d:2989
Another example:
hmset d:2990 id 2990 email example-2990#example.com name fake-name-2990 ... snapshot 1134
zadd d:ByInsertionDate {current_timestamp} d:2990
sadd d:BySnapshot:1134 d:2990
Cron
Here is the algorithm that must be run each day or week depending on your requirements:
for key_name in redis(ZREVRANGEBYSCORE d:ByInsertionDate -inf {timestamp_one_week_ago})
// retrieve the snapshot id from d:{id}
val snapshot_id = redis(hget {key_name} snapshot)
// remove the hash (d:{id})
redis(del key_name)
// remove the hash entry from the set
redis(srem d:BySnapshot:{snapshot_id} {key_name})
// clean the zset from expired keys
redis(zremrangebyscore d:ByInsertionDate -inf {timestamp_one_week_ago})
Usage
select * from my_other_table where snapshot = 1134; will be either:
{snapshot_id} = 1134
for key_name in redis(smembers d:BySnapshot:{snapshot_id})
print(redis(hgetall {keyname}))
or write a lua script to do this directly on redis side. Finally:
select * from my_other_table where id = 2989; will be:
{id} = 2989
print(redis(hgetall d:{id}))
Import
This part is quite easy, just read the table and follow the above model. Depending on your requirements you may want to import all (or a part of) your data with an hourly/daily/weekly cron.

Related

Implementing an enrichment using Spark with MySQL is bad idea?

I am trying to build one giant schema that makes data users to query easier, in order to achieve that, streaming events have to be joined with User Metadata by USER_ID and ID. In data engineering, This operation is called "Data Enrichment" right? the tables below are the example.
# `Event` (Stream)
+---------+--------------+---------------------+
| UERR_ID | EVENT | TIMESTAMP |
+---------+--------------+---------------------+
| 1 | page_view | 2020-04-10T12:00:11 |
| 2 | button_click | 2020-04-10T12:01:23 |
| 3 | page_view | 2020-04-10T12:01:44 |
+---------+--------------+---------------------+
# `User Metadata` (Static)
+----+-------+--------+
| ID | NAME | GENDER |
+----+-------+--------+
| 1 | Matt | MALE |
| 2 | John | MALE |
| 3 | Alice | FEMALE |
+----+-------+--------+
==> # Result
+---------+--------------+---------------------+-------+--------+
| UERR_ID | EVENT | TIMESTAMP | NAME | GENDER |
+---------+--------------+---------------------+-------+--------+
| 1 | page_view | 2020-04-10T12:00:11 | Matt | MALE |
| 2 | button_click | 2020-04-10T12:01:23 | John | MALE |
| 3 | page_view | 2020-04-10T12:01:44 | Alice | FEMALE |
+---------+--------------+---------------------+-------+--------+
I was developing this using Spark, and User Metadata is stored in MySQL, then I realized it would be waste of parallelism of Spark if the spark code includes joining with MySQL tables right?
The bottleneck will be happening on MySQL if traffic will be increased I guess..
Should I store those table to key-value store and update it periodically?
Can you give me some idea to tackle this problem? How you usually handle this type of operations?
Solution 1 :
As you suggested you can keep a local cache copy of in key-value pair on your local and updated the cache as regular interval.
Solution 2 :
You can use a MySql to Kafka Connector as below,
https://debezium.io/documentation/reference/1.1/connectors/mysql.html
For every DML or table alter operations on your User Metadata Table there will be a respective event fired to a Kafka topic (e.g. db_events). You can run a thread in parallel in your Spark streaming job which polls db_events and updates your local cache key-value.
This solution would make your application a near-real time application in true sense.
One over head I can see is that there will be need to run a Kafka Connect service with Mysql Connector (i.e. Debezium) as a plugin.

getting the new row id from pySpark SQL write to remote mysql db (JDBC)

I am using pyspark-sql to create rows in a remote mysql db, using JDBC.
I have two tables, parent_table(id, value) and child_table(id, value, parent_id), so each row of parent_id may have as many rows in child_id associated to it as needed.
Now I want to create some new data and insert it into the database. I'm using the code guidelines here for the write opperation, but I would like to be able to do something like:
parentDf = sc.parallelize([5, 6, 7]).toDF(('value',))
parentWithIdDf = parentDf.write.mode('append') \
.format("jdbc") \
.option("url", "jdbc:mysql://" + host_name + "/"
+ db_name).option("dbtable", table_name) \
.option("user", user_name).option("password", password_str) \
.save()
# The assignment at the previous line is wrong, as pyspark.sql.DataFrameWriter#save doesn't return anything.
I would like a way for the last line of code above to return a DataFrame with the new row ids for each row so I can do
childDf = parentWithIdDf.flatMap(lambda x: [[8, x[0]], [9, x[0]]])
childDf.write.mode('append')...
meaning that at the end I would have in my remote databasde
parent_table
____________
| id | value |
____________
| 1 | 5 |
| 2 | 6 |
| 3 | 7 |
____________
child_table
________________________
| id | value | parent_id |
________________________
| 1 | 8 | 1 |
| 2 | 9 | 1 |
| 3 | 8 | 2 |
| 4 | 9 | 2 |
| 5 | 8 | 3 |
| 6 | 9 | 3 |
________________________
As I've written in the first code snippet above, pyspark.sql.DataFrameWriter#save doesn't return anything, looking at its documentation, so how can I achieve this?
Am I doing something completely wrong? It looks like there is no way to get data back from a Spark's action (which save is) while I would like to use this action as a transformation, shich leads me to think I may be thinking of all this in the wrong way.
A simple answer is to to use the timestamp + auto increment number to create a unique ID. This only works if there is only one server is running at an instance of time.
:)

Script to combine multiple MySQL records into one via summing

I'm a MySQL newbie, but I'm sure there must be a way to do this. I've been looking through StackOverflow for quite a while, though, and haven't found it yet.
I have a MySQL table that is generated from a multi-reducer Hadoop MapReduce job which is analyzing log files. The table is being used in the database that supports a Ruby-on-Rails app, and it looks like this:
+----+-----+------+---------+-----------+
| id | src | dest | time | requests |
+----+-----+------+---------+-----------+
| 0 | abc | xyz | 1000000 | 200000000 |
| 1 | def | uvw | 10 | 300 |
| 2 | abc | xyz | 100000 | 200000 |
| 3 | def | xyz | 1000 | 40000 |
| 4 | abc | uvw | 100 | 5000 |
| 5 | def | xyz | 10000 | 100000 |
+----+-----+------+---------+-----------+
I'm trying to coalesce/sum the columns which have the same src and dest, but I just can't figure out how to do it even after searching through the MySQL 5.1 documentation.
I'm trying to write a script which I could run and obtain something like this at the end (neither the order of the rows nor the id column is important):
+----+-----+------+---------+-----------+
| id | src | dest | time | requests |
+----+-----+------+---------+-----------+
| 6 | abc | xyz | 1100000 | 200200000 |
| 7 | def | uvw | 10 | 300 |
| 8 | abc | uvw | 100 | 5000 |
| 9 | def | xyz | 11000 | 140000 |
+----+-----+------+---------+-----------+
Any ideas on how I could figure this out?
You can't really combine the rows in a single table -- at least not easily. That would require both updates and deletes.
So, just create another table:
create table summary_t as
select src, desc, sum(time) as time, sum(requests) as requests
from table t
group by src, desc;
If you really want this go go back into the original table, then use a temporary table and re-insert the data:
create temporary table summary_t as
select src, desc, sum(time) as time, sum(requests) as requests
from t
group by src, desc;
truncate table t;
insert into t(src, desc, time, requests)
select src, desc, time, requests
from summary_t;
However, having said all that, you should just add another step to your map-reduce application to do that final summary.
Group By with SUM aggregate should work
select src, dest, sum(`time`) as `time`, sum(requests) as requests
from yourtable
group by src, dest
Check if this suite your needs, Create a table with the columns src and dest as primary key and other fields like totaltime and totalrequest.
Create an INSERT AFTER trigger on the existing tabl, which updates the other table totaltime and totalrequest with (old + new) using the src and dest as the key for where condition.

Display ID sharing 2 values from same attribute

I am trying to get the eNum (employee number) of that who masters 2 values (MySQL and Python) from the same attribute column, The closest I get is down below, but the eNum is duplicated. I want to get just one eNum once. I think I am messing it up in the WHERE clause... I don't know...
mysql> select * from employee_expert;
+------+---------+
| eNum | package |
+------+---------+
| E246 | Excel |
| E246 | MySQL |
| E246 | Python |
| E246 | Word |
| E403 | Jave |
| E403 | MySQL |
| E892 | Excel |
| E892 | PHP |
| E892 | Python |
+------+---------+
mysql> SELECT eNum, package
FROM employee_expert
WHERE (package = 'MySQL' OR package = 'Python') AND (package = 'MySQL' OR package = 'Python')
GROUP BY package;
+------+---------+
| eNum | package |
+------+---------+
| E246 | MySQL |
| E246 | Python |
+------+---------+
The WHERE clauses contains unnecessary duplication of the condition package = 'MySQL' OR package = 'Python'. Using WHERE (package = 'MySQL' OR package = 'Python') is enough. Or, to make it more readable you can write WHERE package IN ('MySQL', 'Python').
Your query selects the employees that know 'MySQL' or 'Python' or both.
It looks like you want to select the employees that know both 'MySQL' and 'Python'. You need to use a JOIN for this purpose:
SELECT f.eNum
FROM employee_expert f # 'f' from 'first'
INNER JOIN employee_expert s USING(eNum) # 's' from 'second'
WHERE f.package = 'MySQL'
AND s.package = 'Python'
Unfortunately, this approach does not scale very well if you need to find by a larger set of languages. A better approach would be to use the original query and group the results by eNum like this:
SELECT eNum, COUNT(DISTINCT package) AS nbLangs
FROM employee_expert
WHERE package IN ('MySQL', 'Python') # <------------------------------------+
GROUP BY eNum # Make one entry for each employee |
HAVING nbLangs = 2 # Replace '2' with the number of items in this list --+
This query counts the number of known languages for all the employees that know at least one of the languages in the list then keeps only those that knows all of them.
I think the problem is in the design itself, come to think of it, an employee can master MANY packages and a package can be mastered by MANY employees, it's a many to many relationship, in terms of database that will produce a table employee_package for example which contains a primary key composed of the primary key of each table
+------+------------+
| eNum | package_id |
+------+------------+
| E246 | 1 |
| E246 | 2 |
| E246 | 3 |
| E892 | 1 |
+------+------------+
then your request will be something like :
SELECT DISTINCT e.eNum from employees e JOIN employee_package ep on ep.eNum = e.eNum
WHERE ep.package_id = 1 OR ep.package_id = 2
-- let's say that id 1 is for MySQL and id 2 is for Python

MySQL Many-to-Many Query

I am currently in the process of converting the player saving features of a game's multi-player engine into an SQL database for the integration of a webpage to display/modify/sell characters. The original system stored all data into text files which was an awful way of dealing with this data as it was fixed to the game only. Within the Text files the user's Username, Password, ID, and player-data was stored, allowing for only one character. I have separated this into tables and can successfully save and load character data. The tables I have are quite large so for example purposes I will use the following:
account:
+----+----------+----------+
| ID | Username | Password |
+----+----------+----------+
| 1 | Player1 | 123456 | (Secure passwords much?)
| 2 | Player2 | password | (These are actually hashed in the real db)
+----+----------+----------+
account_character:
+------------+--------------+
| Account_ID | Character_ID |
+------------+--------------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
+------------+--------------+
character:
+----+-----------+-----------+-----------+--------+--------+
| ID | PositionX | PositionY | PositionZ | Gender | Energy | etc....
+----+-----------+-----------+-----------+--------+--------+
| 1 | 100 | 150 | 1 | 1 | 100 |
| 2 | 30 | 90 | 0 | 1 | 100 |
| 3 | 420 | 210 | 2 | 0 | 53.5 |
+----+-----------+-----------+-----------+--------+--------+
These tables are linked using relationships.
What I have so far is, the user logs in and the server queries their username and matches the password. If the passwords match, the server begins to load the character data based on the ID loaded from the account during logging in.
This is where I am stuck. I have successfully done this through phpmyadmin using the SQL command interface, but as it was around 4AM I was tired and accidentally closed the tab that contained the command before I saved it. I have tried to replicate this but I simply cannot obtain the data I require in the query.
I've recently completed a course in databases at college and got a distinction, but for the life of me I cannot get this to work again... I have followed tutorials but as the situations usually differ from mine I cannot apply them until I understand them. I know I'm going to kick myself once I have a working command.
Tl;dr - I wish to query all character data linked to an account using the account's 'ID'.
I think this should work:
SELECT
*
FROM
account_character ac
INNER JOIN account a ON ac.Account_ID = a.ID
INNER JOIN character c on ac.Character_ID = c.ID
WHERE
account.Username = ? AND
account.Password = ?
;
We start by joining together all the relevant tables, and then filter to get characters just for the current user.