The requirement
I am currently building a permissions system. One of the requirements is that it is horizontally scalable.
To achieve this, we have done the following
There is a single "compiled resource permission" table that looks something like this:
| user_id | resource_id | reason |
| 1 | 1 | 1 |
| 1 | 2 | 3 |
| 2 | 1 | 2 |
The structure of this table denotes user 1 has accesses to resource 1 & 2, and user 2 has access only to resource 1.
The "reason" column is a bitwise number which has bits switched on depending on "Why" they have that permission. Binary bit "1" denotes they are an admin, and binary bit "2" denotes they created the resource.
So user 1 has access to resource 1 because they are an admin. He has access to resource 2 because he is an admin and he created that resource. If he was no longer admin, he would still have access to ticket 2 but not ticket 1.
To figure out what needs to go into this table, we use a "patcher" class that programmatically loops around the users & resources passed to it and logically looks at all DB tables necessary to figure out what rows need adding and what rows need removing from the table.
How we are trying to scale and the problem
To horizontally scale this, we split the logic into chunks and give it to a number of "workers" on an async queue
This only seems to scale so far before it no longer speeds up or sometimes even row locking happens which slows it down.
Is there a particular type of row lock we can use to allow it to scale indefinitely?
Are we approaching this from completely the wrong angle? We have a lot of "Reasons" and a lot of complex permission logic that we need to be able to recompile fairly quickly
SQL Queries that run concurrently, for reference
When we are "adding" reasons:
INSERT INTO `compiled_permissions` (`user_id`, `resource_id`, `reason`) VALUES ((1,1,1), (1,2,3), (2,1,2)) ON DUPLICATE KEY UPDATE `reason` = `reason` | VALUES(`reason`);
When we are "removing" reasons:
UPDATE `compiled_permissions` SET `reason` = `reason` & ~ (CASE
(user_id = 1 AND resource_id = 1 THEN 2 ... CASE FOR EVERY "REASON REMOVAL")
ELSE `reason`
END)
WHERE (`user_id`, `resource_id`) IN ((1,1),(1,2) .. ETC )
Related
I have a table which contains task list of persons. followings are columns
+---------+-----------+-------------------+------------+---------------------+
| task_id | person_id | task_name | status | due_date_time |
+---------+-----------+-------------------+------------+---------------------+
| 1 | 111 | walk 20 min daily | INCOMPLETE | 2017-04-13 17:20:23 |
| 2 | 111 | brisk walk 30 min | COMPLETE | 2017-03-14 20:20:54 |
| 3 | 111 | take medication | COMPLETE | 2017-04-20 15:15:23 |
| 4 | 222 | sport | COMPLETE | 2017-03-18 14:45:10 |
+---------+-----------+-------------------+------------+---------------------+
I want to find out monthly compliance in percentage(completed task/total task * 100) of each person like
+---------------+-----------+------------+------------+
| compliance_id | person_id | compliance | month |
+---------------+-----------+------------+------------+
| 1 | 111 | 100 | 2017-03-01 |
| 2 | 111 | 50 | 2017-04-01 |
| 3 | 222 | 100 | 2017-03-01 |
+---------------+-----------+------------+------------+
Here person_id 111 has 1 task in month 2017-03-14 and which status is completed, as 1 out of 1 task is completed in march then compliance is 100%
Currently, I am using separate table which stores this compliance but I have to calculate compliance update that table every time the task status is changed
I have tried creating a view also but it's taking too much time to execute view almost 0.5 seconds for 1 million records.
CREATE VIEW `person_compliance_view` AS
SELECT
`t`.`person_id`,
CAST((`t`.`due_date_time` - INTERVAL (DAYOFMONTH(`t`.`due_date_time`) - 1) DAY)
AS DATE) AS `month`,
COUNT(`t`.`status`) AS `total_count`,
COUNT((CASE
WHEN (`t`.`status` = 'COMPLETE') THEN 1
END)) AS `completed_count`,
CAST(((COUNT((CASE
WHEN (`t`.`status` = 'COMPLETE') THEN 1
END)) / COUNT(`t`.`status`)) * 100)
AS DECIMAL (10 , 2 )) AS `compliance`
FROM
`task` `t`
WHERE
((`t`.`isDeleted` = 0)
AND (`t`.`due_date_time` < NOW())
GROUP BY `t`.`person_id` , EXTRACT(YEAR_MONTH FROM `t`.`due_date_time`)
Is there any optimized way to do it?
The first question to consider is whether the view can be optimized to give the required performance. This may mean making some changes to the underlying tables and data structure. For example, you might want indexes and you should check query plans to see where they would be most effective.
Other possible changes which would improve efficiency include adding an extra column "year_month" to the base table, which you could populate via a trigger. Another possibility would be to move all the deleted tasks to an 'archive' table to give the view less data to search through.
Whatever you do, a view will always perform worse than a table (assuming the table has relevant indexes). So depending on your needs you may find you need to use a table. That doesn't mean you should junk your view entirely. For example, if a daily refresh of your table is sufficient, you could use your view to help:
truncate table compliance;
insert into compliance select * from compliance_view;
Truncate is more efficient than delete, but you can't use a rollback, so you might prefer to use delete and top-and-tail with START TRANSACTION; ... COMMIT;. I've never created scheduled jobs in MySQL, but if you need help, this looks like a good starting point: here
If daily isn't often enough, you could schedule this to run more often than daily, but better options will be triggers and/or "partial refreshes" (my term, I've no idea if there is a technical term for the idea.
A perfectly written trigger would spot any relevant insert/update/delete and then insert/update/delete the related records in the compliance table. The logic is a little daunting, and I won't attempt it here. An easier option would be a "partial refresh" on called within a trigger. The trigger would spot user targetted by the change, delete only the records from compliance which are related to that user and then insert from your compliance_view the records relating to that user. You should be able to put that into a stored procedure which is called by the trigger.
Update expanding on the options (if a view just won't do):
Option 1: Daily full (or more frequent) refresh via a schedule
You'd want code like this executed (at least) daily.
truncate table compliance;
insert into compliance select * from compliance_view;
Option 2: Partial refresh via trigger
I don't work with triggers often, so can't recall syntax, but the logic should be as follows (not actual code, just pseudo-code)
AFTER INSERT -- you may need one for each of INSERT / UPDATE / DELETE
FOR EACH ROW -- or if there are multiple rows and you can trigger only on the last one to be changed, that would be better
DELETE FROM compliance
WHERE person_id = INSERTED.person_id
INSERT INTO compliance select * from compliance_view where person_id = INSERTED.person_id
END
Option 3: Smart update via trigger
This would be similar to option 2, but instead of deleting all the rows from compliance that relate to the relevant person_id and creating them from scratch, you'd work out which ones to update, and update them and whether any should be added / deleted. The logic is a little involved, and I'm not not going to attempt it here.
Personally, I'd be most tempted by Option 2, but you'd need to combine it with option 1, since the data goes stale due to the use of now().
Here's a similar way of writing the same thing...
Views are of very limited benefit in MySQL, and I think should generally be avoided.
DROP TABLE IF EXISTS my_table;
CREATE TABLE my_table
(task_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY
,person_id INT NOT NULL
,task_name VARCHAR(30) NOT NULL
,status ENUM('INCOMPLETE','COMPLETE') NOT NULL
,due_date_time DATETIME NOT NULL
);
INSERT INTO my_table VALUES
(1,111,'walk 20 min daily','INCOMPLETE','2017-04-13 17:20:23'),
(2,111,'brisk walk 30 min','COMPLETE','2017-03-14 20:20:54'),
(3,111,'take medication','COMPLETE','2017-04-20 15:15:23'),
(4,222,'sport','COMPLETE','2017-03-18 14:45:10');
SELECT person_id
, DATE_FORMAT(due_date_time,'%Y-%m') yearmonth
, SUM(status = 'complete')/COUNT(*) x
FROM my_table
GROUP
BY person_id
, yearmonth;
person_id yearmonth x
111 2017-03 1.0
111 2017-04 0.5
222 2017-03 1.0
I'm coming from mysql, trying to wrap my head around redis. Some things were very obvious but a couple have me stumped. How would you go about implementing such things in redis?
First
I have a sort of first come/first serve reservation system. When a user goes to a specific page, it queries the table below and returns the first badge where reservedby = 0, it then updates reservedby with the users id. If the user doesn't complete the process within 15 minutes, reservedby is reset to 0. If the user completes the process, I delete the row from the table and store the badge with the user data. Order is important, the higher on the list a badge is, the better, so if I were to remove it instead of somehow marking it reserved, it would need to go back in on the top if the process isn't completed with 15 minutes.
id | badge | reservedby
------------------------
240 | abc | 4249
241 | bbb | 0
242 | rrr | 0
Second
I have a set of data that doesn't change very often but is queried a lot. When a page loads, it populates a select box with each color, when you select a color, the corresponding sm and lg are displayed.
id | color | sm | lg
---------------------------
1 | blue | 1 | 5
2 | red | 3 | 10
3 | yellow | 7 | 8
Lastly
As far as storing user data, what I'm doing is INCR users and then taking that value and hmset user:<INCR users value> badge "aab" joindate "10/30/2013" etc, is that typically how it should be done?
in reversed order
yes thats how you increment IDs in Redis, there is no automated feature.
depending on how frequent does the table change, if its once a month consider serving a static json file to client and let client-side deal with the rest of the code.
consider using a ZSET and try to keep unique values at scores, OR a LIST with json values.
IMHO your reservation system internals kind of sucks, i could easily reserve your sites offers simply , by sending multiple http requests, other production websites have a limited offer and the user who pays first has the room, until the payment is processed on success it just keeps going, on failure it reverts back.
I have 6 tables. These are simplified for this example.
user_items
ID | user_id | item_name | version
-------------------------------------
1 | 123 | test | 1
data
ID | name | version | info
----------------------------
1 | test | 1 | info
data_emails
ID | name | version | email_id
------------------------
1 | test | 1 | 1
2 | test | 1 | 2
emails
ID | email
-------------------
1 | email#address.com
2 | second#email.com
data_ips
ID | name | version | ip_id
----------------------------
1 | test | 1 | 1
2 | test | 1 | 2
ips
ID | ip
--------
1 | 1.2.3.4
2 | 2.3.4.5
What I am looking to achieve is the following.
The user (123) has the item with name 'test'. This is the basic information we need for a given entry.
There is data in our 'data' table and the current version is 1 as such the version in our user_items table is also 1. The two tables are linked together by the name and version. The setup is like this as a user could have an item for which we dont have data, likewise there could be an item for which we have data but no user owns..
For each item there are also 0 or more emails and ips associated. These can be the same for many items so rather than duplicate the actual email varchar over and over we have the data_emails and data_ips tables which link to the emails and ips table respectively based on the email_id/ip_id and the respective ID columns.
The emails and ips are associated with the data version again through the item name and version number.
My first query is is this a good/well optimized database setup?
My next query and my main question is joining this complex data structure.
What i had was:
PHP
- get all the user items
- loop through them and get the most recent data entry (if any)
- if there is one get the respective emails
- get the respective ips
Does that count as 3 queries or essentially infinite depending on the number of user items?
I was made to believe that the above was inefficient and as such I wanted to condense my setup into using one query to get the same data.
I have achieved that with the following code
SELECT user_items.name,GROUP_CONCAT( emails.email SEPARATOR ',' ) as emails, x.ip
FROM user_items
JOIN data AS data ON (data.name = user_items.name AND data.version = user_items.version)
LEFT JOIN data_emails AS data_emails ON (data_emails.name = user_items.name AND data_emails.version = user_items.version)
LEFT JOIN emails AS emails ON (data_emails.email_id = emails.ID)
LEFT JOIN
(SELECT name,version,GROUP_CONCAT( the_ips.ip SEPARATOR ',' ) as ip FROM data_ips
LEFT JOIN ips as the_ips ON data_ips.ip_id = the_ips.ID )
x ON (x.name = data.name AND x.version = user_items.version)
I have done loads of reading to get to this point and worked tirelessly to get here.
This works as I require - this question seeks to clarify what are the benefits of using this instead?
I have had to use a subquery (I believe?) to get the ips as previously it was multiplying results (I believe based on the complex joins). How this subquery works I suppose is my main confusion.
Summary of questions.
-Is my database setup well setup for my usage? Any improvements would be appreciated. And any useful resources to help me expand my knowledge would be great.
-How does the subquery in my sql actually work - what is the query doing?
-Am i correct to keep using left joins - I want to return the user item, and null values if applicable to the right.
-Am I essentially replacing a potentially infinite number of queries with 2? Does this make a REAL difference? Can the above be improved?
-Given that when i update a version of an item in my data table i know have to update the version in the user_items table, I now have a few more update queries to do. Is the tradeoff off of this setup in practice worthwhile?
Thanks to anyone who contributes to helping me get a better grasp of this !!
Given your data layout, and your objective, the query is correct. If you've only got a small amount of data it shouldn't be a performance problem - that will change quickly as the amount of data grows. However when you ave a large amount of data there are very few circumstances where you should ever see all your data in one go, implying that the results will be filtered in some way. Exactly how they are filtered has a huge impact on the structure of the query.
How does the subquery in my sql actually work
Currently it doesn't work properly - there is no GROUP BY
Is the tradeoff off of this setup in practice worthwhile?
No - it implies that your schema is too normalized.
Which of the following options, if any, is considered best practice when designing a table used to store user settings?
(OPTION 1)
USER_SETTINGS
-Id
-Code (example "Email_LimitMax")
-Value (example "5")
-UserId
(OPTION 2)
create a new table for each setting where, for example, notification settings would require you to create:
"USER_ALERT_SETTINGS"
-Id
-UserId
-EmailAdded (i.e true)
-EmailRemoved
-PasswordChanged
...
...
"USER_EMAIL_SETTINGS"
-Id
-UserId
-EmailLimitMax
....
(OPTION 3)
"USER"
-Name
...
-ConfigXML
Other answers have ably outlined the pros and cons of your various options.
I believe that your Option 1 (property bag) is the best overall design for most applications, especially if you build in some protections against the weaknesses of propety bags.
See the following ERD:
In the above ERD, the USER_SETTING table is very similar to OP's. The difference is that instead of varchar Code and Value columns, this design has a FK to a SETTING table which defines the allowable settings (Codes) and two mutually exclusive columns for the value. One option is a varchar field that can take any kind of user input, the other is a FK to a table of legal values.
The SETTING table also has a flag that indicates whether user settings should be defined by the FK or by unconstrained varchar input. You can also add a data_type to the SETTING to tell the system how to encode and interpret the USER_SETTING.unconstrained_value. If you like, you can also add the SETTING_GROUP table to help organize the various settings for user-maintenance.
This design allows you to table-drive the rules around what your settings are. This is convenient, flexible and easy to maintain, while avoiding a free-for-all.
EDIT: A few more details, including some examples...
Note that the ERD, above, has been augmented with more column details (range values on SETTING and columns on ALLOWED_SETTING_VALUE).
Here are some sample records for illustration.
SETTING:
+----+------------------+-------------+--------------+-----------+-----------+
| id | description | constrained | data_type | min_value | max_value |
+----+------------------+-------------+--------------+-----------+-----------+
| 10 | Favourite Colour | true | alphanumeric | {null} | {null} |
| 11 | Item Max Limit | false | integer | 0 | 9001 |
| 12 | Item Min Limit | false | integer | 0 | 9000 |
+----+------------------+-------------+--------------+-----------+-----------+
ALLOWED_SETTING_VALUE:
+-----+------------+--------------+-----------+
| id | setting_id | item_value | caption |
+-----+------------+--------------+-----------+
| 123 | 10 | #0000FF | Blue |
| 124 | 10 | #FFFF00 | Yellow |
| 125 | 10 | #FF00FF | Pink |
+-----+------------+--------------+-----------+
USER_SETTING:
+------+---------+------------+--------------------------+---------------------+
| id | user_id | setting_id | allowed_setting_value_id | unconstrained_value |
+------+---------+------------+--------------------------+---------------------+
| 5678 | 234 | 10 | 124 | {null} |
| 7890 | 234 | 11 | {null} | 100 |
| 8901 | 234 | 12 | {null} | 1 |
+------+---------+------------+--------------------------+---------------------+
From these tables, we can see that some of the user settings which can be determined are Favourite Colour, Item Max Limit and Item Min Limit. Favourite Colour is a pick list of alphanumerics. Item min and max limits are numerics with allowable range values set. The SETTING.constrained column determines whether users are picking from the related ALLOWED_SETTING_VALUEs or whether they need to enter a USER_SETTING.unconstrained_value. The GUI that allows users to work with their settings needs to understand which option to offer and how to enforce both the SETTING.data_type and the min_value and max_value limits, if they exist.
Using this design, you can table drive the allowable settings including enough metadata to enforce some rudimentary constraints/sanity checks on the values selected (or entered) by users.
EDIT: Example Query
Here is some sample SQL using the above data to list the setting values for a given user ID:
-- DDL and sample data population...
CREATE TABLE SETTING
(`id` int, `description` varchar(16)
, `constrained` varchar(5), `data_type` varchar(12)
, `min_value` varchar(6) NULL , `max_value` varchar(6) NULL)
;
INSERT INTO SETTING
(`id`, `description`, `constrained`, `data_type`, `min_value`, `max_value`)
VALUES
(10, 'Favourite Colour', 'true', 'alphanumeric', NULL, NULL),
(11, 'Item Max Limit', 'false', 'integer', '0', '9001'),
(12, 'Item Min Limit', 'false', 'integer', '0', '9000')
;
CREATE TABLE ALLOWED_SETTING_VALUE
(`id` int, `setting_id` int, `item_value` varchar(7)
, `caption` varchar(6))
;
INSERT INTO ALLOWED_SETTING_VALUE
(`id`, `setting_id`, `item_value`, `caption`)
VALUES
(123, 10, '#0000FF', 'Blue'),
(124, 10, '#FFFF00', 'Yellow'),
(125, 10, '#FF00FF', 'Pink')
;
CREATE TABLE USER_SETTING
(`id` int, `user_id` int, `setting_id` int
, `allowed_setting_value_id` varchar(6) NULL
, `unconstrained_value` varchar(6) NULL)
;
INSERT INTO USER_SETTING
(`id`, `user_id`, `setting_id`, `allowed_setting_value_id`, `unconstrained_value`)
VALUES
(5678, 234, 10, '124', NULL),
(7890, 234, 11, NULL, '100'),
(8901, 234, 12, NULL, '1')
;
And now the DML to extract a user's settings:
-- Show settings for a given user
select
US.user_id
, S1.description
, S1.data_type
, case when S1.constrained = 'true'
then AV.item_value
else US.unconstrained_value
end value
, AV.caption
from USER_SETTING US
inner join SETTING S1
on US.setting_id = S1.id
left outer join ALLOWED_SETTING_VALUE AV
on US.allowed_setting_value_id = AV.id
where US.user_id = 234
See this in SQL Fiddle.
Option 1 (as noted, "property bag") is easy to implement - very little up-front analysis. But it has a bunch of downsides.
If you want to restrain the valid values for UserSettings.Code, you need an auxiliary table for the list of valid tags. So you have either (a) no validation on UserSettings.Code – your application code can dump any value in, missing the chance to catch bugs, or you have to add maintenance on the new list of valid tags.
UserSettings.Value probably has a string data type to accommodate all the different values that might go into it. So you have lost the true data type – integer, Boolean, float, etc., and the data type checking that would be done by the RDMBS on insert of an incorrect values. Again, you have bought yourself a potential QA problem. Even for string values, you have lost the ability to constrain the length of the column.
You cannot define a DEFAULT value on the column based on the Code. So if you wanted EmailLimitMax to default to 5, you can’t do it.
Similarly, you can’t put a CHECK constraint on the Values column to prevent invalid values.
The property bag approach loses validation of SQL code. In the named column approach, a query that says “select Blah from UserSettings where UserID = x” will get a SQL error if Blah does not exist. If the SELECT is in a stored procedure or view, you will get the error when you apply the proc/view – way before the time the code goes to production. In the property bag approach, you just get NULL. So you have lost another automatic QA feature provided by the database, and introduced a possible undetected bug.
As noted, a query to find a UserID where conditions apply on multiple tags becomes harder to write – it requires one join into the table for each condition being tested.
Unfortunately, the Property Bag is an invitation for application developers to just stick a new Code into the property bag without analysis of how it will be used in the rest of application. For a large application, this becomes a source of “hidden” properties because they are not formally modeled. It’s like doing your object model with pure tag-value instead of named attributes: it provides an escape valve, but you’re missing all the help the compiler would give you on strongly-typed, named attributes. Or like doing production XML with no schema validation.
The column-name approach is self-documenting. The list of columns in the table tells any developer what the possible user settings are.
I have used property bags; but only as an escape valve and I have often regretted it. I have never said “gee, I wish I had made that explicit column be a property bag.”
Consider this simple example.
If you have 2 tables, UserTable(contains user details) and
SettingsTable(contains settings details). Then create a new table UserSettings for relating the UserTable and SettingsTable as shown below
Hope you will found the right solution from this example.
Each option has its place, and the choice depends on your specific situation. I am comparing the pros and cons for each option below:
Option 1: Pros:
Can handle many options
New options can easily be added
A generic interface can be developed to manage the options
Option 1: Cons
When a new option is added, its more complex to update all user accounts with the new option
Option names can spiral out of control
Validation of allowed option values is more complex, additional meta data is needed for that
Option 2: Pros
Validation of each option is easier than option 1 since each option is an individual column
Option 2: Cons
A database update is required for each new option
With many options the database tables could become more difficult to use
It's hard to evaluate "best" because it depends on the kind of queries you want to run.
Option 1 (commonly known as "property bag", "name value pairs" or "entity-attribute-value" or EAV) makes it easy to store data whose schema you don't know in advance. However, it makes it hard - or sometimes impossible - to run common relational queries. For instance, imagine running the equivalent of
select count(*)
from USER_ALERT_SETTINGS
where EmailAdded = 1
and Email_LimitMax > 5
This would rapidly become very convoluted, especially because your database engine may not compare varchar fields in a numerically meaningful way (so "> 5" may not work the way you expect).
I'd work out the queries you want to run, and see which design supports those queries best. If all you have to do is check limits for an individual user, the property bag is fine. If you have to report across all users, it's probably not.
The same goes for JSON or XML - it's okay for storing individual records, but makes querying or reporting over all users harder. For instance, imagine searching for the configuration settings for email adress "bob#domain.com" - this would require searching through all XML documents to find the node "email address".
For some reason, somebody told me never to delete any MySQL records. Just flag it with deleted.
For example, I'm building a "follow" social network, like Twitter.
+-------------+------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| user_id | int(11) | NO | | NULL | |
| to_user_id | int(11) | NO | | NULL | |
+-------------+------------+------+-----+---------+----------------+
User 1 follows User 2...
So if one user stops following someone, should I delete this record? Or should I create a column for is_deleted ?
This is a concept called "soft delete". Google for that term to find more. But marking with a flag is only one option - you could also actually perform the delete, but have a trigger which stores a copy in a history table. This way you won't have to update all of your select functions to specifically filter out the deleted records. Also, you won't have as much load on your table as you have to scan through the additional records littering your table.
Generalizing about the larger concept of "you should never delete records" would (and should) probably get this question closed as Not Constructive, but you've given a specific scenario:
User 1 follows User 2...
So if one user stops following someone, should I delete this record?
Or should I create a column for is_deleted ?
The answer in your case depends on whether, after an unfollow, you ever again need to know that User 1 followed User 2. Some made-up, possibly silly, examples where this might be the case:
if it was desirable to change the text User 1 sees when electing to follow User 2 from "Follow User 2" to "Follow User 2 again? Really? Didn't you learn your lesson?"
if you wanted to show User 2 a graph of who (or, in aggregate, how many) followers they've had over time
If you don't need functionality that relies on the past state of users following each other, then it's safe to delete the records. No need to take on the complexity of soft delete when you ain't gonna need it.
I wouldn't say, "never delete any MySQL records". It depends. If you want to keep track of user interactions you could do this by the use of delete flags. You could even create a seperate logging table which tracks each action like "follow" and "unfollow" with the appropriate user id's and timestamps, which gives you more information in the end.
It's up to you and depends on which data you want to store. And please consider the privacy of your users. If they want their data explicitly deleted, then do so.
I have always been a fan of creating a blnDeleted field and using that instead of deleting a record. It is much easier to recover or add that data back in if you leave it in the database.
You may think you will never need the data again, but it is possible. Even for something as simple as tracking unsubscribes or something like that.