MySQL products storing - architecture - mysql

I need to lay down architecture for app. It's designed for selling products.
System is going to accept about 30-40k of new products daily.
It will lead to creation of new records in table product.
System should keep history of prices. User should be able to see how price of product A got changed during last year.
So I have two options:
When a new product comes, I move (copy and remove) old products to another table. Let's name it product_history . So product table contains ONLY products which are being sold at the moment. As result I will need to rewrite queries because row from product can be either in table product or in table product_history (If client wants to see history of sales, statistics, etc).
Nothing gets removed. I keep old products lying in the same table and just mark them as old with some attribute ("is_old"). The new records are indexed by Redis.
Solution 2 makes code easier but I fear that table can large too grow.
Advantages are that there no copying of data. No messing with removing.
Solution 1 makes supporting the system higher. Active table product will always stay small. But playing always with two tables is harder than with one table.
One thing to note, not related to question, but it makes things a bit more complex: every product can have up to 12 different prices(probably more in the future). So the field price is stored as json and gets indexed by Redis already.
Which solution should bring less pain in the future? Which one would you pick?

My pick would be:
1. Go for option#2, it is cleaner, the migration part of option#1 adds to the write load & complexity.
Use an 'active' flag for each product item, which is 1 or true by default.
Partition the table on active flag, so that active items, which are the only queried items, lie in single partition, and inactive in other.
For pricing, do not store the json as a varchar/text, use a native JSON field ( mysql 5.7+), it would allow you richer querying within your JSON.
For redis sync, I would also suggest exploring debezium to stream mysql changes to redis.

Related

Database model for a 24/7 Staff roster at a casino

We presently use a pen/paper based roster to manage table games staff at the casino. Each row is an employee, each column is a 20 minute block of time and each cell represents what table the employee is assigned to, or alternatively they've been assigned to a break. The start and end time of shifts for employees vary as do the games/skills they can deal. We need to keep a copy of the rosters for 7 years, with paper this is fairly easy, I'm wanting to develop a digital application and am having difficulty how to store the data in a database for archiving.
I'm fairly new to working with databases, I think I understand how to model the data for a graph database like neo4j, but I had difficulty when it came to working with time. I've tried to learn about RDBMS databases like MySQL, below is how I think the data should be modelled. Please point out if I'm going in the wrong direction or if a different database type would be more appropriate, it would be greatly appreciated!
Basic Data
Here is some basic data to work with before we factor in scheduling/time.
Employee
- ID Number
- Name
- Skills (Blackjack, Baccarat, Roulette, etc)
Table
- ID Number
- Skill/Type (Can only be one skill)
It may be better to store the roster data as a file like JSON instead? Time sensitive data wouldn't be so much of a problem then. The benefit of going digital with a database would be queries, these could help assist time consuming tasks where human error is common.
Possible Queries
Note: Staff that are on shift are either on a break or on the floor (assigned to a table), Skills have a major or minor type based on difficulty to learn.
What staff have been on the floor for 80 minutes or more? (They are due for a break)
What open tables can I assign this employee to based on their skillset?
I need an employee that has Baccarat skill but is not already been assigned to a Baccarat table.
What employee(s) was on this table during this period of time?
Where was this employee at this point in time?
Who is on shift right now?
How many staff on shift can deal Blackjack?
How many staff have 3 major skills?
What staff have had the Baccarat skill for at least 3 months?
These queries could also be sorted by alphabetical order or time, skill etc.
I'm pretty sure I know how to perform these queries with cypher for neo4j provided I model the data right. I'm not as knowledgeable with SQL queries, I've read it can get a bit complicated depending on the query and structure.
----------------------------------------------------------------------------------------
MYSQL Specific
An employee table could contain properties such as their ID number and Name, but am I right that for their skills and shifts these would be separate tables that reference the employee by a unique integer(I think this is called a foreign key?).
Another table could store the gaming Tables, these would have their own ID and reference a skill/gametype with a foreign key.
To record data like the pen/paper roster, each day could have a table with columns starting from 0000 increasing by 20 in value going all the way to 2340? Prior to the time columns I could have one for staff where each employee is represented with their foreign key, the time columns would then have foreign keys to the assigned gaming Tables, the row data is bound to have many cells that aren't populated since the employee shift won't be 24/7. If I'm using foreign keys to reference gaming Tables I now have a problem when the employee is on break? Unless I treat say the first gaming Table entry as a break?
I may need to further complicate things though, management will over time try different gaming Table layouts, some of the gaming Tables can be converted from say Blackjack to Baccarat. this is bound to happen quite a bit over 7 years, would I want to be creating new gaming Table entries or add a column to use a foreign key and refer to a new table that stores the history of game types during periods of time? Employees will also learn to deal new games during their career, very rarely they may also have the skill removed.
----------------------------------------------------------------------------------------
Neo4j Specific
With this data would I have an Employee and a Table node that have "isA" relationship edges mapping to actual employees or tables?
I imagine with the skills for the two types I would be best with a Skill node and establish relationships like so?: Blackjack->isA->Skill, Employee->hasSkill->Blackjack, Table->typeIs->Blackjack?
TIME
I find difficulty when I want this database to now work with a timeline. I've come across the following suggestions for connecting nodes with time:
Unix Epoch seems to be a common recommendation?
Connecting nodes to a year/month/day graph?
Lucene timeline? (I don't know much about this or how to work with it, have seen some mention it)
And some cases with how time and data relate:
Staff have varied days and start/end times from week to week, this could be shift node with properties {shiftStart,shiftEnd,actualStart,actualEnd}, staff may arrive late or get sick during shift. Would this be the right way to link each shift to an employee? Employee(node)->Shifts(groupNode)->Shift(node)
Tables and Staff may have skill data modified, with archived data this could be an issue, I think the solution is to have time property on the relationship to the skill?
We open and close tables throughout the day, each table has open/close times for each day, this could change in a month depending on what management wants, in addition the times are not strict, for various reasons a manager may open or close tables during the shift. The open/closed status of a table node may only be relevant for queries during the shift, which confuses me as I'd want this for queries but for archiving with time it might not make sense?
It's with queries that I have trouble deciding when to use a node or add a property to a node. For an Employee they have a name and ID number, if I wanted to find an employee by their ID number would it be better to have that as a node of it's own? It would be more direct right, instead of going through all employees for that unique ID number.
I've also come across labels just recently, I can understand that those would be useful for typing employee and table nodes rather than grouping them under a node. With the shifts for an employee I think should continue to be grouped with a shifts node, If I were to do cypher queries for employees working shifts through a time period a label might be appropriate, however should it be applied to individual shift nodes or the shifts group node that links back to the employee? I might need to add a property to individual shift nodes or the relationship to the shifts group node? I'm not sure if there should be a shifts group node, I'm assuming that reducing the edges connecting to the employee node would be optimal for queries.
----------------------------------------------------------------------------------------
If there are any great resources I can learn about database development that'd be great, there is so much information and options out there it's difficult to know what to begin with. Thanks for your time :)
Thanks for spending the time to put a quality question together. Your requirements are great and your specifications of your system are very detailed. I was able to translate your specs into a graph data model for Neo4j. See below.
Above you'll see a fairly explanatory graph data model. In case you are unfamiliar with this, I suggest reading Graph Databases: http://graphdatabases.com/ -- This website you can get a free digital PDF copy of the book but in case you want to buy a hard copy you can find it on Amazon.
Let's break down the graph model in the image. At the top you'll see a time indexing structure that is (Year)->(Month)->(Day)->(Hour), which I have abbreviated as Y M D H. The ellipses indicate that the graph is continuing, but for the sake of space on the screen I've only showed a sub-graph.
This time index gives you a way to generate time series or ask certain questions on your data model that are time specific. Very useful.
The bottom portion of the image contains your enterprise data model for your casino. The nodes represent your business objects:
Game
Table
Employee
Skill
What's great about graph databases is that you can look at this image and semantically understand the language of your question by jumping from one node to another by their relationships.
Here is a Cypher query you can use to ask your questions about the data model. You can just tweak it slightly to match your questions.
MATCH (employee:Employee)-[:HAS_SKILL]->(skill:Skill),
(employee)<-[:DEALS]-(game:Game)-[:LOCATION]->(table:Table),
(game)-[:BEGINS]->(hour:H)<-[*]-(day:D)<-[*]-(month:M)<-[*]-(year:Y)
WHERE skill.type = "Blackjack" AND
day.day = 17 AND
month.month = 1 AND
year.year = 2014
RETURN employee, skill, game, table
The above query finds the sub-graph for all employees who have the skill Blackjack and their table and location on a specific date (1/17/14).
To do this in SQL would be very difficult. The next thing you need to think about is importing your data into a Neo4j database. If you're curious on how to do that please look at other questions here on SO and if you need more help, feel free to post another question or reach out to me on Twitter #kennybastani.
Cheers,
Kenny

Disregard changes to a product description when retrieving order records

The title is somewhat hard to understand, so here is the explanation:
I am building a system, that deals with retail transactions. Meaning - purchases. I have a database with products, where each product has an ID, that is also known to the POS system. When a customer makes a purchase, the data is sent to the back-end for parsing, and is saved. Now everything is fine and dandy, until there are changes to the products name, since my client wants to see the name of the product, as it was purchased then.
How do I save this data, while also keeping a nice, normal-formed database?
Solutions I could think of are:
De-normalization, where we correlate the incoming data with the info we have in the database, and then save only the final text values, not id's.
Versioning, where we keep multiple versions of every product, and save the transactions with the id of the products version, when it came in. The problem with this one is, that as our retail store chain grows, and there are more and more changes happening to the products, the complexity of the whole product will greatly increase.
Any thoughts on this?
This is called a slowly changing dimension.
Either solution that you mention works. My preference is the second, versioning. I would have a product table that has an effdate and enddate on the record. You can easily find the current record (where enddate is null) or the record at any point in time.
The first method always strikes me as more "quick-and-dirty", but it also works. It just gets cumbersome when you have more fields and more objects you are trying to track. It does, in general though, win on performance.
If the name has to be the name as it was originally, the easiest, simplest and most reliable way to do that is to save the name of the product in the invoice line item record.
You should still link to the product with a ProductID, of course.
If you want to keep a history of name changes, you can do that in a separate table if you wish:
ProductNameID
ProductID
Date
Description
And store a ProductNameID with the invoice line item.

Database Historization

We have a requirement in our application where we need to store references for later access.
Example: A user can commit an invoice at a time and all references(customer address, calculated amount of money, product descriptions) which this invoice contains and calculations should be stored over time.
We need to hold the references somehow but what if the e.g. the product name changes? So somehow we need to copy everything so its documented for later and not affected by changes in future. Even when products are deleted, they need to reviewed later when the invoice is stored.
What is the best practise here regarding database design? Even what is the most flexible approach e.g. when the user want to edit his invoice later and restore it from the db?
Thank you!
Here is one way to do it:
Essentially, we never modify or delete the existing data. We "modify" it by creating a new version. We "delete" it by setting the DELETED flag.
For example:
If product changes the price, we insert a new row into PRODUCT_VERSION while old orders are kept connected to the old PRODUCT_VERSION and the old price.
When buyer changes the address, we simply insert a new row in CUSTOMER_VERSION and link new orders to that, while keeping the old orders linked to the old version.
If product is deleted, we don't really delete it - we simply set the PRODUCT.DELETED flag, so all the orders historically made for that product stay in the database.
If customer is deleted (e.g. because (s)he requested to be unregistered), set the CUSTOMER.DELETED flag.
Caveats:
If product name needs to be unique, that can't be enforced declaratively in the model above. You'll either need to "promote" the NAME from PRODUCT_VERSION to PRODUCT, make it a key there and give-up ability to "evolve" product's name, or enforce uniqueness on only latest PRODUCT_VER (probably through triggers).
There is a potential problem with the customer's privacy. If a customer is deleted from the system, it may be desirable to physically remove its data from the database and just setting CUSTOMER.DELETED won't do that. If that's a concern, either blank-out the privacy-sensitive data in all the customer's versions, or alternatively disconnect existing orders from the real customer and reconnect them to a special "anonymous" customer, then physically delete all the customer versions.
This model uses a lot of identifying relationships. This leads to "fat" foreign keys and could be a bit of a storage problem since MySQL doesn't support leading-edge index compression (unlike, say, Oracle), but on the other hand InnoDB always clusters the data on PK and this clustering can be beneficial for performance. Also, JOINs are less necessary.
Equivalent model with non-identifying relationships and surrogate keys would look like this:
You could add a column in the product table indicating whether or not it is being sold. Then when the product is "deleted" you just set the flag so that it is no longer available as a new product, but you retain the data for future lookups.
To deal with name changes, you should be using ID's to refer to products rather than using the name directly.
You've opened up an eternal debate between the purist and practical approach.
From a normalization standpoint of your database, you "should" keep all the relevant data. In other words, say a product name changes, save the date of the change so that you could go back in time and rebuild your invoice with that product name, and all other data as it existed that day.
A "de"normalized approach is to view that invoice as a "moment in time", recording in the relevant tables data as it actually was that day. This approach lets you pull up that invoice without any dependancies at all, but you could never recreate that invoice from scratch.
The problem you're facing is, as I'm sure you know, a result of Database Normalization. One of the approaches to resolve this can be taken from Business Intelligence techniques - archiving the data ina de-normalized state in a Data Warehouse.
Normalized data:
Orders table
OrderId
CustomerId
Customers Table
CustomerId
Firstname
etc
Items table
ItemId
Itemname
ItemPrice
OrderDetails Table
ItemDetailId
OrderId
ItemId
ItemQty
etc
When queried and stored de-normalized, the data warehouse table looks like
OrderId
CustomerId
CustomerName
CustomerAddress
(other Customer Fields)
ItemDetailId
ItemId
ItemName
ItemPrice
(Other OrderDetail and Item Fields)
Typically, there is either some sort of scheduled job that pulls data from the normalized datas into the Data Warehouse on a scheduled basis, OR if your design allows, it could be done when an order reaches a certain status. (Such as shipped) It could be that the records are stored at each change of status (with a field called OrderStatus tacking the current status), so the fully de-normalized data is available for each step of the oprder/fulfillment process. When and how to archive the data into the warehouse will vary based on your needs.
There is a lot of overhead involved in the above, but the other common approach I'm aware of carries even MORE overhead.
The other approach would be to make the tables read-only. If a customer wants to change their address, you don't edit their existing address, you insert a new record.
So if my address is AddressId 12 when I first order on your site in Jamnuary, then I move on July 4, I get a new AddressId tied to my account. (Say AddressId 123123 because your site is very successful and has attracted a ton of customers.)
Orders I palced before July 4 would have AddressId 12 associated with them, and orders placed on or after July 4 have AddressId 123123.
Repeat that pattern with every table that needs to retain historical data.
I do have a third approach, but searching it is difficult. I use this in one app only, and it actually works out pretty well in this single instance, which had some pretty specific business needs for reconstructing the data exactly as it was at a specific point in time. I wouldn't use it unless I had similar business needs.
At a specific status, serialize the data into an Xml document, or some other document you can use to reconstruct the data. This allows you to save the data as it was at the time it was serialized, retaining original table structure and relaitons.
When you have time-sensitive data, you use things like the product and Customer tables as lookup tables and store the information directly in your Orders/orderdetails tables.
So the order table might contain the customer name and address, the details woudl contain all relevant information about the produtct including especially price(you never want to rely on the product table for price information beyond the intial lookup at teh time of the order).
This is NOT denormalizing, the data changes over time but you need the historical value, so you must store it at the time the record is created or you will lose data intergrity. You don't want your financial reports to suddenly indicate you sold 30% more last year because you have price updates. That's not what you sold.

Database structure - most common queries span 3-4 tables. Should I reduce tables?

I am creating a new DB in MySQL for an application and wondered if anyone could provide some advice on the following set up. I'll try and simplify things as best as I can.
This DB is designed to store alerts which are related to specific items created by a user. In turn there is the need to store notes related to the items and/or alerts. At first I considered the following structure...
USERS table - to store basic app user info (e.g. user_id. name, email) - this is the only bit I'm fairly certain does not need to be changed
ITEMS table: contains info on particular item (4 fields or so). Contains user_id to indicate which user created/owns this item
ALERTS table: contains info on the alert, item_id to indicate which item the alert is related to, contains user_id to indicate which user created alert
NOTES table: contains note info, user_id of note owner, item_id if associated with an item, alert_id if associated with alert
Relationships:
An item does not always have an an alert associated with it
An item or alert does not always have a note associated with it
An alert is always associated with an item. More than one alert can be associated with the same item.
A note is always associated with an item or alert. More than one note can be associated with the same item or alert.
Once first created item info is unlikely to be updated by a user.
For arguments sake let's say that each user will create an average of 10 items, each item will have an average of 2 alerts associated with it. There will be an average of 2 notes per item/alert.
Very common queries that will be run:
1) Return all items created by a particular user with any associated alerts and notes. Given a user_id this query would span 3 tables
2) Checking each day for alerts that need to be sent to a user's email address. WHERE alert date==today, return user's email address, item name and any associated notes. This would require a query spanning 4 tables which is why I'm wondering if I need to take a different approach...
Option 1) one table to cover items, alerts and notes. user_id owner for each row. Every time you add a note to an item or alert you are repeating the alert and/or item info. Seems a bit wasteful but item and alert info won't be large.
Option 2) I don't foresee the need to query notes (famous last words?) so how about serializing note data so multiple notes are stored in one row in either the item or alert table (or just a combined alert/item table)
Option 3) Anything else you can think of? I'm asking this question as each option I've considered doesn't feel quite right.
I appreciate this is currently a small project and so performance shouldn't be of great concern and I should just go with the 4 tables. It's more that my common queries will end up being relatively complex that makes me think I need to re-evaluate the structure.
I would say that the common wisdom is to normalize to start and denormalize only when performance data suggest that it's necessary.
Make sure that your tables are indexed properly, with foreign key relationships for JOINs.
If you think you'll end up with a lot of data, this might be a good time to think about a partitioning strategy. Partitioning your fast-growing tables by time would be a good first step.
Four tables is not complex. I commonly write report queries that hit 15 or more tables in a database structure that has hundreds of tables (most with millions of records) and I wouldn't even say our dbs are anything more than medium sized (a typical db in our system might have around 200 gigs of data, so not large at all as databases go). Because they are properly indexed, they still run fast unless I am doing very complex calculations. Normalize, don't even consider denormalizing until you are an experienced database designer who knows better than to worry about the number of tables.

mysql optimization: current & previous orders should be on different tables, or same table with a flag column?

I want to know what's the most optimized way to work with mysql.
I have a quite large database for orders. i have another table for previous orders, whenever the order is completed, it's erased from orders and is moved to previous orders.
should i keep using this mothod or put them all in the same table and add a column that flags if it's current or previous orders?
kfir
In general, moving data around tables is a sensitive process - data can easily get lost or corrupted. The question is how are you querying this table - if you search and filter through it on an often basis, you want to keep the table relatively small. If you only access it to read specific lines (using a direct primary key), the size of the table is less crucial, and then I would advise to keep a flag.
A third option you might want to consider, is having 3 tables - one for ongoing orders, one for historic orders, and one with the order details. That third table can be long and static, while you query the first two tables.
Lastly - at some point, you might want to move the historic data out of the table all together. Maybe you would keep a cron job that runs once a month and moves out data which is older than 6 months to a different, remote database.