I've noticed two main concepts for storing and parsing historical data in a database.
Making a carbon Copy of the de-normalized data that coincides with a specific date
Keeping a version history for each table.
I'm wondering if it might work to keep an audit table of what was changed in order to get a query of what the data was at a given time:
For example you have a company with many employees. Throughout time employees come and go.
tbl_employee would have id and name
tble_employee_audit would have id, employee_id, hired_or_left, datetime
You would then have to take the current list of employees and step backwards through the audit table in order to get to a specific point in time. This would also take into account when someone leaves and then gets hired again.
This is a pretty simple example but with a more complex one do you think this will work? Would it be too taxing from processing standpoint?
Related
I am working a project where I will receive student data dumps once a month. The data will be imported into my system. The initial import will be around 7k records. After that, I don't anticipate more than a few hundred a month. However, there will also be existing records that will be updated as the student changes grades, etc.
I am trying to determine the best way to keep track of what has been received, imported, and updated over time.
I was thinking of setting up a hosted MySQL database with a script that imports the SFTP dump into a table that includes a creation_date and a modification_date field. My thought was, the person performing the extraction, could connect to the MySQL db and run a query on the imported table each month to get the differences before the next extraction.
Another thought I had, was to create a new received table every month for each data dump. Then I would perform the query on the differences.
Note: The importing system is legacy and will accept imports using a utility and unique csv type files. So that probably rules out options like XML.
Thank you in advance for any advice.
I'm going to assume you're tracking students' grades in a course over time.
I would recommend a two table approach:
Table 1: transaction level data. Add-only. New information is simply appended on. Sammy got a 75 on this week's quiz, Beth did 5 points extra credit, etc. Each row is a single transaction. Presumably it has the student's name/id, the value being added, maybe the max possible value or some weighting factor, and of course the timestamp added.
All of this just keeps adding to a never-ending (in theory) table.
Table 2: summary table, rebuilt at some interval. This table does a simple aggregation on the first table, processing the transactional scores into a global one. Maybe it's a simple sum, maybe it's a weighted average, maybe you have something more complex in mind.
This table has one row per student (per course?). You want this to be rebuilt nightly. If you're lazy, you just DROP/CREATE/INSERT. If you're worried about data-loss, you just INSERT and add a timestamp so you can have snapshots going back.
I need to lay down architecture for app. It's designed for selling products.
System is going to accept about 30-40k of new products daily.
It will lead to creation of new records in table product.
System should keep history of prices. User should be able to see how price of product A got changed during last year.
So I have two options:
When a new product comes, I move (copy and remove) old products to another table. Let's name it product_history . So product table contains ONLY products which are being sold at the moment. As result I will need to rewrite queries because row from product can be either in table product or in table product_history (If client wants to see history of sales, statistics, etc).
Nothing gets removed. I keep old products lying in the same table and just mark them as old with some attribute ("is_old"). The new records are indexed by Redis.
Solution 2 makes code easier but I fear that table can large too grow.
Advantages are that there no copying of data. No messing with removing.
Solution 1 makes supporting the system higher. Active table product will always stay small. But playing always with two tables is harder than with one table.
One thing to note, not related to question, but it makes things a bit more complex: every product can have up to 12 different prices(probably more in the future). So the field price is stored as json and gets indexed by Redis already.
Which solution should bring less pain in the future? Which one would you pick?
My pick would be:
1. Go for option#2, it is cleaner, the migration part of option#1 adds to the write load & complexity.
Use an 'active' flag for each product item, which is 1 or true by default.
Partition the table on active flag, so that active items, which are the only queried items, lie in single partition, and inactive in other.
For pricing, do not store the json as a varchar/text, use a native JSON field ( mysql 5.7+), it would allow you richer querying within your JSON.
For redis sync, I would also suggest exploring debezium to stream mysql changes to redis.
We presently use a pen/paper based roster to manage table games staff at the casino. Each row is an employee, each column is a 20 minute block of time and each cell represents what table the employee is assigned to, or alternatively they've been assigned to a break. The start and end time of shifts for employees vary as do the games/skills they can deal. We need to keep a copy of the rosters for 7 years, with paper this is fairly easy, I'm wanting to develop a digital application and am having difficulty how to store the data in a database for archiving.
I'm fairly new to working with databases, I think I understand how to model the data for a graph database like neo4j, but I had difficulty when it came to working with time. I've tried to learn about RDBMS databases like MySQL, below is how I think the data should be modelled. Please point out if I'm going in the wrong direction or if a different database type would be more appropriate, it would be greatly appreciated!
Basic Data
Here is some basic data to work with before we factor in scheduling/time.
Employee
- ID Number
- Name
- Skills (Blackjack, Baccarat, Roulette, etc)
Table
- ID Number
- Skill/Type (Can only be one skill)
It may be better to store the roster data as a file like JSON instead? Time sensitive data wouldn't be so much of a problem then. The benefit of going digital with a database would be queries, these could help assist time consuming tasks where human error is common.
Possible Queries
Note: Staff that are on shift are either on a break or on the floor (assigned to a table), Skills have a major or minor type based on difficulty to learn.
What staff have been on the floor for 80 minutes or more? (They are due for a break)
What open tables can I assign this employee to based on their skillset?
I need an employee that has Baccarat skill but is not already been assigned to a Baccarat table.
What employee(s) was on this table during this period of time?
Where was this employee at this point in time?
Who is on shift right now?
How many staff on shift can deal Blackjack?
How many staff have 3 major skills?
What staff have had the Baccarat skill for at least 3 months?
These queries could also be sorted by alphabetical order or time, skill etc.
I'm pretty sure I know how to perform these queries with cypher for neo4j provided I model the data right. I'm not as knowledgeable with SQL queries, I've read it can get a bit complicated depending on the query and structure.
----------------------------------------------------------------------------------------
MYSQL Specific
An employee table could contain properties such as their ID number and Name, but am I right that for their skills and shifts these would be separate tables that reference the employee by a unique integer(I think this is called a foreign key?).
Another table could store the gaming Tables, these would have their own ID and reference a skill/gametype with a foreign key.
To record data like the pen/paper roster, each day could have a table with columns starting from 0000 increasing by 20 in value going all the way to 2340? Prior to the time columns I could have one for staff where each employee is represented with their foreign key, the time columns would then have foreign keys to the assigned gaming Tables, the row data is bound to have many cells that aren't populated since the employee shift won't be 24/7. If I'm using foreign keys to reference gaming Tables I now have a problem when the employee is on break? Unless I treat say the first gaming Table entry as a break?
I may need to further complicate things though, management will over time try different gaming Table layouts, some of the gaming Tables can be converted from say Blackjack to Baccarat. this is bound to happen quite a bit over 7 years, would I want to be creating new gaming Table entries or add a column to use a foreign key and refer to a new table that stores the history of game types during periods of time? Employees will also learn to deal new games during their career, very rarely they may also have the skill removed.
----------------------------------------------------------------------------------------
Neo4j Specific
With this data would I have an Employee and a Table node that have "isA" relationship edges mapping to actual employees or tables?
I imagine with the skills for the two types I would be best with a Skill node and establish relationships like so?: Blackjack->isA->Skill, Employee->hasSkill->Blackjack, Table->typeIs->Blackjack?
TIME
I find difficulty when I want this database to now work with a timeline. I've come across the following suggestions for connecting nodes with time:
Unix Epoch seems to be a common recommendation?
Connecting nodes to a year/month/day graph?
Lucene timeline? (I don't know much about this or how to work with it, have seen some mention it)
And some cases with how time and data relate:
Staff have varied days and start/end times from week to week, this could be shift node with properties {shiftStart,shiftEnd,actualStart,actualEnd}, staff may arrive late or get sick during shift. Would this be the right way to link each shift to an employee? Employee(node)->Shifts(groupNode)->Shift(node)
Tables and Staff may have skill data modified, with archived data this could be an issue, I think the solution is to have time property on the relationship to the skill?
We open and close tables throughout the day, each table has open/close times for each day, this could change in a month depending on what management wants, in addition the times are not strict, for various reasons a manager may open or close tables during the shift. The open/closed status of a table node may only be relevant for queries during the shift, which confuses me as I'd want this for queries but for archiving with time it might not make sense?
It's with queries that I have trouble deciding when to use a node or add a property to a node. For an Employee they have a name and ID number, if I wanted to find an employee by their ID number would it be better to have that as a node of it's own? It would be more direct right, instead of going through all employees for that unique ID number.
I've also come across labels just recently, I can understand that those would be useful for typing employee and table nodes rather than grouping them under a node. With the shifts for an employee I think should continue to be grouped with a shifts node, If I were to do cypher queries for employees working shifts through a time period a label might be appropriate, however should it be applied to individual shift nodes or the shifts group node that links back to the employee? I might need to add a property to individual shift nodes or the relationship to the shifts group node? I'm not sure if there should be a shifts group node, I'm assuming that reducing the edges connecting to the employee node would be optimal for queries.
----------------------------------------------------------------------------------------
If there are any great resources I can learn about database development that'd be great, there is so much information and options out there it's difficult to know what to begin with. Thanks for your time :)
Thanks for spending the time to put a quality question together. Your requirements are great and your specifications of your system are very detailed. I was able to translate your specs into a graph data model for Neo4j. See below.
Above you'll see a fairly explanatory graph data model. In case you are unfamiliar with this, I suggest reading Graph Databases: http://graphdatabases.com/ -- This website you can get a free digital PDF copy of the book but in case you want to buy a hard copy you can find it on Amazon.
Let's break down the graph model in the image. At the top you'll see a time indexing structure that is (Year)->(Month)->(Day)->(Hour), which I have abbreviated as Y M D H. The ellipses indicate that the graph is continuing, but for the sake of space on the screen I've only showed a sub-graph.
This time index gives you a way to generate time series or ask certain questions on your data model that are time specific. Very useful.
The bottom portion of the image contains your enterprise data model for your casino. The nodes represent your business objects:
Game
Table
Employee
Skill
What's great about graph databases is that you can look at this image and semantically understand the language of your question by jumping from one node to another by their relationships.
Here is a Cypher query you can use to ask your questions about the data model. You can just tweak it slightly to match your questions.
MATCH (employee:Employee)-[:HAS_SKILL]->(skill:Skill),
(employee)<-[:DEALS]-(game:Game)-[:LOCATION]->(table:Table),
(game)-[:BEGINS]->(hour:H)<-[*]-(day:D)<-[*]-(month:M)<-[*]-(year:Y)
WHERE skill.type = "Blackjack" AND
day.day = 17 AND
month.month = 1 AND
year.year = 2014
RETURN employee, skill, game, table
The above query finds the sub-graph for all employees who have the skill Blackjack and their table and location on a specific date (1/17/14).
To do this in SQL would be very difficult. The next thing you need to think about is importing your data into a Neo4j database. If you're curious on how to do that please look at other questions here on SO and if you need more help, feel free to post another question or reach out to me on Twitter #kennybastani.
Cheers,
Kenny
I'm designing an Access .accdb for project management. The project contract stipulates a number of milestones for each project, with an associated date. The exact number of milestones depends on an "either/or" case of project size, but max of 6
My employer would like to track a [Forecast] date, [Actual] date and [Paid] date for each milestone, meaning a large sized project ends up with 24 dates associated with it, often duplicated (if a project runs to time, all four dates will be identical)
Currently, I have tblMilestones, which has a FK linking to tblProject and a record for each Milestone, with the 4 associated dates as fields in the record and a field to mark the milestone as complete or current.
I feel like we're collecting, storing and entering a lot of pretty pointless data - especially the [Forecast] date, for which we collect data from our project managers (not the most reliable of data anyway). Once the milestone is complete and the [Actual] date is entered, the [Forecast] date is pretty meaningless
I'd rather have the contract date in one table, entered when a new project is added, a reporting table for the changeable forecast date, set the Actual date when user marks milestone as complete and draw the paid date from transactions records.
Is this a better design approach? The db is small - less than 50 projects, so part of me thinks I'd just be making things more complicated than they need to be, especially in terms of the extra UI required.
Take a page out of dimensional data warehouse design and store dates in their own table with a DateID:
DateID DateValue
------ ----------
1 2000-01-01
... ...
9999 2012-12-31
Then turn all your date fields--Forecast, Actual, Paid, etc.--into foreign key references to the date table's DateID field.
To populate the dates table you can go two ways:
Use some VBA to generate a large set of dates, say 2005-01-01 to 2100-12-31, and insert them into the dates tables as a one-time operation.
Whenever someone types in a new date, check the dates table to see if it already exists, and if not, insert it.
Whichever way you do it, you'll obviously need an index on DateValue.
Taking a step back from the actual question, I'm realising that you're trying to fit two different uses into the same database--regular transactional use (as your project management app) and analytical use (tracking several different dates for your milestones--in other words, the milestone completion date is a Slowly Changing Dimension). You might want to consider splitting up these two uses into a regular transactional database and a data warehouse for analysis, and setting up an ETL process to move the data between them.
This way you can track only a milestone completion date and a payment date in your transactional database and the data warehouse will capture changes to the completion date over time. And allow you to do analysis and forecasting on that without bogging down the performance of the transactional (application) database.
We dont have any existing data warehouse, but we have customers (in OLTP) that have been with us many years and made purchases. How can I populate a customer dimension and then "replay" all the age updates that have occurred over the years, so that the type 2 dimension will have all the updates for those customers.
Since I want to populate the fact table with sales and refer to the DimCustomerFK. But when our clients query for data I want those customers to have the correct age. Since if I dont make any changes the customer will have the same age now and 10 years back when he placed the first order.
Any ideas how this can be made?
Interesting problem Patrik.
Some options:-
1) design SQL to parse through your customer / transaction OLTP data to create a daily flat file of customer updates. So you will end up with many thousand fairly small files (obviously depending on the number of customers you have and the date range). Name them Customeryyyymmdd.csv. Then create an ETL suite to read in the flat files in forward date order and apply the type 2 changes in order to the DWH.
2) build a very complex SQL query (I'm waving my hands around here as I dont know your data structures so couldnt suggest how complex this would be) that creates an ordered customer change list that you can pass through an ETL SCD component record by record.
Either seems logically feasible given what you have said earlier, but that may give you some ideas to consider that may give you a more concrete solution.
g/l
Mark.