In MS Access, i need to create a history table off a select query that is used for reporting? I don't want an append table as i need the select's data for reporting.
The answer that works best is you do want to use an append query.
Instead of garbaging up your database with lots of history tables, it's better to have one history table with a unique key to differentiate the multiple history reports.
Usually a "Time Stamp" field is a good primary key. Where each record in the report gets the same time stamp.
Also, you can have other key fields depending on the type of report it is. You may want a version field, or a re-try field. You also may want a final copy field. Having these fields will allow you to go back and delete garbage reports or updated reports or bad attempt reports.
Also having solid date fields will allow you to discriminate between daily reports, weekly reports, or monthly reports. (Let alone if you have to worry about Fiscal year, or retail calendars, etc.)
The good thing about having a single table is, you can always go back into your history table and pull out lots of historical data for other types of report comparisons... all in one query instead of trying to tie multiple tables together (mostly with hard to figure out names).
Do yourself and future programmers who will have to deal with your code a favor... and put all the history in one table. Especially since one of those future programmer may be you. You'll be thanking yourself.
Oh... and to get the data for the reporting, you use your primary key to pull out that data. Or... you can have a staging table for your report and then you append the staging table data to the History table (with all the proper key info).
Related
I have to design a database schema for an application I'm building. I will be using MySQL. In this application, users enter data and it gets saved in the database obviously. However, this data is not accessible to the public until the user publishes the data. Currently, I have one column for storing all the data. I was wondering if a boolean field in this table that indicates whether the data has been published is a good idea. Or, is it much better design to create one table for saved data and one table for published data and move the saved data to the published data table when the user presses Publish.
What are the advantages and disadvantages of using each one and is one of them considered better design than the other?
Case: Binary
They are about equal. Use this as a learning exercise -- Implement it one way; watch it for a while, then switch to the other way.
(same) Space: Since a row exists exactly once, neither option is 'better'.
(favor 1 table) When "publishing" it takes a transaction to atomically delete from one table and insert into the other.
(favor 2 tables) Certain SELECTs will spend time filtering out records with the other value for published. (This applies to deleted, embargoed, approved, and a host of other possible boolean flags.)
Case: Revision history
If there are many revisions of a record, then two tables, Current data and History, is better. That is because the 'important' queries involve fetching the only Current data.
(PARTITIONs are unlikely to help in either case.)
I am working a project where I will receive student data dumps once a month. The data will be imported into my system. The initial import will be around 7k records. After that, I don't anticipate more than a few hundred a month. However, there will also be existing records that will be updated as the student changes grades, etc.
I am trying to determine the best way to keep track of what has been received, imported, and updated over time.
I was thinking of setting up a hosted MySQL database with a script that imports the SFTP dump into a table that includes a creation_date and a modification_date field. My thought was, the person performing the extraction, could connect to the MySQL db and run a query on the imported table each month to get the differences before the next extraction.
Another thought I had, was to create a new received table every month for each data dump. Then I would perform the query on the differences.
Note: The importing system is legacy and will accept imports using a utility and unique csv type files. So that probably rules out options like XML.
Thank you in advance for any advice.
I'm going to assume you're tracking students' grades in a course over time.
I would recommend a two table approach:
Table 1: transaction level data. Add-only. New information is simply appended on. Sammy got a 75 on this week's quiz, Beth did 5 points extra credit, etc. Each row is a single transaction. Presumably it has the student's name/id, the value being added, maybe the max possible value or some weighting factor, and of course the timestamp added.
All of this just keeps adding to a never-ending (in theory) table.
Table 2: summary table, rebuilt at some interval. This table does a simple aggregation on the first table, processing the transactional scores into a global one. Maybe it's a simple sum, maybe it's a weighted average, maybe you have something more complex in mind.
This table has one row per student (per course?). You want this to be rebuilt nightly. If you're lazy, you just DROP/CREATE/INSERT. If you're worried about data-loss, you just INSERT and add a timestamp so you can have snapshots going back.
I am building a project in MS Access 2010. I have previous experience in Oracle. I am reading about MS Access and keep seeing references to table relationships. It looks like a convenient way to assist the average person in data entry and validation and for query building, but I write queries exclusively in SQL mode and enforce data entry for users with forms that have their own validation rules.
Is it really necessary to enforce relationships? It doesn't seem like it really gains me anything at an advanced level, and might actually cause problems for me or someone else who eventually takes over maintenance from me later. I've never used them before and I'm not really seeing a benefit to starting now. Can anyone shed some light on that?
You say you have previous experience of Oracle. Did you never define Foreign Key constraints in Oracle? If you did, then that is what you are doing when you define relationships in Access. You can use it for enforcing referential integrity (not allowing you to delete a parent record if child records still exist) or, if you use the cascade delete option, for automatically deleting child records if you delete a parent record. It's a useful backup to cover coding errors where you might have forgotten about possible child records that would otherwise be orphaned if you did not have the relationship (FK) defined.
From a person just querying data, then the relationships are not that important. However from an application point of view, they are VERY helpful if not outright important.
For example, you might have a customer’s table, and then say an orders table. The business rule is that you can’t create an order unless you first have a customer. So if you freely write some SQL to add an order without a customer, your update/insert query will NOT work. And if you need to delete a customer, then all orders for that customer can/will automatic delete for you without having to write a complex delete SQL statement. You might for example want to delete all customers older than 5 or 10 years (so they are inactive). When you delete those customers, then you want all orders also deleted. (This is a VERY difficult query to write if you have to delete the child records for each customer. with enforced relations, then all child records will automatic delete for you (enforced cascade delete)).
And it also important from a reporting point of view. If you write a query to display all customers this month and their billing totals then you get one total result. However if you decide that you do NOT want to display/include customers, you might hit just the orders table and get a total amount that way. The problem is without RI, then you might (by accident or even just some user launching the orders form) have entered order information (with a total amount) but NO customer.
Now what happens is when you run the two different reports/quires, you find the total is DIFFERENT! In a complex application as to “why” the two reports are different can take days, or with lots of data even a week to figure out why two reports on monthly sales do NOT agree with each other. If you enforce the business rule that no orders can be entered into the system UNLESS they have a customer, then you eliminate such errors in reporting. You can “say” that you are perfect user of SQL, but with lots of code, lots of forms for data entry, how can you EVER be sure that orders are NEVER entered without a customer. The user during data entry may forget to enter the customer in that order form. And even if you write code in that order form to ENSURE that customer must be selected, maybe YOU during the writing of some SQL by accident insert an order record into the system without a customer. However your monthly customer total report query “assumes” that you have a customer record that you THEN join in the order totals data.
However some reports must just run on the orders data (a monthly summary total does not need to include customers). The problem now is somewhere in the system you have an order record with total data that does not have a customer. The result is different reports and quires on sales total now don’t agree. This is an outright nightmare.
So some bug or error in the application code might occur and result in what is supposed to be relational data now having “orphaned” records. Perhaps your business rules allow entering of orders without a customer assigned, but then your monthly sales report will have to show that fact, or any query that hits the orders table and does not include customers will have to “check” for the possibility in those queries that no customer record yet exits.
The above is only a SIMPLE scratching of the surface of the GAZILLION issues that crop up. So while you might be just creating simple quires on the data, the problem is that data correctly related in the system? The old saying about garbage in = garbage out rings true here.
At the end of the day when you’re SQL quires pulls data with MULTIPLE tables, then you HAVE to make assumptions about that data and its relational integrity (RI). So when you write that query to display customers and their order totals, you ASSUME and drop in the customers table, and then relational join in the orders table. However if orders exist without a customer record, then your query not going to produce the correct values. And worse a report that hits the orders table will now produce different results.
If you enforce RI then no matter what, you cannot enter an order by accident or force without FIRST having created a customer record. If you don’t enforce such rules, then your data will produce incorrect results.
And a typical complex application will have 40 or 70 related tables. And EVERY ONE of those tables is going to have assumptions made as to if parent (or child) record are “assumed” to have been created correctly based on your set of business assumptions.
You might have a tour booking system. Customers might phone up, put down a deposit but NOT yet be booked to a particular tour. If you allow this setup, then your query on customers this month and their booked tour will have to take this into account. However maybe the business rules are that any customer in the system that puts money down MUST ALSO be booked to a tour (and thus you query to grab that information will take this rule into account).
If every query you always made never was to include data from more than one table, then you likely don’t benefit much from enforcing relational data. However the instant you start bundling queries with multiple tables, then you MUST know the assumptions being made about that data before you can write a query. So do you allow customers with a deposit in the system without a tour booking or not? This rule will decide how you must write that query. If RI is enforced, then you can query on a customer “booking” that and you KNOW that it will be attached to a tour event. And same goes if any booking + deposit does not need a booking – but you HAVE to know the assumption made about that data.
So based on assumptions made about the data is the ONLY practical way to create a query to pull data based on those assumptions. And if you enforce RI, then you at least know the data MUST be related and setup based on those assumptions.
At the end of the day? Anyone creating a data base that models a business application and rules without enforcing RI is building a ship without rudder and without a compass.
And exporting data from each table is a NON issue. However if that data is a mess and has orphaned records, then you only wind up exporting a incorrect data model to another database and all of the issues and problems remain.
If you are building queries purely in SQL mode, defining the relationships probably doesn't make any difference for you. The only thing that might be useful is that if you built something, then didn't look at it again for a few months, you would be able to quickly re-acquaint yourself with the relationships conceptually.
For anyone using the access query builder, defining the relationships allows you to quickly add tables to the query while Access automatically builds the proper (GIGO) relationships for the query JOIN. Again, if you are writing in SQL, you probably already do this, so not much help for you in query building.
Bottom Line - it's more of a graphical tool to streamline the query process, at least until you try exporting the tables to a "real" RDBMS, as someone else already mentioned.
If its not a requirement for you use case then you don't necessarily have use this. A use case where this could be "required" is in an Order Based Scenario.
Lets say you have a Database that creates and tracks Orders. Each Order can have multiple Lines that are tied to the same Order. But for Normalization purposes, most people would separate these into two separate tables. OrderHead and OrderDetail. You would want to enforce Referential Integrity here to ensure that there is never a child record in OrderDetail that doesn't link back to a Parent Order.
I'm sure that you could prevent things like that without it, but it mainly just enforces it.
Relationships helps in preserving data integrity and I agree with your point that if user is entering from access form, probability of errors due to integrity is lesser. But in future if user is moving from MS Access to pure RDBMS, this relationship will definitely will be helpful.
Though objective of relationship is not for migration at later point-in-time, for your case that is one valid reason I could think-of.
Other than that, for MS Access with its own forms relationship may not add specific values.
Currently, I have 48 fields.
I'm completely new to access. This is how I decided to connect everything together.
It doesn't seem to be very effective. Could somebody help me understand how to normalize this database?
Should I try to put employee information in one table, job information in another table and then have an equipment lookup table?
The current job, last job, and previous job can all the SAME table. If you sort this table by descending job start date, then then you have current, last and previous. You thus don’t need nor want a separate table for each of these which really amounts to the concept of a “job”. If sorting by date is not enough, then you could add a column called Job Type (current, previous, etc.). Again, we still only using the one table.
The same goes for Equipment. You really don’t care if the limit is 3 last, or 300 last. By building a normalized table, then ONE form can edit all types and you save MASSIVE amounts of coding and building of tables, User interface software, and that of building quires to retrieve + show the last 3 jobs in a form.
The fact that your design with FAR LESS cost of development allows 3 or 300 last jobs is really moot. More important if some manager comes along and now wants you to save the last 4 jobs, you don’t have some massive re-design here. And you can on the fly add new job types. So in place of current, and say previous, you can also have un-completed, or failed jobs. So adding new business rules means again you don’t add a new type of job table, but only a “type” to the one column you already using to define the job as current or previous.
Identify like objects and make one table to store all of them. In your design you have three tables for equipment but each item of equipment has the same fields; they should be one table. Similarly for jobs, each job is pretty much the same; they should be one table. The same for departments.
Figure out one or more column in each table that can uniquely identify the row in the table (that is, if you know the values for those columns it is impossible for there ever to be two rows with those values). These are your primary keys for your tables.
Identify cases in which an item in one table needs to "point to" (refer to) an item in another table. In this case, make sure that the referring table has a set of columns that match the referred-to table.
When you've done that, you'll have the beginnings of a correctly factored relational database design.
We have a requirement in our application where we need to store references for later access.
Example: A user can commit an invoice at a time and all references(customer address, calculated amount of money, product descriptions) which this invoice contains and calculations should be stored over time.
We need to hold the references somehow but what if the e.g. the product name changes? So somehow we need to copy everything so its documented for later and not affected by changes in future. Even when products are deleted, they need to reviewed later when the invoice is stored.
What is the best practise here regarding database design? Even what is the most flexible approach e.g. when the user want to edit his invoice later and restore it from the db?
Thank you!
Here is one way to do it:
Essentially, we never modify or delete the existing data. We "modify" it by creating a new version. We "delete" it by setting the DELETED flag.
For example:
If product changes the price, we insert a new row into PRODUCT_VERSION while old orders are kept connected to the old PRODUCT_VERSION and the old price.
When buyer changes the address, we simply insert a new row in CUSTOMER_VERSION and link new orders to that, while keeping the old orders linked to the old version.
If product is deleted, we don't really delete it - we simply set the PRODUCT.DELETED flag, so all the orders historically made for that product stay in the database.
If customer is deleted (e.g. because (s)he requested to be unregistered), set the CUSTOMER.DELETED flag.
Caveats:
If product name needs to be unique, that can't be enforced declaratively in the model above. You'll either need to "promote" the NAME from PRODUCT_VERSION to PRODUCT, make it a key there and give-up ability to "evolve" product's name, or enforce uniqueness on only latest PRODUCT_VER (probably through triggers).
There is a potential problem with the customer's privacy. If a customer is deleted from the system, it may be desirable to physically remove its data from the database and just setting CUSTOMER.DELETED won't do that. If that's a concern, either blank-out the privacy-sensitive data in all the customer's versions, or alternatively disconnect existing orders from the real customer and reconnect them to a special "anonymous" customer, then physically delete all the customer versions.
This model uses a lot of identifying relationships. This leads to "fat" foreign keys and could be a bit of a storage problem since MySQL doesn't support leading-edge index compression (unlike, say, Oracle), but on the other hand InnoDB always clusters the data on PK and this clustering can be beneficial for performance. Also, JOINs are less necessary.
Equivalent model with non-identifying relationships and surrogate keys would look like this:
You could add a column in the product table indicating whether or not it is being sold. Then when the product is "deleted" you just set the flag so that it is no longer available as a new product, but you retain the data for future lookups.
To deal with name changes, you should be using ID's to refer to products rather than using the name directly.
You've opened up an eternal debate between the purist and practical approach.
From a normalization standpoint of your database, you "should" keep all the relevant data. In other words, say a product name changes, save the date of the change so that you could go back in time and rebuild your invoice with that product name, and all other data as it existed that day.
A "de"normalized approach is to view that invoice as a "moment in time", recording in the relevant tables data as it actually was that day. This approach lets you pull up that invoice without any dependancies at all, but you could never recreate that invoice from scratch.
The problem you're facing is, as I'm sure you know, a result of Database Normalization. One of the approaches to resolve this can be taken from Business Intelligence techniques - archiving the data ina de-normalized state in a Data Warehouse.
Normalized data:
Orders table
OrderId
CustomerId
Customers Table
CustomerId
Firstname
etc
Items table
ItemId
Itemname
ItemPrice
OrderDetails Table
ItemDetailId
OrderId
ItemId
ItemQty
etc
When queried and stored de-normalized, the data warehouse table looks like
OrderId
CustomerId
CustomerName
CustomerAddress
(other Customer Fields)
ItemDetailId
ItemId
ItemName
ItemPrice
(Other OrderDetail and Item Fields)
Typically, there is either some sort of scheduled job that pulls data from the normalized datas into the Data Warehouse on a scheduled basis, OR if your design allows, it could be done when an order reaches a certain status. (Such as shipped) It could be that the records are stored at each change of status (with a field called OrderStatus tacking the current status), so the fully de-normalized data is available for each step of the oprder/fulfillment process. When and how to archive the data into the warehouse will vary based on your needs.
There is a lot of overhead involved in the above, but the other common approach I'm aware of carries even MORE overhead.
The other approach would be to make the tables read-only. If a customer wants to change their address, you don't edit their existing address, you insert a new record.
So if my address is AddressId 12 when I first order on your site in Jamnuary, then I move on July 4, I get a new AddressId tied to my account. (Say AddressId 123123 because your site is very successful and has attracted a ton of customers.)
Orders I palced before July 4 would have AddressId 12 associated with them, and orders placed on or after July 4 have AddressId 123123.
Repeat that pattern with every table that needs to retain historical data.
I do have a third approach, but searching it is difficult. I use this in one app only, and it actually works out pretty well in this single instance, which had some pretty specific business needs for reconstructing the data exactly as it was at a specific point in time. I wouldn't use it unless I had similar business needs.
At a specific status, serialize the data into an Xml document, or some other document you can use to reconstruct the data. This allows you to save the data as it was at the time it was serialized, retaining original table structure and relaitons.
When you have time-sensitive data, you use things like the product and Customer tables as lookup tables and store the information directly in your Orders/orderdetails tables.
So the order table might contain the customer name and address, the details woudl contain all relevant information about the produtct including especially price(you never want to rely on the product table for price information beyond the intial lookup at teh time of the order).
This is NOT denormalizing, the data changes over time but you need the historical value, so you must store it at the time the record is created or you will lose data intergrity. You don't want your financial reports to suddenly indicate you sold 30% more last year because you have price updates. That's not what you sold.