I'm doing a schoolwork and..
I have to do a vehicle tracker system. I thought of these three designs. What do you think?
My database schemas
Opinions?
If you always measure and store all parameters within one measuring session, then go for the design 1.
Moving the attributes into separate tables only makes sense if the attributes are rarely stored and/or rarely needed.
If you have separate sensors for position and temperature, then go for design 3.
This is most probable, since the position is measured by a GPS tracker and temperature and oil level are measured by the vehicle sensors, which are separate devices and the measurements are performed at separate times.
You may even need to add a separate table for each sensor (i. e. if different sensors measure gas and temperature at different times, then make two tables for them).
Moving liquid into a separate table (as in design 2) makes sense if the list of the liquids you use is not known in design time (i. e. some third liquid, like hydrogen or helium-3 or whatever they will invent will be used by the vehicles and you will need to track it without redesigning the database).
This is not a likely scenario, of course.
if you're reading from the sensors at the same time the second design looks like an overkill to me. It would make sense to keep information separate just if you read that information at different times.
I would suggest the first design.
Your application needs to deal with two types of things
Sensors = which type, where in the engine, and even parameters such as polling frequency and such..
Reads = individual time-stamped recordings from one (or several ?) sensors.
There are a few things to think about:
- How can we find ways of abstracting the sensor concept? The idea is that we could then identify and deal with sensor instances through their properties, rather than having to know where to they are found in database.
-Is is best to keep all measurements for a given timestamp in a single "Read" record or to have one record per sensor, per read, even if several measurements come in sets.
A quick answer to the last question is that the single read event per record seems more flexible; we'll be able to handle, in the very same fashion, both groups of measurements that are systematically polled at the same time, and other measurements that are asynchonous to the former. Even if right-now, all measurements come at once, the potential for easy addition of sensors without changing the database schema and for handling them in like-fashion, is appealing.
Maybe the following design would be closer to what you need:
tblSensors
SensorId PK
Name clear text description of the sensor ("Oil Temp.")
LongName longer description ("Oil Temperarure, Sensor TH-B14 in crankshaft")
SensorType enumeration ("TEMP", "PRESSURE", "VELOCITY"...)
SensorSubType enumeration ("OIL", "AIR"...)
Location enumeration ("ENGINE", "GENERAL", "EXHAUST"...)
OtherCrit other crietrias which may be used to identify/seach for the sensor.
tblReads
Readid PK
DateTime
SensorId FK to tblSensors
Measurment INTeger value
Measurement2 optional extra meassurement (maybe to handle say, all
of a GPS sensor read as one "value"
Measurement3 ... also may have multiple columns for different types of
variables (real-valued ?)
In addition to the above you'd have a few tables where the "enumerations" for the various types of sensors are defined, and the tie-in to the application logic would be by way of the mnemonic-like "keys" of these enumerations. eg.
SELECT S.Name, R.DateTime, R.Measurement
FROM tblReads R
JOIN tblSensors S ON S.SensorId = R.SensorID
WHERE S.SensorType IN ('Temp', 'Pres')
AND S.Location = "ENG"
AND R.DateTime > '04/07/2009'
ORDER BY R.DateTime
This would not prevent you to also call the sensors by their id, and to group reads on the same results line. eg.
SELECT R1.DateTime, R1.Measurement AS OilTemp, R2.Measurement AS OilPress,
R3.Measurement AS MotorRpms
FROM tblReads R1
LEFT OUTER JOIN tblReads R2 ON R1.DateTime = R2.DateTime
LEFT OUTER JOIN tblReads R3 ON R1.DateTime = R3.DateTime
WHERE R1.SensorId = 17
AND R2.SensorId = 11
AND R3.SensorId = 44
AND R1.DateTime > '04/07/2009' AND R1.DateTime < '04/08/2009'
ORDER BY R3.Measurement DESC -- Sorte by Speed, fastest first
Related
I have customer dimension table and the location of customer can change.
The customerid filters the sales fact table.
I have 2 options:
Slowly changing dimension type 2 to hold 1 new record for each customer's location changes
Or
Store the location at the time of data load into the sales fact table.
Both ways allow me to see sales by location (although it's a customer location, the etl will place it on fact table).
The later option saves me from implementing SCD on dim table.
What are factors to decide which of the 2 approaches is suitable?
Your fact table should contain things that we measure, count, total. Your dimensions should be descriptive elements that allow users to slice their data along an axis - basically answer the "by" part of their request
I want to see total sales by year and month across this customer based regional hierarchy
Don't take my word for it, grab a data warehousing book or go read the freely available information from the Kimball Group
Storing the customer data on the fact is a bad idea regardless of your database engine. To satisfy a query like the above, the storage engine needs to read in the entirety of your fact table and the supporting dimensions. It could read (Date, RegionId, CustomerId, SalesAmount) which likely costs something like 16 bytes per row times however many rows you have. Or, it can read (Date, RegionId, CustomerName, CustomerAddress, CustomerCity, CustomerState, CustomerPostalCode, SalesAmount) at a cost of what, 70 bytes per row? That's an inflation to
store your data (disk is cheap but that's not the point)
read your data (basic physics, the more data you wrote to disk, the longer it takes to read it back out)
less available memory for other queries (you're in a multi-user/query environment, when you hog resources, there's less for others)
write data (ETL processing is going to take longer because you have to write more pages to disk than you should have)
inability to optimize (What if the business just wants to see "Total Sales by Year and Month" - no customer hierarchy. The database engine will still have to read all the pages with all that useless customer data just to get at the things the user actually wanted)
Finally, the most important takeaway from the Data Warehouse Toolkit is on like page 1. The biggest reason that Data Warehouse projects fails is that IT drives the requirements and it sounds like you're thinking of doing that to avoid creating a SCD type 2 dimension. If the business problem you're attempting to solve is that they need to be able to see sales data associated to the customer data at the point of time it happened, you have a Type 2 customer dimension.
Yes, technologies like Columnstore Compression can reduce the amount of storage required but it's not free because now you're adding workload to the cpu. Maybe you have it, maybe you don't. Or, you model it correctly and then do the compression as well and you still come out ahead in a proper dimensional model.
How you model location depends on what it relates to. If it is an attribute of a sale then it belongs as its own dim related to the sale. If it is an attribute of a customer (such as their home address) then it belongs in the customer dim. If the location is an attribute of both a sale and a customer then it belongs in both
I'm considering converting some excel files I regularly update to a database. The files have a large number of columns. Unfortunately, many of the databases I am looking at, such as Access and PostreSQL, have very low column limits. MySQL's is higher, but I'm worried that as my dataset expands I might break that limit as well.
Basically, I'm wondering what (open source) databases are effective at dealing with this type of problem.
For a description of the data, I have a number of excel files (less than 10) with each containing a particular piece of information on some firms over time. It totals about 100mb in excel files. The firms are in the columns (about 3500 currently), the dates are in the rows (about 270 currently, but switching to a higher frequency for some of the files could easily cause this to balloon).
The most important queries will likely be to get the data for each of the firms on a particular date and put it in a matrix. However, I may also run queries to get all the data for a particular firm for a particular piece of data over every date.
Changing dates to a higher frequency is also the reason that I'm not really interested in transposing the data (the 270 beats Access' limit anyway, but increasing the frequency would far exceed MySQL's column limits). Another alternative might be to change it so that each firm has its own excel file (that way I limit the columns to some amount less than 10), but is quite unwieldy for the purposes of updating the data.
This seems to be begging to be split up!
How using a schema like:
Firms
id
name
Dates
id
date
Data_Points
id
firm_id
date_id
value
This sort of de-composed schema will make reporting quite a bit easier.
For reporting you can easily get a stream of all values with a query like
SELECT firms.name, dates.date, data_points.value from data_points left join firms on firms.id = data_points.firm_id left join dates on dates.id = data_points.date_id
What I have is about 130GB of time varying state data of several thousand financial instruments' orderbooks.
The csv files I have contain a row per each change in the orderbook state (due to an executed trade, inserted order etc.). The state is described as: a few fields of general orderbook information (e.g. isin code of the instrument), a few fields of information about the state change (such as orderType, time) and finally the buy and sell levels of the current state. There are up to 20 levels (Buy level 1 representing the best buy price, sell level one representing the best sell price and so on.) of both sell and buy orders, and each of them consist of 3 fields (price, aggregated volume and order amount). Finally there is additional 3 field of aggregated data of the levels beyond 20 for both buy and sell side. This amounts to total maximum of 21*2*3 = 126 fields of the levels data per state.
The problem is that since there rarely exists anywhere near 20 levels it doesn't seem to make sense to reserve fields for each of them. E.g. I'd have a rows where there are 3 buy levels and the rest of the fields are empty. On the other hand the same orderbook can have 7 buy levels after a few moments.
I will definitely normalize the general orderbook info into it's own table, but I don't know how to handle the levels efficiently.
Any help would be much appreciated.
I have had to deal with exactly this structure of data, at one point in time. One important question is how the data will be used. If you are only looking for the best bid and ask price at any given time, then the levels do not make much of difference. If you are analyzing market depth, then the levels can be important.
For the volume of data you are using, other considerations such as indexing and partitioning may be more important. If the data you need for a particular query fits into memory, then it doesn't matter how large the overall table is.
My advice is to keep the different levels in the same record. Then, you can use page compression (depending on your storage engine) to eliminate most of the space reserved for the empty values. SQL Server does this automatically, so it was a no-brainer to put the levels in a single record.
A compromise solution, if page compression does not work, is to store a fixed number of levels. Five levels would typically be populated, so you wouldn't have the problem of wasted space on empty fields. And, that number of levels may be sufficient for almost all usage.
I'm currently developing a website where users can search for other users based on attributes (age, height, town, education, etc.). I now want to implement some kind of rating between user profiles. The rating is calculated via its own algorithm based on similiarity between the 2 given profiles. User A has a rating "match rating" of 85 with User B and 79 with User C for example. B and C have a rating of 94 and so on....
The user should be able to search for certain attributes and filter the results by rating.
Since the rating differs from profile to profile and also depends on the user doing the search, I can't simply add a field to my users table and use ORDER BY. So far I came up with 2 solutions:
My first solution was to have a nightly batch job, that calculates the rating for every possible user combination and stores it in a separate table (user1, user2, rating). I then can join this table with the user table and order the result by rating. After doing some math I figured that this solution doesn't scale that well.
Based on the formula n * (n - 1) / 2 there are 45 possible combination for 10 users. For 1.000 users I suddenly have to insert 499.500 rating combinations into my rating table.
The second solution was to leave MySQL be and just calculate the rating on the fly within my application. This also doesn't scale well. Let's say the search should only return 100 results to the UI (with the highest rated on top). If I have 10.000 users and I want to do a search for every user living in New York sorted by rating, I have to load EVERY user that is living in NY into my app (let's say 3.000), apply the algorithm and then return only the top 100 to the user. This way I have loaded 2.900 useless user objects from the DB and wasted CPU on the algorithm without ever doing anything with it.
Any ideas how I can design this in my MySQL db or web app so that a user can have an individual rating with every other user in a way that the system scales beyond a couple thousand users?
If you have to match every user against every other user, the algorithm is O(N^2), whatever you do.
If you can exploit some sort of 1-dimensional "metric", then you can try and associate each user with a single synthetic value. But that's awkward and could be impossible.
But what you can do is to note which users require a change in their profiles (whenever any of the parameters on which the matching is based, changes). At that point you can batch-recalculate the table for those users only, thus working in O(N): if you have 10000 users and only 10 require recalculation, you have to examine 100,000 records instead of 100,000,000.
Other strategies would be to only run the main algorithm for records which have the greater chance of being compared: in your example, "same city". Or when updating records (but this would require to store (user_1, user_2, ranking, last_calculated), only recalculate those records with high ranking, very old, or never calculated. Lowest ranked matches aren't likely to change so much that they float to the top in a short time.
UPDATE
The problem is also operating with O(N^2) storage space.
How to reduce this space? I think I can see two approaches. One is to not put some information in the match table at all. The "match" function makes the more sense the more it is rigid and steep; having ten thousand "good matches" would mean that matching means very little. So we would still need lotsa recalculations when User1 changes some key data, in case it brings some of User1's "no-no" matches back into the "maybe" zone. But we would keep a smaller clique of active matches for each user.
Storage would still grow quadratically, but less steeply.
Another strategy would be to recalculate the match, and then we would need to develop some method for quickly selecting which users are likely to have a good match (thus limiting the number of rows retrieved by the JOIN), and some method to quickly calculate a match; which could entail somehow rewriting the match between User1 and User2 to a very simple function of a subset of DataUser1, DataUser2 (maybe using ancillary columns).
The challenge would be to leverage MySQL capabilities and offload some calculations the the MySQL engine.
To this purpose you might perhaps "map" some data, at input time (therefore in O(k)), to spatial information, or to strings and employ Levenshtein distance.
The storage for a single user would grow, but it would grow linearly, not quadratically, and MySQL SPATIAL indexes are very efficient.
If the search should only return the top 100 best matches, then why not just store those? It sounds like you would never want to search the bottom end of the results anyway, so just don't calculate them.
That way, your storage space is only o(n), rather than o(n^2), and updates should be, as well. If someone really wants to see matches past the first 100 (and you want to let them) then you have the option of running the query in real time at that point.
I agree with everything #Iserni says.
If you have a web app and users need to "login", then you might have an opportunity to create that user's rankings at that time and stash them into a temporary table (or rows in an existing table).
This will work in a reasonable amount of time (a few seconds) if all the data needed for the calculation fits into memory. The database engine should then be doing a full table scan and creating all the ratings.
This should work reasonably well for one user logging in. Passably for two . . . but it is not going to scale very well if you have, say, a dozen users logging in within one second.
Fundamentally, though, your rating does not scale well. You have to do a comparison of all users to all users to get the results. Whether this is batch (at night) or real-time (when someone has a query) doesn't change the nature of the problem. It is going to use a lot of computing resources, and multiple users making requests at the same time will be a bottleneck.
I have read many strong statements here and elsewhere on the subject of storing arrays in mysql. The rules of normalization seem to suggest its a bad idea and searching within the stored array fosters inelegant code. HOWEVER, for the application I am working on it seems like a reasonable solution to store an array in a field. I'm sure that is what everyone wrongly thinks in this position but I can't figure out a better way. Here is the setup:
I have a series of tables that store registered students, courses they can take and their performance on each course. All are "normalized" to avoid duplication and errors. I want to be able to generate a "myCourses" section so after login the student sees courses they are eligible for and courses they have taken but are free to review. The approach that comes to mind is two arrays; my_eligible_courses and my_completed_courses. On registration, the student is given a set of courses for which they are eligible. This could be stored as rows where there are multiple occurrences of studentid, one for each course they can take:
student1 course 1
student1 course 2
student1 course n
The table could then be queried for all of student 1's eligible courses and displayed as a list when the student logs in.
Alternately, studentid could be a primary key and in a column "eligible_courses" there would be an array (course 1,course 2, course n).
There is a table for student performance, to record every course taken and metrics associated with student performance. It will be queried to report on student performance, quality of course etc but this table will grow quite large. I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
One other complication is that the set of courses a student is eligible is variable and expanding as new courses are developed, which to me seems to suggest that generating a set of new columns for each new course is a bad idea-for example, new course_name, pretest_score, posttest_score, time_to_complete, ... Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
So to restate the question, is it better to store "inelegant" arrayed list of eligible and completed courses in a registered student table or dynamically generate these lists?
I'm guessing this is still too vague but any discussion of db design that gives an example of an inelegant array vs a restructured schema would be appreciated.
You should feel confident that if you have indexes on your tables for the appropriate columns, querying for my_completed_courses will be pretty snappy.
When your table grows to the point that you notice slowdown, you can configure your MySQL server with appropriate memory allocation settings so that it can keep more data cached in memory. Or you could look into that now.
In response to the edit you made about adding new courses: Don't add a new column for each course. Don't add a new table for each course. Create a table for courses, and add rows for each course.
You should then be able to join your tables together on indexed columns to generate the list of data you need.
This is a bad idea for two obvious reasons:
DBMS can't enforce proper referentialX (and possibly domain) integrity and relying on application-level integrity is almost always a bad idea.
While the database will be able to answer the query: "based on given student, give me courses", you won't be able to (efficiently) go in the opposite direction, should you ever need to.
X What's to stop a buggy application from storing a non-existent ID in array? Or deleting a course that is still referenced by students? Even if your application is careful about course deletion, there is no way to do it efficiently - you'll need a full table scan to examine all arrays.
Why are you even trying this? A link (aka. junction) table would solve these problems, for a moderate cost of some additional storage space.
If you are really concerned about storage space, you could even switch the DBMS and use one that supports leading-edge index compression (such as Oracle).
I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
Databases are very good at querying humongous amounts of data. In this case, if you use the clustering properly, the DBMS will be able to get this data in very few I/O operations, meaning very fast. Did you perform any actual benchmarks? Have you measured any actual performance problem?
Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
Generating a new table may be justified in case it will have different columns. But, that doesn't sound like what you are trying to do.
It seems to me that you simply need:
CHECK (
(COMPLETED = 0 AND (performance fields) IS NULL)
OR (COMPLETED = 1 AND (performance fields) IS NOT NULL)
)
When a student enrolls into course, insert a row in STUDENT_COURSE, set COMPLETED to 0 and leave performance fields NULL.
When the student completed the course, set COMPLETED to 1 and fill the performance fields.
(BTW, you could even omit COMPLETED altogether and just rely on testing the performance fields for NULL.)
InnoDB tables are clustered, which means that rows in STUDENT_COURSE belonging to the same student are stored physically close together, which means that getting courses of the given student is extremely fast.
If you need to go in the opposite direction (get students of a given course), add an index on same fields but in opposite order: {COURSE_ID, STUDENT_ID}. You might even consider covering in this case.
Since we are talking about small number of rows, leaving COMPLETED unindexed is just fine. If you are really concerned about that, you can even do something like:
The COMPLETED_STUDENT_COURSE is a B-Tree only for completed courses (and essentially a subset of STUDENT_COURSE which is a B-Tree for all enrolled courses).
Here are a few thoughts that I believe may assist you in making a good decision.
Generally, it is a rule to use correctly normalised tables. But there can be exceptions to this. Perhaps your project may be such.
Most of the time, new developers tend to focus on getting the data into a DB. They get stuck when it comes to retrieving it for a specific purpose. So given both cases of arrays vs. relational tables, ask your self if either method serves your purpose. For example, if you wanted to list the courses of student X, your array method is just fine. This is because you can retrieve it by the primary key like a student ID. But if you wanted to know how many students are on course A, the array method will be a horrible way to go.
Then again, the above point would depend on your data volume as well. For example, if you only have about a hundred students, you'll probably not notice a difference in performance. But if you're looking at several thousand records and you have a big list of courses for students, the array approach is not the way to go.
Benchmark. This is the best way for you to find out your answer. You can use MySQL's explain or just time it using your program that executes the queries. Try each method with your standard volume of data and see which one works best. For example, in the recent past, MySQL was boasting about their strength of the ISAM engine. Then I had to work on a large application that involved millions of records. And here, I noticed that each time a new record came in, Indexes had to be rebuilt. So now we had to bend the rules. Likewise, you'd better do your tests with the correct volumes of data and make a better decision.
But do not take this example as a rule. Rather, go by the standards of normalisation and only bend the rules for exceptions.