I am collecting data and storing it MySQL, for:
75 variables
55 countries
Each year
I have, at this stage since I am building this tool created a single table, of variables / countries (storing 1 year worth of data).
Next year (and for several years after that) a new set of data will be input for each country.
There are therefore 3 variables in controlling data returned to a user reviewing all collected data. The general form of any query would be:
Show me these specifics variables, for these specific countries, for these specific years.
(Show me average age and weight, for USA and Canada, for 2012 and 2009, for example)
My question is, it seems that I have two options for arranging this data:
-Multiple tables where I create a table of country / variable for each year data is collected
- Single table and simply add a column (field) for the year that data relates to.
As far as I can tell I could make these database calls with either sructure, but is one more powerful / efficient / quicker, and why?
Thanks for your consideration.
It's a PDO / PHP interface if that is relevent.
Using a relational approach generally involves more tables. This translates into queries being a bit more slow (though probably not noticeable in small databases) and database size to be smaller. This makes it simpler to update information properly and thus ensure data integrity. For example, if Joe's address changes you know it will be changed on all reports using Joe's address.
Using less linked tables where one field can be repeated multiple times you risk having disparity between data from different tables where you would naturally expect it to be equal. Access speed should be a bit faster if you arrange your tables properly because your information will be grouped according to how you access it.
For example, in the first method you would have an Orders table with a Supplier and Client table to make a complete invoice whereas in the second method you would want to put some information of both Supplier and Client in the Orders table such that accessing that finding the row corresponding to the invoice number you are looking for would return the entire set of data that you need (thus eliminating the need for joins on Supplier and Client and reducing load on the database server).
Edit: I think a better answer would require a bit more information about your data (samples for example).
Related
This question already has answers here:
Many tables or rows, which one is more efficient in SQL?
(3 answers)
Closed 7 years ago.
Every month I get sent a file from a external company which needs to be stored in a database, each file containing up to a million records. The main data fields are Month, Year, Postcode and TransactionType.
I was proposing that we should save the data in our database as a new SQL table each month so we know there is only a finite amount of data in each table. However one of my collegues said he was once told that to create a new table every month is bad practice, but he didn't know why.
If I was to have multiple tables, there would only be a maximum of 60 tables, though there may be far fewer (down to 12) dependent on how far into the past my client needs to look. This means that every month I will need to delete a month's worth of data.
However when I do my SQL queries I will only need a single row of data from a single table per query. I would think in theory this would be more efficient than having a single table filled with millions of rows.
I was wondering if anyone had any definitive reasons as to why splitting the data this way would be a bad thing to do?
All "like" items should be stored together in a database for the following reasons:
You should be able to provide any subset of the items using a single SELECT statement only by changing the WHERE clause of that statement. With separate tables you will have to write code to decompose the request into the parts that compute the table name and the parts that filter that table. And you will have to duplicate that logic in each application, or teach it to each user, that wants to use your database.
You should not artificially limit the use to which your data can be put. If you have separate monthly tables you have already substantially limited the types of queries you can enter against them without having to write more complex UNION queries.
The addition of more instances of a known data type to your database should not require ALTERing the structure of your database and, as a general principal, regularly-run code should not even have ALTER permissions
If proper indexes are maintained, there is very little performance difference when SELECTing data from a table 60 times the size of a smaller table. (There can be more effect on INSERT and UPDATE commands but it sound like you'll be doing a bulk update rather than updating the data constantly).
I can think of only two reasons for sharding data into separate tables:
You discover that you have a performance issue that can't be resolved through better data design.
You have records with different level of security and are relying on GRANT SELECT permissions to allow some users to see the records at higher levels of security.
A simpler method would be to add a column to that table which contains a datetimestamp of when that was loaded into the system. That way you can filter by that perticular column to segregate that data into the months/years that it was loaded in.
Another advantage from a performance perspective, that if you regularly filter data this way, you can create an index based on this date column.
Having multiple tables that contain the same information is not recommended for performance reasons and how information is stored in SQL. Eventually it will take up more space and if one month's data needs to reference another month's data it will be quite slow.
Hope this helps.
If you think it isn't difficult for you to manage your application, you can do it.
Example. Do you need to change SQL queries every month?
If user need more report that need data more than 1 month, What happen?
Using partitioning, DBMS will split your data to multiple table on the physical storage but You can call all of them by the same name. DBMS will analyse with partition it should take. Performance isn't different significantly.
I am trying to take information from one MySQL table, perform a bunch of calculations on this data, and then put the results in a second MySQL table. What would be the best way of doing this (i.e. in MySQL itself, using python, etc.)?
My apologies for the vagueness, I'll try to be more specific. Table 1 has every meal that every person in my class eats, so each meal is a primary key, and other columns include the person and the number of calories. The primary key for Table 2 is the person, and another column is the percentage of total calories this person has eaten, out of the calories of the entire class. Another column is the percentage of total calories of this person's gender in the class. Every day, I want to take the new eating information, and use it to update the percentages in Table 2. (Thanks for the help!)
Assming the calculations can be done in SQL (and percentages are definitely do-able), you have some choices.
The first, and academically correct, choice, is not to store this in a table at all. One of the principles of normalization is that you don't store duplicate or calculated values - instead, you calculate them as you need them.
This isn't just an academic concern - it avoids many silly bugs and anomalies, and it means your data is always up to date - you don't have to wait for your calculation query to run before you can use the data.
If the calculation is non-trivial and/or an essential part of the business domain, common practice is to create a database view, which behaves like a table when queried, but is actually calculated on the fly. This means that the business logic is encapsulated in the view, rather than repeated in multiple queries. You can go further, with materialized views etc. - but the basic principle is the same.
In some cases, where the volume of data is huge, or the calculations are time consuming, or you have calculations that are very hard to embed in a single SQL statement, it's common to create "aggregate tables" - this is what you are suggesting. You can populate these tables either by (scheduled) queries, or by using database triggers.
However, aggregate tables are a last resort - they make the solution much harder to maintain and debug - if the data is wrong, you don't have a single query to debug, you've got to follow the chain of logic all the way through.
Assuming you are in a class of a few dozen people, and are reporting on less than 10 million years of meals, any modern RDBMS can calculate this report in milliseconds - there's really no need to store it in an aggregate table.
A possible solution could be that you create a View or a Materialized View with the complex SELECT query behind it.
The Materialized View could be an other option too, as you have wrote that you would like to have these results re-queried/refreshed every day.
If you need to do more advanced operations on those tables, you could create a Stored procedure and call it when you need its data.
Note: you can't work furthermore (eg.: can't call it from a select for joining it's result set) with the procedures result set other than say a temporary table.
Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.
I'm just thinking about MySQL database design and there are often situations where
A particular action is or is not carried out and consequently data is or is not stored in the database
Whether or not a user undertook a particular action is displayed statistically
An example of this would be:
A user does or does not fill out a survey. If they do fill out a survey, the data they provide is stored in the database. The total number of users who filled out the survey is displayed.
Now, in order to get the number of users who filled out the survey, we could either
create a field of type BOOL which is set to TRUE on suvey completion; we then calculate the number of users who completed the survey using a simple COUNT(*) WHERE field=TRUE
calculate the number of users who filled out the survey using the data they provided by joining the users and survey results tables and grouping on the user
This isn't a particularly complex example, but there are cases where without the BOOL flag, queries can be become very complex and expensive. But the flag is an almost unnecessary addition to the database tables - we use it only for convenience. Also it means we have to ensure that we UPDATE all user flags at the relevant time, as well as storing user data.
What would be your approach to this kind of problem? For smaller applications, i'll usually just write complex queries and cache their results (occasionally using views to make things more manageable). But in larger applications, with potentially many joins, I might be tempted to flag the users with an action field so that reads are simpler and cheaper.
The best solution is an indexed view (SQL Server terminology) or a materialized view (Oracle terminology) or a materialized query table (DB2 terminology). All those solutions keep the data up to date in real time. No maintenance.
When your platform doesn't support those kinds of database objects, you have to resort to using a table, along with all the other things necessary to keep the data right. You can keep the data right with
triggers
cron jobs
If you use triggers, you should probably also run a periodic cron job to make sure the data stored matches the data calculated.
It helps that, in the real world, most of these kinds of requirements really don't have to be up to date in real time. These kinds of numbers usually support management decisions; a lag of even a day is often acceptable. (In other words, it sometimes helps to think of it as a data warehouse problem or as a report rather than as an OLTP problem.) I've had to negotiate these kinds of requirements many times. I've never had anyone refuse to accept a two-hour update cycle. (But that's certainly application-dependent.)
calculate the number of users . . . by joining the users and
survey results tables and grouping on
the user
If you can join the users and the survey results tables, then the survey results table must have a user identifier, right? If that's right, you don't need to join those two tables to determine the number of users who completed a survey.
What you are describing is called a "denormalized view", i.e. a table that contains results which can be computed from other data already in the database. The reason to do this is indeed performance, whether to do this or not depends on the cost of (re-)generating the data, the effort in your code required to keep it coherent, and the extra amount of database space to store the computed values.
I have a question about table design and performance. I have a number of analytical machines that produce varying amounts of data (which have been stored in text files up to this point via the dos programs which run the machines). I have decided to modernise and create a new database to store all the machine results in.
I have created separate tables to store results by type e.g. all results from the balance machine get stored in the balance results table etc.
I have a common results table format for each machine which is as follows:
ClientRequestID PK
SampleNumber PK
MeasureDtTm
Operator
AnalyteName
UnitOfMeasure
Value
A typical ClientRequest might have 50 samples which need to tested by various machines. Each machine records only 1 line per sample, so there are apprx 50 rows per table associated with any given ClientRequest.
This is fine for all machines except one!
It measures 20-30 analytes per sample (and just spits them out in one long row), whereas all the other machines, I am only ever measuring 1 analyte per RequestID/SampleNumber.
If I stick to this format, this machine will generate over a miliion rows per year, because every sample can have as many as 30 measurements.
My other tables will only grow at a rate of 3000-5000 rows per year.
So after all that, my question is this:
Am I better to stick to the common format for this table, and have bucket loads of rows, or is it better to just add extra columns to represent each Analyte, such that it would generate only 1 row per sample (like the other tables). The machine can only ever measure a max of 30 analytes (and a $250k per machine, I won;t be getting another in my lifetime).
All I am worried about is reporting performance and online editing. In both cases, the PK: RequestID and SampleNumber remain the same, so I guess it's just a matter of what would load quicker. I know the multiple column approach is considered woeful from a design perspective, but would it yield better performance in this instance?
BTW the database is MS Jet / Access 2010
Any help would be greatly appreciated!
Millions of rows in a Jet/ACE database are not a problem if the rows have few columns.
However, my concern is how these records are inserted -- is this real-time data collection? If so, I'd suggest this is probably more than Jet/ACE can handle reliably.
I'm an experienced Access developer who is a big fan of Jet/ACE, but from what I know about your project, if I was starting it out, I'd definitely choose a server database from the get go, not because Jet/ACE likely can't handle it right now, but because I'm thinking in terms of 10 years down the road when this app might still be in use (remember Y2K, which was mostly a problem of apps that were designed with planned obsolescence in mind, but were never replaced).
You can decouple the AnalyteName column from the 'common results' table:
-- Table Common Results
ClientRequestID PK SampleNumber PK MeasureDtTm Operator UnitOfMeasure Value
-- Table Results Analyte
ClientRequestID PK SampleNumber PK AnalyteName
You join on the PK (Request + Sample.) That way you don't duplicate all the rest of the rows needlessly, can avoid the join in the queries where you don't require the AnalyteName to be used, can support extra Analytes and is overall saner. Unless you really start having a performance problem, this is the approach I'd follow.
Heck, even if you start having performance problems, I'd first move to a real database to see if that fixes the problems before adding columns to the results table.