I'm considering converting some excel files I regularly update to a database. The files have a large number of columns. Unfortunately, many of the databases I am looking at, such as Access and PostreSQL, have very low column limits. MySQL's is higher, but I'm worried that as my dataset expands I might break that limit as well.
Basically, I'm wondering what (open source) databases are effective at dealing with this type of problem.
For a description of the data, I have a number of excel files (less than 10) with each containing a particular piece of information on some firms over time. It totals about 100mb in excel files. The firms are in the columns (about 3500 currently), the dates are in the rows (about 270 currently, but switching to a higher frequency for some of the files could easily cause this to balloon).
The most important queries will likely be to get the data for each of the firms on a particular date and put it in a matrix. However, I may also run queries to get all the data for a particular firm for a particular piece of data over every date.
Changing dates to a higher frequency is also the reason that I'm not really interested in transposing the data (the 270 beats Access' limit anyway, but increasing the frequency would far exceed MySQL's column limits). Another alternative might be to change it so that each firm has its own excel file (that way I limit the columns to some amount less than 10), but is quite unwieldy for the purposes of updating the data.
This seems to be begging to be split up!
How using a schema like:
Firms
id
name
Dates
id
date
Data_Points
id
firm_id
date_id
value
This sort of de-composed schema will make reporting quite a bit easier.
For reporting you can easily get a stream of all values with a query like
SELECT firms.name, dates.date, data_points.value from data_points left join firms on firms.id = data_points.firm_id left join dates on dates.id = data_points.date_id
Related
The scenario is as follows:
There is a table with hundreds of businesses, let's say each business creates its own daily data of 100 to 200 rows.
For example, each business receives 150 orders per day. These data are recorded in the database.
At the end of each month, since the size of the data created by the enterprises will enlarge the database, I think the result of the statistical reporting of a business will be longer.
For example, firm A produced 3000 rows of data at the end of the month, while firm B produced 4000 rows.
At the end of the year, the number of rows produced by Company A will be 36000, and the number of rows produced by Company B will be 48000.
When enterprises want to see their monthly earnings and how many sales they make, it doesn't make sense for them to navigate through a database of 84000 rows one by one and calculate the desired data. Because I only gave this example for two businesses, imagine if there were hundreds?
I thought of a solution to this, I want to consult a logical solution, because I do not know how solutions are made in large applications.
While businesses are producing daily data, I can update this statistical report every day by calculating the data they produce and keeping it in a single row in a separate table.
At the end of the month, I think that he can quickly find the relevant line without having to scroll through thousands of lines.
What is the best option for calculating any data on large tables?
In the scheme of things 84000 records is a table is nothing to worry about, you can process that just fine with most RDBMs database engines. If you are talking about millions of records, then you most likely want to look at Data Warehouse strategies. SQL server has something called Analysis Services which can take your data and create pre-calculated aggregates so that report computations are faster, I expect most database vendors will have something similar.
For transacction processing which is what your original database will use (OLTP - Online Transaction Processing) you want data in 3rd Normal Form. For analytical purposes (OLAP - Online Analystical Processing) you want de-normalised data as the aggregations can be done faster if you don't need to jump across multiple tables. Customer->territory->Region and variant->product->type->supertype.
I am collecting data and storing it MySQL, for:
75 variables
55 countries
Each year
I have, at this stage since I am building this tool created a single table, of variables / countries (storing 1 year worth of data).
Next year (and for several years after that) a new set of data will be input for each country.
There are therefore 3 variables in controlling data returned to a user reviewing all collected data. The general form of any query would be:
Show me these specifics variables, for these specific countries, for these specific years.
(Show me average age and weight, for USA and Canada, for 2012 and 2009, for example)
My question is, it seems that I have two options for arranging this data:
-Multiple tables where I create a table of country / variable for each year data is collected
- Single table and simply add a column (field) for the year that data relates to.
As far as I can tell I could make these database calls with either sructure, but is one more powerful / efficient / quicker, and why?
Thanks for your consideration.
It's a PDO / PHP interface if that is relevent.
Using a relational approach generally involves more tables. This translates into queries being a bit more slow (though probably not noticeable in small databases) and database size to be smaller. This makes it simpler to update information properly and thus ensure data integrity. For example, if Joe's address changes you know it will be changed on all reports using Joe's address.
Using less linked tables where one field can be repeated multiple times you risk having disparity between data from different tables where you would naturally expect it to be equal. Access speed should be a bit faster if you arrange your tables properly because your information will be grouped according to how you access it.
For example, in the first method you would have an Orders table with a Supplier and Client table to make a complete invoice whereas in the second method you would want to put some information of both Supplier and Client in the Orders table such that accessing that finding the row corresponding to the invoice number you are looking for would return the entire set of data that you need (thus eliminating the need for joins on Supplier and Client and reducing load on the database server).
Edit: I think a better answer would require a bit more information about your data (samples for example).
I want to build a MySQL database for storing the ranking of a game every 1h.
Since this database will become quite large in a short time, I figured it's important to have a proper design. Therefor some advice would be gratefully appreciated.
In order to keep it as small as possible, I decided to log only the first 1500 positions of the ranking. Every ranking of a player holds the following values:
ranking position, playername, location, coordinates, alliance, race, level1, level2, points1, points2, points3, points4, points5, points6, date/time
My approach was to simply grab all values of each top 1500 player every hour by a php script and insert them into the MySQL as one row. So every day the MySQL will grow 36,000 rows. I will have a second script that deletes every row that is older than 28 days, otherwise the database would get insanely huge. Both scripts will run as a cronjob.
The following queries will be performed on this data:
The most important one is simply the query for a certain name. It should return all stats for the player for every hour as an array.
The second is a query in which all players have to be returned that didn't gain points1 during a certain time period from the latest entry. This should return a list of players that didn't gain points (for the last 24h for example).
The third is a query in which all players should be listed that lost a certain amount or more points2 in a certain time period from the latest entry.
The queries shouldn't take a lifetime, so I thought I should probably index playernames, points1 and points2.
Is my approach to this acceptable or will I run into a performance/handling disaster? Is there maybe a better way of doing this?
Here is where you risk a performance problem:
Your indexes will speed up your reads, but will considerably slow down your writes. Especially since your DB will have over 1 million rows in that one table at any given time. Since your writes are happening via cron, you should be okay as long as you insert your 1500 rows in batches rather than one round trip to the DB for every row. I'd also look into query compiling so that you save that overhead as well.
Ranhiru Cooray is correct, you should only store data like the player name once in the DB. Create a players table and use the primary key to reference the player in your ranking table. The same will go for location, alliance and race. I'm guessing that those are more or less enumerated values that you can store in another table to normalize your design and be returned in your results with appropriates JOINs. Normalizing your data will reduce the amount of redundant information in your database which will decrease it's size and increase it's performance.
Your design may also be flawed in your ranking position. Can that not be calculated by the DB when you select your rows? If not, can it be done by PHP? It's the same as with invoice tables, you never store the invoice total because it is redundant. The items/pricing/etc can be used to calculate the order totals.
With all the adding/deleting, I'd be sure to run OPTIMIZE frequently and keep good backups. MySQL tables---if using MyISAM---can become corrupted easily in high writing/deleting scenarios. InnoDB tends to fair a little better in those situations.
Those are some things to think about. Hope it helps.
I have a database called RankHistory that is populated daily with each user's username and rank for the day (rank as in 1,2,3,...). I keep logs going back 90 days for every user, but my user base has grown to the point that the MySQL database holding these logs is now in excess of 20 million rows.
This data is recorded solely for the use of generating a graph showing how a user's rank has changed for the past 90 days. Is there a better way of doing this than having this massive database that will keep growing forever?
How great is the need for historic data in this case? My first thought would be to truncate data older than a certain threshold, or move it to an archive table that doesn't require as frequent or fast access as your current data.
You also mention keeping 90 days of data per user, but the data is only used to show a graph of changes to rank over the past 30 days. Is the extra 60 days' data used to look at changes over previous periods? If it isn't strictly necessary to keep that data (or at least not keep it in your primary data store, as per my first suggestion), you'd neatly cut the quantity of your data by two-thirds.
Do we have the full picture, though? If you have a daily record per user, and keep 90 days on hand, you must have on the order of a quarter-million users if you've generated over twenty million records. Is that so?
Update:
Based on the comments below, here are my thoughts: If you have hundreds of thousands of users, and must keep a piece of data for each of them, every day for 90 days, then you will eventually have millions of pieces of data - there's no simple way around that. What you can look into is minimizing that data. If all you need to present is a calculated rank per user per day, and assuming that rank is simply a numeric position for the given user among all users (an integer between 1 - 200000, for example), storing twenty million such records should not put unreasonable strain on your database resources.
So, what precisely is your concern? Sheer data size (i.e. hard-disk space consumed) should be relatively manageable under the scenario above. You should be able to handle performance via indexes, to a certain point, beyond which the data truncation and partitioning concepts mentioned can come into play (keep blocks of users in different tables or databases, for example, though that's not an ideal design...)
Another possibility is, though the specifics are somewhat beyond my realm of expertise, you seem to have an ideal candidate for an OLAP cube, here: you have a fact (rank) that you want to view in the context of two dimensions (user and date). There are tools out there for managing this sort of scenario efficiently, even on very large datasets.
Could you run an automated task like a cron job that checks the database every day or week and deletes entries that are more than 90 days old?
Another option, do can you create some "roll-up" aggregate per user based on whatever the criteria is... counts, sales, whatever and it is all stored based on employee + date of activity. Then you could have your pre-aggregated rollups in a much smaller table for however long in history you need. Triggers, or nightly procedures can run a query for the day and append the results to the daily summary. Then your queries and graphs can go against that without dealing with performance issues. This would also help ease moving such records to a historical database archive.
-- uh... oops... that's what it sounded like you WERE doing and STILL had 20 million+ records... is that correct? That would mean you're dealing with about 220,000+ users???
20,000,000 records / 90 days = about 222,222 users
EDIT -- from feedback.
Having 222k+ users, I would seriously consider that importance it is for "Ranking" when you have someone in the 222,222nd place. I would pair the daily ranking down to say the top 1,000. Again, I don't know the importance, but if someone doesn't make the top 1,000 does it really matter???
I have a question about table design and performance. I have a number of analytical machines that produce varying amounts of data (which have been stored in text files up to this point via the dos programs which run the machines). I have decided to modernise and create a new database to store all the machine results in.
I have created separate tables to store results by type e.g. all results from the balance machine get stored in the balance results table etc.
I have a common results table format for each machine which is as follows:
ClientRequestID PK
SampleNumber PK
MeasureDtTm
Operator
AnalyteName
UnitOfMeasure
Value
A typical ClientRequest might have 50 samples which need to tested by various machines. Each machine records only 1 line per sample, so there are apprx 50 rows per table associated with any given ClientRequest.
This is fine for all machines except one!
It measures 20-30 analytes per sample (and just spits them out in one long row), whereas all the other machines, I am only ever measuring 1 analyte per RequestID/SampleNumber.
If I stick to this format, this machine will generate over a miliion rows per year, because every sample can have as many as 30 measurements.
My other tables will only grow at a rate of 3000-5000 rows per year.
So after all that, my question is this:
Am I better to stick to the common format for this table, and have bucket loads of rows, or is it better to just add extra columns to represent each Analyte, such that it would generate only 1 row per sample (like the other tables). The machine can only ever measure a max of 30 analytes (and a $250k per machine, I won;t be getting another in my lifetime).
All I am worried about is reporting performance and online editing. In both cases, the PK: RequestID and SampleNumber remain the same, so I guess it's just a matter of what would load quicker. I know the multiple column approach is considered woeful from a design perspective, but would it yield better performance in this instance?
BTW the database is MS Jet / Access 2010
Any help would be greatly appreciated!
Millions of rows in a Jet/ACE database are not a problem if the rows have few columns.
However, my concern is how these records are inserted -- is this real-time data collection? If so, I'd suggest this is probably more than Jet/ACE can handle reliably.
I'm an experienced Access developer who is a big fan of Jet/ACE, but from what I know about your project, if I was starting it out, I'd definitely choose a server database from the get go, not because Jet/ACE likely can't handle it right now, but because I'm thinking in terms of 10 years down the road when this app might still be in use (remember Y2K, which was mostly a problem of apps that were designed with planned obsolescence in mind, but were never replaced).
You can decouple the AnalyteName column from the 'common results' table:
-- Table Common Results
ClientRequestID PK SampleNumber PK MeasureDtTm Operator UnitOfMeasure Value
-- Table Results Analyte
ClientRequestID PK SampleNumber PK AnalyteName
You join on the PK (Request + Sample.) That way you don't duplicate all the rest of the rows needlessly, can avoid the join in the queries where you don't require the AnalyteName to be used, can support extra Analytes and is overall saner. Unless you really start having a performance problem, this is the approach I'd follow.
Heck, even if you start having performance problems, I'd first move to a real database to see if that fixes the problems before adding columns to the results table.