I wanted to create a table with the name of the table being a date. When I gather stock data for that day, I wanted to store it like this:
$date = date('Y-m-d');
$mysqli->query(
"CREATE TABLE IF NOT EXISTS `$date`(ID INT Primary Key)"
);
That way I will have a database like:
2013-5-1: AAPL | 400 | 400K
MFST | 30 | 1M
GOOG | 700 | 2M
2013-5-2: ...
I think it would be easier to store information like this, but I see a similar question to this was closed.
How to add date to MySQL table name?
"Generating more and more tables is exactly the opposite of "keeping
the database clean". A clean database is one with a sensible,
normalized, fixed schema which you can run queries against."
If this is not the right way to do it, could someone suggest what would be? Many people commenting on this question stated that this was not a "clean" solution?
Do not split your data into several tables. This will become a maintenance nightmare, even though it might seem sensible to do so at first.
I suggest you create a date column that holds the information you currently want to put into the table name. Databases are pretty clever in storing dates efficiently. Just make sure to use the right datatype, not a string. By adding an index to that column you will also not get a performance penalty when querying.
What you gain is full flexibility in querying. There will be virtually no limits to the data you can extract from a table like this. You can join with other tables based on date ranges etc. This will not be possible (or at least much more complicated and cumbersome) when you split the data by date into tables. For example, it will not even be easy to just get the average of some value over a week, month or year.
If - and that's depending on the real amount of data you will collect - some time in the future the data grows dramatically, to more than several million rows I would estimate - you can have a look at the data partitioning features MySQL offers out of the box. However, I would not advise to use them immediately, unless you already have a clearly cut growth model for the data.
In my experience there is very seldom a real need for this technique in most cases. I have worked with tables in the 100s of gigabytes range, with tables having millions of rows. It is all a matter of good indexing and carefully crafted queries when the data gets huge.
Related
I am in the situation where I need to store data for 1900+ cryptocurrencies every minute, i use MySQL innoDB.
Currently, the table looks like this
coins_minute_id | coins_minute_coin_fk | coins_minute_usd | coins_minute_btc | coins_minute_datetime | coins_minute_timestamp
coins_minute_id = autoincrement id
coins_minute_coin_fk = medium int unsigned
coins_minute_usd = decimal 20,6
coins_minute_btc = decimal 20,8
coins_minute_datetime = datetime
coins_minute_timestamp = timestamp
The table grew incredibly fast in the matter of no time, every minute 1900+ rows are added to the table.
The data will be used for historical price display as a D3.js line graph for each cryptocurrency.
My question is how do i optimize this database the best, i have thought of only collecting the data every 5 minutes instead of 1, but it will still add up to a lot of data in no time, i have also thought if it was better to create a unique table for each cryptocurrency, does any of you who loves to design databases know some other very smart and clever way to do stuff like this?
Kindly Regards
(From Comment)
SELECT coins_minute_coin_fk, coins_minute_usd
FROM coins_minutes
WHERE coins_minute_datetime >= DATE_ADD(NOW(),INTERVAL -1 DAY)
AND coins_minute_coin_fk <= 1000
ORDER BY coins_minute_coin_fk ASC
Get rid of coins_minute_ prefix; it clutters the SQL without providing any useful info.
Don't specify the time twice -- there are simple functions to convert between DATETIME and TIMESTAMP. Why do you have both 'created' and an 'updated' timestamps? Are you doing UPDATE statements? If so, then the code is more complicated than simply "inserting". And you need a unique key to know which row to update.
Provide SHOW CREATE TABLE; it is more descriptive that what you provided.
30 inserts/second is easily handled. 300/sec may have issues.
Do not PARTITION the table without some real reason to do so. The common valid reason is that you want to delete 'old' data periodically. If you are deleting after 3 months, I would build the table with PARTITION BY RANGE(TO_DAYS(...)) and use weekly partitions. More discussion: http://mysql.rjweb.org/doc.php/partitionmaint
Show us the queries. A schema cannot be optimized without knowing how it will be accessed.
"Batch" inserts are much faster than single-row INSERT statements. This can be in the form of INSERT INTO x (a,b) VALUES (1,2), (11,22), ... or LOAD DATA INFILE. The latter is very good if you already have a CSV file.
Does your data come from a single source? Or 1900 different sources?
MySQL and MariaDB are probably identical for your task. (Again, need to see queries.) PDO is fine for either; no recoding needed.
After seeing the queries, we can discuss what PRIMARY KEY to have and what secondary INDEX(es) to have.
1 minute vs 5 minutes? Do you mean that you will gather only one-fifth as many rows in the latter case? We can discuss this after the rest of the details are brought out.
That query does not make sense in multiple ways. Why stop at "1000"? The output is quite large; what client cares about that much data? The ordering is indefinite -- the datetime is not guaranteed to be in order. Why specify the usd without specifying the datetime? Please provide a rationale query; then I can help you with INDEX(es).
I have a basic question about database designing.
I have a lot of files which I have to read and insert them in database. Each file has some thousand lines and each line has about 30 fields (by these types: small int, int, big int, varchar, json). Of course I use multi threads along with bulk inserting in order to increase insert speed (finally I have 30-40 millions records).
After inserting I want to have some sophisticated analysis and the performance is important to me.
Now I get each line fields and I'm ready to insert so I have 3 approaches:
1- One big table:
In this case I can create a big table with 30 columns and stores all of the files fields in that. So there is a table with huge size which I want to have a lot of analysis on it.
2- A fairly large table (A) and some little tables (B)s
In this case I can create some little tables which consist of the columns that have fairly identical records if we separate them from the other columns. So these little tables just has some hundred or thousand records instead of 30 millions records. So in fairly large table (A), I emit the columns which I put them in another table and I use a foreign key instead of them. Finally I has a table (A) with 20 columns and 30 millions records and some tables (B) with 2-3 columns and 100-50000 records for each of them. So in order to analysis the table A, I have to use some joins ,for example in select and ...
3- just a fairly large table
In this case I can create a fairly large table like table A in above case (with 20 columns) and instead of using foreign keys, I use a mapping between source columns and destination columns (this is something like foreign keys but has a little difference). For example I have 3 columns c1,c2,c3 that in case 2, I put them in another table B and use foreign key to access them, but now I assign a specific number to each distinctive records consist of c1,c2,c3 at inserting time and store the relation between the record and its assigned value in the program codes. So this table is completely like the table A in case number 2 but there is no need to use join in select or ...
While the inserting time is important, the analysis time that I will have is more important to me, so I want to know your opinion about which of these case is better and also I will glad to see the other solutions.
From a design perspective 30 to 40 million is not that bad a number. Performance is fully dependent on how you would design your DB to be.
If you are using SQL Server then you could consider putting the large table on a separate database file group. I have worked on one case in a similar fashion where we had around 1.8 Billion record in a single table.
For the analysis if you are not going to look into the entire data in one shot. You could consider a vertical partitioning of the data. You could use a partition schema based on your need. Some sample could be to split the data as yearly partitions and this will help if your analysis will be limited to a years worth of data(just an example).
The major thing would be de-normalization /normalization based on your need and of course non clustered/clustered indexing of the data. Again this will depend on what sort of analysis queries you would be using.
A single thread can INSERT one row at a time and finish 40M rows in a day or two. With LOAD DATA, you can do it in perhaps an hour or less.
But is loading the real question? For doing grouping, summing, etc, the question is about SELECT. For "analytics", the question is not one of table structure. Have a single table for the raw data, plus one or more "Summary tables" to make the selects really fast for your typical queries.
Until you give more details about the data, I cannot give more details about a custom solution.
Partitioning (vertical or horizontal) is unlikely to help much in MySQL. (Again, details needed.)
Normalization shrinks the data, which leads to faster processing. But, it sounds like the dataset is so small that it will all fit in RAM?? (I assume your #2 is 'normalization'?)
Beware of over-normalization.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I have a big table containing trillions of records of the following schema (Here serial no. is the key):
MyTable
Column | Type | Modifiers
----------- +--------------------------+-----------
serial_number | int |
name | character varying(255) |
Designation | character varying(255) |
place | character varying(255) |
timeOfJoining | timestamp with time zone |
timeOfLeaving | timestamp with time zone |
Now I want to fire queries of the form given below on this table:
select place from myTable where Designation='Manager' and timeOfJoining>'1930-10-10' and timeOfLeaving<'1950-10-10';
My aim is to achieve fast query execution times. Since, I am designing my own database from scratch, therefore I have the following options. Please guide me as to which one of the two options will be faster.
Create 2 separate table. Here, table1 contains the schema (serial_no, name, Designation, place) and table 2 contains the schema (serial_no, timeOfJoining, timeOfLeaving). And then perform a merge join between the two tables. Here, serial_no is the key in both the tables
Keep one single table MyTable. And run the following plan: Create an index Designation_place_name and using the Designation_place_name index, find rows that fit the index condition relation = 'Manager'(The rows on disc are accessed randomly) and then using the filter function keep only rows that match the timeOfJoining criteria.
Please help me figure out which one will be faster. It'll be great if you could also tell me the respective pros and cons.
EDIT: I intend to use my table as read-only.
If you are dealing with lots and lots of rows and you want to use a relational database, then your best bet for such a query is to satisfy it entirely in an index. The example query is:
select place
from myTable
where Designation='Manager' and
timeOfJoining > '1930-10-10' and
timeOfLeaving < '1950-10-10';
The index should contain the four fields mentioned in the table. This suggests an index like: mytable(Designation, timeOfJoining, timeOfLeaving, place). Note that only the first two will be used for the where clause, because of the inequality. However, most databases will do an index scan on the appropriate data.
With such a large amount of data, you have other problems. Although memory is getting cheaper and machines bigger, indexes often speed up queries because an index is smaller than the original table and faster to load in memory. For "trillions" of records, you are talking about tens of trillions of bytes of memory, just for the index -- and I don't know which databases are able to manage that amount of memory.
Because this is such a large system, just the hardware costs are still going to be rather expensive. I would suggest a custom solution that stored the data in a compressed format with special purpose indexing for the queries. Off-the-shelf databases are great products applicable in almost all data problems. However, this seems to be going near the limit of their applicability.
Even small efficiencies over an off-the-shelf database start to add up with such a large volume of data. For instance, the layout of records on pages invariably leaves empty space on a page (records don't exactly fit on a page, the database has overhead that you may not need such as bits for nullability, and so on). Say the overhead of the page structure and empty space amount to 5% of the size of a page. For most applications, this is in the noise. But 5% of 100 trillion bytes is 5 trillion bytes -- a lot of extra I/O time and wasted storage.
EDIT:
The real answer to the choice between the two options is to test them. This shouldn't be hard, because you don't need to test them on trillions of rows -- and if you have the hardware for that, you have the hardware for smaller tests. Take a few billions of rows on a machine with correspondingly less memory and CPUs and see which performs better. Once you are satisfied with the results, multiply the data by 10 and try again. You might want to do this one more time if you are not convinced of the results.
My opinion, though, is that the second is faster. The first duplicates the "serial number" in both tables, adding 8 bytes to each row ("int" is typically 4-bytes and that isn't big enough, so you need bigint). That alone will increase the I/O time and size of indexes for any analysis. If you were considering a columnar data store (such as Vertica) then this space might be saved. The savings on removing one or two columns is at the expense of reading in more bytes in total.
Also, don't store the raw form of any of the variables in the table. The "Designation" should be in a lookup table as well as the "place" and "name", so each would be 4-bytes (that should be big enough for the dimensions, unless one is all people on earth).
But . . . The "best" solution in terms of cost, maintainability, and scalability is probably something like Hadoop. That is how companies like Google and Yahoo manage vast quantities of data, and it seems apt here too.
Given the amount and type of data, I would suggest going with the second option. The upside is , you do not need to join anything. The join is usually very costly. However, in that case you are holding lots of redundant data.
The first option would be more memory efficient, the second more time efficient.
Furthermore, using indices, the DBMS is able to use index scans to read data from storage. Also, you should consider changing the variable length datatypes to fixed length datatypes, then the DBMS has an easier job of jumping between tuples as every tuple has a fixed (and known) length. In that case, operations like skip the next 100000 tuples are easy for the DBMS.
I am sorry to tell you but this schema just won't work for 'trillions' of records with any relational database. Just to store the index pages for serial_number and Designation for 1 trillion rows will require 465 terabytes. That is more than double the size of the entire World Data Centre for Climate database that currently holds the world record as the largest. If these requirements are for real, you really need to move to a star/snowflake schema. That means no varchars in this fact table, not even dates, only integers. Move all text and date fields to dimensions.
For the most part a single table makes some sense, but it would be ridiculous to store all those values as strings, depending on the uniqueness of your name/designation/place fields you could use something like this:
serial_number | BIGINT
name_ID | INT
Designation_ID | INT
place_ID | INT
timeOfJoining | timestamp with time zone
timeOfLeaving | timestamp with time zone
Without knowing the data it's impossible to know which lookups would be practical. As others have mentioned you've got some challenges ahead. Regarding indexing, I agree with Gordon.
Recently I think about the best practices with storing historical data in MySQL database. For now, each versionable table has two columns - valid_from and valid_to, both DATETIME type. Records with current data has valid_from filled with its creation day. When I update this row, I fill valid_to with update date and add new record with valid_from the same as valid_to in previous row - easy stuff. But I know that table will be enormous very quick so fetching data can be very slow.
I'd like to know if you have any practices with storing historical data?
It's a common mistake to worry about "large" tables and performance. If you can use indexes to access your data, it doesn't really matter if you have 1000 of 1000000 records - at least not so as you'd be able to measure. The design you mention is commonly used; it's a great design where time is a key part of the business logic.
For instance, if you want to know what the price of an item was at the point when the client placed the order, being able to search product records where valid_from < order_date and valid_until is either null or > order_date is by far the easiest solution.
This isn't always the case - if you're keeping the data around just for archive purposes, it may make more sense to create archive tables. However, you have to be sure that time is really not part of the business logic, otherwise the pain of searching multiple tables will be significant - imagine having to search either the product table OR the product_archive table every time you want to find out about the price of a product at the point the order was placed.
This is not complete answer, just few suggestions.
You can add indexed boolean field like is_valid. This should improve performance with big table with historical and current records.
In general - storing historical data in seprate table may complicate your application (just imagine complexity of query that is supposed to get data with mixed current and historical records...).
Today computers are really fast. I think you should compare/test performance with single table and separate table for historical records.
In addition - try to test your hardware to see how fast is MySQL with big tables to determine how to design database. If its too slow for you - you can tune MySQL configuration (start with increasing cache/RAM).
I'm nearing completion of an application which does exactly this. Most of my indexes index by key fields first and then the valid_to field which is set to NULL for current records thereby allowing current records to be found easily and instantly. Since most of my application deals with real time operations, the indexes provide fast performance. Once in a while someone needs to see historical records, and in that instance there's a performance hit, but from testing it's not too bad since most records don't have very many changes over their lifetime.
In cases where you may have a lot more expired records of various keys than current records it may pay to index over valid_to before any key fields.
I developed a stats site for a game as a learning project a few years back. It's still used today and I'd like to get it cleaned up a bit.
The database is one area that needs improvement. I have a table for the game statistics, which has GameID, PlayerID, Kills, Deaths, DamageDealt, DamageTaken, etc. In total, there are about 50 fields in that single table and many more that could be added in the future. At what point are there too many fields? It currently has 57,341 rows and is 153.6 MiB by itself.
I also have a few fields that stores arrays in a BLOB in this same table. An example of the array is Player vs Player matchups. The array stores how many times that player killed another player in the game. These are the bigger fields in filesize. Is storing an array in a BLOB advised?
The array looks like:
[Killed] => Array
(
[SomeDude] => 13
[GameGuy] => 10
[AnotherPlayer] => 8
[YetAnother] => 7
[BestPlayer] => 3
[APlayer] => 9
[WorstPlayer] => 2
)
These tend to not exceed more than 10 players.
I prefer to not have one table with an undetermined number of columns (with more to come) but rather to have an associated table of labels and values, so each user has an id and you use that id as a key into the table of labels and values. That way you only store the data you need per user. I believe this approach is called EAV (as per Triztian's comment) and it's how medical databases are kept, since there are SO many potential fields for an individual patient, even while any given patient only has a very small number of those fields with actual data.
so, you'd have
user:
id | username | some_other_required_field
user_data:
id | user_id | label | value
Now you can have as many or as few user_data rows as you need per user.
[Edit]
As to your array, I would treat this with a relational table as well. Something like:
player_interraction:
id | player_id | player_id | interraction_type
here you would store the two players who had an interaction and what type of interaction it was.
The table design seems mostly fine. As long as the columns you are storing can't be calculated from the other columns within the same row. IE, you're not storing SelfKills, OtherDeath, and TotalDeaths (where TotalDeaths = SelfKills + OtherDeath). That wouldn't make sense and could be cut out of your table.
I'd be curious to learn more about how you are storing those Arrays in a BLOB - what purpose do they serve in a BLOB? Why aren't they normalized into a table for easy data transformation and analytics? (OR are they and they are just being stored as an array here for easy of data display to end users).
Also, I'd be curious how much data your BLOB's take up vs the rest of the table. Generally speaking, the size of the rows isn't as big of a deal as the number of rows, and ~60K is no big deal at all. As long as you aren't writing queries that need to check every column value (ideally you're ignoring blobs when trying to write a where clause).
With mysql you've got a hard limit of roughly 4000 columns (fields) and 65Kb total storage per row. If you need to store large strings, use a text field, they're stored on disk. Blobs really should be reserved for non-textual data (if you must).
Don't worry in general about the size of your db, but think about the structure and how it's organized and indexed. I've seen small db's run like crap.
If you still want numbers, when you're total DB gets in the GB range or past a couple hundred thousand rows in a single table, then start worrying more about things--150M in 60K rows isn't much and table scans aren't going to cost you much in performance. However, now's the time to make sure you create good covering indexes on your heavily used queries.
There's nothing wrong with adding columns to a database table as time goes on. Database designs change all the time. The thing to keep in mind is how the data is grouped. I have always treated a database table as a collection of like items.
Things I consider are as follows:
When inserting data into a row how many columns will be null?
Does this new column apply to 80% of my data that is already there?
Will I be doing several updates to a few columns in this table?
If so, do I need to keep track of what the previos values were just in case?
By thinking about you data like this you may discover that you need to break your table up into a handful of separate smaller tables linked together by foreign keys.