I am working on finalising a site that will go live soon. It will process up to 1 million files per week and store all the information from these files in multiple tables in a database.
The main table will have 10 records per file so will gain about 10million records per week. Currently that table has 85 columns storing about 1.6KiB of data per row.
I'm obviously worried about having 85 columns, it seems crazy but I'm more worried about the joins if I split the data into multiple tables... If I end up with 4 tables of 20 odd columns and over 500,000,000 records in each of them them, won't those joins take massive amounts of time?
The joins would all take place on 1 column (traceid) which will be present in all tables and indexed.
The hardware this will run on is an i7 6700, 32GB RAM. The table type is innodb.
Any advice would be greatly appreciated!
Thanks!
The answer to this depends on the structure of your data. 85 columns is a lot. It will be inconvenient to add an 86th column. Your queries will be verbose. SELECT *, when you use it for troubleshooting, will splat a lot of stuff across your screen and you'll have trouble interpreting it. (Don't ask how I know this. :-)
If every item of data you process has exactly one instance of all 85 values, and they're all standalone values, then you've designed your table appropriately.
If most rows have a subset of your 85 values, then you should figure out the subsets and create a table for each one.
For example, if you're describing apartments and your columns have meanings like these:
livingroom yes/no
livingroom_sq_m decimal (7,1)
diningroom yes/no
diningroom_sq_m decimal (7,1)
You may want an apartments table and a separate rooms table with columns like this.
room_id pk
apt_id fk
room_name text
room_sq_m decimal (7,1)
In another example, if some of your columns are
cost_this_week
cost_week_1
cost_week_2
cost_week_3
etc.
you should consider normalizing that information in to a separate table.
Related
I have a basic question about database designing.
I have a lot of files which I have to read and insert them in database. Each file has some thousand lines and each line has about 30 fields (by these types: small int, int, big int, varchar, json). Of course I use multi threads along with bulk inserting in order to increase insert speed (finally I have 30-40 millions records).
After inserting I want to have some sophisticated analysis and the performance is important to me.
Now I get each line fields and I'm ready to insert so I have 3 approaches:
1- One big table:
In this case I can create a big table with 30 columns and stores all of the files fields in that. So there is a table with huge size which I want to have a lot of analysis on it.
2- A fairly large table (A) and some little tables (B)s
In this case I can create some little tables which consist of the columns that have fairly identical records if we separate them from the other columns. So these little tables just has some hundred or thousand records instead of 30 millions records. So in fairly large table (A), I emit the columns which I put them in another table and I use a foreign key instead of them. Finally I has a table (A) with 20 columns and 30 millions records and some tables (B) with 2-3 columns and 100-50000 records for each of them. So in order to analysis the table A, I have to use some joins ,for example in select and ...
3- just a fairly large table
In this case I can create a fairly large table like table A in above case (with 20 columns) and instead of using foreign keys, I use a mapping between source columns and destination columns (this is something like foreign keys but has a little difference). For example I have 3 columns c1,c2,c3 that in case 2, I put them in another table B and use foreign key to access them, but now I assign a specific number to each distinctive records consist of c1,c2,c3 at inserting time and store the relation between the record and its assigned value in the program codes. So this table is completely like the table A in case number 2 but there is no need to use join in select or ...
While the inserting time is important, the analysis time that I will have is more important to me, so I want to know your opinion about which of these case is better and also I will glad to see the other solutions.
From a design perspective 30 to 40 million is not that bad a number. Performance is fully dependent on how you would design your DB to be.
If you are using SQL Server then you could consider putting the large table on a separate database file group. I have worked on one case in a similar fashion where we had around 1.8 Billion record in a single table.
For the analysis if you are not going to look into the entire data in one shot. You could consider a vertical partitioning of the data. You could use a partition schema based on your need. Some sample could be to split the data as yearly partitions and this will help if your analysis will be limited to a years worth of data(just an example).
The major thing would be de-normalization /normalization based on your need and of course non clustered/clustered indexing of the data. Again this will depend on what sort of analysis queries you would be using.
A single thread can INSERT one row at a time and finish 40M rows in a day or two. With LOAD DATA, you can do it in perhaps an hour or less.
But is loading the real question? For doing grouping, summing, etc, the question is about SELECT. For "analytics", the question is not one of table structure. Have a single table for the raw data, plus one or more "Summary tables" to make the selects really fast for your typical queries.
Until you give more details about the data, I cannot give more details about a custom solution.
Partitioning (vertical or horizontal) is unlikely to help much in MySQL. (Again, details needed.)
Normalization shrinks the data, which leads to faster processing. But, it sounds like the dataset is so small that it will all fit in RAM?? (I assume your #2 is 'normalization'?)
Beware of over-normalization.
tl;dr at the bottom.
So, I have an application with roughly the following schema:
`budget`hasMany =>
`item1`
`item2`
...
`item10`
Now, this 10 items share a set of 23 fields that match in all the 10 of the items. At least other 20 fields are shared in 7 or more items.
This came like this, in retrospective it was idiotic but at the moment it seemed the right thing.
So, with this into mind, I thought: why the hell not make 9 tables dissapear, make 1 table that contains the all the fields from all the items, given that a lot are shared anyway.
What would I gain? Lots of code would dissapear. Lots of tables would dissapear. Retrieving a budget with all it's item would require only a join with a single table, instead of 10 joins.
My doubts come from the fact that this new table would have around 80 columns. All small columns, storing mostly integers, doubles or small varchars. Still, 80 columns strikes me as a lot. Another problem is that in the future, instead of having 10 tables with 1kk records each, I would have 1 big table with 10kk records.
So, my question is: Is it worth changing in order to remove some redundancy, reduce the amount of code and enchance the habilities to retrieve and work with the data?
tl;dr Should I combine 10 tables into 1 table, considering that the 10 tables share a lot of common fields (but still the new table will have 80 columns), in order to reduce the number of tables, the amount of code in the app and enchance the way I retrieve data?
As far as I know, which might not be a lot, it is optimal to split up a database into singular pieces (as it currently is). It is called to normalize the database ("https://en.wikipedia.org/wiki/Relational_database").
It limits the Errors that might happen to the database and makes it less risky to change things through updates etcetera as well as better if you want to insert one item but not another (if you only had 1 table all others would be null and you would always have to go back and fetch info etc. which will make the insert statements harder).
If you will always have all 20 items inserted at a time and always do queries based on all of them (no advanced computation on singular items) then it might be reasonable to put everything into one table. But if you want to insert only a couple of items and you then want to make more complex computations I would advice you to keep them separated and linked through some kind of Customer_Id or w/e
#yBrodsky , as example you should create a table for furniture that will store furniture name, id and description. And another table that would store its attributes with furniture id.
Furniture table will have colums id, furniture_title, description.
And
Other table will have
id, furniture_id, attribute_key, attribute_value
I have to make decision how to plan table that will be used to store dates.
I have about 20 different dates for each user and guess 100 000 users right now and growing.
So question is for SELECT query what will work faster if I make table with 20 fields? e.g.
"user_dates"
userId, date_registered, date_paid, date_started_working, ... date_reported, date_fired 20 total fields with 100 000 records in table
or make 2 tables it like
first table "date_types" with 3 fields and 20 records for above column names.
id, date_type_id, date_type_name
1 5 date_reported
2 3 date_registerd
...
and second table with 3 fields actual records
"user_dates"
userId, date_type, date
201 2 2012-01-28
202 5 2012-06-14
...
but then with 2 000 000 records ?
I think second option is more universal if I need to add more dates I can do it from front end just adding record to "date_type" table and then using it in "user_dates" however I am now worried about performance with 2 million records in table.
So which option you think will work faster?
A longer table will have a larger index. A wider table will have a smaller index but take more psychical space and probably have more overhead. You should carefully examine your schema to see if normalization is complete.
I would, however, go with your second option. This is because you don't need to necessarily have the fields exist if they are empty. So if the user hasn't been fired, no need to create a record for them.
If the dates are pretty concrete and the users will have all (or most) of the dates filled in, then I would go with the wide table because it's easier to actually write the queries to get the data. Writing a query that asks for all the users that have date1 in a range and date2 in a range is much more difficult with a vertical table.
I would only go with the longer table if you know you need the option to create date types on the fly.
The best way to determine this is through testing. Generally the sizes of data you are talking about (20 date columns by 100K records) is really pretty small in regards to MySQL tables, so I would probably just use one table with multiple columns unless you think you will be adding new types of date fields all the time and desire a more flexible schema. You just need to make sure you index all the fields that will be used in for filtering, ordering, joining, etc. in queries.
The design may also be informed by what type of queries you want to perform against the data. If for example you expect that you might want to query data based on a combination of fields (i.e. user has some certain date, but not another date), the querying will likely be much more optimal on the single table, as you would be able to use a simple SELECT ... WHERE query. With the separate tables, you might find yourself needing to do subselects, or odd join conditions, or HAVING clauses to perform the same kind of query.
As long as the user ID and the date-type ID are indexed on the main tables and the user_dates table, I doubt you will notice a problem when querying .. if you were to query the entire table in either case, I'm sure it would take a pretty long time (mostly to send the data, though). A single user lookup will be instantaneous in either case.
Don't sacrifice the relation for some possible efficiency improvement; it's not worth it.
Usually I go both ways: Put the basic and most oftenly used attributes into one table. Make an additional-attributes table to put rarley used attributes in, which then can be fetched lazily from the application layer. This way you are not doing JOIN's every time you fetch a user.
Evening,
I'm going through the long process of importing data from a battered, 15-year-old, read-only data format into MySQL to build some smaller statistical tables from it.
The largest table I have built before was (I think) 32 million rows, but I didn't expect it to get that big and was really straining MySQL.
The table will look like this:
surname name year rel bco bplace rco rplace
Jones David 1812 head Lond Soho Shop Shewsbury
So, small ints and varchars.
Could anyone offer advice on how to get this to workas quickly as possible? Would indexes on any of the coulmns help, or would they just slow queries down.
Much of the data in each column will be duplicated many times. Some fields don't have much more than about 100 different possible values.
The main columns I will be querying the table on are: surname, name, rco, rplace.
INDEX on a column fastens the search.
Try to INDEX columns that you would be using more often in queries. As you have mentioned you would be using the columns surname, name, rco, rplace. I'd suggest you index them.
Since the table has 32 million records, indexing would take sometime however it is worth the wait.
Here's the scenario, the old database has this kind of design
dbo.Table1998
dbo.Table1999
dbo.Table2000
dbo.table2001
...
dbo.table2011
and i merged all the data from 1998 to 2011 in this table dbo.TableAllYears
now they're both indexed by "application number" and has the same numbers of columns (56 columns actually..)
now when i tried
select * from Table1998
and
select * from TableAllYears where Year=1998
the first query has 139669 rows # 13 seconds
while the second query has same number of rows but # 30 seconds
so for you guys, i'm i just missing something or is multiple tables better than single table?
You should partition the table by year, this is almost equivalent to having different tables for each year. This way when you query by year it will query against a single partition and the performance will be better.
Try dropping an index on each of the columns that you're searching on (where clause). That should speed up querying dramatically.
So in this case, add a new index for the field Year.
I believe that you should use a single table. Inevitably, you'll need to query data across multiple years, and separating it into multiple tables is a problem. It's quite possible to optimize your query and your table structure such that you can have many millions of rows in a table and still have excellent performance. Be sure your year column is indexed, and included in your queries. If you really hit data size limitations, you can use partitioning functionality in MySQL 5 that allows it to store the table data in multiple files, as if it were multiple tables, while making it appear to be one table.
Regardless of that, 140k rows is nothing, and it's likely premature optimization to split it into multiple tables, and even a major performance detriment if you need to query data across multiple years.
If you're looking for data from 1998, then having only 1998 data in one table is the way to go. This is because the database doesn't have to "search" for the records, but knows that all of the records in this table are from 1998. Try adding the "WHERE Year=1998" clause to the Table1998 table and you should get a slightly better comparison.
Personally, I would keep the data in multiple tables, especially if it is a particularly large data set and you don't have to do queries on the old data frequently. Even if you do, you might want to look at creating a view with all of the table data and running the reports on that instead of having to query several tables.