Big sql tables and entity framework - sql-server-2008

I am trying to find out better way deal with large volume of records within single table. Given that a table have 50M records and records are inserted about ~1.5M records a day.
Which options are available to avoid further performance struggling after month/year?
Please point me to the right direction. Currently I am thinking about having separate table (or maybe views) with different sets of records (like deleted/old/new) and feed EF with appropriate table/view in appropriate scenario (just for example if it simple user he might not need to work with all records like deleted or 2 years old and that might give some performance issues)

Related

Creating sql tables by month and year [duplicate]

This question already has answers here:
Many tables or rows, which one is more efficient in SQL?
(3 answers)
Closed 7 years ago.
Every month I get sent a file from a external company which needs to be stored in a database, each file containing up to a million records. The main data fields are Month, Year, Postcode and TransactionType.
I was proposing that we should save the data in our database as a new SQL table each month so we know there is only a finite amount of data in each table. However one of my collegues said he was once told that to create a new table every month is bad practice, but he didn't know why.
If I was to have multiple tables, there would only be a maximum of 60 tables, though there may be far fewer (down to 12) dependent on how far into the past my client needs to look. This means that every month I will need to delete a month's worth of data.
However when I do my SQL queries I will only need a single row of data from a single table per query. I would think in theory this would be more efficient than having a single table filled with millions of rows.
I was wondering if anyone had any definitive reasons as to why splitting the data this way would be a bad thing to do?
All "like" items should be stored together in a database for the following reasons:
You should be able to provide any subset of the items using a single SELECT statement only by changing the WHERE clause of that statement. With separate tables you will have to write code to decompose the request into the parts that compute the table name and the parts that filter that table. And you will have to duplicate that logic in each application, or teach it to each user, that wants to use your database.
You should not artificially limit the use to which your data can be put. If you have separate monthly tables you have already substantially limited the types of queries you can enter against them without having to write more complex UNION queries.
The addition of more instances of a known data type to your database should not require ALTERing the structure of your database and, as a general principal, regularly-run code should not even have ALTER permissions
If proper indexes are maintained, there is very little performance difference when SELECTing data from a table 60 times the size of a smaller table. (There can be more effect on INSERT and UPDATE commands but it sound like you'll be doing a bulk update rather than updating the data constantly).
I can think of only two reasons for sharding data into separate tables:
You discover that you have a performance issue that can't be resolved through better data design.
You have records with different level of security and are relying on GRANT SELECT permissions to allow some users to see the records at higher levels of security.
A simpler method would be to add a column to that table which contains a datetimestamp of when that was loaded into the system. That way you can filter by that perticular column to segregate that data into the months/years that it was loaded in.
Another advantage from a performance perspective, that if you regularly filter data this way, you can create an index based on this date column.
Having multiple tables that contain the same information is not recommended for performance reasons and how information is stored in SQL. Eventually it will take up more space and if one month's data needs to reference another month's data it will be quite slow.
Hope this helps.
If you think it isn't difficult for you to manage your application, you can do it.
Example. Do you need to change SQL queries every month?
If user need more report that need data more than 1 month, What happen?
Using partitioning, DBMS will split your data to multiple table on the physical storage but You can call all of them by the same name. DBMS will analyse with partition it should take. Performance isn't different significantly.

What is good for creating a database with atleast 1000 items per table?

I want to ask some tips to all experts out there. I want to create a construction supply inventory and management system. We know that it has different product category(E.g. Aggregates, Paints And Chemicals, Plumbing etc.) In every Product category it is possible to have atleast 1000 product items. So is it good to create a table in every category or create only one table and put all product there? Thanks in advance.
There isn't really any good reason to use multiple tables with the same layout instead of a single table (at least not in your case), but there are several reasons to use a single table:
Databases are built to handle a lot of records in a table, but not so much to handle a lot of tables.
Execution plans can be reused for queries to the same table, but an execution plan for one table can not be resused for a different table. Each query for a new category needs a new execution plan. Caching execution plans for all tables uses memory that could be used to cache data.
If you need to do any change to the table, like adding a field or adding an index, it's a lot easier to do that to a single table than to do it to multiple tables.
To access different tables depending on a value you need to create queries dynamically. To concatenate values into queries brings security issues like SQL injection.
By using data as table names you lose the separation between the data and the database layout. When data changes it should not require the database design to change.
If you ever need to get products from more than one category at once, that gets much more complicated with categories in separate tables.

Are there any (potential) problems with having large gaps in auto-increment IDs in the neighbouring rows of a table?

I have a mysql web app that allows users to edit personal information.
A single record is stored in the database across multiple tables. There is a single row in one table for the record, and then additional one-to-many tables for related information. Rows in the one-to-many tables can additionally point to other one-to-many-tables.
All this is to say, data for a single personal information record is a tree that is very spread out in the database.
To update a record, rather than trying to deal with a hodgepodge of update and delete and insert statements to address all the different information that may change from save to save, I simply delete the entire old tree, and then re-insert a new one. This is much simpler on the application side, and so far it has been working fine for me without any problems.
However I do note that some of the auto-incrementing IDs in the one-to-many tables are starting to creep higher. It will still be decades at least before I am anywhere close to this bumping against the limits of INT, let alone BIGINT -- however I am still wondering if there are any drawbacks to this approach that I should be aware of.
So I guess my question is: For database structures like mine, which consist of large trees of information spread across multiple tables, when updating the information, any part of which may have changed, is it ok to just delete the old tree and re-insert a new one? Or should I be rethinking this. IOW is it ok or not ok for there to be large gaps between the IDs of the rows in a table?
Thanks (in advance) for your help.
If your primary keys are indexed (which they should) you shouldn't get problems, apart from the database files needing some compacting from time to time.
However the kind of data you are storing could probably stored better in a document database, like MogoDB, have you consider using one of these?

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.

MySQL structure for DBs larger than 10mm records

I am working with an application which has a 3 tables each with more than 10mm records and larger than 2GB.
Every time data is inserted there's at least one record added to each of the three tables and possibly more.
After every INSERT a script is launched which queries all these tables in order to extract data relevent to the last INSERT (let's call this the aggregation script).
What is the best way to divide the DB in smaller units and across different servers so that the load for each server is manageable?
Notes:
1. There are in excess of 10 inserts per second and hence the aggregation script is run the same number of times.
2. The aggregation script is resource intensive
3. The aggregation script has to be run on all the data in order to find which one is relevant to the last insert
4. I have not found a way of somehow dividing the DB into smaller units
5. I know very little about distributed DBs, so please use very basic terminology and provide links for further reading if possible
There are two answers to this from a database point of view.
Find a way of breaking up the database into smaller units. This is very dependent on the use of your database. This is really your best bet because it's the only way to get the database to look at less stuff at once. This is called sharding:
http://en.wikipedia.org/wiki/Shard_(database_architecture)
Have multiple "slave" databases in read only mode. These are basically copies of your database (with a little lag). For any read only queries where that lag is acceptable, they access these databases across the code in your entire site. This will take some load off of the master database you are querying. But, it will still be resource intensive on any particular query.
From a programming perspective, you already have nearly all your information (aside from ids). You could try to find some way of using that information for all your needs rather than having to requery the database after insert. You could have some process that only creates ids that you query first. Imagine you have tables A, B, C. You would have other tables that only have primary keys that are A_ids, B_ids, C_ids. Step one, get new ids from the id tables. Step two, insert into A, B, C and do whatever else you want to do at the same time.
Also, general efficiency/performance of all queries should be reviewed. Make sure you have indexes on anything you are querying. Do explain on all queries you are running to make sure they are using indexes.
This is really a midlevel/senior dba type of thing to do. Ask around your company and have them lend you a hand and teach you.