Maximum number of rows in an MS Access database engine table? - ms-access

We know the MS Access database engine is 'throttled' to allow a maximum file size of 2GB (or perhaps internally wired to be limited to fewer than some power of 2 of 4KB data pages). But what does this mean in practical terms?
To help me measure this, can you tell me the maximum number of rows that can be inserted into a MS Access database engine table?
To satisfy the definition of a table, all rows must be unique, therefore a unique constraint (e.g. PRIMARY KEY, UNIQUE, CHECK, Data Macro, etc) is a requirement.
EDIT: I realize there is a theoretical limit but what I am interested in is the practical (and not necessarily practicable), real life limit.

Some comments:
Jet/ACE files are organized in data pages, which means there is a certain amount of slack space when your record boundaries are not aligned with your data pages.
Row-level locking will greatly reduce the number of possible records, since it forces one record per data page.
In Jet 4, the data page size was increased to 4KBs (from 2KBs in Jet 3.x). As Jet 4 was the first Jet version to support Unicode, this meant that you could store 1GB of double-byte data (i.e., 1,000,000,000 double-byte characters), and with Unicode compression turned on, 2GBs of data. So, the number of records is going to be affected by whether or not you have Unicode compression on.
Since we don't know how much room in a Jet/ACE file is taken up by headers and other metadata, nor precisely how much room index storage takes, the theoretical calculation is always going to be under what is practical.
To get the most efficient possible storage, you'd want to use code to create your database rather than the Access UI, because Access creates certain properties that pure Jet does not need. This is not to say there are a lot of these, as properties set to the Access defaults are usually not set at all (the property is created only when you change it from the default value -- this can be seen by cycling through a field's properties collection, i.e., many of the properties listed for a field in the Access table designer are not there in the properties collection because they haven't been set), but you might want to limit yourself to Jet-specific data types (hyperlink fields are Access-only, for instance).
I just wasted an hour mucking around with this using Rnd() to populate 4 fields defined as type byte, with composite PK on the four fields, and it took forever to append enough records to get up to any significant portion of 2GBs. At over 2 million records, the file was under 80MBs. I finally quit after reaching just 700K 7 MILLION records and the file compacted to 184MBs. The amount of time it would take to get up near 2GBs is just more than I'm willing to invest!

Here's my attempt:
I created a single-column (INTEGER) table with no key:
CREATE TABLE a (a INTEGER NOT NULL);
Inserted integers in sequence starting at 1.
I stopped it (arbitrarily after many hours) when it had inserted 65,632,875 rows.
The file size was 1,029,772 KB.
I compacted the file which reduced it very slightly to 1,029,704 KB.
I added a PK:
ALTER TABLE a ADD CONSTRAINT p PRIMARY KEY (a);
which increased the file size to 1,467,708 KB.
This suggests the maximum is somewhere around the 80 million mark.

As others have stated it's combination of your schema and the number of indexes.
A friend had about 100,000,000 historical stock prices, daily closing quotes, in an MDB which approached the 2 Gb limit.
He pulled them down using some code found in a Microsoft Knowledge base article. I was rather surprised that whatever server he was using didn't cut him off after the first 100K records.
He could view any record in under a second.

It's been some years since I last worked with Access but larger database files always used to have more problems and be more prone to corruption than smaller files.
Unless the database file is only being accessed by one person or stored on a robust network you may find this is a problem before the 2GB database size limit is reached.

We're not necessarily talking theoretical limits here, we're talking about real world limits of the 2GB max file size AND database schema.
Is your db a single table or
multiple?
How many columns does each table have?
What are the datatypes?
The schema is on even footing with the row count in determining how many rows you can have.
We have used Access MDBs to store exports of MS-SQL data for statistical analysis by some of our corporate users. In those cases we've exported our core table structure, typically four tables with 20 to 150 columns varying from a hundred bytes per row to upwards of 8000 bytes per row. In these cases, we would bump up against a few hundred thousand rows of data were permissible PER MDB that we would ship them.
So, I just don't think that this question has an answer in absence of your schema.

It all depends. Theoretically using a single column with 4 byte data type. You could store 300 000 rows. But there is probably alot of overhead in the database even before you do anything. I read some where that you could have 1.000.000 rows but again, it all depends..
You can also link databases together. Limiting yourself to only disk space.

Practical = 'useful in practice' - so the best you're going to get is anecdotal. Everything else is just prototyping and testing results.
I agree with others - determining 'a max quantity of records' is completely dependent on schema - # tables, # fields, # indexes.
Another anecdote for you: I recently hit 1.6GB file size with 2 primary data stores (tables), of 36 and 85 fields respectively, with some subset copies in 3 additional tables.
Who cares if data is unique or not - only material if context says it is. Data is data is data, unless duplication affects handling by the indexer.
The total row counts making up that 1.6GB is 1.72M.

When working with 4 large Db2 tables I have not only found the limit but it caused me to look really bad to a boss who thought that I could append all four tables (each with over 900,000 rows) to one large table. the real life result was that regardless of how many times I tried the Table (which had exactly 34 columns - 30 text and 3 integer) would spit out some cryptic message "Cannot open database unrecognized format or the file may be corrupted". Bottom Line is Less than 1,500,000 records and just a bit more than 1,252,000 with 34 rows.

Related

Is it more efficient to split a large table into several tables, or stick with one, in MySQL?

I'm writing a C# program where I'm looking at ~5300 stock tickers. I'm storing the data in a MySQL database with the following fields: date, tickername, closingPrice, movingaverage50, movingaverage200, ... and a few others. Each stock can have up to 15300 different datapoints. So the total database will be 5300x15300x6 or so different fields.
My question is, is there a more efficient way to store all this data other than one big table? Would breaking the data up into different tables, say by decade, buy me anything? Is there some link/website where I should go to get a general feel of what considerations I should look at to design a database to be as fast as possible, or does the MySQL database itself make it efficient?
I'm currently reading in 5500 excel files one at a time to fill my c# objects with data, and that takes around 15minutes... I'm assuming once I get my MySQL going that will be cut way down.
Thanks for any help; this is more of a fishing for a place to get started thinking about database design I guess.
This is too long for a comment.
In general, it is a bad idea to store multiple tables with the same format. That becomes a maintenance problem and has dire consequences for certain types of queries. So, one table is preferred.
The total number of rows is 486,540,000. This is pretty large but not extraordinary.
The question about data layout depends not only on the data but how it is being used. My guess is that the use of indexes and perhaps partitions would solve your performance problems.
Processing 5,500 Excel files in 15 minutes seems pretty good. Whether the database will be significantly faster depends on the volume of data between the server and the application. If you are reading the "Excel" files as CSV text files, then the database may not be a big gain. If you are reading through Excel, then it might be better.
Note: with a database, you may be able to move processing from C# into the database. This allows the database to take advantage of parallel processing, which can open other avenues for performance improvement.
One table.
PRIMARY KEY(ticker, date) -- This makes getting historical info about a single ticker efficient because of the clustering.
PARTITION BY (TO_DAYS(date)) -- This leads to all the INSERT activity being in a single partition. This partition is of finite size, hence the random accessing to insert 5300 new rows every night scattered around will probably still be in cache.
Partition by month, or something of approximately that size -- small enough for a partition to be cached, but not so small enough that you have an unwieldy number of partitions. (It is good to keep a table under 50 partitions. This 'limitation' may lift with "native partitions" that are coming in 5.7.)
If you already have several months' data in a table, put that in a single, oversized, partition; there is no advantage in splitting it up by month.
Minimize column sizes. 2-byte SMALLINT UNSIGNED for ticker_id, linked to a normalization table of tickers. 3-byte DATE; etc. Volume can be too big for INT UNSIGNED, either use FLOAT (with some roundoff error) or some DECIMAL. Prices are tricky -- rounding errors with FLOAT, excessive size with DECIMAL: need at least (9,4) (5 bytes) for US tickers, worse if you go back to the days of fractional pricing (eg, 5-9/16).
Think through the computation of moving averages; this may be the most intensive activity.

Access Table Corrupting database

I have a table, “tblData_Dir_Statistics_Detail” with 15 fields that is holding about 150,000 records. I can load the records into this table but, am having trouble updating other tables from this table (I want to use several different queries to update a couple different tables). There is only one index on the table and the only thing unusual about the table is there are 3 text fields that I run out to 255 characters because some of the paths\data are that long or even exceed 255. I have tried trim these to 150 characters but it has no impact on the correcting the problems that I am having using this table.
Additionally, I manually recreated the table because it acts like it is corrupted. Even this had no impact on the problems.
The original problem that I was getting is that my code would stop with a “System resource exceeded.”.
Here is the list of things I am experience and can’t seem to figure out why:
When I use the table in an update query (using task manager) I always see my Physical Memory usage for Access jump from about 35,000 K to 85,000 K instantly when the code hits this query and then, within a second or two, I get the resources exceed error.
Sometimes, but not all the time, when I compact and repair, tblData_Dir_Statistics_Detail is deleted by the process and is subsequently listed in MSysCompactError table as an error. The “ErrorCode” in the table is -1011 and the “ErrorDiscription” is “System resource exceeded.”
Sometimes, but not all the time, when I compact and repair, if I lose tblData_Dir_Statistics_Detail, I will lose the next one below it in the database window (shows also in the SYS table).
Sometimes, but not all the time, when I compact and repair, if I lose tblData_Dir_Statistics_Detail, I will lose the next TWO positions below it in the database window (shows also in the SYS table).
I have used table structures like this with much larger tables without problems for years. Additionally, I have a parallel table “tblData_Dir_Statistics” which has virtually the same structure and holds the same data but at a summarized level, and have no trouble with that or any other table.
Summary:
My suspicion is that there is some kind of character being imported into one of the fields that is corrupting this entire table.
If this is true, how could I find the corruption?
If it is not this, what else could it be?
Many Thanks!
A few considerations:
Access files have a size-limit of 2 GB. If your file becomes at any time bigger than 2 GB (even by 1 byte) the whole file is corrupted
Access creates temporary objects when sorting data and/or executing queries (and those temporary objects are created and stored in the file). Depending on the complexity of your queries, those temporary objects might be pushing the file size up (see previous paragraph).
If you are using text fields with lengths bigger than 255 characters, consider using Memo fields (these fields cannot be indexed as far as I remember, so be careful when using them)
Consider adding more indexes to your tables to ease and speed up the queries.
Consider dividing the database: Put all the data in one (or more) file(s) and link the tables in it (them) to another Access file, and execute the queries in this last one.

Innodb + 8K length row limit

I've a Innodb table with, amongst other fields, a blob field that can contain up to ~15KB of data. Reading here and there on the web I found that my blob field can lead (when the overall fields exceed ~8000 bytes) to a split of some records into 2 parts: on one side the record itself with all fields + the leading 768 bytes of my blob, and on another side the remainder of the blob stored into multiple (1-2?) chunks stored as a linked list of memory pages.
So my question: in such cases, what is more efficient regarding the way data is cached by MySQL ? 1) Let the engine deal with the split of my data or 2) Handle the split myself in a second table, storing 1 or 2 records depending on the length of my blob data and having these records cached by MySQL ? (I plan to allocate as much memory as I can for this to occur)
Unless your system's performance is so terrible that you have to take immediate action, you are better off using the internal mechanisms for record splitting.
The people who work on MySQL and its forks (e.g. MariaDB) spend a lot of time implementing and testing optimizations. You will be much happier with simple application code; spend your development and test time on your application's distinctive logic rather than trying to work around internals issues.

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.

Performance of additional columns vs additional rows

I have a question about table design and performance. I have a number of analytical machines that produce varying amounts of data (which have been stored in text files up to this point via the dos programs which run the machines). I have decided to modernise and create a new database to store all the machine results in.
I have created separate tables to store results by type e.g. all results from the balance machine get stored in the balance results table etc.
I have a common results table format for each machine which is as follows:
ClientRequestID PK
SampleNumber PK
MeasureDtTm
Operator
AnalyteName
UnitOfMeasure
Value
A typical ClientRequest might have 50 samples which need to tested by various machines. Each machine records only 1 line per sample, so there are apprx 50 rows per table associated with any given ClientRequest.
This is fine for all machines except one!
It measures 20-30 analytes per sample (and just spits them out in one long row), whereas all the other machines, I am only ever measuring 1 analyte per RequestID/SampleNumber.
If I stick to this format, this machine will generate over a miliion rows per year, because every sample can have as many as 30 measurements.
My other tables will only grow at a rate of 3000-5000 rows per year.
So after all that, my question is this:
Am I better to stick to the common format for this table, and have bucket loads of rows, or is it better to just add extra columns to represent each Analyte, such that it would generate only 1 row per sample (like the other tables). The machine can only ever measure a max of 30 analytes (and a $250k per machine, I won;t be getting another in my lifetime).
All I am worried about is reporting performance and online editing. In both cases, the PK: RequestID and SampleNumber remain the same, so I guess it's just a matter of what would load quicker. I know the multiple column approach is considered woeful from a design perspective, but would it yield better performance in this instance?
BTW the database is MS Jet / Access 2010
Any help would be greatly appreciated!
Millions of rows in a Jet/ACE database are not a problem if the rows have few columns.
However, my concern is how these records are inserted -- is this real-time data collection? If so, I'd suggest this is probably more than Jet/ACE can handle reliably.
I'm an experienced Access developer who is a big fan of Jet/ACE, but from what I know about your project, if I was starting it out, I'd definitely choose a server database from the get go, not because Jet/ACE likely can't handle it right now, but because I'm thinking in terms of 10 years down the road when this app might still be in use (remember Y2K, which was mostly a problem of apps that were designed with planned obsolescence in mind, but were never replaced).
You can decouple the AnalyteName column from the 'common results' table:
-- Table Common Results
ClientRequestID PK SampleNumber PK MeasureDtTm Operator UnitOfMeasure Value
-- Table Results Analyte
ClientRequestID PK SampleNumber PK AnalyteName
You join on the PK (Request + Sample.) That way you don't duplicate all the rest of the rows needlessly, can avoid the join in the queries where you don't require the AnalyteName to be used, can support extra Analytes and is overall saner. Unless you really start having a performance problem, this is the approach I'd follow.
Heck, even if you start having performance problems, I'd first move to a real database to see if that fixes the problems before adding columns to the results table.