Large amounts of Data

Large amounts of Data - ms-access

I've been working with MS Access 2010 for a while now and for the most part everything works. YAY. However, I have large amounts of data to eventually plot (x-axis y-axis pairs) that come from a piece of equipment that I use for work. I can import this data as a seperate table, but I am not particularly fond of the idea of having my database overloaded with seperate tables that are purely to store this data. To my undertanding each table should represent an entity that fits into the large context of the database. Also, for the equipment I'm using right now all the x-axis data is redundant. The question is, what is the best way to divid the data for effecient storage?
Considerations:
I keep running into the same problems as I think about this question. Suppose that in either case I made two tables, one to store the x-axis data and another to store the y-axis data, and then had a linking table between the two allowing for a many to many relationship.
On the one hand, I could store one value per Record (all values in one Column). But, then there would need to be a tag field in each of these two tables, thus defeating the purpose of the split.
On the other hand, I could store one value per Field (all value in one Row), which in my case would yield over 2000 fields in each table.
There is a third option, the one I'm currently using, to store one pair per row in a single table. However, there is much redundancy.

You should stick with your current method. This is by far the simplest method to both retrieve and add to the data. Below I have my reactions to your other suggestions.
Suppose that in either case I made two tables, one to store the x-axis
data and another to store the y-axis data, and then had a linking
table between the two allowing for a many to many relationship.
This might provide a slight hard drive space improvement if X and Y are not integers. However, it would complicate things significantly for questionable benefit.
On the one hand, I could store one value per Record (all values in one
Column). But, then there would need to be a tag field in each of these
two tables, thus defeating the purpose of the split.
This would make it a lot more complicated to work with the data and is a bad idea. You would need to use complicated querying to get both data points in the same row. You could do this, but it complicate both input and retrieval.
On the other hand, I could store one value per Field (all value in one
Row), which in my case would yield over 2000 fields in each table.
If you do this, you will regret it. This would make it nearly impossible to do any meaningful data analysis later on.
There is a third option, the one I'm currently using, to store one
pair per row in a single table. However, there is much redundancy.
This is ideal. You can easily import your data into the two columns, the data is easily retrievable. The redundancies are not important unless the value is irrelevant.

Related

What is more efficient, a table with 100 columns and less rows or 5 columns and 30 times more rows

Edit 1:
Because few good ppl have pointed out that my question isnt very clear, I thought I will rewrite it and make it more clear now.
So basically, I am making an app, which allows users to create his own form with his own set of input fields, with data like name, type etc. After creating his form and he publishes the form, whenever there is an entry in the form, the data gets saved into the db ofcourse. Because the form itself is dynamic. I need a way to save this data.
My first choice was JSONizing it and saving. But because I cannot do any SQL queries on them, if I save in JSON format, i am eliminating this option.
Then the simple method is storing in a table like (id, rowid, columnname, value) and i keep the rowid same for all row data. But in this way, if a form contains 30 fields, after 100 entries my db would have 3000 rows. so in the long run, it would go huge and I think queries will get slow when there are millions of rows in the table.
Then I got this idea of a table like (id, rowid, column1, column2...column100). And i will save all the inputs in the form into single row. In this way it does add only 1 row per submit and its easier to query too. I will store the actual column names and map them to the right column(number) from there. This is my idea. column100 because 100 is the maximum inputs the user can add in his form.
So my question is, whether my idea is good, or should I stick to the classic table.

If I've understood your question, you need to have to design a database structure to store data whose schema you don't know in advance.
This is hard - there's no "efficient" solution in relational databases that I'm aware of.
Option 1 would be to look at a non-relational (NoSQL) solution instead.
I won't elaborate the benefits and drawbacks, as they are highly dependent on which NoSQL option you choose.
It's worth noting that many relational engines (including MySQL) allow you to store and query structured data formats like JSON. I've not used this feature in MySQL myself, but similar functionality in SQL Server performs very well.
Within relational databases, the common solution is an "Entity/Attribute/Value" (EAV)schema. This is sorta like your option 2.
EAV designs can theoretically store an unlimited number of columns, and an unlimited number of rows - but common queries quickly become impossible. In your sample data, finding all records where the name begins with a K and the power is at least 22 turns into a very complex SQL query. It also means the application needs to enforce rules of uniqueness, mandatory/optional data attributes, and data transformation from one format to another.
From a performance point of view, this doesn't really scale to complex queries. This is because every clause in your "where" needs a self join, and indexes won't have a big impact on searches for non-text strings (searching for numerical "greater than 20" is not the same as searching for a text "greater than 20".).
Option 3 is, indeed, to make the schema logic fit into a limited number of columns (your option 1).
It means you have a limitation on the number of columns, and you still have to manage mandatory/optional, uniqueness etc. in the application. However, querying the data should be easier - finding accounts where the name starts with K and the power is at least 22 is a fairly straightforward exercise.
You do have a lot of unused columns, but that doesn't really impact performance much - disk space is so cheap that all the wasted space is probably less space than you carry around in your smart phone.

If I understand your requirement, what I will do with your requirement is to create a many to many relationship something like this:
(tbl1) form:
- id
- field1
- field2
(tbl2) user_added_fields:
- id
- field_name
(tbl3) form_table_user_added_fields:
- form_id (fk)
- user_added_fields_id (fk)
This may not likely to solve your own requirements, but I hope this will give you a hint. Happy coding! :)

MySQL design regarding a web

I am tackling a problem in class to design a mySQL representation of a web that stores a list of events associated with a person. So, for this table/tables, it would have 2 columns, one of which is the person's name and the other is the event. However, a person will generally have anywhere from 30-1000 events, so this table, which we plan to have for our entire undergraduate class of 6000 students, will have millions of entries. Is there a better way to store this in mySQL that will take less space, but will still be able to retrieve individual events and the list of people that attended it just as easily as if it was a table of two columns?

Yes, there is a technique called many-to-many, and essentially breaks your one table into three, which is critical when you consider that there are indeed exactly three entities being modeled (as a good sanity check)
Person
Event
A Person's association with an Event
You model this as three tables, with the first two having essentially two columns each: one with a unique index (called "primary key"), and the second being a semantic name (person name, event name). Note that you can also add any number of columns to these with only one factor of increased storage (most likely your first move will be to add a date column to the event table).
The third table is the interesting one, it contains only 2 columns, each numeric, both of which are references to the other tables (each row is simply: (person_id, event_id)). We term these "foreign keys".
This structure means a few things:
No matter how many events someone goest to, that someone is only represented once.
same with events, not matter how many attendees
The attendance is a "first-class" entity, and can grow to include it's own attributes (i.e. "role")
This structure is called many-to-many because each person may attend many events, and each event may have many attendees.
The quintessential feature of the design is that no single piece of domain knowledge is repeated, only "keys" are repeated as necessary to model the real-world domain. (i.e. in your first example, accounting for a name change would require an unknown quantity of updates, and might lead to data anomalies, avoidance of which is a primary concern of database normalization.

Don't worry about "space". This isn't the 1970s and we're not going to run out of columns on punch cards to store data. You should be concerned with expressing your requirements in the proper, most normalized data structure. With proper indexing there shouldn't be a problem, not with this volume of data.
Remember indexes need to be defined on anything you will include as part of a WHERE clause, and sometimes you may need to add additional indexes for large lists fetched with ORDER BY and LIMIT.
Whenever possible or practical use an integer identifier instead of a string. These are stored as a small number of bytes, typically 4, compared with a variable length string which is typically at least the length of the string in bytes plus 1.
A properly normalized database will use numerical identifiers for things anyway, so this kind if thing isn't a huge concern. The only time you go against this, or deliberately de-normalize your data, is when you have a legitimate performance problem that cannot be easily solved using some other method.
As always, test your schema by generating large amounts of dummy data and see how it performs. Since you have a good idea of the requirements in advance, do some testing at those levels, and then, to be on the safe side, try 2x, 5x and 10x the data to see how much flexibility your design has. It's okay to have performance limitations so long as you know at what kind of scale you'll experience them.

mySQL relational databases were designed specifically to handle this sort of problem. Handling millions of entries is not a problem. Complex queries may take a couple seconds but will perform remarkably well.
It is best design to store 1 event per row. The way you are going about it sounds like the best way. Good Luck.

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.

120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.

I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.

How to decide between row based and column based table structures?

I've some data set, which has hundreds of parameters (with more coming in)
If I dump them in one table, it'll probably end up having hundreds of columns (and I am not even sure how many, at this point)
I could do row based, with a bunch of meta tables, but somehow row based structure feels unintuitive
One more way would be to keep column based, but have multiple tables (split the tables logically) which seems like a good solution.
Is there any other way to do it? If yes, could you point me to some tutorial? (I am using mysql)
EDIT:
based on the answers, I should clarify one thing - updates and deletes are going to be much lesser, than inserts and selects. as it is, selects are going to be the bulk of the operations, so selects have to be fast.

I ran across several designs where a #4 was possible:
Split your columns into searchable and auxiliary
Define a table with only searchable columns, and an extra BLOB column
Put everything in one table: searchable columns go as-is, auxiliary go as a BLOB
We used this approach with BLOBs of XML data or even binary data, representing the entire serialized object. The downside is that your auxiliary columns remain non-searchable for all practical purposes. The upside is that you can add new auxiliary columns at will without changing the schema. You can also make schema changes to make previously auxiliary columns searchable with a schema change and a very simple program.

It all depends on the kind of data you need to store.
If it's not "relational" at all - for instance, a collection of web pages, documents, etc - it's usually not a good fit for a relational database.
If it's relational, but highly variable in schema - e.g. a product catalogue - you have a number of options:
single table with every possible column (your option 1)
"common" table with the attributes that each type shares, and joined tables for attributes for subtypes
table per subtype
If the data is highly variable and you don't want to make schema changes to accommodate the variations, you can use "entity-attribute-value" or EAV - though this has some significant drawbacks in the context of relational database. I think this is what you have in mind with option 2.
If the data is indeed relational, and there is at least the core of a stable model in the data, you could of course use traditional database design techniques to come up with a schema. That seems to correspond with option 3.

Does every item in the data set have all those properties? If yes, then one big table might well be fine (although scary-looking).
On the other hand, perhaps you can group the properties. The idea being that if an item has one of the properties in the group, then it has all the properties in that group. If you can create such groupings, then these could be separate tables.
So should they be separate? Yes, unless you can prove that the cost of performing joins is unacceptable. Perform all SELECTs via stored procedures and you can denormalise later on without much trouble.

How many columns in table to keep? - MySQL

I am stuck between row vs columns table design for storing some items but the decision is which table is easier to manage and if columns then how many columns are best to have? For example I have object meta data, ideally there are 45 pieces of information (after being normalized) on the same level that i need to store per object. So is 45 columns in a heavry read/write table good? Can it work flawless in a real world situation of heavy concurrent read/writes?

If all or most of your columns are filled with data and this number is fixed, then just use 45 fields. It's nothing inherently bad with 45 columns.
If all conditions are met:
You have a possibility of the the attributes which are neither known nor can be predicted at design time
The attributes are only occasionally filled (say, 10 or less per entity)
There are many possible attributes (hundreds or more)
No attribute is filled for most entities
then you have a such called sparce matrix. This (and only this) model can be better represented with an EAV table.

"There is a hard limit of 4096 columns per table", it should be just fine.

Taking the "easier to manage" part of the question:
If the property names you are collecting do not change, then columns is just fine. Even if it's sparsely populated, disk space is cheap.
However, if you have up to 45 properties per item (row) but those properties might be radically different from one element to another then using rows is better.
For example taking a product catalog. One product might have color, weight, and height. Another might have a number of buttons or handles. These are obviously radically different properties. Further this type of data suggests that new properties will be added that might only be related to a particular set of products. In this case, rows is much better.
Another option is to go NoSql and utilize a document based database server. This would allow you to set the named "columns" on a per item basis.
All of that said, management of rows will be done by the application. This will require some advanced DB skills. Management of columns will be done by the developer at design time; which is usually easier for most people to get their minds around.

I don't know if I'm correct but I once read in MySQL to keep your table with minimum columns IF POSSIBLE, (read: http://dev.mysql.com/doc/refman/5.0/en/data-size.html ), do NOTE: this is if you are using MySQL, I don't know if their concept applies to other DBMS like oracle, firebird, posgresql, etc.
You could take a look at your table with 45 column and analyze what you truly need and leave the optional fields into other table.
Hope it helps, good luck

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008