Changing database table based on one column - mysql

I am new to database design, but not to computers and terminology. I need some help in my database design. I am collecting data from a Global Navigation Satellite System (GNSS) receiver and each packet is differing in sizes depending on which constellation it sees (GPS, GALILEO, GLONASS, etc). There are some common fields across them all, but the way I currently have it set up is all the possible fields are columns, and any field that is unrelated to the new packet is just NULL. This is very inefficient, I just don't know how to go about designing a better way. Thoughts? The main point is as it is now, every time I do a query, I either specify all the fields that are relevant to that specific packet type, or I get a bunch of useless data.
I was thinking of one option, where I have all the common fields in one table, and another table for the unique fields for each packet type. Then have a column that tells what type of packet it is, so when I do a SELECT query I can do a JOIN and only get the data that is relevant to said packet.

Thousands of rows of a hundred columns, many of which are NULL? Sounds fine.
It might help if you provided a CREATE TABLE; there could be tips on datatypes, etc.
It is usually a good idea to "cleanse" data that comes in from external sources. Especially if, say, two consellations provide the same value in a different format. In this example, have one column and convert one (or both) from the given format to the table's datatype. Sometimes that is 'automatic'. For example, for a FLOAT or DECIMAL(6,3) column, "123.456" and "1.23456e2" look different, but go into FLOAT the same. (OK, there could be a rounding difference.) You may choose to use DOUBLE.
NULLs don't cost much. Perhaps the main concern is your programming effort.
Your title "changing database based on column" -- that is messy; don't do it.

Related

How should I organize user data with several rows in MySQL?

I am currently developing a quiz-app, that keeps track of user data, such as :
the sets they've studied by the ID of the specific set ([1,2,6,12])
the friends they have by the id of the user ([1,2,3,4])
their schedule(
{"2022-07-03 00:00:00":{"551":{"type":"Flashcards","setid":"1"},"552":{"type":"Flashcards","setid":"1"},"553":{"type":"Flashcards","setid":"1"},"554":{"type":"Flashcards","setid":"1"},"555":{"type":"Flashcards","setid":"1"},"556":{"type":"Flashcards","setid":"1"},"557":{"type":"Flashcards","setid":"6"},"558":{"type":"Flashcards","setid":"6"},"559":{"type":"Flashcards","setid":"6"},"560":{"type":"Flashcards","setid":"6"}}})
every individual day they've logged in (["05/15/2022","05/17/2022","05/18/2022","05/19/2022","05/22/2022","05/23/2022","05/24/2022","05/25/2022","05/28/2022","05/29/2022","05/30/2022","05/31/2022","06/02/2022","06/05/2022","06/07/2022","06/08/2022","06/10/2022","06/11/2022","06/13/2022","06/14/2022","06/15/2022","06/17/2022","06/18/2022","06/19/2022","06/20/2022","06/22/2022","06/24/2022","06/25/2022","06/26/2022","06/28/2022","06/29/2022","06/30/2022","07/01/2022","07/02/2022"])
Note: there is quite a lot of other types of information that is stored inside the table.
All of this aforementioned information is collected in a mysql table called "users", which has rows for each user, with accompanying data (as mentioned above).
It has recently come to my attention that MySQL has data limits for the amount of data that can be represented in a given row (around 65K bytes). If I continue to represent data this way, I believe that at scale (assume a user uses the app for 5 years, imagine the amount of data inside the "every individual day they've logged in" table), I will face MySQL's data limits and it may cause problems in the future.
Here's a picture, showing how the information is represented inside of the table "users"
How would I better represent this type of table? Should I use multiple tables inside an SQL database? How should I format it? Do I not have to worry about the data limit, and should I continue saving data in this way?
Thanks.
If I understand this correctly, you are packing way too much information in each row. The structure of your data is not being represented in a way that allows MySQL to do what it is good at. You are just creating big buckets for each user and stashing them in MySQL.
To make this work better, you can either create tables to store each relationship (this is, after all, a relational database) like user_login, user_friend_requests, and so on. The direct answer to your question is that each cell in your table should be a table itself.
OR, you can embrace the blob, and use something like mongodb, which is much more suited for storing and retrieving the data in a way that fits your mindset. Since you don't do any real queries on the data, a NoSQL solution would probably fit you better.
So the "right" answer to your question is "modify your schema to store this data better, or switch your database to match the way you want to store the data."
However, having said all that, since it seems you are storing JSON in those cells, you can use the JSON data type (max size 1GB but better if you don't use so much - see https://dev.mysql.com/blog-archive/how-large-can-json-documents-be), or LONGTEXT, 4GB. (Assuming you are running in a 64-bit environment - see Maximum length for MySQL type text).
The JSON data type actually has some pretty cool features.

What is more efficient, a table with 100 columns and less rows or 5 columns and 30 times more rows

Edit 1:
Because few good ppl have pointed out that my question isnt very clear, I thought I will rewrite it and make it more clear now.
So basically, I am making an app, which allows users to create his own form with his own set of input fields, with data like name, type etc. After creating his form and he publishes the form, whenever there is an entry in the form, the data gets saved into the db ofcourse. Because the form itself is dynamic. I need a way to save this data.
My first choice was JSONizing it and saving. But because I cannot do any SQL queries on them, if I save in JSON format, i am eliminating this option.
Then the simple method is storing in a table like (id, rowid, columnname, value) and i keep the rowid same for all row data. But in this way, if a form contains 30 fields, after 100 entries my db would have 3000 rows. so in the long run, it would go huge and I think queries will get slow when there are millions of rows in the table.
Then I got this idea of a table like (id, rowid, column1, column2...column100). And i will save all the inputs in the form into single row. In this way it does add only 1 row per submit and its easier to query too. I will store the actual column names and map them to the right column(number) from there. This is my idea. column100 because 100 is the maximum inputs the user can add in his form.
So my question is, whether my idea is good, or should I stick to the classic table.
If I've understood your question, you need to have to design a database structure to store data whose schema you don't know in advance.
This is hard - there's no "efficient" solution in relational databases that I'm aware of.
Option 1 would be to look at a non-relational (NoSQL) solution instead.
I won't elaborate the benefits and drawbacks, as they are highly dependent on which NoSQL option you choose.
It's worth noting that many relational engines (including MySQL) allow you to store and query structured data formats like JSON. I've not used this feature in MySQL myself, but similar functionality in SQL Server performs very well.
Within relational databases, the common solution is an "Entity/Attribute/Value" (EAV)schema. This is sorta like your option 2.
EAV designs can theoretically store an unlimited number of columns, and an unlimited number of rows - but common queries quickly become impossible. In your sample data, finding all records where the name begins with a K and the power is at least 22 turns into a very complex SQL query. It also means the application needs to enforce rules of uniqueness, mandatory/optional data attributes, and data transformation from one format to another.
From a performance point of view, this doesn't really scale to complex queries. This is because every clause in your "where" needs a self join, and indexes won't have a big impact on searches for non-text strings (searching for numerical "greater than 20" is not the same as searching for a text "greater than 20".).
Option 3 is, indeed, to make the schema logic fit into a limited number of columns (your option 1).
It means you have a limitation on the number of columns, and you still have to manage mandatory/optional, uniqueness etc. in the application. However, querying the data should be easier - finding accounts where the name starts with K and the power is at least 22 is a fairly straightforward exercise.
You do have a lot of unused columns, but that doesn't really impact performance much - disk space is so cheap that all the wasted space is probably less space than you carry around in your smart phone.
If I understand your requirement, what I will do with your requirement is to create a many to many relationship something like this:
(tbl1) form:
- id
- field1
- field2
(tbl2) user_added_fields:
- id
- field_name
(tbl3) form_table_user_added_fields:
- form_id (fk)
- user_added_fields_id (fk)
This may not likely to solve your own requirements, but I hope this will give you a hint. Happy coding! :)

Having data stored across tables representing individual data types - Why is it wrong?

Say I have lots of time to waste and decide to make a database where information is not stored as entities but in separate inter-related tables representing INT,VARCHAR,DATE,TEXT, etc types.
It would be such a revolution to never have to design a database structure ever again except that the fact no-one else has done it probably indicates it's not a good idea :p
So why is this a bad design ? What principles is this going against ? What issues could it cause from a practical point of view with a relational database ?
P.S: This is for the learning exercise.
Why shouldn't you separate out the fields from your tables based on their data types? Well, there are two reasons, one philosophical, and one practical.
Philosophically, you're breaking normalization
A properly normalized database will have different tables for different THINGS, with each table having all fields necessary and unique for that specific "thing." If the only way to find the make, model, color, mileage, manufacture date, and purchase date of a given car in my CarCollectionDatabase is to join meaningless keys on three tables demarked by data-type, then my database has almost zero discoverablity and no real cohesion.
If you designed a database like that, you'd find writing queries and debugging statements would be obnoxiously tiresome. Which is kind of the reason you'd use a relational database in the first place.
(And, really, that will make writing queries WAY harder.)
Practically, databases don't work that way.
Every database engine or data-storage mechanism i've ever seen is simply not meant to be used with that level of abstraction. Whatever engine you had, I don't know how you'd get around essentially doubling your data design with fields. And with a five-fold increase in row count, you'd have a massive increase in index size, to the point that once you get a few million rows your indexes wouldn't actually help.
If you tried to design a database like that, you'd find that even if you didn't mind the headache, you'd wind up with slower performance. Instead of 1,000,000 rows with 20 fields, you'd have that one table with just as many fields, and some 5-6 extra tables with 1,000,000+ entries each. And even if you optimized that away, your indexes would be larger, and larger indexes run slower.
Of course, those two ONLY apply if you're actually talking about databases. There's no reason, for example, that an application can't serialize to a text file of some sort (JSON, XML, etc.) and never write to a database.
And just because your application needs to store SQL data doesn't mean that you need to store everything, or can't use homogenous and generic tables. An Access-like application that lets user define their own "tables" might very well keep each field on a distinct row... although in that case your database's THINGS would be those tables and their fields. (And it wouldn't run as fast as a natively written database.)

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

Database design - empty fields [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am currently debating an issue with my dev team. They believes that empty fields are bad news. For instance, if we have a customer details table that stores data for customers from different countries, and each country has a slightly different address configuration - plus 1-2 extra fields, e.g. French customer details may also store details for entry code, and floor/level plus title fields (madamme, etc.). South Africa would have a security number. And so on.
Given that we're talking about minor variances my idea is to put all of the fields into the table and use what is needed on each form.
My colleague believes we should have a separate table with extra data. E.g. customer_info_fr. But this seams to totally defeat the purpose of a combined table in the first place.
The argument is that empty fields / columns is bad - but I'm struggling to find justification in terms of database design principles for or against this argument and preferred solutions.
Another option is a separate mini EAV table that stores extra data with parent_id, key, val fields. Or to serialise extra data into an extra_data column in the main customer_data table.
I think I am confused because what I'm discussing is not covered by 3NF which is what I would typically use as a reference for how to structure data.
So my question specifically: -
If you have slight variances in data for each record (1-2 different fields for instance) what is the best way to proceed?
There is definitely a school of thought which holds that NULL fields are bad, in and of themselves. Relational theory demands that databases consist of facts, and NULLs are the absence of fact. So, a rigorously designed database would have no nullable columns.
Your colleague is proposing something which is on the road to 6th Normal Form, where all the tables consist of a primary key and at most one other column. Only in such a schema we wouldn't have tables called customer_info_fr. That's not normalised. Many countries might include ENTRY_CODE in their addresses. So we would need address_entry_codes and address_floor_numbers. Not to mention address_building_number and address_building_name, as some places are identified by number and other by name.
It's completely accurate and truthful as a logical design. Alas from a physical perspective it is Teh Suck! The simplest query - select * from addresses - becomes a multi-table join, and outer joins at that. Nullable columns are a way of reconciling ugly design with the hard truth, "you cannae break the laws of physics". Nullable columns allow us to combine disjoint data sets into a single table, albeit at the cost of handling nulls (they can affect data retrieval, index usage, maths, etc).
Some designs attempt to get around the use of nulls by applying magic values. That is, if we don't know the correct value for some column we inject a default value which is a value but also means "unknown". A common instance of this is date '9999-12-31' as an open-ended TO_DATE in a FROM-TO date range. As long as everybody understands and adheres to the convention it's not a problem. It becomes a problem when some tables have date '9999-12-01' or date '9999-01-31' instead.
This is why magic values are not a robust solution. Consumers of our data need to know that -1 is the value we use for DofQ in our stock control system when we don't know the real value. But at least it's obviously not a valid value. Choosing say 20 as a magic value is deadly because it could be a real DofQ: we can no longer tell the actual values from the "don't knows".
So, given a choice between nulls and magic values, choose nulls.
I'd be interested in your colleague's justification as to why empty fields are bad. As far as I'm aware, empty or null fields aren't bad in and of themselves. If you have a lot of empty data values for a column that you are planning on putting an important index on, you may want to consider other options. This goes for any column where you have a lot of duplicate records actually and need an index, as duplicated records lower the cardinality of the column, making indexes less useful. In your case, I don't see it being an issue.
For this kind of data, you're likely using a VARCHAR or some kind of TEXT column anyway, which are variable length fields in the database. It doesn't matter if your field is full of data or empty, you're still going to incur the overhead of a variable-length column (which isn't worth worrying about in normal circumstances). So again, there's no difference to the RDBMS.
From the sounds of what you're designing, I think if you came up with a generic method of handling address variances in a single table, it would be the way to go. Your code and structure would be much simpler at the negligible (in my opinion) cost of some empty data fields.
That's what nullable fields are for: "Data not available/applicable".
SQL has a different notion of null than most programming languages, so SQL's null is often a misunderstood concept.
Whatever you do, do not go down the EAV route. This is a prescription for a poorly performing database, far, far worse than a few empty fields.
If you must have a separate related tables for the different situations, a lot of that will depend on how different the entities are and how they will be queried. If you will be querying across categories, you will find that joins to a bunch of tables to get all the data you may or may not need is a nightmare (I don't know if Germany will be in my result set so I join to the Germany details tables, oops didn't need to). It can be far simpler to handle nulls than to try it figure out which of many tables you need to join to (and to always remember to left join to those tables).
However, if you will never be querying across the entitites and the fields make sense separate, then put them in a separate table.
Nulls invariably add complexity to a data model because the behaviour of null in SQL rarely matches the maths, logic or reality that you intended to model with it. In other words, some queries return incorrect results, which you then need to compensate for with additional logic.
All information can be represented accurately without nulls. Since nulls add complexity it is sound design practice to begin your data model without them and then only add a null where you find some special reason to do so or where some database feature or limitation forces a null upon you.
I wouldn't overthink it. NULL can be used, but developers need to be careful using them.
I would prefer to have the Address be a long Text field in the database for any website that deals with multiple countries.
Most websites have Address Line1, Address Line 2, Postal/ZIP Code, City, State/Region, Country ... anything more than that (like EAV) would be overkill.
I wouldn't mind having the user interface show different labels near the text boxes for each country.
Entry code, floor/level, title fields, security number, and so on should fit in the address lines, the label near it, or a tip in the UI can indicate it.