Encode probability distribution in single cell of table [closed] - mysql

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Is there any mechanism for storing all the information of a probability distribution (discrete and/or continuous) into a single cell of a table? If so, how is this achieved and how might one go about making queries on these cells?

Your question is very vague. So only general hints can be provided.
I'd say there are two typical approaches for this (if I got your question right):
you can store some complex data into a single "cell" (how you call it) inside a database table. Easiest for this is to use JSON encoding. So you have an array of values, encode that to a string and store that string. If you want to access the values again you query the string and decode it back into an array. Newer versions of MariaDB or MySQL offer an extension to access such values on sql level too, though access is pretty slow that way.
you use an additional table for the values and store only a reference in the cell. This actually is the typical and preferred approach. This is how the relational database model works. The advantage of this approach is that you can directly access each value separately in sql, that you can use mathematical operations like sums, averages and the like on sql level and that you are not limited in the amount of storage space like you are when using a single cell. Also you can filter the values, for example by date ranges or value boundaries.
In the end, taking all together, both approaches offer the same, though they require different handling of the data. The fist approach additionally requires some scripting language on the client side to handle encoding and decoding, but that typically is given anyway.
The second approach us considered cleaner and will be faster in most of the cases, except if you always access to whole set of values at all times. So a decision can only be made when knowing more specific details about the environment and goal of an implementation.

Say we have a distribution in column B like:
and we want to place the distribution in a single cell. In C1 enter:
=B1
and in C2 enter:
=B1 & CHAR(10) & C1
and copy downwards. Finally, format cell C13 with wrap on:

Related

Best way to store variable-sized floating-point arrays in MySQL database [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 10 months ago.
Improve this question
I have a MySQL database, in which I need to store variable-sized floating-point arrays. These arrays are profile measurements made from a metrology tool (an ellipsometer to be specific) in a lab. The number of points per array varies depending on what recipe the user used: the number of points currently ranges from about 5 to 199 (the largest I've seen), but the max value could increased if a user made a recipe in the future that measured more points. Most of the arrays are 49 points (about 70%), and only 0.04% used 199 points.
Current, we are storing the arrays in a MySQL database as a as a comma-separated string as TEXT datatype. But retrieval requires parsing and typecasting, which makes it very slow. It also takes up about twice as much space as a 32bit binary array would. My question is what is the best way to store this data? I do not need to search or sort or join-to this column: just retrieve it alongside the rest of the data that goes with the measurement.
There's a SQLish way of doing this, and it does not involve commas or BLOBs.
Create a datapoints table with three columns:
datapoint_id BIGINT UNSIGNED AUTO_INCREMENT NOT NULL
measurement_id BIGINT UNSIGNED
datapoint FLOAT ( or DOUBLE if you need it)
Then store your sequences of datapoints into the table, in order, in multiple rows. Each distinct measurement gets its own measurement_id value.
You can retrieve a measurement of any length with
SELECT datapoint
FROM datapoints
WHERE measurement_id = whatever
ORDER BY datapoint_id;
This may seem like too many rows to store. But, SQL is optimized really well for this sort of use pattern.
Creating this index will make things really fast.
CREATE INDEX points ON datapoints
(measurement_id, datapoint_id, datapoint);
And, you can make each measurement contain as few or as many datapoints as needed.
Edit. With this structure you can use SQL to get all sorts of descriptive statistics really easily.
SELECT COUNT(*),
MAX(datapoint),
MIN(datapoint),
AVG(datapoint),
STDEV(datapoint)
FROM datapoints
GROUP BY measurement_id;
The index I suggested will accelerate that operation too.
If you really do need arrays, consider postgresql in place of Mysql. It has them in its language.
All FLOATS in a single table
A single table with about 10K rows is trivial for a database to handle. It would have at least the FLOAT and something like recipe_id. (See OJones's Answer for more discussion.)
JSON
Most languages can handle JSON. With json_decode() and json_encode() (or however they are spelled), you can easily turn an array into a sting and vice versa. The database version may have a JSON datatype or simply use TEXT.
BLOB
A BLOB gets trickier because of the typecasting that would be required; it is easier in old languages like C / C++, but harder in newer languages. You mentioned JMP as the datasource, but not the language you are using.

Writing CSV files - fill columns with whitespace or not? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
When doing various data analysis, it often makes sense to save some intermediate results as a CSV file. Could be for documentation, or to hand over to colleagues that want to work with Excel or similar or to have a quick way to do a sanity check yourself.
But how do I best format such a CSV file? Let's assume I want to have a classic spreadsheet with a header row and the data in columns. Like so:
Device_id;Location;Mean_reading;Error_count
opti-1;Upper-Underburg Backroad 2;1.45;42
ac-4;Valley 23;0.1;2
level-245;Lower-Underburg Central Market Place;1034;5
For opening it in Excel or reading it in with pandas, this works flawlessly, as long as you specify the use of ; as separator. However, as you can see with this example, it's quite hard to read when opening it up in a simple text editor, the use of which might be preferable in many cases (remote access, faster opening, no assumptions needed about separator or decimal dot vs. comma etc).
So I could simply add some whitespace to make the CSV look like this:
Device_id ;Location ;Mean_reading ;Error_count
opti-1 ;Upper-Underburg Backroad 2 ;1.45 ;42
ac-4 ;Valley 23 ;0.1 ;2
level-245 ;Lower-Underburg Central Market Place ;1034 ;5
But should I?
Are there any documented best practices or standards on how to write CSV files in such cases?
I can see pros and cons for both ways (see below), so I'm wondering if there's any guidelines on which way to go.
I'm leaning towards the latter way and looking at what kind of CSV files I do get out of various data loggers or other software, this seems to be the prefered way, but on the other hand, searching for CSV whitespace on this here site mostly results in questions about how to get rid of it.
And I can see some potential issues with the needed length of the fields, since I either need to make assumptions (i.e. Location needs a length of 40 characters) that might or might not be correct (What happens when I place a device in Underburg western motorway industrial estate northern fence?) or I need some potentially expensive logic to figure out the needed field lengths.
I work daily with CSV data files (in the printing industry, where CSV still is the common denominator). I usually tell customers that the format to choose depends on the purpose.
CSV files without whitespace is for machine (software) reading, OR where you can have a common separator that is not used elsewhere - if you want to avoid the path of escaping the separators.
Fixed-width-files are better for humans, or where the separator chosen will at times be part of the text. It comes at a penalty if you use spaces to separate since fixed-width will take up more space. And, as you point out, you need to know the longest possible field in advance. This type of file format for my customers are mostly result export from legacy software dating back many years.
A variant to consider could be TAB separated files, since you can choose on the viewer / editor part how wide a TAB should be. That way, you are less depending on the field size.
Or, keep it as compact version for machine reading, and make yourself a temporary copy using AWK as a filter. It's trivial to do, and you can have the field length anything you want, without modifying the original file.

Is it good to keep multi-valued attributes in a table existing in a 24/7 running service? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I am working on a project where I am using a table with a multi-valued attribute having 5-10 values. Is it good to keep multivalued attributes or should I normalize it into normal forms ?
But I think that it unnecessarily increases the no of rows.If we have 10 multi values for an attribute then each row or tuple will be replaced with new 10 rows which might increase the query running time.
Can anyone give suggestions on this?
The first normal form requests that each attribute be atomic.
I would say that the answer to this question hinges on the “atomic”: it is too narrow to define it as “indivisible”, because then no string would be atomic, as it can be split into letters.
I prefer to define it as “a single unit as far as the database is concerned”. So if this array (or whatever it is) is stored and retrieved in its entirety by the application, and its elements are never accessed inside the database, it is atomic in this sense, and there is nothing wrong with the design.
If, however, you plan to use elements of that attribute in WHERE conditions, if you want to modify individual elements with UPDATE statements or (worst of all) if you want the elements to satisfy constraints or refer to other tables, your design is almost certainly wrong. Experience shows that normalization leads to simpler and faster queries in that case.
Don't try to get away with few large table rows. Databases are optimized for dealing with many small table rows.

I have to create a table with 96 columns . is it efficient or not ? but 96 columns must be in the table [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I have a table with 96 columns . the problem is i get confused for create this table with a large amount of columns.
Don't do that, then!
It's rare to genuinely need a table with that many columns. Most likely, you will be able to split the data across multiple tables into a relational database. For example, if, in your long table, each record contains the name of a product, the price of the product, the store that sells the product, and the address of the store, you will usually want to have separate Stores and Products tables, probably with a many-to-many relationship between them.
To a large extent you can do so without much thought, by putting your database into some normal form, typically the third normal form. These normal forms are chosen to have nice properties when you want to insert, update, or delete a record. However, you usually have to think about the meaning of the data you store to find a decomposition that makes sense. A lack of repetitions in the initial data doesn't mean there won't be any later.
Read more
Those concepts are well explained in the Manga Guide to Databases.
This answer gives an example of a situation that requires partitioning, and another answer by the same user explains the performance benefits. (Besides not confusing oneself.)
But I need to!
In some odd situations, you might genuinely need a long table. Maybe you're starting a club for people who have exactly 95 names and so you need to store an identifier key (since there is no natural primary key in this case) and each of the names in order. In that case, you will have some test data you can use to immediately verify that the table has the correct format.
To avoid getting confused, it might help to use pen and paper (or a blackboard): write out the test data in the order that's most natural, find a reasonable name and format for each column, and then work off that when writing your table creation procedure. The line numbers in your editor should be enough to make sure you haven't skipped a column.

SQL DB Best Practice [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
when it comes to SQL DB schema, is it a better practice to add more "boolean" like fields or is it better to keep one field and have it be "mode" representing different combinations? For either case, can you elaborate why it's better?
Thanks
If you care about specific values of things . . . IsActive, IsBlue, IsEnglishSpeaking, then have a separate flag for each one.
There are a few cases when having a combined "mode" might be beneficial. However, in most cases, you want your columns to represent variables that you care about. You don't want to have special logic dissecting particular values.
In MySQL, a boolean value actually occupies 1 bytes (8 bits). MySQL has other data types such as enums and sets that might do what you want. The bit data type, despite its name, does not seem to pack flags into a single byte (which happens on other databases).
I think I get where you're coming from... The answer is: use boolean for has_property scenarios, and something else for everything else.
Basically, you're talking about a "flag" versus a "variable". Flags are boolean. Everything else has some other type (integer, datetime, etc...). Databases have strict typing, so either work with a strictly typed language or make sure the data you pass to you DB is correctly typed. DBs have incredible flexibility, as well. For instance, BLOBs can store arbitrary binary data (like pickled Python).