We have a table for which we have to present many many counts for different combinations of fields.
This takes quite a while to do on the fly and doesn't provide historical data, so I'm thinking in the best way to store those counts in another table, with a timestamp, so we can query them fast and get historical trends.
For each count we need 4 pieces of information to identify it, and there are about 1000 different metrics we would like to store.
I'm thinking on three different strategies, having a count and a timestamp but varying in how to identify the count for retrieval.
1 table with 4 fields to identify the count, the 4 fields wouldn't be normalized as they contain data from different external tables.
1 table with 1 "tag" field, which will contain the 4 pieces of information as a tag. This tags could be enriched and kept in another table maybe having a field for each tag part and linking them to the external tables.
Different tables for the different groups of counts to be able to normalize on one or more fields, but this will need anywhere from 6 to tens of tables.
I'm going with the first one, not normalized at all, but wondering if anyone has a better or simpler way to store all this counts.
Sample of a value:
status,installed,all,virtual,1234,01/05/2015
First field, status, can have up to 10 values
Second field, installed, can have up to 10 per different field 1
Third field, all,can have up to 10 different values, but they are the same for all categories
Fourth field, virtual, can have up to 30 values and will also be the same for all previous categories.
Last two fields will be a number and a timestamp
Thanks,
Isaac
When you have a lot of metrics and you don't need to use them to do intra-metrics calculation you can go for the 1. solution.
I would probably build a table like this
Satus_id | Installed_id | All_id | Virtual_id | Date | Value
Or if the combination of the first four columns have a proper name, I would probably create two tables (I think you refer to this possibility as the second solution with the 2):
Metric Table
Satus_id | Installed_id | All_id | Virtual_id | Metric_id | Metric_Name
Values Table
Metric_id | Date | Value
This is good if you have names for your metrics or other details which otherwise you will need to duplicate for each combination with the first approach.
In both cases it will be a bit complicated to do intra-rows operations using different metrics, for this reason this approach is suggested only for high level KPIs.
Finally, because all possible combination for the last two fields are always present in you table you can think to convert them to a columns:
Satus_id | Installed_id | Date | All1_Virtual1 | All1_Virtual2 | ... | All10_Virtua30
With 10 values for All and 30 for Virtual you will have 300 columns, not very easy to handle, but they will be worth to have if you have to do something like:
(All1_Virtual2 - All5_Virtual23) * All6_Virtual12
But in these case I would prefer (if possible) to do the calculation in advance to reduce the number of columns.
Related
I'm currently designing a relational database table in MySQL for handling multiple categories, representing them later in a tree structure on the client side and filtering on them. Here is a picture of how the structure looks like:
So we have a root element which is set by default. We can after that add children to it (Level one). So far a table structure in the simplest case could be defined so:
| id | name | parent_id |
--------------------------------
1 All Categories NULL
2 History 1
However, I have a requirement that I need to include another tree structure type (Products) in the table (a corresponding API is available). The records from the other table have their own id types (UUID). Basically I need to ingest them in my table. A possible structure will look like so:
| id | UUID | name | parent_id |
----------------------------------------------------------
1 NULL All Categories NULL
2 NULL History 1
3 NULL Products 1
4 CN1001231232 Catalog electricity 3
5 CN1001231242 Catalog basic components 4
6 NULL Shipping 1
I am new to relational databases, but all of these possible NULL values for the UUID indicate (at least for me) to be bad design of database table. Is there a way of avoiding this, or even better way for this "ingestion"?
If you had a table for users, with columns first_name, middle_name, last_name but then a user signed up and said they have no middle name, you could just store NULL for that user's middle_name column. What's bad design about that?
NULL is used when an attribute is unknown or inapplicable on a given row. It seems appropriate for the case you describe, i.e. when records that did not come from the external source have no UUID, and need no UUID.
That said, some computer science theorists insist that NULL is never appropriate. There's a decades-old controversy about whether SQL should even have a NULL.
The alternative would be to create a second table, in which you store only the UUID and the reference to the entity in your first table. Then just don't store rows for the other id's.
| id | UUID |
-------------------
4 CN1001231232
5 CN1001231242
And don't store the UUID column in your first table. This eliminates the NULLs, but it means you need to do a JOIN of the two tables whenever you want to query the entities with their UUID's.
First make sure you actually have to combine these in the same table. Are the products categories? If they are categories and are used like categories then it makes sense to have them in the same table, but if they have categories then they should be kept separate and given a category/parent id.
If you're sure it's appropriate to store them in the same table then the way you have it is good with one adjustment. For the UUID you can use a separate naming scheme that makes it interchangeable with id for those entries and avoids collisions with the other uuids. For example:
| id | UUID | name | parent_id |
----------------------------------------------------------
1 CAT000000001 All Categories NULL
2 CAT000000002 History 1
3 CAT000000003 Products 1
4 CN1001231232 Catalog electricity 3
5 CN1001231242 Catalog basic components 4
6 CAT000000006 Shipping 1
Your requirements combine the two things relational database are not great with out of the box: modelling hierarchies, and inheritance (in the object-oriented sense).
Your design users the "single table inheritance" model (one of 3 competing options). It's the simplest option in terms of design.
In practical terms, you may want to add a column to explicitly state which type of record you're dealing with ("regular category" and "product category") so your queries are more obvious to others.
Is it better for write performance to write a row with multiple columns?
id1 | field1Name | field1Value | field2Name | field2Value | field3Name | field3Value
or multiple rows with less columns
id1 | field1Name | field1Value
id1 | field2Name | field2Value
id1 | field3Name | field3Value
In terms of query requirements, we can achieve what we wanted with both structures. I am wondering how write performance would be impacted between these 2 approaches.
It depends in a complex way by the type of the columns and the binary size of their actual values. For example, if the column is of type TEXT and you write 50Kb of plain text - it will not fit into a single page and MySQL will have to use several data-pages to fit the data. Obviously more pages means more I/O.
On the other hand, if you write multiple rows (one value per row) you will have 2 headaches:
MySQL will have to update the primary index multiple times - again suboptimal I/O
If you want to SELECT several values - you will have to JOIN multiple rows which is very inconvenient (and you will soon realize it)
If you desperately need the best write performance - you should consider using columnar-based database. Otherwise stick with the multi-columns option.
Suppose we have 2 numbers of 3 bits each attached together like '101100', which basically represents 5 and 4 combined. I want to be able to perform aggregation functions like SUM() or AVG() on this column separately for each individual 3-bit column.
For instance:
'101100'
'001001'
sum(first three column) = 6
sum(last three column) = 5
I have already tried the SUBSTRING() function, however, speed is the issue in that case as this query will run on millions of rows regularly. And string matching will slow the query.
I am also open for any new databases or technologies that may support this functionality.
You can use the function conv() to convert any part of the string to a decimal number:
select
sum(conv(left(number, 3), 2, 10)) firstpart,
sum(conv(right(number, 3), 2, 10)) secondpart
from tablename
See the demo.
Results:
| firstpart | secondpart |
| --------- | ---------- |
| 6 | 5 |
With the current understanding I have of your schema (which is next to none), the best solution would be to restructure your schema so that each data point is its own record instead of all the data points being in the same record. Doing this allows you to have a dynamic number of data points per entry. Your resulting table would look something like this:
id | data_type | value
ID is used to tie all of your data points together. If you look at your current table, this would be whatever you are using for the primary key. For this answer, I am assuming id INT NOT NULL but yours may have additional columns.
Data Type indicates what type of data is stored in that record. This would be the current tables column name. I will be using data_type_N as my values, but yours should be a more easily understood value (e.g. sensor_5).
Value is exactly what it says it is, the value of the data type for the given id. Your values appear to be all numbers under 8, so you could use a TINYINT type. If you have different storage types (VARCHAR, INT, FLOAT), I would create a separate column per type (val_varchar, val_int, val_float).
The primary key for this table now becomes a composite: PRIMARY KEY (id, data_type). Since your previously single record will become N records, the primary key will need to adjust to accommodate that.
You will also want to ensure that you have indexes that are usable by your queries.
Some sample values (using what you placed in your question) would look like:
1 | data_type_1 | 5
1 | data_type_2 | 4
2 | data_type_1 | 1
2 | data_type_2 | 1
Doing this, summing the values now becomes trivial. You would only need to ensure that data_type_N is summed with data_type_N. As an example, this would be used to sum your example values:
SELECT data_type,
SUM(value)
FROM my_table
WHERE id IN (1,2)
GROUP BY data_type
Here is an SQL Fiddle showing how it can be used.
Say that I have an e-commerce that 100 kind of products. Now, I'm offering a voucher for the customer. The voucher can be used for some of the products or all products.
So, the table I have now is like this:
| Voucher |
------------------
| id |
| voucher_number |
| created_at |
| expired_date |
| status | (available, unavailable)
| Voucher_detail |
------------------
| id |
| id_voucher |
| product_id |
So, the question is, if the voucher is set to be available for all products. There will be 100 records in voucher_detail because there are 100 products. Isn't it a waste, because the voucher will only be used for one products.
Or there are another database design that is better than this one?
Well, I don't think there is a better design for your situation. I mean, when designing the database, it should fit all the use-cases, and this way, you cover them all.
Of course, when having a voucher for all the products it needs 100 rows, but this perfectly suits the situation where you have the voucher for like 5 products, and it is a secure and sure way to know exactly each voucher for what products it can be used.
Was thinking, maybe you could have all the products for the voucher, in a column, separated by ',' and split them when reading (if you really care about size) but that just doesn't feel right.
You probably don't generate more than a small subset of the 2^100-1 possible combinations, correct? Perhaps you have only a few, and they can be itemized as "all", "clothing", "automotive", etc.? If so ponder working around that.
If you really need true/false for 100 things, consider the SET datatype. Since it is restricted to 64 bits, you will need at least 2 SET columns. Again, you may decide to 'group' the bits in some logical way. At one bit per item, we are talking about around a dozen bytes for the entire 100 choices.
"all" would probably be represented as every bit being on.
With the SET approach, you would have string names for each of th 100 cases. The SET would be any combination of them. Please study the manual on SET; the manipulations are a bit strange.
I am parsing a collection of monthly lists of bulletin board systems from 1993-2000 in a city. The goal is to make visualizations from this data. For example, a line chart that shows month by month the total number of BBSes using various kinds of BBS software.
I have assembled the data from all these lists into one large table of around 17,000 rows. Each row represents a single BBS during a single month in time. I know this is probably not the optimal table scheme, but that's a question for a different day. The structure is something like this:
date | name | phone | codes | sysop | speed | software
1990-12 | Aviary | xxx-xxx-xxxx | null | Birdman | 2400 | WWIV
Google Fusion Tables offers a function called "summarize" (or "aggregation" in the older version). If I make a view summarizing by the "date" and "software" columns, then FT produces a table of around 500 rows with three columns: date, software, count. Each row lists the number of BBSes using a given type of software in a given month. With this data, I can make the graph I described above.
So, now to my question. Rather than FT, I'd like to work on this data in MySQL. I have imported the same 17,000-row table into a MySQL database, and have been trying various queries with COUNT and DISTINCT, hoping to return a list equivalent what I get from FT's Summarize function. But nothing I've tried has worked.
Can anyone suggest how to structure such a query?
Kirkman, you can do this using a COUNT function and the GROUP BY statement (which is used in conjunction with aggregate SQL functions)
select date, software, count(*) as cnt
from your_table
group by date, software