MYSQL: How to query where JSON array contain specific label - mysql

MySQL 5.7.24
Lets say I have 3 rows like this:
ID (PK) | Name (VARCHAR) | Data (JSON)
--------+----------------+-------------------------------------
1 | Admad | [{"label":"Color", "value":"Red"}, {"label":"Age", "value":40}]
2 | Saleem | [{"label":"Color", "value":"Green"}, {"label":"Age", "value":37}, {"label":"Hoby", "value":"Chess"}]
3 | Daniel | [{"label":"Food", "value":"Grape"}, {"label":"Age", "value":47}, {"label":"State", "value":"Sel"}]
Rule #1: The JSON column is dynamic. Means not everybody will have the same structure
Rule #2: Assuming I can't modify the data structure
My question, it it possible to query so that I can get the ID of records where the Age is >= 40? In this case 1 & 3.
Additional Info (after being pointed as duplicate): if you look at my data, the parent container is array. If I store my data like
{"Age":"40", "Color":"Red"}
then I can simply use
Data->>'$.Age' >= 40
My current thinking is to use a stored procedure to loop the array but I hope I don't have to take that route. The second option is to use regex (which I also hope not). If you think "JSON search" is the solution, kindly point to me which one (or some sample for this noob of me). The documentation's too general for my specific needs.

Here's a demo:
mysql> create table letsayi (id int primary key, name varchar(255), data json);
mysql> > insert into letsayi values
-> (1, 'Admad', '[{"label":"Color", "value":"Red"}, {"label":"Age", "value":"40"}]'),
-> (2, 'Saleem', '[{"label":"Color", "value":"Green"}, {"label":"Age", "value":"37"}, {"label":"Hoby", "value":"Chess"}]');
mysql> select id, name from letsayi
where json_contains(data, '{"label":"Age","value":"40"}');
+----+-------+
| id | name |
+----+-------+
| 1 | Admad |
+----+-------+
I have to say this is the least efficient way you could store your data. There's no way to use an index to search for your data, even if you use indexes on generated columns. You're not even storing the integer "40" as an integer — you're storing the numbers as strings, which makes them take more space.
Using JSON in MySQL when you don't need to is a bad idea.
Is it still possible to query age >= 40?
Not using JSON_CONTAINS(). That function is not like an inequality condition in a WHERE clause. It only matches exact equality of a subdocument.
To do an inequality, you'd have to upgrade to MySQL 8.0 and use JSON_TABLE(). I answered another question recently about that: MySQL nested JSON column search and extract sub JSON
In other words, you have to convert your JSON into a format as if you had stored it in traditional rows and columns. But you have to do this every time you query your data.
If you need to use conditions in the WHERE clause, you're better off not using JSON. It just makes your queries much too complex. Listen to this old advice about programming:
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."
— Brian Kernighan
how people tackle dynamically added form fields
You could create a key/value table for the dynamic form fields:
CREATE TABLE keyvalue (
user_id INT NOT NULL,
label VARCHAR(64) NOT NULL,
value VARCHAR(255) NOT NULL,
PRIMARY KEY (user_id, label),
INDEX (label)
);
Then you can add key/value pairs for each user's dynamic form entries:
INSERT INTO keyvalue (user_id, label, value)
VALUES (123, 'Color', 'Red'),
(123, 'Age', '40');
This is still a bit inefficient in storage compared to real columns, because the label names are stored every time you enter a user's data, and you still store integers as strings. But if the users are really allowed to store any labels of their own choosing, you can't make those real columns.
With the key/value table, querying for age > 40 is simpler:
SELECT user_id FROM key_value
WHERE label = 'Age' AND value >= 40

Related

Filter objects inside array in MySQL JSON column [duplicate]

MySQL 5.7.24
Lets say I have 3 rows like this:
ID (PK) | Name (VARCHAR) | Data (JSON)
--------+----------------+-------------------------------------
1 | Admad | [{"label":"Color", "value":"Red"}, {"label":"Age", "value":40}]
2 | Saleem | [{"label":"Color", "value":"Green"}, {"label":"Age", "value":37}, {"label":"Hoby", "value":"Chess"}]
3 | Daniel | [{"label":"Food", "value":"Grape"}, {"label":"Age", "value":47}, {"label":"State", "value":"Sel"}]
Rule #1: The JSON column is dynamic. Means not everybody will have the same structure
Rule #2: Assuming I can't modify the data structure
My question, it it possible to query so that I can get the ID of records where the Age is >= 40? In this case 1 & 3.
Additional Info (after being pointed as duplicate): if you look at my data, the parent container is array. If I store my data like
{"Age":"40", "Color":"Red"}
then I can simply use
Data->>'$.Age' >= 40
My current thinking is to use a stored procedure to loop the array but I hope I don't have to take that route. The second option is to use regex (which I also hope not). If you think "JSON search" is the solution, kindly point to me which one (or some sample for this noob of me). The documentation's too general for my specific needs.
Here's a demo:
mysql> create table letsayi (id int primary key, name varchar(255), data json);
mysql> > insert into letsayi values
-> (1, 'Admad', '[{"label":"Color", "value":"Red"}, {"label":"Age", "value":"40"}]'),
-> (2, 'Saleem', '[{"label":"Color", "value":"Green"}, {"label":"Age", "value":"37"}, {"label":"Hoby", "value":"Chess"}]');
mysql> select id, name from letsayi
where json_contains(data, '{"label":"Age","value":"40"}');
+----+-------+
| id | name |
+----+-------+
| 1 | Admad |
+----+-------+
I have to say this is the least efficient way you could store your data. There's no way to use an index to search for your data, even if you use indexes on generated columns. You're not even storing the integer "40" as an integer — you're storing the numbers as strings, which makes them take more space.
Using JSON in MySQL when you don't need to is a bad idea.
Is it still possible to query age >= 40?
Not using JSON_CONTAINS(). That function is not like an inequality condition in a WHERE clause. It only matches exact equality of a subdocument.
To do an inequality, you'd have to upgrade to MySQL 8.0 and use JSON_TABLE(). I answered another question recently about that: MySQL nested JSON column search and extract sub JSON
In other words, you have to convert your JSON into a format as if you had stored it in traditional rows and columns. But you have to do this every time you query your data.
If you need to use conditions in the WHERE clause, you're better off not using JSON. It just makes your queries much too complex. Listen to this old advice about programming:
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."
— Brian Kernighan
how people tackle dynamically added form fields
You could create a key/value table for the dynamic form fields:
CREATE TABLE keyvalue (
user_id INT NOT NULL,
label VARCHAR(64) NOT NULL,
value VARCHAR(255) NOT NULL,
PRIMARY KEY (user_id, label),
INDEX (label)
);
Then you can add key/value pairs for each user's dynamic form entries:
INSERT INTO keyvalue (user_id, label, value)
VALUES (123, 'Color', 'Red'),
(123, 'Age', '40');
This is still a bit inefficient in storage compared to real columns, because the label names are stored every time you enter a user's data, and you still store integers as strings. But if the users are really allowed to store any labels of their own choosing, you can't make those real columns.
With the key/value table, querying for age > 40 is simpler:
SELECT user_id FROM key_value
WHERE label = 'Age' AND value >= 40

How to organise the storage many to many relations in Json in mysql

I have the table with JSON-field (example)
# table1
id | json_column
---+------------------------
1 | {'table2_ids':[1,2,3], 'sone_other_data':'foo'}
---+------------------------
2 | {'foo_data':'bar', 'table2_ids':[3,5,11]}
And
# table2
id | title
---+------------------------
1 | title1
---+------------------------
2 | title2
---+------------------------
...
---+------------------------
11 | title11
Yes, I know about stored many-to-many relation in the third table. But it's a duplication data (in first case relations in Json_column, in second in third-table)
I know about generated columns in MySQL, but I don't understand how to use it for stored m2m relations. Maybe I have use views to get pairs of table1.id <-> table2.id. But how use index in this case?
I can't understand your explanation for why you can't use a third table to represent the many-to-many pairs. Using a third table is of course the best solution.
I think views have no relevance to this problem.
You could use JSON_EXTRACT() to access individual members of the array. You can use a generated column to pull each member out so you can easily reference it as an individual value.
create table table1 (
id int auto_increment primary key,
json_column json,
first_table2_id int as (json_extract(json_column, '$.table2_ids[0]'))
);
insert into table1 set json_column = '{"table2_ids":[1,2,3], "sone_other_data":"foo"}'
(You must use double-quotes inside a JSON string, and single-quotes to delimit the whole JSON string.)
select * from table1;
+----+-----------------------------------------------------+-----------------+
| id | json_column | first_table2_id |
+----+-----------------------------------------------------+-----------------+
| 1 | {"table2_ids": [1, 2, 3], "sone_other_data": "foo"} | 1 |
+----+-----------------------------------------------------+-----------------+
But this is still a problem: In SQL, the table must have the columns defined by the table metadata, and all rows therefore have the same columns. There no such thing as each row populating additional columns based on the data.
So you need to create another extra column for each potential member of the array of table2_ids. If the array has fewer elements than the number of columns, JSON_EXTRACT() will fill in NULL when the expression returns nothing.
alter table table1 add column second_table2_id int as (json_extract(json_column, '$.table2_ids[1]'));
alter table table1 add column third_table2_id int as (json_extract(json_column, '$.table2_ids[2]'));
alter table table1 add column fourth_table2_id int as (json_extract(json_column, '$.table2_ids[3]'));
I'll query using vertical output, so the columns will be easier to read:
select * from table1\G
*************************** 1. row ***************************
id: 1
json_column: {"table2_ids": [1, 2, 3], "sone_other_data": "foo"}
first_table2_id: 1
second_table2_id: 2
third_table2_id: 3
fourth_table2_id: NULL
This is going to get very awkward. How many columns do you need? That depends on how many table2_ids is the maximum length of the array.
If you need to search for rows in table1 that reference some specific table2 id, which column should you search? Any of the columns may have that value.
select * from table1
where first_table2_id = 2
or second_table2_id = 2
or third_table2_id = 2
or fourth_table2_id = 2;
You could put an index on each of these generated columns, but the optimizer won't use them.
These are some reasons why storing comma-separated lists is a bad idea, even inside a JSON string, if you need to reference individual elements.
The better solution is to use a traditional third table to store the many-to-many data. Each value is stored on its own row, so you don't need many columns or many indexes. You can search one column if you need to look up references to a given value.
select * from table1_table2 where table2_id = 2;

Postgres JSON querying - match all pairs, ignore order

I have a Postgres table that contains a jsonb column, the data in which is arbitrarily deep.
id | jsonb_data
---|----------------------
1 | '{"a":1}'
2 | '{"a":1,"b":2}'
3 | '{"a":1,"b":2,"c":{"d":4}}'
Given a JSON object in my WHERE clause, I want to find the rows that contain objects that contain the same data and no more, but in any order. Including, preferably, nested objects.
SELECT * FROM table
WHERE json_match_ignore_order(jsonb_data, '{"b":2,"a":1}');
id | jsonb_data
---|-----------
2 | '{"a":1,"b":2}'
This would essentially work identically to the following Ruby code, but I'd really like to do it in the database if possible.
table.select { |row| row.jsonb_data_as_a_hash == {b: 2, a: 1} }
How can I do this?
With jsonb type you can use equal sign even for values with nested objects.
Thus the following will also work:
create table jsonb_table(
id serial primary key,
jsonb_data jsonb
);
insert into jsonb_table(jsonb_data)
values
('{"a":1}'),
('{"a":{"c":5},"b":2}'),
('{"a":{"c":5},"b":2,"c":{"d":4}}');
select * from jsonb_table
where jsonb_data = '{"b":2,"a":{"c":5}}'::jsonb;
You will get rows with objects containing same keys with same values recursively (in this case only the second row).

Mysql - flexible, excel-like structure

I have recently inherited a already started project, and I have one challenge right now. One of the requirements is to allow a user to create a "database" inside the application, that can have a variable number of user-defined columns (it's an excel-like structure).
Here's the sqlfiddle for my current structure.
Here's a query I am using to fetch rows:
select `row`,
group_concat(dd.value order by field(`col`, 1, 2, 3) asc) as `values`
from db_record dr,
db_dictionary dd
where dr.database_id in (1, 2, 3)
and dr.database_dictionary_id = dd.id
group by `row`
order by group_concat(dd.value order by field(`col`, 1, 2, 3) asc);
Ability to sort by any column is achieved by using group_concat().
I am thinking about that design, because I have some doubts regarding performance and meeting requirements:
It has to be sortable (by any column), meaning that user sorts asc by column 2, and rows are ordered properly.
It has to be searchable/filterable. User can filter by values in any column, and only rows containing search phrase should be returned.
First requirement I think is handled by the query I pasted above. Second one - I also tried adding HAVING clause to the query with LIKE, but it compared the whole GROUP_CONCAT() result.
Can someone advise, whether the current DB structure is ok for the purpose and help me with the latter requirement? Or maybe there's a better approach to the problem?
Last question, is it possible to return values for each column in one query? In DB, records look like this:
-------------------------------------------
| database_id | dictionary_id | row | col |
-------------------------------------------
| 1 | 1 | 1 | 1 |
-------------------------------------------
| 2 | 2 | 1 | 2 |
-------------------------------------------
| 3 | 3 | 1 | 3 |
-------------------------------------------
And I would like to get a query result groupped by row, similar to that: (column 1 .. 3 values are dictionary_id values)
----------------------------------------
| row | column 1 | column 2 | column 3 |
----------------------------------------
| 1 | 1 | 2 | 3 |
----------------------------------------
Is that achievable in mysql? Or the only solution is to use GROUP_CONCAT() and then I can use php to split into columns?
I need a flexlible and efficient structure, and I hope someone can advise me on that, I would really appreciate any help or suggestions.
Excel-2-MySQL
A Flexible, Dynamic Adaption of Excel Format to a MySQL Relational Schema
The approach of this solution may work for other relational database systems as it does not rely on any specific features of MySQL except for SQL compliant DDL and DML commands. The maintenance of this database can be handled through a combination of internal db constraints and stored procedure apis, or externally by an alternate scripting language and user interface. The focus of this walk through is the purpose of the schema design, organization of the data and supporting values as well as potential points of expansion for additional enhancements.
Schema Overview and Design Concepts to Adapt a Spreadsheet
The schema leverages an assumption that each data point on the spreadsheet grid can be represented by a unique combination of keys. The simplest combination would be a row-column coordinate pair, such as "A1" (Column A, Row Number 1) or "G72" (Column G, Row Number 72)
This walk-through demonstration will show how to adapt the following data sample in spreadsheet form into a reusable, multi-user relational database format.
A pair of coordinates can also include a uniquely assigned spreadsheet/mini-db ID value. For a multi-user environment, the same schema can still be used by adding a supporting user ID value to associate with each spreadsheet ID.
Defining the Smallest Schema Unit: The Vector
After bundling together all the identifying meta info about each data point, the collection is now tagged with a single, globally unique ID, which to some may now appear like a catalog of "vectors".
A VECTOR by mathematical definition is a collection of multiple components and their values used to simplify solutions for problems which exist in spaces that are described through multiple (n) dimensions.
The solution is scalable: mini-databases can be as small as 2 rows x 2 columns or hundreds to thousands of rows and columns wide.
Search, Sort and Pivot Easily
To build search queries from the data values of vectors that have common attributes such as:
Database/Spreadsheet ID and Owner (Example, 10045, Owner = 'HELEN')
Same Column: (Example, Column "A")
Your data set would be all vector id's and their associated data values which have these common values. Pivot outputs could be accomplished generically with probably some simple matrix algebra transformations... a spreadsheet grid is only two dimensions, so it can't be that hard!
Handling Different Data Types: Some Design Considerations
The simple approach: Store all the data as VARCHAR types but keep track of the original data type so that when you query the vector's data value you can apply the right conversion function. Just be consistent and use your API or input process to vigilantly police the population of your data in the data store... the last thing you'll want to end up debugging is a Numeric conversion function that has encountered a STRING typed character.
The next section contains the DDL code to set up a one-table solution which uses multiple columns to manage the different possible data types that may be hosted within a given spreadsheet grid.
A Single Table Solution for Serving a Spreadsheet Grid Through MySQL
Below is the DDL worked out on MySQL 5.5.32.
-- First Design Idea... Using a Single Table Solution.
CREATE TABLE DB_VECTOR
(
vid int auto_increment primary key,
user_id varchar(40),
row_id int,
col_id int,
data_type varchar(10),
string_data varchar(500),
numeric_data int,
date_data datetime
);
-- Populate Column A with CITY values
INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
string_data, numeric_data, date_data)
VALUES ('RICHARD', 2, 1, 'STRING', 'ATLANTA', NULL, NULL);
INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
string_data, numeric_data, date_data)
VALUES ('RICHARD', 3, 1, 'STRING', 'MACON', NULL, NULL);
INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
string_data, numeric_data, date_data)
VALUES ('RICHARD', 4, 1, 'STRING', 'SAVANNAH', NULL, NULL);
INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
string_data, numeric_data, date_data)
VALUES ('RICHARD', 5, 1, 'STRING', 'FORT BENNING', NULL, NULL);
INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
string_data, numeric_data, date_data)
VALUES ('RICHARD', 6, 1, 'STRING', 'ATHENS', NULL, NULL);
-- Populate Column B with POPULATION values
INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
string_data, numeric_data, date_data)
VALUES ('RICHARD', 2, 2, 'NUMERIC', NULL, 1500000, NULL);
INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
string_data, numeric_data, date_data)
VALUES ('RICHARD', 3, 2, 'NUMERIC', NULL, 522000, NULL);
INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
string_data, numeric_data, date_data)
VALUES ('RICHARD', 4, 2, 'NUMERIC', NULL, 275200, NULL);
INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
string_data, numeric_data, date_data)
VALUES ('RICHARD', 5, 2, 'NUMERIC', NULL, 45000, NULL);
INSERT INTO DB_VECTOR (user_id, row_id, col_id, data_type,
string_data, numeric_data, date_data)
VALUES ('RICHARD', 6, 2, 'NUMERIC', NULL, 1325700, NULL);
There is a temptation to run off and start over-normalizing this table, but redundancy may not be that bad. Separate off information that is related to the spreadsheets (Such as OWNER/USER Name and other demographic info) but otherwise keep things together until you understand the purpose of the vector-based design and some of the performance trade-offs.
One such tradeoff with a over-normalized schema is that now the required data values are scattered across multiple tables. Filter criteria may now have to apply on different tables involved in these joins. Ironic as it may seem, I have observed that flattened, singular table structures fare well when it comes to querying and reporting despite some apparent redundancy.
An Additional Note: Creating tables for supporting data linked to the main data source via Foreign Key relations are a different story... an implied relation exists between tables, but many RDBMS systems actually self-optimize based on Foreign Key connections.
For Example: Searching the USER_OWNER column with several million records benefits from a potential boost if it is linked by a FK to a supporting table which identifies a finite user list of 20 people... This is also known as an issue of CARDINALITY, which helps the database build execution plans that can take short-cuts through an otherwise unknown data set.
Getting Your Data Back Out: Some Sample Queries
The first is a base query to pull the data back out in a organized, grid-like format... just like the original Excel page.
SELECT base_query.CITY, base_query.POPULATION
FROM (
SELECT CASE WHEN col_a.data_type = 'STRING'
THEN col_a.string_data
WHEN col_a.data_type = 'NUMERIC'
THEN col_a.numeric_data
WHEN col_a.data_type = 'DATETIME'
THEN col_a.date_data ELSE NULL END as CITY,
CASE WHEN col_b.data_type = 'STRING'
THEN col_b.string_data
WHEN col_b.data_type = 'NUMERIC'
THEN col_b.numeric_data
WHEN col_b.data_type = 'DATETIME'
THEN col_b.date_data ELSE NULL END as POPULATION
FROM db_vector col_a, db_vector col_b
WHERE ( col_a.col_id = 1 AND col_b.col_id = 2 )
AND ( col_a.row_id = col_b.row_id)
) base_query WHERE base_query.POPULATION >= 500000
ORDER BY base_query.POPULATION DESC
Even the base query here is still a little specific to manage a scalable, generic solution for a spreadsheet of one or many values in width or length. But you can see how the internal query in this example remains untouched and a complete data set can quickly be filtered or sorted in different ways.
Some Parting Thoughts: (a.k.a. Some Optional Homework)
It is possible to solve this with an flexible, multi-table solution. I was able to accomplish this in THREE.
DB_VECTOR (as you have already seen) underwent some modifications: data values were moved out and strictly positional information (row and column id's) plus a globally unique spreadsheet id was left behind.
DB_DATA was used as the final home for the raw data fields: STRING_DATA, NUMERIC_DATA, and DATE_DATA... each record uniquely identified by a VID (vector id).
In the multi-table solution, I used the unique VID instead as a pointer with multiple associated dimensions (owner, sheet id, row, column, etc.) to point to its corresponding data value.
An example of the utility of this design: the possibility of a "look-up" function or query that identifies a collection of vector ids and the data they point to based on the properties of the data itself, or the vector components (row, column, sheet id, etc.)... or a combination.
The possibility is that instead of circulating a whole lot of data (the spreadsheet itself) between different parts of the code handling this schema, queries deal with only the specific properties and just push around lists (arrays?) or sets of universally unique ids which point to the data as it is needed.
Initializing New Spreadsheets: If you pursue the multi-table design, your DB_VECTOR table becomes a hollow collection of bins with pointers to the actual data. Before you populate the raw data values, the VECTOR_ID (vid) will need to exist first so you can link the two values.
Which Way is UP???: Using numeric values for row and column id's seemed like the easy way first, but I noticed that: (a) I was easily mixing up columns and rows... and worse, not noticing it until it was too late; (b) Excel actually has a convention: Rows (numeric), Columns (Alphabetic: A through ZZ+?) Will users miss the convention or get lost when using our schema? Are there any problems with adopting a non-numeric identification scheme for our data vectors?
Yet Another Dimension: Excel Spreadsheets have MULTIPLE sheets. How would support for this convention change the design of your VECTORS? Engineers and scientists even push this limit to more than the three dimensions humans can see. How would that change things? If you tried it, did you find out if it imposed a limitation, or did it matter at all?
Stumbled Into This One...: My current DB_VECTOR table contains an extra VARCHAR value called "DETAILS". I found it a useful catch-bin for a miscellaneous, custom attribute that can be unique all the way down to the lowest (VECTOR ID/POINTER) level... or you can use it to create a custom label for an unusual collection of vectors that may not have an easily definable relation (like Excel's "Range Name" property)... What would you use it for?
If you're still with me... thanks. This one was a challenging thought exercise in database design. I have purposely left out fully expanded discussions on optimization and performance considerations for the sake of clarity... perhaps something to consider at a later time.
Best Wishes on Your Project.
Why not model tabular storage as a table? Just build the ALTER|CREATE|DROP TABLE statements ad hoc, and you can reap all the benefits of actually having a database server. Indexes and SQL come to mind.
Example schema:
CREATE TABLE Worksheets
(
WorksheetID int auto_increment primary key,
WorkbookID int not null,
Name varchar(256) not null,
TableName nvarchar(256) not null
);
CREATE TABLE Columns
(
ColumnID int auto_increment primary key,
WorksheetID int not null,
ColumnSequenceNo int not null,
Name varchar(256) not null,
PerceivedDatatype enum ('string', 'number') not null
)
-- Example of a dynamically generated data table:
-- Note: The number in the column name would correspond to
-- ColumnSequenceNo in the Columns table
CREATE TABLE data_e293c71b-b894-4652-a833-ba817339809e
(
RowID int auto_increment primary key,
RowSequenceNo int not null,
Column1String varchar(256) null,
Column1Numeric double null,
Column2String varchar(256) null,
Column2Numeric double null,
Column3String varchar(256) null,
Column3Numeric double null,
-- ...
ColumnNString varchar(256) null,
ColumnNNumeric double null
);
INSERT INTO Worksheets (WorkbookID, Name, TableName)
VALUES (1, `Countries`, `data_e293c71b-b894-4652-a833-ba817339809e`);
SET #worksheetID = LAST_INSERT_ID();
INSERT INTO Columns (WorksheetID, ColumnSequenceNo, Name, PerceivedDatatype)
VALUES (#worksheetID, 1, `Country Name`, `string`),
(#worksheetID, 2, `Population`, `numeric`),
(#worksheetID, 3, `GDP/person`, `numeric`);
-- example of an insert for a new row:
-- if the new data violates any perceived types, update them first
INSERT INTO data_e293c71b-b894-4652-a833-ba817339809e (
RowSequenceNo,
Column1String,
Column2String, Column2Numeric,
Column3String, Column3Numeric)
VALUES (
1,
`United States of America`,
`3000000`, 3000000,
`34500`, 34500);
-- example of a query on the first column:
select *
from data_e293c71b-b894-4652-a833-ba817339809e
where Column1String like `United%`;
-- example of a query on a column with a numeric perceived datatype:
select *
from data_e293c71b-b894-4652-a833-ba817339809e
where Column3Numeric between 4000 and 40000;
Moral of the story is that you shouldn't fight the database server — use it to your advantage.
select `row`,
group_concat(if(field(`row`, 1), dd.value, null)) as row1,
group_concat(if(field(`row`, 2), dd.value, null)) as row2,
group_concat(if(field(`row`, 3), dd.value, null)) as row3
from db_record dr
left join db_dictionary dd on (dr.dictionary_id = dd.id)
where dr.database_id = 0
group by `column`
having row1 like '%biu%'
order by `row` uni;
My first impression is that you may be overthinking this quite a bit. I'm guessing that you wish to get a permutation of 3 or more player combinations across all db dictionaries (players). And the sqlfiddle suggests recording all these in the db_record table to be retrieved later on.
Using group_concat is pretty expensive, so is the use of 'having'. When you view the original sqlfiddle's execution plan, it says in the "Extra" column
Using where; Using temporary; Using filesort
"Using temporary; Using filesort" are indications of inefficiency around using temporary tables and having to hit the disk multiple times during filesort. The first execution time was 25ms (before it was cached, bringing that fictitiously down to 2ms on the second execution onwards)
To the original question, creating a "database" inside the "application"? If you mean a flexible DB within a DB, you're probably overusing a relational DB. Try shifting some of the responsibilities out to the application layer code (php?), yes outside of the DB, and leave the relational DB to do what it's best at, relating relevant tables of data. Keep it simple.
After some thinking, I think I might have a solution, but I am not sure if it's the best one. Before running the query in the app, I already know how many columns that virtual "database" has, and since I know which column I need to search (column 3 in this example), I can build a query like that:
select `row`,
group_concat(if(field(`column`, 1), dd.value, null)) as column1,
group_concat(if(field(`column`, 2), dd.value, null)) as column2,
group_concat(if(field(`column`, 3), dd.value, null)) as column3
from db_record dr
left join db_dictionary dd on (dr.dictionary_id = dd.id)
where dr.database_id = 1
group by `row`
having column3 like '%biu%'
order by `columns` asc;
So, in PHP I can add group_concat(if(...)) for each column and add HAVING clause to search.
But I would like to get some feedback about that solution if possible.

Handling custom user fields with possibility of grow and shrink

This question is much about how to do, idea etc.
I have a situation where a user can create as many custom fields as he can of type Number, Text, or Date, and use this to make a form. I have to make/design some table model which can handle and store the value so that query can be done on these values once saved.
Previously I have hard coded the format for 25 user defined fields (UDF). I make a table with 25 column with 10 Number, 10 Text, and 5 Date type and store the label in it if a user makes use of any field. Then map it to other table which has same format and store the value. Mapping is done to know which field is having what label but this is not an efficient way, I hope.
Any suggestion would be appreciated.
Users have permissions for creating any number of UDF of the above types. then it can be used to make forms again this is also N numbers and have to save the data for each form types.
e.g. let's say a user created 10 number 10 date and 10 text fields used first 5 of each to make form1 and all 10 to make form2 now saved the data.
My thoughts on it:
Make a table1 with [id,name(as UDF_xxx where xxx is data type),UserLabel ]
table2 to map form and table1 [id(f_key table1_id), F_id(form id)]
and make 1 table of each data type as [ id(f_key of table1),F_id(form number),R_id(row id for data, would be same for all data type),value]
Thanks to all I'm going to implement, it both DataSet entry and json approach looks good as it gives wider extension-ability. Still I've to figure out which will best fit with the existing format.
There are two approaches I have used.
XML: To create a dynamic user attribute, you may use XML. This XML will be stores in a clob column - say, user_attributes. Store the entire user-data in XML key-value pair, with type as an attribute or another field. This will give you maximum freedom. You can use XOM or any other XML object Model API to display or operate on the data. A typical Node will look like
<userdata>
...
...
<datanode>
<key type="Date">User Birth</key>
<value>1994-02-25</value>
</datanode>
...
</userdata>
Attribute-AttributeValue This is same thing as above but using tables. What you do is you create a table -- attributes with FK as user_id, another table attribute_values with FK as attribute_id. attributes contains multiple field-names and types for each user and attribute_values contains values of those attributes. so basically,
users
user_id
attributes
attr_id
user_id (FK)
attr_type
attr_name
attribute_values
attr_val_id
attr_id (FK)
attr_val
If you see in both the approached you are not limited by how-many or what type of data you have. But there is a down-side of this is parsing. In either of the case, you will have to to do a small amount of processing to display or analyze the data.
The best of both worlds (having rigid column structure vs having completely dynamic data) approach is to have a users table with must-have columns (like user_name, age, sex, address etc) and have user-created data (like favorite pet house etc.) in either XML or attribute-attribute_value.
What do you want to achieve?
A table per form permutation or might each dataset consist of different sets?
Two possibilities pop into my mind:
Create a table that describes one field of a dataset, i.e. the key might be dataset id + field id and additional columns could contain the value stored as a string and the type of that value (i.e. number, string, boolean, etc.).
That way each dataset might be different but upon reading a dataset and storing it into an object you could create the appropriate value types (Integer, Double, String, Boolean etc.)
Create a table per form, using some naimg convention. When the form layout is changed, execute ALTER TABLE statements to add, remove, rename columns or change their type.
When the user changes the type of a column or deletes it, you might need to either deny that if the values are not null or at least ask the user if she's willing to drop values that don't match the new requirements.
Edit: Example for approach 1
Table UDF //describes the available fields--------
id (PK)
user_id (FK)
type
name
Table FORM //describes a form's general attributes--------
id (PK)
user_id (FK)
name
description
Table FORM_LAYOUT //describes a form's field layout--------
form_id (FK)
udf_id (FK)
mapping //mapping info like column index, form field name etc.
Table DATASET_ENTRY //describes one entry of a dataset, i.e. the value of one UDF in
--------
id (PK)
row_id
form_id (FK)
udf_id (FK)
value
Selecting the content for a specific form might then be done like this:
SELECT e.value, f.type, l.mapping from DATASET_ENTRY e
JOIN UDF f ON e.udf_id = f.id
JOIN FORM_LAYOUT l ON e.form_id = l.form_id AND e.udf_id = l.udf_id
WHERE e.row_id = ? AND e.form_id = ?
Create a table which manages which fields exist. Then create tables for each data type you want to support, where the user will their values into.
create table Fields(
fieldid int not null,
fieldname text not null,
fieldtype int not null
);
create table FieldDate
(
ValueId int not null,
fieldid int not null,
value date
);
create table FieldNumber
(
ValueId int not null,
fieldid int not null,
value number
);
..
Another possibility would be to use ALTER TABLE to create custom fields. If your application has the rights to perform this command and the custom fields are changing very rarely this would be the option I chose.