This question already has answers here:
How to store data with dynamic number of attributes in a database
(8 answers)
Storing JSON in database vs. having a new column for each key
(10 answers)
Closed 3 years ago.
I am working on a project, that has multiple processes and each process has different data items to process. Data items (For different processes) have different columns (but always the same columns for the same process).
At first, I assumed that it would be fine to have a table for all of the processes and then, whenever a new process is created, another table with the item data could be created as well, but it turns out, that there would be a new process way to often to create new tables all the time. Then I was looking into nested tables but found out that there is no concept of the nested tables in MySQL. (I've heard that this could be done with MariaDB. Has anyone worked with it?)
To make it a bit more clear here is the current concept (columns and values here are only approximate to make the concept more clear):
process_table:
ID | process_name | item_id | ...
---------------------------------
1 | some_process | 111 | ...
2 | other_process| 222 | ...
3 | third_process| 333 | ...
4 | third_process| 444 | ...
...
item_tables:
item_table_1:
ID | Column1 | Column2 | process_name | ...
--------------------------------------
111| val1 | val2 | some_process | ...
...
item_table_2:
ID | Column4 | Column5 | process_name | ...
--------------------------------------
333| val1 | val2 | third_process| ...
444| val3 | val4 | third_process| ...
...
So then for each new process, there would be new item_table and for each process, it needs to have different column names, and in item table, the specific item would be linked to 'item_id' column in the process table.
I think that the easiest solution (when creating new tables all the time is not an option) for this would be nested tables, where, in the process table, there could be another column, that would hold the item_table values and then those could have different columns based on the process itself.
So the big question is: Is there at least anything similar to nested tables or anything else in MySQL that would help me implement structure like this without creating new tables all the time, and if not, then maybe there are some tips or reviews about MariaDB? Maybe someone has already implemented nested tables with it (If that is possible at all)
One of the solutions would be to have one table for the 'item_table' and then have one column for all the different values for processes, that would be stored in JSON format for example, but this would make it a lot harder to read the table.
For example:
item_table:
ID | process_name | data
--------------------------------------
111| some_process | {values: {column1:val1,column2:val2,...}}
Do you use the values from the items-table for processing or something like that (do you run queries against them)?
This table/database structure looks.. ineffecient and unmaintainable imo.
This should all be done with just two tables. The processes table and the items table that contains the process_id (not the name) from the processes table.
If the column count for the items is always the same, just use "generic" names for the values like value_1, value_2 (or whatever suits best for the process) or a json/blob/varchar field with a JSON string for example. (depends if you need to run queries against this data)
id | process_id | data
EDIT:
Your edit and second solution should be the way to go.
"easy readability" has no priority above functionality and performance.
Related
I have a list of ids in text format as a comma separated value like so
("12345", "12346", "12347", etc, etc)
I would like to find their existence or non existence from a table say devices table which has a column called device ids (not primary key)
Ideally i would like to get a list which says if each item exists or not.
So far I have tried to get the query of those that exist and I have to manually find the non existing ones.
Is there a for loop I have to run on stored procedures or something like that. Please help.
Table structure
<pre>
| id | device_id | device_name |
+------+-----------------+---------------+
| 71 | 352701060409650 | 57X |
| 13 | 352701060409700 | 582 |
</pre>
You need to create a query with left join to the same table with 'IFNULL' condition. There already has been a post for this topic. Please check this out here.
Im looking for a way to check if a value is present in one of the rows of the page column.
For example if should check if the value '45' is present?
Id | page |
---------------
1 | 23 |
---------------
2 | |
---------------
3 | 33,45,55 |
---------------
4 | 45 |
---------------
The find_in_set function is just what you're looking for:
SELECT *
FROM mytable
WHERE FIND_IN_SET('45', page) > 0
You should not store values in lists. This is especially true in this case:
Values should be stored in the proper data type. You are storing numbers as characters.
Foreign key relationships should be properly defined.
SQL doesn't have very good string processing functions.
Resulting queries cannot make use of indexes.
SQL has a great data type for lists, called a table. In this case, you want a junction table.
Sometimes, you are stuck with other people's really bad design decisions. In that case, you can use find_in_set() as suggested by Mureinik.
'customer_data' table:
id - int auto increment
user_id - int
json - TEXT field containing json object
tags - varchar 200
* id + user_id are set as index.
Each customer (user_id) may have multiple lines.
"json" is text because it may be very large with many keys or or not so big with few keys containing short values.
I usually search for the json for user_id.
Problem: with over 100,000 lines and it takes forever to complete a query. I understand that TEXT field are very wasteful and mysql does not index them well.
Fix 1:
Convert the "json" field to multiple columns in the same table where some columns may be blank.
Fix 2:
Create another table with user_id|key|value, but I may go into huge "joins" and will that not be much slower? Also the key is string but value may be int or text and various lengths. How to I reconcile that?
I know this is a pretty regular usecase, what are the "industry standards" for this usecase?
UPDATE
So I guess Fix 2 is the best option, how would I query this table and get one row result, efficiently?
id | key | value
-------------------
1 | key_1 | A
2 | key_1 | D
1 | key_2 | B
1 | key_3 | C
2 | key_3 | E
result:
id | key_1 | key_2 | key_3
---------------------------
1 | A | B | C
2 | D | | E
This answer is a bit outside the box defined in your question, but I'd suggest:
Fix 3: Use MongoDB instead of MySQL.
This is not to criticize MySQL at all -- MySQL is a great structured relational database implementation. However, you don't seem interested in using either the structured aspects or the relational aspects (either because of the specific use case and requirements or because of your own programming preferences, I'm not sure which). Using MySQL because relational architecture suits your use case (if it does) would make sense; using relational architecture as a workaround to make MySQL efficient for your use case (as seems to be the path you're considering) seems unwise.
MongoDB is another great database implementation, which is less structured and not relational, and is designed for exactly the sort of use case you describe: flexibly storing big blobs of json data with various identifiers, and storing/retrieving them efficiently, without having to worry about structural consistency between different records. JSON is Mongo's native document representation.
I have 6 tables. These are simplified for this example.
user_items
ID | user_id | item_name | version
-------------------------------------
1 | 123 | test | 1
data
ID | name | version | info
----------------------------
1 | test | 1 | info
data_emails
ID | name | version | email_id
------------------------
1 | test | 1 | 1
2 | test | 1 | 2
emails
ID | email
-------------------
1 | email#address.com
2 | second#email.com
data_ips
ID | name | version | ip_id
----------------------------
1 | test | 1 | 1
2 | test | 1 | 2
ips
ID | ip
--------
1 | 1.2.3.4
2 | 2.3.4.5
What I am looking to achieve is the following.
The user (123) has the item with name 'test'. This is the basic information we need for a given entry.
There is data in our 'data' table and the current version is 1 as such the version in our user_items table is also 1. The two tables are linked together by the name and version. The setup is like this as a user could have an item for which we dont have data, likewise there could be an item for which we have data but no user owns..
For each item there are also 0 or more emails and ips associated. These can be the same for many items so rather than duplicate the actual email varchar over and over we have the data_emails and data_ips tables which link to the emails and ips table respectively based on the email_id/ip_id and the respective ID columns.
The emails and ips are associated with the data version again through the item name and version number.
My first query is is this a good/well optimized database setup?
My next query and my main question is joining this complex data structure.
What i had was:
PHP
- get all the user items
- loop through them and get the most recent data entry (if any)
- if there is one get the respective emails
- get the respective ips
Does that count as 3 queries or essentially infinite depending on the number of user items?
I was made to believe that the above was inefficient and as such I wanted to condense my setup into using one query to get the same data.
I have achieved that with the following code
SELECT user_items.name,GROUP_CONCAT( emails.email SEPARATOR ',' ) as emails, x.ip
FROM user_items
JOIN data AS data ON (data.name = user_items.name AND data.version = user_items.version)
LEFT JOIN data_emails AS data_emails ON (data_emails.name = user_items.name AND data_emails.version = user_items.version)
LEFT JOIN emails AS emails ON (data_emails.email_id = emails.ID)
LEFT JOIN
(SELECT name,version,GROUP_CONCAT( the_ips.ip SEPARATOR ',' ) as ip FROM data_ips
LEFT JOIN ips as the_ips ON data_ips.ip_id = the_ips.ID )
x ON (x.name = data.name AND x.version = user_items.version)
I have done loads of reading to get to this point and worked tirelessly to get here.
This works as I require - this question seeks to clarify what are the benefits of using this instead?
I have had to use a subquery (I believe?) to get the ips as previously it was multiplying results (I believe based on the complex joins). How this subquery works I suppose is my main confusion.
Summary of questions.
-Is my database setup well setup for my usage? Any improvements would be appreciated. And any useful resources to help me expand my knowledge would be great.
-How does the subquery in my sql actually work - what is the query doing?
-Am i correct to keep using left joins - I want to return the user item, and null values if applicable to the right.
-Am I essentially replacing a potentially infinite number of queries with 2? Does this make a REAL difference? Can the above be improved?
-Given that when i update a version of an item in my data table i know have to update the version in the user_items table, I now have a few more update queries to do. Is the tradeoff off of this setup in practice worthwhile?
Thanks to anyone who contributes to helping me get a better grasp of this !!
Given your data layout, and your objective, the query is correct. If you've only got a small amount of data it shouldn't be a performance problem - that will change quickly as the amount of data grows. However when you ave a large amount of data there are very few circumstances where you should ever see all your data in one go, implying that the results will be filtered in some way. Exactly how they are filtered has a huge impact on the structure of the query.
How does the subquery in my sql actually work
Currently it doesn't work properly - there is no GROUP BY
Is the tradeoff off of this setup in practice worthwhile?
No - it implies that your schema is too normalized.
Update: Question refined, I still need help!
I have the following table structure:
table reports:
ID | time | title | (extra columns)
1 | 1364762762 | xxx | ...
Multiple object tables that have the following structure
ID | objectID | time | title | (extra columns)
1 | 1 | 1222222222 | ... | ...
2 | 2 | 1333333333 | ... | ...
3 | 3 | 1444444444 | ... | ...
4 | 1 | 1555555555 | ... | ...
In the object tables, on an object update a new version with the same objectID is inserted, so that the old versions are still available. For example see the entries with objectID = 1
In the reports table, a report is inserted but never updated/edited.
What I want to be able to do is the following:
For each entry in my reports table, I want to be able to query the state of all objects, like they were, when the report was created.
For example lets look at the sample report above with ID 1. At the time it was created (see the time column), the current version of objectID 1 was the entry with ID 1 (entry ID 4 did not exist at that point).
ObjectID 2 also existed with it's current version with entry ID 2.
I am not sure how to achieve this.
I could use a query that selects the object versions by the time column:
SELECT *
FROM (
SELECT *
FROM objects
WHERE time < [reportTime]
ORDER BY time DESC
)
GROUP BY objectID
Lets not talk about the performance of this query, it is just to make clear what I want to do. My problem is the comparison of the time columns. I think this is no good way to make sure that I got the right object versions, because the system time may change "for any reason" and the time column would then have wrong data in it, which would lead to wrong results.
What would be another way to do so?
I thought about not using a time column for this, but instead a GLOBAL incremental value that I know the insertion order across the database tables.
If you are interting new versions of the object, and your problem is the time column(I assume you are using this column to sort which one is newer); I suggest you to use an auto-incremental ID column for the versions. Eventually, even if the time value is not reliable for you, the ID will be.Since it is always increasing. So higher ID, newer version.