Mysql efficiently storing dynamic customer data in row or column - mysql

'customer_data' table:
id - int auto increment
user_id - int
json - TEXT field containing json object
tags - varchar 200
* id + user_id are set as index.
Each customer (user_id) may have multiple lines.
"json" is text because it may be very large with many keys or or not so big with few keys containing short values.
I usually search for the json for user_id.
Problem: with over 100,000 lines and it takes forever to complete a query. I understand that TEXT field are very wasteful and mysql does not index them well.
Fix 1:
Convert the "json" field to multiple columns in the same table where some columns may be blank.
Fix 2:
Create another table with user_id|key|value, but I may go into huge "joins" and will that not be much slower? Also the key is string but value may be int or text and various lengths. How to I reconcile that?
I know this is a pretty regular usecase, what are the "industry standards" for this usecase?
UPDATE
So I guess Fix 2 is the best option, how would I query this table and get one row result, efficiently?
id | key | value
-------------------
1 | key_1 | A
2 | key_1 | D
1 | key_2 | B
1 | key_3 | C
2 | key_3 | E
result:
id | key_1 | key_2 | key_3
---------------------------
1 | A | B | C
2 | D | | E

This answer is a bit outside the box defined in your question, but I'd suggest:
Fix 3: Use MongoDB instead of MySQL.
This is not to criticize MySQL at all -- MySQL is a great structured relational database implementation. However, you don't seem interested in using either the structured aspects or the relational aspects (either because of the specific use case and requirements or because of your own programming preferences, I'm not sure which). Using MySQL because relational architecture suits your use case (if it does) would make sense; using relational architecture as a workaround to make MySQL efficient for your use case (as seems to be the path you're considering) seems unwise.
MongoDB is another great database implementation, which is less structured and not relational, and is designed for exactly the sort of use case you describe: flexibly storing big blobs of json data with various identifiers, and storing/retrieving them efficiently, without having to worry about structural consistency between different records. JSON is Mongo's native document representation.

Related

Relational databases: Integrate one tree structure into another

I'm currently designing a relational database table in MySQL for handling multiple categories, representing them later in a tree structure on the client side and filtering on them. Here is a picture of how the structure looks like:
So we have a root element which is set by default. We can after that add children to it (Level one). So far a table structure in the simplest case could be defined so:
| id | name | parent_id |
--------------------------------
1 All Categories NULL
2 History 1
However, I have a requirement that I need to include another tree structure type (Products) in the table (a corresponding API is available). The records from the other table have their own id types (UUID). Basically I need to ingest them in my table. A possible structure will look like so:
| id | UUID | name | parent_id |
----------------------------------------------------------
1 NULL All Categories NULL
2 NULL History 1
3 NULL Products 1
4 CN1001231232 Catalog electricity 3
5 CN1001231242 Catalog basic components 4
6 NULL Shipping 1
I am new to relational databases, but all of these possible NULL values for the UUID indicate (at least for me) to be bad design of database table. Is there a way of avoiding this, or even better way for this "ingestion"?
If you had a table for users, with columns first_name, middle_name, last_name but then a user signed up and said they have no middle name, you could just store NULL for that user's middle_name column. What's bad design about that?
NULL is used when an attribute is unknown or inapplicable on a given row. It seems appropriate for the case you describe, i.e. when records that did not come from the external source have no UUID, and need no UUID.
That said, some computer science theorists insist that NULL is never appropriate. There's a decades-old controversy about whether SQL should even have a NULL.
The alternative would be to create a second table, in which you store only the UUID and the reference to the entity in your first table. Then just don't store rows for the other id's.
| id | UUID |
-------------------
4 CN1001231232
5 CN1001231242
And don't store the UUID column in your first table. This eliminates the NULLs, but it means you need to do a JOIN of the two tables whenever you want to query the entities with their UUID's.
First make sure you actually have to combine these in the same table. Are the products categories? If they are categories and are used like categories then it makes sense to have them in the same table, but if they have categories then they should be kept separate and given a category/parent id.
If you're sure it's appropriate to store them in the same table then the way you have it is good with one adjustment. For the UUID you can use a separate naming scheme that makes it interchangeable with id for those entries and avoids collisions with the other uuids. For example:
| id | UUID | name | parent_id |
----------------------------------------------------------
1 CAT000000001 All Categories NULL
2 CAT000000002 History 1
3 CAT000000003 Products 1
4 CN1001231232 Catalog electricity 3
5 CN1001231242 Catalog basic components 4
6 CAT000000006 Shipping 1
Your requirements combine the two things relational database are not great with out of the box: modelling hierarchies, and inheritance (in the object-oriented sense).
Your design users the "single table inheritance" model (one of 3 competing options). It's the simplest option in terms of design.
In practical terms, you may want to add a column to explicitly state which type of record you're dealing with ("regular category" and "product category") so your queries are more obvious to others.

check if value is present in one of the database rows

Im looking for a way to check if a value is present in one of the rows of the page column.
For example if should check if the value '45' is present?
Id | page |
---------------
1 | 23 |
---------------
2 | |
---------------
3 | 33,45,55 |
---------------
4 | 45 |
---------------
The find_in_set function is just what you're looking for:
SELECT *
FROM mytable
WHERE FIND_IN_SET('45', page) > 0
You should not store values in lists. This is especially true in this case:
Values should be stored in the proper data type. You are storing numbers as characters.
Foreign key relationships should be properly defined.
SQL doesn't have very good string processing functions.
Resulting queries cannot make use of indexes.
SQL has a great data type for lists, called a table. In this case, you want a junction table.
Sometimes, you are stuck with other people's really bad design decisions. In that case, you can use find_in_set() as suggested by Mureinik.

Database design. How to approach this please?

I'm designing a database (MySQL) that will manage a fleet of vehicles.
Company has many garages across the city, at each garage, vehicles gets serviced (operation). An operation can be any of 3 types of services.
Table Vehicle, Table Garagae, Table Operation, Table Operation Type 1, Table Operation Type 2, Table Operation type 3.
Each Operation has the vehicle ID, garage ID, but how do I link it to the the other tables (service tables) depending on which type of service the user chooses?
I would also like to add a billing table, but I'm lost at how to design the relationship between these tables.
If I have fully understood it I would suggest something like this (first of all you shouldn't have three operation tables):
Vehicles Table
- id
- garage_id
Garages Table
- id
Operations/Services Table
- id
- vehicle_id
- garage_id
- type
Customer Table
- id
- service_id
billings Table
- id
- customer_id
You need six tables:
vechicle: id, ...
garage: id, ...
operation: id, vechicle_id, garage_id, operation_type (which can be
one of the tree options/operations available, with the possibility to be extended)
customer: id, ...
billing: id, customer_id, total_amount
billingoperation: id, billing_id, operation_id, item_amount
You definitely should not creat three tables for operations. In the future if you would like to introduce a new operation that would involve creating a new table in the database.
For the record, I disagree with everyone who is saying you shouldn't have multiple operation tables. I think that's perfectly fine, as long as it is done properly. In fact, I'm doing that with one of my products right now.
If I understand, at the core of your question, you're asking how to do table inheritance, because Op Type 1 and Op Type 2 (etc.) IS A Operation. The short answer is that you can't. The longer answer is that you can't...at least not without some helper logic.
I assume you have some sort of program that will pull data from the database, rather than you just writing sql commands by hand. Working under that assumption, let's use this as a subset of your database:
Garage
------
GarageId | GarageLocation | etc.
---------|----------------|------
1 | 123 Main St. | XX
Operation
---------
OperationId | GarageId | TimeStarted | TimeEnded | OperationTypeDescId | OperationTypeId
------------|----------|-------------|-----------|---------------------|----------------
2 | 1 | noon | NULL | 2 | 2
OperationTypeDesc
-------------
OperationTypeDescId | Name | Description
--------------------|-------|-------------------------
1 | OpFoo | Do things with the stuff
2 | OpBar | Do stuff with the things
OpFoo
-----
OpID | Thing1 | Thing2
-----|--------|-------
1 | 123 | abc
OpBar
-----
OpID | Stuff1 | Stuff2
-----|--------|-------
1 | 456 | def
2 | 789 | ghi
Using this setup, you have the following information:
A garage has it's information, plain and simple
An operation has a unique ID (OperationId), a garage where it was executed, an ID referencing the description of the operation, and the OperationType ID (more on this in a moment).
A pre-populated table of operation types. Each type has a unique ID (OperationTypeDescId), the name of the operation, and a human-readable description of what that operation is.
1 table for each row in OperationTypeDesc. For convenience, the table name should be the same as the Name column
Now we can begin to see where inheritance comes into play. In the operation table, the OperationTypeId references the OpId of the relevant table...the "relevant table" is determined by the OperationTypeDescId.
An example: Let's say we had the above data set. In this example we know that there is an operation happening in a garage at 123 Main St. We know it started at noon, and has not yet ended. We know the type of operation is "OpBar". Since we know we're doing an OpBar operation instead of an OpFoo operation, we can focus on only the OpBar-relevant attributes, namely stuff1 and stuff2. Since the Operations's OperationTypeId is 2, we know that Stuff1 is 789 and Stuff2 is ghi.
Now the tricky part. In your program, this is going to require Reflection. If you don't know what that is, it's the practice of getting a Type from the NAME of that type. In our example, we know what table to look at (OpBar) because of its name in the OperationTypeDesc table. Put another way, you don't automatically know what table to look in; reflection tells you that information.
Edit:
Csaba says "In the future if you would like to introduce a new operation that would involve creating a new table in the database". That is correct. You would also need to add a new row to the OperationTypeDesc table. Csaba implies this is a bad thing, and I disagree - with a few provisions. If you are going to be adding a new operation type frequently, then yes, he makes a very good point. you don't want to be creating new tables constantly. If, however, you know ahead of time what types of operations will be performed, and will very rarely add new types of operations, then I maintain this is the way to go. All of your info common to all operations goes in the Operation table, and all op-specific info goes into the relevant "sub-table".
There is one more very important note regarding this. Because of how this is designed, you, the human, must be aware of the design. Whenever you create a new operation type, it's not as simple as creating the new table. Specifically, you have to make sure that the new table name and the OperationTypeDesc "Name" entry are the same. Think of it as an extra constraint - an "INTEGER" column can only contain ints, otherwise the db won't allow the data. In the same manner, the "Name" column can only contain the name of an existing table. You the human must be aware of that constraint, because it cannot be (easily) automatically enforced.

Whether to merge avatar and profile tables?

I have two tables:
Avatars:
Id | UserId | Name | Size
-----------------------------------------------
1 | 2 | 124.png | Large
2 | 2 | 124_thumb.png | Thumb
Profiles:
Id | UserId | Location | Website
-----------------------------------------------
1 | 2 | Dallas, Tx | www.example.com
These tables could be merged into something like:
User Meta:
Id | UserId | MetaKey | MetaValue
-----------------------------------------------
1 | 2 | location | Dallas, Tx
2 | 2 | website | www.example.com
3 | 2 | avatar_lrg | 124.png
4 | 2 | avatar_thmb | 124_thumb.png
This to me could be a cleaner, more flexible setup (at least at first glance). For instance, if I need to allow a "user status message", I can do so without touching the database.
However, the user's avatars will be pulled far more than their profile information.
So I guess my real questions are:
What king of performance hit would this produce?
Is merging these tables just a really bad idea?
This is almost always a bad idea. What you are doing is a form of the Entity Attribute Value model. This model is sometimes necessary when a system needs a flexible attribute system to allow the addition of attributes (and values) in production.
This type of model is essentially built on metadata in lieu of real relational data. This can lead to referential integrity issues, orphan data, and poor performance (depending on the amount of data in question).
As a general matter, if your attributes are known up front, you want to define them as real data (i.e. actual columns with actual types) as opposed to string-based metadata.
In this case, it looks like users may have one large avatar and one small avatar, so why not make those columns on the user table?
We have a similar type of table at work that probably started with good intentions, but is now quite the headache to deal with. This is because it now has 100s of different "MetaKeys", and there is no good documentation about what is allowed and what each does. You basically have to look at how each is used in the code and figure it out from there. Thus, figure out how you will document this for future developers before you go down that route.
Also, to retrieve all the information about each user it is no longer a 1-row query, but an n-row query (where n is the number of fields on the user). Also, once you have that data, you have to post-process each of those based on your meta-key to get the details about your user (which usually turns out to be more of a development effort because you have to do a bunch of String comparisons). Next, many databases only allow a certain number of rows to be returned from a query, and thus the number of users you can retrieve at once is divided by n. Last, ordering users based on information stored this way will be much more complicated and expensive.
In general, I would say that you should make any fields that have specialized functionality or require ordering to be columns in your table. Since they will require a development effort anyway, you might as well add them as an extra column when you implement them. I would say your avatar pics fall into this category, because you'll probably have one of each, and will always want to display the large one in certain places and the small one in others. However, if you wanted to allow users to make their own fields, this would be a good way to do this, though I would make it another table that can be joined to from the user table. Below are the tables I'd suggest. I assume that "Status" and "Favorite Color" are custom fields entered by user 2:
User:
| Id | Name |Location | Website | avatarLarge | avatarSmall
----------------------------------------------------------------------
| 2 | iPityDaFu |Dallas, Tx | www.example.com | 124.png | 124_thumb.png
UserMeta:
Id | UserId | MetaKey | MetaValue
-----------------------------------------------
1 | 2 | Status | Hungry
2 | 2 | Favorite Color | Blue
I'd stick with the original layout. Here are the downsides of replacing your existing table structure with a big table of key-value pairs that jump out at me:
Inefficient storage - since the data stored in the metavalue column is mixed, the column must be declared with the worst-case data type, even if all you would need to hold is a boolean for some keys.
Inefficient searching - should you ever need to do a lookup from the value in the future, the mishmash of data will make indexing a nightmare.
Inefficient reading - reading a single user record now means doing an index scan for multiple rows, instead of pulling a single row.
Inefficient writing - writing out a single user record is now a multi-row process.
Contention - having mixed your user data and avatar data together, you've forced threads that only one care about one or the other to operate on the same table, increasing your risk of running into locking problems.
Lack of enforcement - your data constraints have now moved into the business layer. The database can no longer ensure that all users have all the attributes they should, or that those attributes are of the right type, etc.

should i really use a relation table when tagging blog posts?

while trying to figure out how to tag a blog post with a single sql statement here, the following thought crossed my mind: using a relation table tag2post that references tags by id as follows just isn't necessary:
tags
+-------+-----------+
| tagid | tag |
+-------+-----------+
| 1 | news |
| 2 | top-story |
+-------+-----------+
tag2post
+----+--------+-------+
| id | postid | tagid |
+----+--------+-------+
| 0 | 322 | 1 |
+----+--------+-------+
why not just using the following model, where you index the tag itself as follows? taken that tags are never renamed, but added and removed, this could make sense, right? what do you think?
tag2post
+----+--------+-------+
| id | postid | tag |
+----+--------+-------+
| 1 | 322 | sun |
+----+--------+-------+
| 2 | 322 | moon |
+----+--------+-------+
| 3 | 4443 | sun |
+----+--------+-------+
| 4 | 2567 | love |
+----+--------+-------+
PS: i keep an id, i order to easily display the last n tags added...
It works, but it is not normalized, because you have redundancy in the tags. You also lose the ability to use the "same" tags to tag things besides posts. For small N, optimization doesn't matter, so I have no problems if you run with it.
As a practical matter, your indexes will be larger (assuming you are going to index on tag for searching, you are now indexing duplicates and indexing strings). In the normalized version, the index on the tags table will be smaller, will not have duplicates, and the index on the tag2post table on tagid will be smaller. In addition, the fixed size int columns are very efficient for indexing and you might also avoid some fragmentation depending on your clustering choices.
I know you said no renaming, but in general, in both cases, you might still need to think about the semantics of what it means to rename (or even delete) a tag - do all entries need to be changed, or does the tag get split in some way. Because this is a batch operation in a transaction in the worst case (all the tag2post have to be renamed), I don't really classify it as significant from a design point of view.
This sounds fine to me, using an ID to reference something that you delegated into another table makes sense when you have things that vary, say a user's name or whatever, because you don't want to change it's name in every place in your database when he changes it. However in this case the tag names themselves will not vary, so the only potential downside I see is that a text index might be slightly slower than a numeric index to search through.
Where is the real advantage of your proposal over a relation table containing IDs?
Technically they solve the same problem, but your proposed solution does it in a redundant, de-normalized way that only seems to satisfy the instinctive urge to be able to read the data directly from the relation table.
The DB server is pretty good at joining tables, and even more so if the join is over an INT field with an index on it. I don't think you will be facing devastating performance issues when you join another table (like: INT id, VARCHAR(50) TagName) to your query.
But you lose the ability to easily rename a tag (even if you don't plan on doing so), and you needlessly inflate your relation table with redundant data. Over time, this may cost you more performance than the normalized solution.
The de-normalised method may be fine depending on your application.
You may find that it causes a performance hit due to searching a large set of VARCHAR data.
When doing a search for things tagged like "sun*" (e.g. sun, sunny, sunrise)
you will not need to do a join. However, you will need to do a like comparison on a MUCH larger set of VARCHAR data. Proper indexing may alleviate this issue but only testing will tell you which method is faster with your dataset.
You also have the option of adding a VIEW that pre-joins the normalised tables. This gives you simpler queries while still allowing you to have highly normalised data.
My recommendation is to go with a normalised structure (and add de-normalised views a necessary for ease of use) until you encounter an issue that de-normalising the data schema fixes.
I was considering that too. Want a list of tags in the database, just select distinct tag from tag2post. I was told that since I wanted to optimize for select statements, it would be better to use an integer key because it was much faster than using a string.