SQL Query to find duplicates where a String contains a specific id - mysql

I have a database table where one field (payload) is a string where a JSON-object is stored. This JSON has multiple attributes. I would like to find a way to query all entries where the payload json-object contains the same value for the attribute id_o to find duplicates.
So for example if there existed multiple entries where id_o of the payload-string is "id_o: 100" I want to get these rows back.
How can I do this?
Thanks in advance!

I have faced similar issue like this before.
I used regexp_substr
SELECT regexp_substr(yourJSONcolumn, '"id_o":"([^,]*)',0,1,'e') end as give_it_a_name
the comma in "([^,])" can be replaced with a "." if after the id_0:100 has a . or something else that you want to remove.

i think storing json in database is not a good experience. Now your db needs a normalization, it will be good, if you create a new row in your db, give it a unique index and store this id_o property there.
UPDATE
here what i find in another question:
If you really want to be able to add as many fields as you want with no limitation (other than an arbitrary document size limit), consider a NoSQL solution such as MongoDB.
For relational databases: use one column per value. Putting a JSON blob in a column makes it virtually impossible to query (and painfully slow when you actually find a query that works).
Relational databases take advantage of data types when indexing, and are intended to be implemented with a normalized structure.
As a side note: this isn't to say you should never store JSON in a relational database. If you're adding true metadata, or if your JSON is describing information that does not need to be queried and is only used for display, it may be overkill to create a separate column for all of the data points.

I guess you JSON looks like this: {..,"id_o":"100",..}
SELECT * FROM your_table WHERE your_column LIKE '%"id_o":"100"%'

Related

A use case for MySql JSON datatype

I am creating a DB-schema for a website on which users can write the Articles. I was almost done with the design and suddenly I read few blogs on JSON datatype in MySQL.
As per blogs, there are certain use cases where JSON can be used:
for storing metadata. e.g a product having its height, widths,
colour stored as JSON.
for storing the non-standard schema type data
for storing the tags as JSON. e.g this question could have tags -
mysql, JSON. So the blogs recommended using a JSON structure that
holds all the tags.
The last one is doubtful to me. Why?
Ok I have stored the tag value in JSON as {"tags": ["mysql", "JSON", "mysql-datatype"]}. I agree this helps in easily maintaining the tags with the Article.
But suppose a user wants to read all the article related to mysql tags!! If I have been maintained a separate table for article_id - tags_id, I could have easily get all the Articles based on the tags. But with JSON this could be a very hectic requirement, though this can be solved but with a cost. Slower queries ofcourse.
This is my schema for Article:
Is my way of thinking correct or am I missing something here? Love to hear some suggestions.
The task you're trying to do, to associate articles with tags, is better handled as a many-to-many relationship. For this you need another table, which I believe is the article_tags table in your diagram.
This makes it easy to query for all articles with a given tag.
SELECT ...
FROM article AS a
JOIN article_tags AS t USING (article_id)
WHERE t.topic_id = 1234 -- whatever is the id for the topic you want to read
Doing the same thing if you use JSON to store tags in the article table is different:
SELECT ...
FROM article AS a
WHERE JSON_CONTAINS(a.article_tags, '1234')
This might seem simpler, since it does not require a JOIN.
But any search that puts the column you need to search inside a function call will not be able to use an index. This will result in a table-scan, so the query will always search every row in the table. As your table grows, it will become slower and slower to do this search "the hard way."
The first method with the article_tags table uses an index two ways:
Look up the entries in article_tags matching the desired tag quickly
Look up the corresponding articles by their primary key quickly
No table-scan needed. The query reads only the rows that are going to be in the query result.
My take on the JSON data type and JSON functions follows this general rule:
Reference JSON columns in the select-list, but not in the WHERE clause.
That is, if you can do your search conditions in the WHERE clause using non-JSON columns, you can take advantage of indexes to make the query as efficient as possible.
Once the relevant rows have been found that way, then you may extract parts of your JSON data to return in the result. Compared to the cost of searching JSON documents in a table-scan, the cost of extracting a field from the JSON documents on the rows matching your search is relatively small.
The select-list is not evaluated for rows unless they match the search conditions.
I haven't used it myself yet, but from my understanding I wouldn't use JSON for items you'd want to lookup / filter by. For example: I'd use it for storing a JSON config where the config schema might change frequently (meaning no DB schema changes).
However it looks like MySQL does have functions to search in the JSON, https://dev.mysql.com/doc/refman/8.0/en/json-search-functions.html
JSON_CONTAINS(target, candidate[, path])
Not sure on the efficiency of this compared to an indexed string column.

Single Column vs Multi Column Design (for Non Primary Key columns)

In database table design, which of the following is better design for event-log type of data growth
Design 1) Numeric columns(Long) and character columns (Varchar2) with
Index:
..(pkey)|..|..|StockNumber Long | StockDomain Varchar2 |...
.. |..|..|11111 | Finance
.. |..|..|23458 | Medical
Design 2) Character column Varchar2 with Index:
..(pkey)|..|..|StockDetails Varchar2(1000) |..|..
.. |..|..|11111;Finance |..|..
.. |..|..|23458;Medical |..|..
Design advantages: First design is very specific and Second design is more general which can accommodate more general data.In both the cases, columns indexed.
Storage: First design indexes require less storage than second
Performance: Same?
I am having a question about performance vs flexibility. Obviously, first design is better. But second design is the more general purpose. Let me know your insights
Note: Edited the question for more clarity.
In general, having discrete columns is the better way to go for a few reasons:
Datatypes - You have guarantees that the data you have saved is in the right formats, at least as far as non string columns go, your stockNumber will always be a number if it's a bigint/long, trying to set it to anything else will cause your insert/update to error. As part of a colon separated value (CSV) string there is a chance of bad data when it's part of a string.
Querying - Querying a single column has to be done using LIKE since you are looking for a substring of the single column string. If I look for WHERE StockDetails LIKE '%11111%' I will find the first line, but I may find another line where a dollar value inside that column, in a different field is $11111. With discrete columns your query would be WHERE StockNumber = 11111 guaranteeing it finds the data only in that column.
Using the data - Once you have found the row you're wanting, you then have to read the data. This means parsing out your CSV into separate fields. If one of those fields had a colon in it, and it is improperly escaped, the rest of the data is going to be parsed wrong, and you still need your values in a guaranteed same order, leaving blank sections ;; where you would have had a null value in a column.
There is a middle ground between storing CSVs and a separate columns. I have seen, and in fact am doing on one major project, data stored in a table as json. With json you have property names, so you don't care the order the fields appear in the string, because domain will still always be domain, any non standard fields you don't need in an entry (say a property that only exists for the medical domain) will just not be there rather than needing a blank double colon, and parsers for json exist in all languages I can think of that you would connect to your database, there's no need to manually code something to parse out your CSV string. For example your StockDetails given above would look like this:
+--------------------------------------+
| StockDetails |
+--------------------------------------+
| {"number":11111, "domain":"Finance"} |
| {"number":23458, "domain":"Medical"} |
+--------------------------------------+
This solves issues 2 and 3 above:
You now write your query as WHERE StockDetails LIKE '%"number":11111 including the json property name guarantees you don't find the data anywhere else in your string.
You don't need to worry about fields out of order, or missing in your string causing your data to be unusable, using json gives you the key/value pair, all you need to do is handle nulls where the key doesn't exist. This also lets you add fields easily, adding a new CSV field can break your code to parse it, the number of values will be off for your existing data, so you will need to update all rows potentially, however since in json you only store non null fields, a new field will be treated like any other null value on existing data.
In relational database design, you need discrete columns. One value per column per row.
This is the only way to use data types and constraints to implement some data integrity. In your second design, how would you implement a UNIQUE constraint on either StockNumber or StockDomain? How would you make sure StockNumber is actually a number?
This is the only way to create indexes on each column individually, or create a compound index that puts the StockDomain first.
As an analogy, look in the telephone book: can you find all people whose first name is "Bill" easily or efficiently? No, you have to search the whole book to find people with a specific first name. The order of columns in an index matters.
The second design is practically not a database at all — it's a file.
To respond to your comments, I'm reiterating what I wrote in a comment:
Sometimes denormalization is worthwhile, but I can't tell [if your second design is worthwhile], because you haven't described how you will query this data. You must take into account your query needs before you can decide on any optimization.
Stated another way: denormalization, like all other optimizations, benefits one query type, at the expense of other query types. Therefore you need to know which queries you need to be optimal, and which queries are less important, so it won't hurt your overall performance if the other queries are degraded.
If you can't predict the queries, default to designing a database with rules of normalization. Normalization is not designed for performance optimization, it's designed to prevent data anomalies, which is a good goal too.
You have posted several new comments, I guess in the hopes that I will suddenly understand and endorse your second design. But you still haven't described any specific query that will be optimized by using your second design.

json field type vs. one field for each key

I'm working on a website which has a database table with more than 100 fields.
The problem is when my records number get very much (like more than 10000) the speed of response gets very much and actually doesn't return any answer.
Now i want to optimize this table.
My question is: Can we use json type for fields to reduce the number of columns?
my limitation is that i want to search, change and maybe remove that specific data which is stored in json.
PS: i read this qustion : Storing JSON in database vs. having a new column for each key, but that was asked in 2013 and as we know in MuSQL 5.7 json field type is added.
tnx for any guide...
First of all having table with 100 columns may suggest you should rethink your architecture before proceeding. Otherwise it will only become more and more pain in later stages.
May be you are storing data as seperate columns which can be broken down to be stored as seperate rows.
I think the sql query you are writing is like (select * ... ) where you may be fetching extra columns than you may require. You may specify the columns you require. It will definitely speed up the api response.
In my personal view storing active data in json inside sql is not useful. Json should be used as last resort for the meta data which does not mutate or needs not to be searched.
Please make your question more descriptive about the schema of your database and query you are making for api.

Position independent string matching

I have 2,000,000 strings in my mysql database. Now , when a new string comes as input, I try to find out if the string is already in my database, else, I insert the string.
Definition of String Match
For my case, position of a word in the text doesn't matter. Only all the words should be present in the string and no extra words in either string.
Ex - Ram is a boy AND boy is a Ram will be said to match. Ram is a good boy won't match.
PS - Please ignore the sense
Now, my question is what is the best way to do these matching given the number of strings(2,000,000) I have to match with .
Solution I could think of :
Index all the strings in SOLR/Sphinx
On new search, I will just
hit the search server and have to consider at max top 10 strings
Advantages :-
Faster than mysql full text search
Disadvantages :-
Keeping search server updated with the new queries in mysql
database.
Are there any other better solutions that I can go for ? Any suggestions and approach to tackle this are most welcome :)
Thanks !
You could just compute a second column that has the words in sorted order. THen just a unique index on that column :)
ALTER TABLE table ADD sorted varchar(255) not null, unique index(sorted);
then... (PHP for convenience, but other languages will be similar)
$words = explode(' ',trim($string));
sort($words);
$sorted = mysql_real_escape_string(implode(' ',$words));
$string = mysql_real_escape_string($string);
$sql = "INSERT IGNORE INTO table SET `string`='$string',`sorted`='$sorted'";
I would suggest to create some more tables that stores the information about your existing data.
so that regardless of how much data your table has you will not have to deal with performance issue during "match/check and insert" logic in your query.
please check the schema suggestion I have made for similar requirement in another post on SO.
accommodate fuzzy matching
in above post to achieve your needs you will need just one extra table where I have mentioned data match with 90% accuracy. let me know if that answer is not clear or you have any doubt on that.
EDIT-1
in your case you will have 3 tables. one you already have, where you have your 2,000,000 string messages stored. now another two table i was talking about is as follows.
second table to store all unique Expression (unique word accross all messages)
third table to store link between each Expression(word) and messgae that word appears in.
see the below query results.
Now lets say your input has a string "Is Boy Ram"
first extract Each Expression from string you have 3 in this string. "Is" and "Ram" and "Boy".
now its just matter of completing the Select query to see if these all expression exist in last table
"MyData_ExpressionString" for single StringID. I guess now you have better picture and you know what to do next. and yes, i haven't created Indexes but I guess you already know what indexes you will need.
Calculate a bloom filter for each string by adding all the words to the filter for the given string. On any new string lookup, calculate the bloom filter, and lookup the matching strings in the DB.
You probably can get by with a fairly short bloom filter, some testing on your strings could tell you how long you need.

MySQL: One row or more?

I have any kind of content what has an ID now here I can specify multiple types for the content.
The question is, should I use multiple rows to add multiple types or use the type field and put there the types separated with commas and parse them in PHP
Multiple Rows
`content_id` | `type`
1 | 1
1 | 2
1 | 3
VS
Single Row
`content_id` | `type`
1 | 1,2,3
EDIT
I'm looking for the faster answer, not the easier, please consider this. Performance is really important for me. So I'm talking about a really huge database with millions or ten millions of rows.
I'd generally always recommend the "multiple rows" approach as it has several advantages:
You can use SQL to return for example WHERE type=3 without any great difficulty as you don't have to use WHERE type LIKE '%3%', which is less efficient
If you ever need to store additional data against each content_id and type pair, you'll find it a lot easier in the multiple row version
You'll be able to apply one, or more, indexes to your table when it's stored in the "multiple row" format to improve the speed at which data is retrieved
It's easier to write a query to add/remove content_id and type pairs when each pair is stored separately than when you store them as a comma seaparated list
It'll (nearly) always be quicker to let SQL process the data to give you a subset than to pass it to PHP, or anything else, for processing
In general, let SQL do what it does best, which is allow you to store the data, and obtain subsets of the data.
I always use multiple rows. If you use single rows your data is hard to read and you have to split it up once you grab it from the database.
Use multiple rows. That way, you can index that type column later, and search it faster if you need to in the future. Also it removes a dependency on your front-end language to do parsing on query results.
Normalised vs de-normalised design.
usually I would recommend sticking to the "multiple rows" style (normalised)
Although sometimes (for performance/storage reasons) people deliberately implement "single row" style.
Have a look here:
http://www.databasedesign-resource.com/denormalization.html
The single row could be better in a few cases. Reporting tends to be easer with some denormalization is the main example. So if your code is cleaner/performs better with the single row, then go for that. Other wise the multiple rows would be the way to go.
Never, ever, ever cram multiple logical fields into a single field with comma separators.
The right way is to create multiple rows.
If there's some performance reason that demands you use a single row, at least make multiple fields in the row. But that said, there is almost never a good performance reason to do this. First make a good design.
Do you ever want to know all the records with, say, type=2? With multiple rows, this is easy: "select content_id from mytable where type=2". With the crammed field, you would have to say "select content_id from mytable where type like '%2%'". Oh, except what happens if there are more than 11 types? The above query would find "12". Okay, you could say "where type like '%,2,%'". Except that doesn't work if 2 is the first or the last in the list. Even if you came up with a way to do it reliably, a LIKE search with an initial % means a sequential read of every record in the table, which is very slow.
How big will you make the cram field? What if the string of types is too big to fit in your maximum?
Do you carry any data about the types? If you create a second table with key of "type" and, say, a description of that type, how will you join to that table. With multiple rows, you could simply write "select content_id, type_id, description from content join type using (type_id)". With a crammed field ... not so easy.
If you add a new type, how do you make it consistent? Suppose it used to say "3,7,9" and now you add "5". Can you say "3,7,9,5" ? Or do they have to be in order? If they're not in order, it's impossible to check for equality, because "1,2" and "2,1" will not look equal but they are really equivalent. In either case, updating a type field now becomes a program rather than a single SQL statement.
If there is some trivial performace gain, it's just not worth it.