Finding common phrases in rows that have dynamic content

Finding common phrases in rows that have dynamic content - mysql

I'm using MySQL, and I am trying to find common strings over a given character length within a series of messages that are highly dynamic, Each message may have a common phrase, but they will be appended with reference codes or names that don't match a specific format on either side of the string. for example, this is an example of the types of common phrases I'm trying to scan for, but has dynamic content embedded as well, and in different formats (https://screencast.com/t/rlABTWitQ)
The end result I am looking for is something akin to this (https://screencast.com/t/qXzrGNFuf)
Because of the highly variable nature of the formats of these messages, uses of substring_index and regexp (as much as my amateur familiarity with REGEXP has taken me), I can't seem to get anything going
SELECT LEFT("first_middle_last", CHAR_LENGTH("first_middle_last") - LOCATE('_', REVERSE("first_middle_last")));
I can't use something like this, as it would just strip out on a specific type of character. As you can see, the types of strings are too variant in format

Related

Ways MySQL/MariaDB could silently be changing values when storing

I'm searching for cases in MySQL/MariaDB where the value transmitted when storing will differ from the value that can be retrieved later on. I'm only interested in fields with non-binary string data types like VARCHAR and *TEXT.
I'd like to get a more comprehensive understanding on how much a stored value can be trusted. This would especially be interesting for cases where the output just lacks certain characters (like with the escape character example below) as this is specifically dangerous when validating.
So, this boils down to: Can you create an input string (and/or define an environment) where this doesn't output <value> in the second statement?
INSERT INTO t SET v = <value>, id = 1; // success
SELECT v FROM t WHERE id = 1;
Things I can think of:
strings containing escaping (\a → a)
truncated if too long
character encoding of the table not supporting the input
If something fails silently probably also depends on how strict the SQL mode is set (like with the last two examples).
Thanks a lot in advance for your input!

you can trust that all databases do, what the standards purpose, with strings and integer it is simple, because it saves the binary representation of that number or character in your choosen character set.
Decimal Double and single values are different, because the can't be saved directly and so it comes to fractals see decimal representation
That also follows standards, but you have to account with it.

Remove JSON keys with wildcards from a MySQL field

I have a MySQL 8.0.20 database with a table that describes metadata about uploaded image files. One column contains a JSON object with a whole bunch of auto-generated data that I'm trying to clean up.
This JSON object sometimes contains one or more variable key names that match a specific pattern. Something like
{
"image_name": "P10043983",
"image_size": "60138",
"image_original_exifdata": "{
'FileName':'P10043983.jpg',
'MimeType':'image/jpeg',
'UndefinedTag:0xA435':'\u0000\u0000\u0000\u0000\u0000\u0000'
}"
}
That UndefinedTag:0xA435 (with many permutations) is the problem. It's referring to various image Exif details like lens type, GPS data, etc. It's stuff that I'm not interested in and that these cameras mostly don't provide, so I've ended up with a table full of long strings of useless characters just taking up space. I want those JSON fields gone for performance and cleanliness.
Is there a way to run a SQL query that would use wildcards or regular expressions to find (and, ideally, remove) all of these pesky variable keys? I'd like to avoid manually making a list of all of the possible "UndefinedTag" keys to search against, and I also didn't like the results when I just treated the whole thing as a string and did REGEXP_REPLACE calls (it sometimes left trailing commas that broke my JSON and were difficult for me to avoid/resolve).
I know some of the JSON functions like JSON_SEARCH() accept wildcards, but it explicitly says the search path can't end in a wildcard (so no UndefinedTag:0x** allowed). Many of the functions I'm after (e.g., JSON_REMOVE()) don't accept wildcards at all. Hell, I've even had trouble finding known keys, and I suspect that silly colon in the key name might have something to do with it.
So, how can I clean up my table and remove the many forms of this UndefinedTag problem? Maybe it's easier to just go back to the regex_replace plan and deal instead with the trailing commas?

Deal with long numbers in scientific notation in json string - Freemarker

I have a json string which contains a long number but in scientific notation (like 1.559101974041E12 instead of 1559101974041). Due to this, I am not able to parse it using ?eval as this value must be in double quotes in order to get parsed.
I thought of one solution like putting double quotes around them using regex and get them evaluated. After that, use some free marker method to convert value into long. But this solution is very risky and can alter other values as well.

I'm not sure how your template looks, but if you have variable s that contains the string "1.559101974041E12" (the quotation marks aren't part of the string value itself), then you can parse it like s?number. s?eval doesn't work because scientific notation is not part of the FreeMarker syntax (but ?number can parse more formats).
If you will re-print the number in the template, note that depending on locale and configuration settings, it might will look like 1,559,101,974,041. You can prevent that with ?c (for example like ${s?number?c}), in which case it will always look like 1559101974041.

Regex for matching with and without quotes for dynamic JSON

I have the following text strings:
"Name":"John"}]
"Age":36
"Address":"ABC,PQR234[]/.,#ANYCHARACTERS"
"Gender":null
I need to get two groups (key value pair) from this such that the output would be only:
Key|Value
Name|John
Age|36
Address|ABC,PQR234[]/.,#ANYCHARACTERS
The requirement is to have a single regex to grab everything in the double quotes if the double quotes are present. If not, take the value without the quotes.
In our example above, 36 and null are the one without the quotes and they need to be captured as well.
I have tried a lot but have failed to do so.
UPDATE:
I don't know why I am getting down votes for this question. Yes this is JSON that I am trying to parse but there is a reason behind why I am doing this and not using any document parser.
I am supposed to use Talend for getting a dynamic JSON converted into Key Value Pair. What I mean by dynamic is the fields of the JSON can vary and hence I do not have a fixed schema and hence cannot use a document parser (which demands a fixed structure of JSON). I am devising a solution to get around this using Normalizer (on comma) and then extracting the key value pair which will be in double quotes using Regular Expressions. I tried many things on my own and since I am not an expert in Regular expressions, I have come here to get inputs.
If you know any better solution to this, I would be very happy to get your inputs.

How about this?
/"?([^\n"]*)"?:"?([^\n"]*)"?/
Explained in detail at:
https://regex101.com/r/UM0rl2/1/

Regex to match a username

I am trying to create a regex to validate usernames which should match the following :
Only one special char (._-) allowed and it must not be at the extremes of the string
The first character cannot be a number
All the other characters allowed are letters and numbers
The total length should be between 3 and 20 chars
This is for an HTML validation pattern, so sadly it must be one big regex.
So far this is what I've got:
^(?=(?![0-9])[A-Za-z0-9]+[._-]?[A-Za-z0-9]+).{3,20}
But the positive lookahead can be repeated more than one time allowing to be more than one special character which is not what I wanted. And I don't know how to correct that.

You should split your regex into two parts (not two Expressions!) to make your life easier:
First, match the format the username needs to have:
^[a-zA-Z][a-zA-Z0-9]*[._-]?[a-zA-Z0-9]+$
Now, we just need to validate the length constraint. In order to not mess around with the already found pattern, you can use a non-consuming match that only validates the number of characters (its literally a hack for creating an and pattern for your regular expression): (?=^.{3,20}$)
The regex will only try to match the valid format if the length constraint is matched. It is non-consuming, so after it is successful, the engine still is at the start of the string.
so, all together:
(?=^.{3,20}$)^[a-zA-Z][a-zA-Z0-9]*[._-]?[a-zA-Z0-9]+$
Debugger Demo

I think you need to use ? instead of +, so the special character is matched only once or not.
^(?=(?![0-9])?[A-Za-z0-9]?[._-]?[A-Za-z0-9]+).{3,20}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008