I'm trying to do a query on a Japanese dictionary DB I created that identifies repeating words—words like ニコニコ (niko niko), ピカピカ (pika pika), etc. While I know how to do LIKE %% queries, I'm not certain how to get it to define a pattern off one part and see if the other part matches it.
Parameters:
All of the words I'm looking for are 4 double-byte characters long
Pattern A consists of the first two characters, Pattern B consists of the last two
The query is being run on a headwords table that is structured rather simply: It has two fields, id and headword
Collation on the table is set to utf8_bin
We want to filter to search only headwords that are 4 characters long, then identify Pattern A and see if Pattern B is identical. If so, return the id.
Bonus: If there is a way to run the search as straight utf8 instead of utf8_bin, that would be helpful for picking up some additional matches (e.g. つれづれ tsure dure). The headwords column has a UNIQUE index on it, and requires utf8_bin collation to enforce the index properly for normal operations.
Data & Result ExampleAdded per Strawberry's suggestion
id | headword
=============
1 | たべる
2 | あらわれる
3 | ばかばかしい
4 | ニコニコ
5 | じゅんびする
6 | ぴかぴか
7 | する
8 | つれづれ
9 | ひとびと
10 | ひと
Desired result would return ids 4 and 6; an optimal result would also return 8 and 9.
1 is too short by 1 character, and Pattern A (たべ) does not match Pattern B (る)
2 is too long by 1 character, and Pattern A (あら) does not match Pattern B (われ). Ditto for 5
3 has matches for Patterns A and B (ばか), however it's too long at 6 characters
7 and 10 are too short by 2 characters. While there's a possible Pattern A (e.g. ひと in 10 appears in ひとびと in 9), it's not long enough to provide a Pattern B to compare against
In PHP, this is what you are looking for: preg_match('/^(..)\1$/u', 'ニコニコ') will be true.
The u qualifier says that characters are utf8.
The .. finds any 2 characters.
The \1 is a back-reference to (..), hence matching a duplicate.
The ^ and $ 'anchor' the regexp to the start and end of the target string.
The 'ニコニコ' is merely one of the test cases.
So, start at the beginning, find 2 utf8 characters, make sure they are immediately repeated, and nothing else follows.
Related
I am trying to identify Spanish ID numbers using REGEX on MySQL. I am took this regex to adapt it to my dataset, as the items are not isolated and might not start/end with those characters. The expressions are:
Original: ^(x?\d{8}|[xyz]\d{7})[trwagmyfpdxbnjzsqvhlcke]$
Mine:[0-9]{8,8}[A-Za-z]{1}
When I run the search using my REGEX, this is a sample of what I get:
GOOD --> 47099085T
GOOD --> D73654109H
NOT OK --> 8.30781719e-05
NOT OK --> 0113:11:19%2000:54:17.042828927Z
How can I modify [0-9]{8,8}[A-Za-z]{1} to exclude the "NOT OK" items?
Spanish ID syntax:
The number of the National Identity Document includes 8 digits and one letter for security. The letter is found by taking all 8 digits as a number and dividing it by 23. The remainder of this digit, which is between 0 and 22, gives the letter used for security. The letters I, Ñ, O, U are not used. The letters I and O are not used – to avoid confusions with the numbers 0 and 1. The Ñ is not used to avoid confusions with N.
Remainder: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Letter: T R W A G M Y F P D X B N J Z S Q V H L C K E
-- EDIT II --
After running a test on a bigger data set, I have found other matches that should be excluded.
How can I modify (^|[^0-9.])([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke]) to DO NOT match:
70ce4827ce88530583ed5a1a40245f24
BE4-SGS-V2-00199982a5aa
2945a6bf-86b6-4ea0-94d9-aec84980762d
0x01010083B5627CCA663946A282DE573804AA85
xmp.iid:FE7F11740720681189A59382544B2855
Ok, according to documentation the Spanish ID system (DNI) is structured thus:
The number of the National Identity Document includes 8 digits and one letter for security. The letter is found by taking all 8 digits as a number and dividing it by 23. The remainder of this digit, which is between 0 and 22, gives the letter used for security. The letters I, Ñ, O, U are not used. The letters I and O are not used – to avoid confusions with the numbers 0 and 1. The Ñ is not used to avoid confusions with N.
Remainder: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Letter: T R W A G M Y F P D X B N J Z S Q V H L C K E
After some exploration with Negative Lookaheads and completely failing to get them to work, we can use a more manual approach to a solution, by manually checking that the found "block" of 8 integers is not preceeded by an integer or a decimal point:
/[^\.\d][\d]{8}[TRWAGMYFPDXBNJZSQVHLCKE]/gmi
MySQL safe/syntax version:
(^|[^0-9.])([0-9]{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])
Example usage using REGEX_REPLACE to return rows where the id_column matches the ID syntax and returns those syntax strings:
SELECT REGEXP_REPLACE(`id_column`,
'(^|[^\\d.])(\\d{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])', '$2') as id_output
FROM `table_name`
WHERE id_column REGEXP '(^|[^\\d.])(\\d{8}[TRWAGMYFPDXBNJZSQVHLCKEtrwagmyfpdxbnjzsqvhlcke])'
NOTE: Prior to MySQL 8.0.17, the result returned by this function used the UTF-16 character set; in MySQL 8.0.17 and later, the character set and collation of the expression searched for matches is used. (Bug #94203, Bug #29308212)
This matches the two correct matches on your example as well as checking that only one of the valid letters comes after the numerical match.
It is important to note that using the max value in the qualifier {min,max} is pretty irrelevant because it does not mean no more than max should exist in the source string. Please see here for further reading.
What does my Regex do:
Checks that a set of 8 integers is not preceeded by either another integer or a decimal point (so 9 integers are never "captured").
Checks that the set of 8 found integers is immediately followed by one of the valid letters of either case.
You can see my Regex in action here and the corresponding MySQL demo here.
47099085T // matches
D73654109H // matches
8.30781719e-05 // unmatched
0113:11:19%2000:54:17.042828927Z // unmatched
Which of these methods would be the most efficient way of storing, retrieving, processing and searching a large (millions of records) index of stored URLs along with there keywords.
Example 1: (Using one table)
TABLE_URLs-----------------------------------------------
ID DOMAIN KEYWORDS
1 mysite.com videos,photos,images
2 yoursite.com videos,games
3 hissite.com games,images
4 hersite.com photos,pictures
---------------------------------------------------------
Example 2: (one-to-one Relationship from one table to another)
TABLE_URLs-----------------------------------------------
ID DOMAIN KEYWORDS
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_KEYWORDS---------------------------------------------
ID DOMAIN_ID KEYWORDS
1 1 videos,photos,images
2 2 videos,games
3 3 games,images
4 4 photos,pictures
---------------------------------------------------------
Example 3: (one-to-one Relationship from one table to another (Using a reference table))
TABLE_URLs-----------------------------------------------
ID DOMAIN
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_TO_KEYWORDS------------------------------------
ID DOMAIN_ID KEYWORDS_ID
1 1 1
2 2 2
3 3 3
4 4 4
---------------------------------------------------------
TABLE_KEYWORDS-------------------------------------------
ID KEYWORDS
1 videos,photos,images
2 videos,games
3 games,images
4 photos,pictures
---------------------------------------------------------
Example 4: (many-to-many Relationship from url to keyword ID (using reference table))
TABLE_URLs-----------------------------------------------
ID DOMAIN
1 mysite.com
2 yoursite.com
3 hissite.com
4 hersite.com
---------------------------------------------------------
TABLE_URL_TO_KEYWORDS------------------------------------
ID DOMAIN_ID KEYWORDS_ID
1 1 1
2 1 2
3 1 3
4 2 1
5 2 4
6 3 4
7 3 3
8 4 2
9 4 5
---------------------------------------------------------
TABLE_KEYWORDS-------------------------------------------
ID KEYWORDS
1 videos
2 photos
3 images
4 games
5 pictures
---------------------------------------------------------
My understanding is that Example 1 would take the largest amount of storage space however searching through this data would be quick (Repeat keywords saved multiple times, however keywords are sat next to the relevant domain)
wWhereas Example 4 would save a tons on storage space but searching through would take longer. (Not having to store duplicate keywords, however referencing multiple keywords for each domain would take longer)
Could anyone give me any insight or thoughts on which the best method would be to utilise when designing a database that can handle huge amounts of data? With the foresight that you may want to display a URL with its assosicated keywords OR search for one or more keywords and bring up the most relevant URLs
You do have a many-to-many relationship between url and keywords. The canonical way to represent this in a relational database is to use a bridge table, which corresponds to example 4 in your question.
Using the proper data structure, you will find out that the queries will be much easier to write, and as efficient as it gets.
I don't know what drives you to think that searchin in a structure like the first one will be faster. This requires you to do pattern matching when searching for each single keyword, which is notably slow. On the other hand, using a junction table lets you search for exact matches, which can take advantage of indexes.
Finally, maintaining such a structure is also much easier; adding or removing keywords can be done with insert and delete statements, while other structures require you do do string manipulation in delimited list, which again is tedious, error-prone and inefficient.
None of the above.
Simply have a table with 2 string columns:
CREATE TABLE domain_keywords (
domain VARCHAR(..) NOT NULL,
keyword VARCHAR(..) NOT NULL,
PRIMARY KEY(domain, keyword),
INDEX(keyword, domain)
) ENGINE=InnoDB
Notes:
It will be faster.
It will be easier to write code.
Having a plain id is very much a waste.
Normalizing the domain and keyword buys little space savings, but at a big loss in efficiency.
"Huse database"? I predict that this table will be smaller than your Domains table. That is, this table is not your main concern for "huge".
Supposing I have 1000 numbers from 1 -> 1000, and a user can have any of the 1000 combination (eg: 4, 25, 353..).
How can I efficiently store that combination in a MySQL DB.
What I thought. I can use the power of 2, and store each number in a really large int, like:
1 -> 01
2 -> 10
4 -> 100
etc.
So if I happen to get the number 6 (110) I know the user has the combination of numbers 2, 4 (2 | 4 = 6) .
So we can have 2^1000 combinations, 125byte. But that is not efficient at all since bigint has 8bytes and I cant store
that in MySQL without using vachars etc. Nodejs cant handle that big number either (and I dont as well) with 2^53-1 being the max.
Why I am asking this question; can I do the above with base 10 instead of 2 and minimize the max bytes that the int can be. That was silly and I think making it to base10 or another base out of 2 changes nothing.
Edit: Additional thoughts;
So one possible solution is to make them in sets of 16digit numbers then convert them to strings concat them with a delimiter, and store that instead of numbers. (Potentially replace multiple 1's or 0's with a certain character to make it even smaller. Though I have a feeling that falls into the compression fields, but nothing better has come to my mind.)
Based on your question I am assuming you are optimizing for space
If most users have many numbers from the set then 125 bytes the way you described is the best you can do. You can store that in a BINARY(125) column though. In Node.js you could just a Buffer (you could use a plain string but should use a Buffer) to operate on the 125 byte bit-field.
If most users have only a few elements in the set then it will take less space to have a separate table with two columns such as:
user_id | has_element (SMALLINT)
---------------------
1 | 4
1 | 25
1 | 353
2 | 7
2 | 25
2 | 512
2 | 756
2 | 877
This will also make queries cleaner and more efficient for doing simple queries like SELECT user_id FROM user_elements WHERE has_element = 25;. You should probably add an index on has_element if you do queries like that to make them many times more efficient than storing a bitfield in a column.
My table lists every character from all 5 of George R. R. Martin's currently published A Song of Ice and Fire novels. Each row contains a record indicating which book in the series the character is from (numbered 1-5) and a single letter indicating the character's gender (M/F). For example:
A B C
1 Character Book Gender
------------------------------
2 Arya Stark - 1 - F
3 Eddard Stark - 1 - M
4 Davos Seaworth - 2 - M
5 Lynesse Hightower - 2 - F
6 Xaro Xhoan Daxos - 2 - M
7 Elinor Tyrell - 3 - F
I can use COUNTIF to find out that there are three females and three males in this table, but I want to know, for example, how many males there are in book 2. How could I write a formula that would make this count? Here is a pseudocode of what I'm trying to achieve:
=COUNTIF(C2:C7, Column B = '2' AND Column C = 'M')
This would output 2.
I'm aware that this task is far better suited to databases and a SELECT query, but I'd like to know how to solve this problem within the constraints of a LibreOffice Calc spreadsheet, without using a macro. Excel-based solutions are fine, so long as they also work in Calc. If there's no solution that uses COUNTIF, it doesn't matter, so long as it works.
I worked it out, thanks to a prompt by assylias. The COUNTIFS formula produces the result I want by counting multiple search criteria. For example, this formula works out how many male characters are in Book 1 (A Game of Thrones).
=COUNTIFS($A$2:$A$2102, "=1", $L$2:$L$2102, "=M")
Let's say I have this strings in a MySQL table:
id | hash
1 | 462a276e262067573e553b5f6a2b4a323e35272d3c6b6227417c4f2654
2 | 5c2670355b6e503f39427a435a423d6d4c7c5156344c336c6c244a7234
3 | 35785c5f45373c495b70522452564b6f4531792b275e40642854772764
...
millions of records !
Now I have a set of substrings (6 character size), for example this:
["76e262", "435a42", "75e406", "95b705", "344c33"]
What I want is to know how many of these substrings are in each string, so the result could be:
id | matches
63 | 5
34 | 5
123 | 3
153 | 3
13 | 2
9 | 1
How can achieve this in a fast way ?
Real numbers and sizes are:
1) Table with 100.000/200.000 hashes
2) Main Hash size: 256 bytes
3) Substring of mini-hashes: 16 of 32 each one
NOTE: I'd like to avoid the "%LIKE%" since it's 16 likes for each row, and millions rows
You can accomplish this by using the Aho-Corasick algorithm: http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
MySQL doesn't have a function for that, so you'd need to write your own or consider using a language like java or c to massage the data.
How about a different approach?
You could also consider having a shifting mechanism for your data and the check on the shifting. For example, if your key is 462a276e262067573e553b5f6a2b4a323e35272d3c6b6227417c4f2654 and you know that your hash will have 58 chars, then you would have these variations:
62a276e262067573e553b5f6a2b4a323e35272d3c6b6227417c4f26544
2a276e262067573e553b5f6a2b4a323e35272d3c6b6227417c4f265446
a276e262067573e553b5f6a2b4a323e35272d3c6b6227417c4f2654462
276e262067573e553b5f6a2b4a323e35272d3c6b6227417c4f2654462a
...
Each one of these would be in a column, every one of them would be indexed.
So your query would be simply:
Select * from table where hash like "a27e262%" or s1 like "a27e262%" ...
Note that this would be MUCH faster than LIKE "%value%" as the column is indexed and the LIKE is only checking the begins with.
There are many disadvantages to this solutions: space required for the extra columns, insertion and update time would increase because of the time calculating the shifted columns, and time required to process the result of the select. But you wouldn't need to implement the algorithm in mysql.
You could also require that the minimum length of the string being searched is 6 chars, so you won't need to shift the whole string, only to keep the first 6 digits. If a match is found then you keep looking for the next 6 digits on the next match.