Can I use InnoDB full text search to tokenize documents? - mysql

I used to think that SQL cannot process unstructured data (like text) unless we write some user-defined functions in C. However, InnoDB's FullText Search feature seems did much of such work already.
According to https://dev.mysql.com/doc/refman/5.6/en/innodb-fulltext-index.html, the index is saved in InnoDB tables named FTS_00000..._00000..._INDEX_?.
I tried to run SELECT * FROM FTS_00000..._00000..._INDEX_1, in the hope to see tokens in each document (perhaps with stopwords removed already). However, I got an error message
ERROR 1146 (42S02): Table 'tf.FTS_0000000000000028_0000000000000030_INDEX_1' doesn't exist
even if select * from information_schema.INNODB_SYS_TABLES; reveals that the table exists.
Does anyone know how I could get the tokens of each document I inserted into the full-text index? It would be great if I can get the information in the following data schema:
token_id document_id count
"apple" 103343 3
"orange" 9593 1
...

Just because InnoDB uses a table as an internal data structure doesn't mean you have access to query those FTS tables with SQL statements. They don't appear in INFORMATION_SCHEMA.TABLES.
After creating the table opening_lines which is the example given in that manual page, I see this:
mysql> SELECT table_id, name, space FROM INFORMATION_SCHEMA.INNODB_SYS_TABLES
-> WHERE name LIKE 'test/%';
+----------+----------------------------------------------------+-------+
| table_id | name | space |
+----------+----------------------------------------------------+-------+
| 52 | test/FTS_000000000000002e_0000000000000085_INDEX_1 | 36 |
| 53 | test/FTS_000000000000002e_0000000000000085_INDEX_2 | 37 |
| 54 | test/FTS_000000000000002e_0000000000000085_INDEX_3 | 38 |
| 55 | test/FTS_000000000000002e_0000000000000085_INDEX_4 | 39 |
| 56 | test/FTS_000000000000002e_0000000000000085_INDEX_5 | 40 |
| 57 | test/FTS_000000000000002e_0000000000000085_INDEX_6 | 41 |
| 47 | test/FTS_000000000000002e_BEING_DELETED | 31 |
| 48 | test/FTS_000000000000002e_BEING_DELETED_CACHE | 32 |
| 49 | test/FTS_000000000000002e_CONFIG | 33 |
| 50 | test/FTS_000000000000002e_DELETED | 34 |
| 51 | test/FTS_000000000000002e_DELETED_CACHE | 35 |
| 46 | test/opening_lines | 30 |
+----------+----------------------------------------------------+-------+
12 rows in set (0.00 sec)
mysql> SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA='test';
+---------------+
| TABLE_NAME |
+---------------+
| opening_lines |
+---------------+
1 rows in set (0.00 sec)
As far as I know, there's no way to query the FTS tables directly at all. They are only for InnoDB's internal implementation of fulltext indexing.

Related

Finding closest value. How to tell MySQL that the data is already ordered?

Let's say I have a table like the following:
+-----------+------------+------+-----+---------+
| Field | Type | Null | Key | Default |
+------------+------------+------+-----+---------+
| datetime | double | NO | PRI | NULL |
| some_value | float | NO | | NULL |
+------------+------------+------+-----+---------+
Date is necessary to be in double and is registered in unix time with fractional seconds (no possibility to install mysql 5.6 to use fractional DATETIME). In addition, the values of the field datetime are not only primary, they are also always increasing. I would like to find the closest row to certain value. Usually you can use something like:
select * from table order by abs(datetime - $myvalue) limit 1
However, I'm afraid that this implementation will be slow for hundred thousands of values, because it is going to search in all the database. And since I have an ordered list, I know I can do some binary search to speed up the process, but I have no idea how to tell MySQL to perform such kind of search.
In order to test the performance I do the following lines:
SET profiling = 1;
SELECT * FROM table order by abs(datetime - $myvalue) limit 1;
SHOW PROFILE FOR QUERY 1;
With the following results:
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000122 |
| Waiting for query cache lock | 0.000051 |
| checking query cache for query | 0.000191 |
| checking permissions | 0.000038 |
| Opening tables | 0.000094 |
| System lock | 0.000047 |
| Waiting for query cache lock | 0.000085 |
| init | 0.000103 |
| optimizing | 0.000031 |
| statistics | 0.000057 |
| preparing | 0.000049 |
| executing | 0.000023 |
| Sorting result | 2.806665 |
| Sending data | 0.000359 |
| end | 0.000049 |
| query end | 0.000033 |
| closing tables | 0.000050 |
| freeing items | 0.000089 |
| logging slow query | 0.000067 |
| cleaning up | 0.000032 |
+--------------------------------+----------+
Which in my understanding, the sorting the result takes 2.8 seconds, however my data is already sorted. As additional information, I have around 240,000 rows.
It won't scan the entire database. A primary key is indexed by a B-tree. Forcing it into a binary search would be slower, if you could do it, which you can't.
Try making it a field:
select abs(datetime - $myvalue) as date_diff, table.*
from table
order by date_diff
limit 1
Indexes are supported in RDBMSs. Define an index on date time or field of your interest and db will not do the complete table scan

TRUNCATE-INSERT vs SELECT-UPDATE-INSERT

I have a table that I am using as a temporary table. A cron runs every hour to set a certain value for each row.
| id | item_id | value |
+====+=========+=======+
| 1 | 5 | 52 |
| 2 | 34 | 314 |
| 3 | 27 | 189 |
| 4 | 19 | 200 |
+====+=========+=======+
What I would like to know is if it is better to first TRUNCATE and then refill this table or that I could rather SELECT the existing row, UPDATE it or INSERT it if it doesn't exist.
Insert the record if it doesn't exist in your temporary table and if it has already in your temporary table but you need to update it's value then update the specific record by only target it.
It would be more wise, because it will be reduce the operation execution time.

How does GROUP BY DESC select its order?

So I am creating sections for a store. The store can have multiple scopes, if there isn't a section_identifier set for a given store_id it should fallback to the global store which is 0.
The SQL command that I want should return a list of section_options for any related given store.
Example of my table:
SELECT * FROM my_table:
+----+--------------------+----------------------+----------+
| id | section_identifier | option_identifier | store_id |
+----+--------------------+----------------------+----------+
| 17 | header | header_option_one | 1 |
| 18 | footer | footer_option_one | 0 |
| 19 | homepage_feature | homepage_feature_one | 0 |
| 23 | header | header_option_three | 0 |
| 25 | homepage_feature | homepage_feature_one | 1 |
+----+--------------------+----------------------+----------+
So section_identifier is unique, the IDs I need back for store 1 would be 17, 18 and 25.
When I run:
SELECT * FROM my_table GROUP BY section_identifier it returns:
+----+--------------------+----------------------+----------+
| id | section_identifier | option_identifier | store_id |
+----+--------------------+----------------------+----------+
| 18 | footer | footer_option_one | 0 |
| 23 | header | header_option_three | 0 |
| 19 | homepage_feature | homepage_feature_one | 0 |
+----+--------------------+----------------------+----------+
This means if I run SELECT * FROM my_table GROUP BY section_identifier DESC:
I get the response (this is my desired output):
+----+--------------------+----------------------+----------+
| id | section_identifier | option_identifier | store_id |
+----+--------------------+----------------------+----------+
| 25 | homepage_feature | homepage_feature_one | 1 |
| 17 | header | header_option_one | 1 |
| 18 | footer | footer_option_one | 0 |
+----+--------------------+----------------------+----------+
Although this works, I have no understanding of as to why.
Its my understanding the initial GROUP BY should get the first instance in the database, IE the response I expect should be:
+----+--------------------+----------------------+----------+
| id | section_identifier | option_identifier | store_id |
+----+--------------------+----------------------+----------+
| 18 | footer | footer_option_one | 0 |
| 17 | header | header_option_three | 1 |
| 19 | homepage_feature | homepage_feature_one | 0 |
+----+--------------------+----------------------+----------+
However, it seems to be referencing my store_id somehow? I have tried a few different combinations and Im weirdly getting my expected result each time but I have no understanding as to why.
Can anybody explain this to me please?
PS
I have tried updating the option_identifier of id = 7 to see if MySql references the latest saved on disk and it didn't change the result.
Also: I'm not planning on using this feature or asking for an alternative, I'm asking what's going on with it?
SELECT * FROM my_table GROUP BY section_identifier
is an invalid SQL query.
How GROUP BY works?
Let's get the query above and see how GROUP BY works. First the database engine selects all the rows that match the WHERE clause. There is no WHERE clause in this query; this means all the rows of the table are used to generate the result set.
It then groups the rows using the expressions specified in the GROUP BY clause:
+----+--------------------+----------------------+----------+
| id | section_identifier | option_identifier | store_id |
+----+--------------------+----------------------+----------+
| 17 | header | header_option_one | 1 |
| 23 | header | header_option_three | 0 |
+----+--------------------+----------------------+----------+
| 18 | footer | footer_option_one | 0 |
+----+--------------------+----------------------+----------+
| 19 | homepage_feature | homepage_feature_one | 0 |
| 25 | homepage_feature | homepage_feature_one | 1 |
+----+--------------------+----------------------+----------+
I marked the groups in the listing above to make everything clear.
On the next step, from each group the database engine produces a single row. But how?
The SELECT clause of your query is SELECT *. * stands for the full list of table columns; in this case, SELECT * is a short way to write:
SELECT id, section_identifier, option_identifier, store_id
Let's analyze the values of column id for the first group. What value should the database engine choose for id? 17 or 23? Why 17 and why 23?
It does not have any criteria to favor 17 over 23. It just picks one of them (probably 17 but this depends on a lot of internal factors) and goes one.
There is no problem to determine the value for section_identifier. It is the column used to GROUP BY, all its values in a group are equal.
The choosing dilemma occurs again on columns option_identifier and store_id.
According to the standard SQL your query is not valid and it cannot be executed. However, some database engines run it as described above. The values for expressions that are not (at least one of the below):
used in the GROUP BY clause;
used with GROUP BY aggregate functions in the SELECT clause;
functionally dependent of columns used in the GROUP BY clause;
are indeterminate.
Since version 5.7.5, MySQL implements functional dependency detection and, by default, it rejects an invalid GROUP BY query like yours.
How to make it work
It's not clear for me how you want to get the result set. Anyway, if you want to get some rows from the table then GROUP BY is not the correct way to do it. GROUP BY does not select rows from a table, it generates new values using the values from the table. A row generated by GROUP BY, most of the times, do not match any row from the source table.
You can find a possible solution to your problem in this answer. You'll have to write the query yourself after you read and understand the idea (and is very clear to you how the "winner" rows should be selected).
GROUP BY sorts records in ascending order by default. Your store_id is not being referenced at all instead the records returned are sorted in ascending order of the section_identifier

mysql table having a->b and b->a values, select only a->b set of values

I have one table having 5 columns
linkid, orinodeno, orinodeno, ternodeno, terifindex
linkid is autoincremented. orinodeno, oriifindex is one combination value and ternodeno, terifindex other combination (orinodeno,oriifindex is originating value and ternodeno,terifindex terminating value i.e, in between there is a link eg just like map two pts n in between connecting link) so my table contains a->b values (i.e a is combination of orinodeno, oriifindex and b is combination of ternodeno,terifindex) and b->a values. so I have to select only a->b set of values not b->a. Also sending my table image. My Table
There is no a map definition in sql databases, forget it. Check any database normalization tutorial. Then you shouldn't have any problems with select statements.
Please be clear about what you are asking. If you can not explain in words, please give example input and your expected output.
From link of table image you have provided and description, It looks like you expect following:
Data in current table:
------------------------------------------------------------------
|linkid | orinodenumber | oriifindex | ternodenumber | terifindex|
------------------------------------------------------------------
|305 | 261 | 2 | 309 | 2 |
|306 | 309 | 2 | 261 | 2 |
|307 | 257 | 10 | 310 | 10 |
|308 | 310 | 10 | 257 | 10 |
|309 | 257 | 11 | 310 | 11 |
------------------------------------------------------------------
Expected Output:
------------------------------------------------------------------
|linkid | orinodenumber | oriifindex | ternodenumber | terifindex|
------------------------------------------------------------------
|305 | 261 | 2 | 309 | 2 |
|307 | 257 | 10 | 310 | 10 |
------------------------------------------------------------------
If that is your case, following query might help you (Assuming table name as link_table):
SELECT *
FROM link_table o
WHERE EXISTS (SELECT linkid
FROM link_table i
WHERE o.orinodenumber = i.ternodenumber
AND o.oriifindex = i.terifindex
AND o.linkid < i.linkid);

Erasing duplicate records from MySQL

Due to a bug in my javascript click handling, multiple Location objects are posted to a JSON array that is sent to the server. I think I know how to fix that bug, but I'd also like to implement a server side database duplicate erase function. However, I'm not sure how to write this query.
The only affected table is laid out as
+----+------------+--------+
| ID | locationID | linkID |
+----+------------+--------+
| 64 | 13 | 14 |
| 65 | 14 | 13 |
| 66 | 14 | 15 |
| 67 | 15 | 14 |
| 68 | 15 | 16 |
| 69 | 16 | 17 |
| 70 | 16 | 14 |
| 71 | 17 | 16 |
| 72 | 17 | 16 |
| 73 | 17 | 16 |
| 74 | 17 | 16 |
| 75 | 17 | 16 |
| 76 | 17 | 16 |
| 77 | 17 | 16 |
+----+------------+--------+
As you can see, I have multiple pairs of (17, 16), while 14 has two pairs of (14, 13) and (14, 15). How can I delete all but one record of any duplicate entries?
Don't implement post factum correction logic, put a unique index on the fields that need to be unique, that way the database will stop dupe inserts before it's too late.
If you're using MySQL 5.1 or higher you can remove dupes and create a unique index in 1 command:
ALTER IGNORE TABLE 'YOURTABLE'
ADD UNIQUE INDEX somefancynamefortheindex (locationID, linkID)
You can create a temporary table where you can store the distinct records and then truncate the original table and insert data from temp table.
CREATE TEMPORARY TABLE temp_table (locationId INT,linkID INT)
INSERT INTO temp_table (locationId,linkId) SELECT DISTINCT locationId,linkId FROM table1;
DELETE from table1;
INSERT INTO table1 (locationId,linkId) SELECT * FROM temp_table ;
delete from tbl
using tbl,tbl t2
where tbl.locationID=t2.locationID
and tbl.linkID=t2.linkID
and tbl.ID>t2.ID
I am assuming you don't mean for the clean up, but for the new check? Put a unique index on if possible, if you don't have control of the DB do an upsert and check for nulls instead of an insert.