How to improve my natural language search query - mysql

I have a query I am trying to build which I want to dose some natural language searching. I am unsure of the best way to do this in mysql. I believe mysql has some cool natural language stuff that I can use.
I have two tables which I have shown below.
1. transaction_category...
+--------------------+--------------------+-------------------+----------+
| tran_category_code | tran_category_desc | tran_category_seq | btn_type |
+--------------------+--------------------+-------------------+----------+
| CarParking | Car Parking | 2 | default |
| Electricity | Electricity | 1 | default |
| Groceries | Groceries | 4 | default |
| HealthInsurance | Health Insurance | 5 | default |
| Other | Other | 7 | default |
| Petrol | Petrol | 3 | default |
| Phone | Phone | 6 | default |
+--------------------+--------------------+-------------------+----------+
2. transaction_category_keyword...
+---------------------------------+------------------------------+--------------------+
| transaction_category_keyword_id | transaction_category_keyword | tran_category_code |
+---------------------------------+------------------------------+--------------------+
| 6 | Telstra | Phone |
| 7 | Park | CarParking |
| 8 | Coles | Groceries |
| 9 | Bp Connect | Petrol |
| 10 | Bupa | HealthInsurance |
+---------------------------------+------------------------------+--------------------+
My query is below and that returns the results I want but I was just wondering if anyone could give me advice on whether this could be improved using mysql's natural language functions. This would help me because the search is very simple now but I will be building on it a lot soon.
SELECT
tck.transaction_category_keyword_id,
tck.transaction_category_keyword,
tck.tran_category_code
FROM transaction_category tc, transaction_category_keyword tck
WHERE tc.tran_category_code = tck.tran_category_code
AND 'Coles Menai Syd Au' like '%' ||UPPER(tck.transaction_category_keyword) || '%'
+---------------------------------+------------------------------+--------------------+
| transaction_category_keyword_id | transaction_category_keyword | tran_category_code |
+---------------------------------+------------------------------+--------------------+
| 7 | Park | CarParking |
| 8 | Coles | Groceries |
| 10 | Bupa | HealthInsurance |
| 9 | Bp Connect | Petrol |
| 6 | Telstra | Phone |
+---------------------------------+------------------------------+--------------------+
thanks

In general, if you have a wildcard at both the beginning and end of your search field, then your searches are going to be fairly slow on any non-trivial table sizes, as the field will have to be searched starting from every index.
You would definitely benefit from full text search and match as you are searching for bags of words (and their relative frequencies in the index), rather than a specific string within some other field. I assume you have read the docs at http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html. There are a number of subtleties you need to understand such as stop words, boolean search, query expansion, etc. The comments on these pages are very good as they have the accumulated knowledge of people who have been there before and experimented.
It is also worth reading about tf-idf which is how MySQL (and many other full-text searches) work internally, see the docs, wich basically ranks a search according to a combination of how rare a word is in all documents and how many times is occurs in a particular document.
I can't give you any more focused examples, or performance metrics, as your question is asking will full text outperform a double wildcard like search, to which the answer is a pretty much unqualified yes.
CAVEAT: Always worth mentioning, given the differences between engines, but before MySQL version 5.6 full-text search only words for MyISAM, but thereafter with InnoDB too.

Related

Can you use "knight moves" to query a single database and join it to itself?

I am working on a project that involves code in both Prolog and SQL to solve the same problem. A problem I've run across is I can't use a single database to form a hierarchy. In this list of prolog facts you can see that the "assembly" parts are related to each other.
basicpart(spoke).
basicpart(rearframe).
basicpart(handles).
basicpart(gears).
basicpart(bolt).
basicpart(nut).
basicpart(fork).
assembly(bike,[wheel,wheel,frame]).
assembly(wheel,[spoke,rim,hub]).
assembly(frame,[rearframe,frontframe]).
assembly(frontframe,[fork,handles]).
assembly(hub,[gears,axle]).
assembly(axle,[bolt,nut]).
If I put all of these "assembly" definitions into one SQL database, can I use knight moves (joining a table to itself on 2 different columns in it) to build this hierarchy in SQL in only 2 tables?
If I understand that question correctly. You cannot construct your bike with just one query (I'm not familiar with the term "knight moves"). In fact, you can -- but it must be a recursive SQL query. Because you will be computing the transitive closure of the part-subpart relationship.
Unfortunately I don't immediately know how to write these. SQL syntax is frankly abysmal and recursive SQL looks even abysmaller, so below is example code using a loop instead.
You actually need only one table to represent the data as the basicpart/1 relation does not bring anything to the table, except label certain "things" as basic. But these are also the things that do not appear in assembly/2 in the first position.
Notes:
Not using ENUMS which are not really "types" in MySQL/MariaDB but just a constraint on a field of a specific table. (Like, WTF!)
The multiset representation of the Prolog code ("a bike has two wheels") is flattened into multiple rows separately identified by a numeric surrogate id. This is due to the "First Normal Form" dogma of RDBMS practice. There is à priori nothing wrong with having multisets as values, if the query language and the RDBMS engine can support it. For example, you can have XML values in PostgreSQL complete with queries over its content, as I remember1.
DELIMITER //
DROP PROCEDURE IF EXISTS prepare;
CREATE PROCEDURE prepare()
BEGIN
DROP TABLE IF EXISTS assembly;
CREATE TABLE assembly
(id INT AUTO_INCREMENT KEY, -- surrogate key because a bike may have several wheels
part VARCHAR(10) NOT NULL,
subpart VARCHAR(10) NOT NULL);
INSERT INTO assembly(part,subpart) VALUES
("bike","wheel"),
("bike","wheel"),
("bike","frame"),
("wheel","spoke"),
("wheel","rim"),
("wheel","hub"),
("frame","rearframe"),
("frame","frontframe"),
("frontframe","fork"),
("frontframe","handles"),
("hub","gears"),
("hub","axle"),
("axle","bolt"),
("axle","nut");
END;
DROP PROCEDURE IF EXISTS compute_transitive_closure;
CREATE PROCEDURE compute_transitive_closure()
BEGIN
DROP TABLE IF EXISTS pieces;
CREATE TABLE pieces
(id INT AUTO_INCREMENT KEY,
part VARCHAR(10) NOT NULL,
subpart VARCHAR(10) NOT NULL,
path VARCHAR(500) NOT NULl DEFAULT "",
depth INT NOT NULL DEFAULT 0);
INSERT INTO pieces(part,subpart,path,depth) VALUES
("ROOT","bike","/bike",0);
SET #depth=0;
l: LOOP
INSERT INTO pieces(part,subpart,path,depth)
SELECT
p.subpart,
a.subpart,
CONCAT(p.path,'/',a.subpart),
#depth+1
FROM
pieces p,
assembly a
WHERE
p.depth = #depth AND p.subpart = a.part;
IF ROW_COUNT() <= 0 THEN
LEAVE l;
ELSE
SELECT * FROM pieces;
END IF;
SET #depth=#depth+1;
END LOOP;
END; //
DELIMITER ;
Put the above into a file SQL.txt, and then, in a database testme:
MariaDB [testme]> source SQL.txt;
MariaDB [testme]> CALL prepare;
MariaDB [testme]> CALL compute_transitive_closure;
Then after 4 passages through the loop, you get:
+----+------------+------------+--------------------------------+-------+
| id | part | subpart | path | depth |
+----+------------+------------+--------------------------------+-------+
| 1 | ROOT | bike | /bike | 0 |
| 2 | bike | wheel | /bike/wheel | 1 |
| 3 | bike | wheel | /bike/wheel | 1 |
| 4 | bike | frame | /bike/frame | 1 |
| 5 | wheel | spoke | /bike/wheel/spoke | 2 |
| 6 | wheel | spoke | /bike/wheel/spoke | 2 |
| 7 | wheel | rim | /bike/wheel/rim | 2 |
| 8 | wheel | rim | /bike/wheel/rim | 2 |
| 9 | wheel | hub | /bike/wheel/hub | 2 |
| 10 | wheel | hub | /bike/wheel/hub | 2 |
| 11 | frame | rearframe | /bike/frame/rearframe | 2 |
| 12 | frame | frontframe | /bike/frame/frontframe | 2 |
| 20 | frontframe | fork | /bike/frame/frontframe/fork | 3 |
| 21 | frontframe | handles | /bike/frame/frontframe/handles | 3 |
| 22 | hub | gears | /bike/wheel/hub/gears | 3 |
| 23 | hub | gears | /bike/wheel/hub/gears | 3 |
| 24 | hub | axle | /bike/wheel/hub/axle | 3 |
| 25 | hub | axle | /bike/wheel/hub/axle | 3 |
| 27 | axle | bolt | /bike/wheel/hub/axle/bolt | 4 |
| 28 | axle | nut | /bike/wheel/hub/axle/nut | 4 |
| 29 | axle | bolt | /bike/wheel/hub/axle/bolt | 4 |
| 30 | axle | nut | /bike/wheel/hub/axle/nut | 4 |
+----+------------+------------+--------------------------------+-------+
1: This made me dig out "Database in Depth: Relational Theory for Practitioners", O'Reilly 2005, by Chris Date, an excellent introduction to the relational model. On page 30, Date considers "sets as values" (but does not consider "multisets"):
Second (and regardless of what you might think of my first argument),
the fact is that a set like {P2,P4,P5} is no more and no less
decomposable by the DBMS than a character string is. Like character
strings, sets do have some inner structure; as with characters
strings, however, it's convenient to ignore that structure for certain
purposes. In other words, if a character string is compatible with the
requirements of 1NF - that is, if character strings are atomic - then
sets must be, too. The real point I'm getting at here is that the
notion of atomicity has no absolute meaning; it just depends on what
we want to do with the data. Sometimes we want to deal with an entire
set of part numbers as a single thing, and sometimes we want to deal
with individual part numbers within that set - but then we are
descending to a lower level of detail (a lower level of abstraction).

MySQL -> HTML Report, Styled like a Pivot Table

Ok, I'd like to start off by apologizing (profusely), since this seems to be a common question. Most of the examples seem to be somewhat similar, as well, but - for the life of me, I cannot wrap my brain around how to apply the myriad of quality responses to my specific table. And, I'm sure it's probably just the easiest thing in the world, what with all the very thorough responses/examples/links to resources with explanations/etc.
So, I suppose I'll just get right to it. The basics:
We host off-site copies of our clients' backups.
We need to know how much space they're using.
We are not at all consistent in Naming Convention, folder vs. disk per client, etc.
We need to automate a 'report', monthly, with data as follows:
-[C.Srv 01]---Size(GB)--Free(%)
Client 01 [Total] [AVG]
Server 01 109.43 25
Server 02 415.19 25
WHERE C.Srv = [Specified Cloud Server]
Clients Get a Total Size(GB) and an Average Free(%)
My MySQL table is this:
# Name DataType Length/Set Unsigned Allow NULL ZeroFill Default
1. ID INT 11 AUTO_INCREMENT
2. Client TEXT
3. Server TEXT
4. C.Srv TEXT
5. Size DECIMAL 10,2
6. Free DECIMAL 10,4
So, for Example, let's say I have this...
___ ________ ________ _________ _________ _______
ID | CLIENT | SERVER | C.SRV | SIZE | FREE
---|--------|--------|---------|---------|-------
1 | a | adc | cs_01 | 109.43 | 0.2504
2 | a | asql | cs_01 | 415.19 | 0.2504
3 | b | bdc | cs_01 | 583.91 | 0.1930
4 | b | bdev | cs_01 | 316.52 | 0.1930
5 | b | bsql | cs_01 | 1259.56 | 0.1930
6 | c | cdc | cs_01 | 355.30 | 0.7631
7 | d | ddc | cs_01 | 398.21 | 0.5808
Is it possible to get something pretty, in HTML (preferably), that has the basic structure of this...
_______ __________ ________
CS_01 | Size(GB) | Free(%)
-------|----------|--------
-a | 524.62 | 25.04%
-------|----------|--------
adc | 109.43 | 25.04%
asql | 415.19 | 25.04%
-b | 2178.88 | 19.30%
-------|----------|--------
bdc | 583.91 | 19.30%
bdev | 316.52 | 19.30%
bsql | 1259.56 | 19.30%
+c | 355.30 | 76.31%
-------|----------|--------
+d | 398.21 | 58.08%
_______|__________|________
Or, am I just S.O.L.? Format, I can mess with in CSS, or whatever (I hope), just so long as it's in that basic structure. (I don't know if it matters, but the final goal will be to collapse at the Client Level; in case that somehow factors into the approach/data-gathering.)

How to save language skill levels correctly in a database

I think I am before a problem where many of you were before. I have a registration form where a user can pick any language of the planet and then pick his skill level for the respective language from a selectbox.
So, for example:
Language1: German
Skill: Fluent
Language2: English
Skill: Basic
I'm thinking what's the best way to store these values in a MySQL database.
I thought of two ways.
First way: creating a column for each language and assigning a skill value to it.
--------------------------------------------------
| UserID | language_en | language_ge |
--------------------------------------------------
| 22 | 1 | 4 |
--------------------------------------------------
| 23 | 3 | 4 |
--------------------------------------------------
So the language is always the column's name and the number represents the skill level (1. Basic, 2. Average ... )
I believe this is a nice way to work with these things and it is also pretty fast. The problem starts when there are 50 languages or more. It doesn't sound like a good idea to make 50 columns where the script always have to check them all if a user have any skill in that language.
Second way: inserting an array in one of the table's column. The table will look like this:
----------------------------------
| UserID | languages |
----------------------------------
| 22 | "ge"=>"4", "en"=>"1" |
----------------------------------
This way the user with ID 22 has skill level 4 for Germany and skill level 1 for English. This is fine because we don't need to check 50 additional columns (or even more) but it's not the right way in my eyes anyway.
We have to parse a lot of results and find a user with, for example, has level 1 for Germany and level 2 for Spanish without looking for the English skill level - it will take the server's a longer time and when bigger data comes we are in trouble.
I bet many of you have experienced this kind of issue. Please, can someone advise me how to sort this out?
Thanks a lot.
I'd advise you to have a separate table with all the languages:
Table: Language
+------------+-------------------+--------------+
| LanguageID | LanguageNameShort | LanguageName |
+------------+-------------------+--------------+
| 1 | en | English |
| 2 | de | German |
+------------+-------------------+--------------+
And another table to link the users to the languages:
Table: LanguageLink
+--------+------------+--------------+
| UserID | LanguageID | SkillLevelID |
+--------+------------+--------------+
| 22 | 1 | 1 |
| 22 | 2 | 4 |
| 23 | 1 | 3 |
| 23 | 2 | 4 |
+--------+------------+--------------+
This is the normalised way to represent that kind of relations in a DB. All data is easily searchable and you don't have to change the DB scheme if you add a language.
To render a user's languages you could use a query like that. It will give you a row per lanugage a user speaks:
SELECT
LanguageLink.UserID,
LanguageLink.SkillLevelID,
Language.LanguageNameShort
FROM
LanguageLink,
Language
WHERE
LanguageLink.UserID = 22
AND LanguageLink.LanguageID = Language.LanguageID
If you want to go further, you could create another table fo the skill level:
Table: Skill
+--------------+-----------+
| SkillLevelID | SkillName |
+--------------+-----------+
| 1 | bad |
| 2 | mediocre |
| 3 | good |
| 4 | perfect |
+--------------+-----------+
What I've done here is called Database normalization. I'd recommend reading about it, it may help you design further databases.

Convert Mysql Query to Rails ActiveRecord Query Without using find_by_sql

I have table named questions like follows
+----+---------------------------------------------------------+----------+
| id | title | category |
+----+---------------------------------------------------------+----------+
| 89 | Tinker or work with your hands? | 2 |
| 54 | Sketch, draw, paint? | 3 |
| 53 | Express yourself clearly? | 4 |
| 77 | Keep accurate records? | 6 |
| 32 | Efficient? | 6 |
| 52 | Make original crafts, dinners, school or work projects? | 3 |
| 70 | Be elected to office or make your opinions heard? | 5 |
| 78 | Take photographs? | 3 |
| 84 | Start your own political campaign? | 5 |
| 9 | Free spirit or a rebel? | 3 |
| 38 | Lead a group? | 5 |
| 71 | Work in groups? | 4 |
| 2 | Helpful? | 4 |
| 4 | Mechanical? | 6 |
| 14 | Responsible? | 6 |
| 66 | Pitch a tent, an idea? | 1 |
| 62 | Write useful business letters? | 5 |
| 28 | Creative? | 3 |
| 68 | Perform experiments? | 2 |
| 10 | Like to figure things out? | 2 |
+----+---------------------------------------------------------+----------+
I have a sql query to get one random record from each category.Can any one convert the mysql query to rails activerecord query(with out using Question.find_by_sql).This mysql query is working absolutely fine but I need only active record query because of my dependency in further steps.
Here is mysql query
SELECT t.id, title as question, category
FROM
(
SELECT
(
SELECT id
FROM questions
WHERE category = t.category
ORDER BY RAND()
LIMIT 1
) id
FROM questions t
GROUP BY category
) q JOIN questions t
ON q.id = t.id
Thank You for your consideration!
When things get crazy one have to reach out for Arel:
It is intended to be a framework framework; that is, you can build
your own ORM with it, focusing on innovative object and collection
modeling as opposed to database compatibility and query generation.
So what we want to do is to let Arel create the query for us. Moreover the approach here is gonna be used: the questions table is left joined with randomized version of itself:
q_normal = Arel::Table.new("questions")
q_random = Arel::Table.new("questions").project(Arel.sql("*")).order("RAND()").as("q2")
Time to left join
query = q_normal.join(q_random, Arel::Nodes::OuterJoin).on(q_normal[:category].eq(q_random[:category])).group(q_normal[:category]).order(q_random[:category])
Now you can use which columns you want using project, e.g.:
query.project(q_normal[:id])
The only way I can think of to do this requires a good bit of application code. I don't think there's a way of accessing the RAND() functionality in MySQL (or equivalent in other DB technologies) using ActiveRecord. Here's what I came up with:
counts = Question.group(:category_id).count(:id)
offsets = {}
counts.each do |cat_id, count|
offsets[cat_id] = rand(count)
end
random_questions = []
offsets.each do |cat_id, offset|
random_questions.push(Question.where(:category_id => cat_id).offset(offset).first)
end

MySQL/RDBMS: Is it okay to index long strings? Will it do the job?

Let's suppose I have a table of movies:
+------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| title | tinytext | YES | | NULL | |
| synopsis | synopsis | YES | | NULL | |
| year | int(4) | YES | | NULL | |
| ISBN | varchar(13) | YES | | NULL | |
| category | tinytext | YES | | NULL | |
| author | tinytext | YES | | NULL | |
| theme | tinytext | YES | | NULL | |
| edition | int(2) | YES | | NULL | |
| search | text | YES | | NULL | |
+------------+---------------------+------+-----+---------+----------------+
In this example, I'm using search column as a summary of the table. So, a possible record would be like the following:
+------------+-------------------------------------------------------------+
| Field | Value |
+------------+-------------------------------------------------------------+
| id | 1 |
| title | Awesome Book |
| synopsis | This is a cool book with a cool history |
| year | 2013 |
| ISBN | 1234567890123 |
| category | Horror |
| author | John Doe |
| theme | Programmer goes insane |
| edition | 2nd |
| search | 2013 horror john doe awesome book this is a cool book (...) |
+------------+---------------------+------+-----+---------+----------------+
This column search will be the one scanned when a search is made. Notice that it has all the words of other fields, in lower case, and possibly some extra words to help on a search.
I have two questions about it:
1) Knowing that this column is a text field and can get really big, is it okay to index it? Will it improve the performance as expected? Why?
2) Despite the index, is it a good idea to use this method to search or is it better to try every column on my query? How can I improve it?
OBS: I don't really have this table, it's just for example purposes. Please ignore any error in datatypes or syntax I may have done.
1) Knowing that this column is a text field and can get really big, is
it okay to index it? Will it improve the performance as expected? Why?
Yes, you can index it, but no, it won't improve performance. An index on string-type columns only helps when the query matches the start of the column - so in your case, someone searching '2013 horror john' would hit the index, but someone searching 'horror john 2013' would not.
2) Despite the index, is it a good idea to use this method to search
or is it better to try every column on my query? How can I improve it?
As Gordon Linoff writes, the best solution is probably full text searching - this is blazingly fast for text searches, deals with "fuzzy" matching, and generally allows you to write a search function similar to the way google works.
Indexing the search column is not helpful.
What you may want is full text search capabilities on the column, which you can read about here.
Which you use for search depends on whether the searches will be using context. If someone searches for "Clinton", do you want them to restrict the search to authors named "Clinton" or to books about "Clinton"? If you don't care about the context, then full text on one field is quite reasonable.
I need to add: you don't need to put all the search terms in a separate field to use full text search. You can create a full text index on multiple columns. This gives you the flexibility of using full text searches with context (by looking only in specific columns) or without context (by looking in all of them). Your question was about the search column in particular, but that is not the best way to implement the functionality that you are looking for.