Storing and searching similar phrases

Storing and searching similar phrases - mysql

Let's say I want to store an item in my database like "visit Spain". However, I'm going to allow user submissions, and I'd like to keep track of all the users who wish to visit Spain, however I'd like them to be able to type "Visit Spain" as well as "Go to Spain", "See Spain", or "tour spain".
I am looking for an efficient way to do this. Currently my thinking has me going along these lines (simplified):
Nouns
uniqueId
noun
verb [fk]
Verbs
uniqueId
verb
synonyms
uniqueId
verb [fk]
synonym
Am I off base, or is this the best way to be going about it? I'm looking for both performance and ease of maintenance...

You should look into some simple natural language processing (NLP).
Ideally, you need to normalize the input so that you can search for users that have the same normalized values.
First tokenize the input, separating the words. "Visit Spain" would become ("Visit", "Spain") and "
Look for single words that have equivalences. For example, you can ignore case for many things.
Use table lookup to find more advanced single word equivalences, such as "Visit" => "Tour", "See" => "Tour". Using this ("Visit", "Spain") and ("See", "Spain") would both be translated to ("Tour", "Spain")
Look for phrase equivalences. For example "go to" => "Visit". This would make ("Go", "to", "Spain") become ("Tour", "Spain").
Apply pattern matching. For example ("Tour" X "and" Y) => ("Tour" X), ("Tour" Y) could become two separate items, ("Tour", "Spain") and ("Tour", "France").
When you have applied all of your transformations, store the resulting normalized items.
Your work is in defining classes of translations, finding many instances of those translations, and then applying them to your input.
Once you have a normalized item, you can search for other users that have the same normalized item.

Related

When creating a file format. Is it better to create multiple formats or have several optional sections?

Let's say I am the technical lead for a software company specializing in writing applications for chefs that help organize recipes. Starting out we are developing one app for bakers who make cakes and another for sushi chefs. One of the requirements is to create a standard file format for importing and exporting recipes. (This file format will go on to be an industry standard with other companies using it to interface with our products) we are faced with two options: Make a standard recipes format (lets say .recipe) which uses common properties where applicable and optional properties where they differ, or making independent formats for each application (let us say .sushi and .cake).
Imagine the file format would look something like this for sushi:
{
"name":"Big California",
"type":"sushi",
"spiciness": 0,
"ingredients": [
{
"name":"rice"
"amount": 20.0,
"units": "ounces"
},
{
...
}
],
}
and imagine the file format would look something like this for cakes:
{
"name":"Wedding Cake",
"type":"cake",
"layers": 3,
"ingredients": [
{
"name":"flour"
"amount": 40.0,
"units": "ounces"
},
{
...
}
],
}
Notice the file formats are very similar with only the spiciness and layers properties differing between them. Undoubtedly as the applications grow in complexity and sophistication, and will cause many more specialized properties to be added. There will also be more applications added to the suite for other types of chefs. With this context,
Is it wiser to have each application read/write .recipe files that adhere to a somewhat standardized interface, or is it wiser to remove all interdependence and have each application read/write their respective .sushi and .cake file types?

This kind of thing get be a very, very deep thing to get right. I think a lot of it depends on what you want to be able to do with recipes beyond simply displaying them for use by a chef.
For example, does one want to normalise data across the whole system? That is, when a recipe specifies "flour", what do you want to say about that flour, and how standardised do you want that to be? Imagine a chef is preparing an entire menu, and wants to know how much "plain white high gluten zero additives flour" is used by all the recipes in that menu. They might want to know this so they know how much to buy. There's actually quite a lot you can say about just flour that means simply having "flour" as a data item in a recipe may not go far enough.
The "modern" way of going about these things is to simply have plain text fields, and rely on some kind of flexible search to make equivalency associations between fields like "flour, white, plain, strong" and "high gluten white flour". That's what Google does...
The "proper" way to do it is to come up with a rigid schema that allows "flour" to be fully specified. It's going to be hard to come up with a file / database format schema that can exhaustively and unambiguously describe every single possible aspect of "flour". And if you take it too far, then you have the problem of making associations between two different records of "flour" that, for all reasonable purposes are identical, but differ in some minor aspect. Suppose you had a field for particle size; the search would have to be clever enough to realise that there's no real difference in flours that differ by, for example, 0.5 micrometer in average particle size.
We've discussed the possible extent to the definition of a "flour". One also has to consider the method by which the ingredient is prepared. That adds a whole new dimension of difficulty. And then one would have to desribed all the concievable kitchen utensils too. One can see the attractions of free text...
With that in mind, I would aim to have a single file format (.recipe), but not to break down the data too much. I would forget about trying to categorise each and every ingredient down to the last possible level of detail. Instead, for each ingredient I'd have a free text description, then perhaps a well structured quantity field (e.g. a number and a unit, 1 and cup), and finally a piece of free text describing the ingredient preparation (e.g. sieved). Then I'd have something that describes a preparation step, referencing the ingredients; that would have some free text fields and structured fields ("sieve the", , "into a bowl"). The file will contain a list of these. You might also have a list of required utensils, and a general description field too. You'll be wanting to add structured fields for recipe metadata (e.g. 'cake', or 'sushi', or serves 2).
Or something like that. Having some structure allows some additional functionality to be implemented (e.g. tidy layout of the recipe on a screen). Just having a single free-text field for the whole thing means that it'd be difficult to add, say, an ingredient ordering feature - who is to say what lines in the text represent ingredients?
Having separate file formats would involve coming up with a large number of schema. It would be even more unmanagable.

How to extract needed informations from text?

I have a lot of publications from which I want to parse and extract needed and useful informations.
Suppose I have this publication A
2 places available tomorrow at 12AM from California to Alaska. Cost is 100$. And this is my phone number 814141243.
Another one B
One place available to Texas. We will be leaving at 13PM today. Cost will be discussed. Tel: 2323575456.
I want to find the best way to extract data from these publications using an algorithm with linear complexity.
For each publication, the algorithm must produce this:
{ "publication": [
{ "id":"A",
"date":"26/01/2016",
"time":"12AM",
"from":"California",
"to":"Alaska",
"cost":"100$",
"nbrOfPlaces":"2",
"tel":"814141243" },
{ "id":"B",
"date":"25/01/2016",
"time":"13PM",
"from":"",
"to":"Texas",
"cost":"",
"nbrOfPlaces":"1",
"tel":"2323575456" }
]
}
So i want the maximum of informations from those publications. But obviously the problem is with the the words chosen by the writer of the publication and how they are structured. Simply, publications don't have common structure so that i can't easily parse and extract needed informations.
Is there any concepts or paradigms that deal with this kind of problem?
Note: I can't force publications' writers to respect a precise structure for the text.

It seems all the comments are discouraging you from trying to do this. However, the variation in the text seems quite limited; I can see a simple algorithm finding the info in most (but obviously not all) input. I'd try something like this:
Split the text into parts on interpunction: .;?!() and then look at the text line by line; this will help determine context.
Use a list of often-used words and abbreviations to determine where each bit of info is located.
Date: look for the names of days or months, "today", "tomorrow" or typical notations of dates like "12/31".
Time: look for combinations with "AM", "PM", "morning", "noon" etc., or typical time notations like "12:30"
Route: look for "from" and "to", possibly combined with "going", "driving", "traveling" etc. and maybe look for capital letters to find the place names (and/or use a list of often-used destinations).
Cost: look for a line that contains "$" or "cost" or "price" or similar, and find the number, or typical "to be discussed" or "to be determined" phrasing.
Places: look for "places", "seats", "people" and find the number, or "place", "seat" or "person" and conclude there is 1 place.
Phone: look for a sequence of digits of a certain length, with maybe spaces or ./() between them.
If you're certain that you've found a part of the info, mark it so that it isn't used again; e.g. if you find "8.30" together with "AM", it's obviously a time. However, if you just find "8.30" it could be a date or a time, or even $8.30.
You'll have to take into account that a small percentage of input will never be machine-readable; something like "off to the big apple at the crak-o-dawn, wanna come with? you pay the gas-moh-nay!" will always need human interpretation.

How to realize a context search based on synomyns?

Lets say an internet user searches for "trouble with gmail".
How can I return entries with "problem|problems|issues|issue|trouble|troubles with gmail|googlemail|google mail"?
I don't like to manually add these linkings between different keywords so the links between "issue <> problem <> trouble" and "gmail <> googlemail <> google mail" are completly unknown. They should be found in an automated process.
Approach to solve the problem
I provide a synonyms/thesaurus plattform like thesaurus.com, synonym.com, etc. or use an synomys database/api and use this user generated input for my queries on a third website.
But this won't cover all synonyms like the "gmail"-example.
Which other options do I have? Maybe something based on the given data and logged search phrases of the past?

You have to think of it ignoring the language.
When you show a baby the same thing using two words, he understand that those words are synonym. He might not have understood perfectly, but he will learn when this is repeated.
You type "problem with gmail".
Two choices:
Your search give results: you click on one item.
The system identify that this item was already clicked before when searching for "google mail bug". That's a match, and we will call it a "relative search".
Your search give poor results:
We will search in our history for a matching search:
We propose : "do you mean trouble with yahoo mail? yes/no". You click no, that's a "no match". And we might propose others suggestions like a list of known "relative search" or a list of might be related playing with both full text search in our history and levenshtein distance.
When a term is sufficiently scored to be considered as a "synonym", you can consider it is. Algorithm might be wrong, but in fact it depends on what you really expect.
If i search "sending a message is difficult with google", and "gmail issue", nothing is synonym, but search are relatively the same. This is more important to me than true synonyms.
And if you really want to get the synonym, i would do it in a second phase comparing words inside "relative searches" and would include a manual check.
I think google algorithm use synonym mainly to highlight search terms in page result, but not to do an actual search where they use the relative search terms, except in known situations, as the result for "gmail" and "google mail" are not the same.
But if you identify 10 relative searches for "gmail" which all contains "google mail", that will be a good start point to guess they are synonyms.

This is a bit long for a comment.
What you are looking for is called a "thesaurus" or "synonyms" list in the world of text searching. Apparently, there is a proposal for such functionality in MySQL. It is not yet implemented. (Here is a related question on Stack Overflow, although the link in the question doesn't seem to work.)
The work-around would be to modify queries before sending them to the database. That is, parse the query into words, then look up all the synonyms for those words, and reconstruct the query. This works better for the natural language searches than the boolean searches (which require more careful reconstruction).
Pseudo-code for getting the final word list with synonyms would be something like:
select #finalwords = concat_ws(' ', group_concat(synonyms separator ' ') )
from synonyms s
where find_in_set(s.baseword, #words) > 0;

Seems to me that you have two problems on your hands:
Lemmatisation, which breaks words down into their lemma, sometimes called the headword or root word. This is more difficult than Stemming, as it doesn't just chop suffixes off of words, but tries to find a true root, e.g. "are" => "be". This is something that is often done programatically, although it appears to be a complex task. Here is an online example of text being lemmatized: http://lemmatise.ijs.si/Services
Searching for synonymous lemmas. This is a very complex problem. One approach to this that I have heard of is modifying the lemmatisation engine to return more than one lemma for a given set of words, i.e. "problems" => "problem" and "issue", thereby allowing a more flexible set of results. However, this means that the synonymous lemmas must be provided to the lemmatisation engine from elsewhere. I truly have no idea how you would build a list of synonyms programatically.
So, you may consider a strategy whereby you lemmatise the text to be searched for, then pass each lemma out to your synonym finder (however that works) to get a final list of lemmas to perform your search with.
I think you have bitten off a very large problem for yourself.

If the system in question is a publicly accessible website, one 'out there' option is to ensure all content can be crawled by Google and then use a Google search on your own site, which should give you the synonym capability 'for free'. There would obviously be some vagaries in the results though and lag in getting match results for newly created content, depending upon how regularly the crawlers hit the site. Probably not suitable in your use case, but for some people, this may be sufficient.

Seeing your revised question, what about using a public API?
http://www.programmableweb.com/category/reference/apis?category=20066&keyword=synonym

Should query languages have priority of operator OR higher than priority of AND?

Traditionally most programming languages have priority of AND higher than priority of OR so that expression "a OR b AND c" is treated as "a OR (b AND c)". Following that idea search engines and query/addressing languages (css,xpath,sql,...) used the same prioritization. Wasn't it mistake ?
When dealing with large enough data, that prioritization is inconvenient because it makes it impossible to create reusable query context without use of parentheses. It is more convenient to create context by using AND and then union results within that context by using OR. It is even more convenient if space is used as AND operator and comma is used as OR operator.
Examples:
When searching the internet for airline tickets to bahamas in november or december it would be more convenient to type "airline ticket bahamas november,december" instead of "airline ticket bahamas november", "airline ticket bahamas december" or "airline ticket bahamas (november,december)"
In CSS if we need to set style red of 2 elements, we have to do that: body.app1 div.d1 td.phone span.area, body.app1 div.d1 td.fax span.area{color:red} essentially duplicating prefix body.app1 div.d1 and suffix span.area
If priority of OR was higher than AND we would write this in CSS: body.app1 div.d1 td.phone,td.fax span.area{color:red}
Of course, this idea can be developed into having 2 operators OR one with higher priority than AND and one with lower, for example ',' is higher, ';' is lower, but in many cases languages don't have spare symbols to extend that way and also existing priority of "," where it's used is low.

Considering the fact that the background of OR and AND is from mathematical logic where they have well defined precedence, you can not violate that precedence in your design without confusing a VAST majority of the users.

I'd rather have consistency everywhere, in code, SQL, search queries, so that I won't have to remember which way it goes in this particular situation.

I think that you are missing the point of the priority of operators. They are there merely as a convenience to the programmer/book writer. The use of the order of operations makes it so that certain clauses can be written without parens, but the use of parens makes sure the reader knows exactly what the code is doing, especially when different languages have different order of operations.

I don't know of any language which does this, but perhaps it should be an error to have AND and OR combined without parentheses. This is a very common cause of stupid errors.

When Boolean operators are used to join logical statements, the "and" operator should have precedence over "or". I think the point of confusion is that many query languages implicitly form logical statements from nouns but don't make clear what those statements are.
For example, "foo & bar" might be interpreted accepting only pages where both of the following are true:
The page contains an item that matches "foo".
The page contains an item that matches "bar".
The query "foo | bar" might be interpreted to evaluate the above conditions and accept any page where either condition holds true, but it could also be interpreted as involving a single condition:
This page contains an item that matches either "foo" or "bar".
Note that in the simple "foo | bar" case it wouldn't matter which interpretation one chose, but given "foo & moo | bar", it would be impossible to adopt the latter interpretation for the | operator without giving it priority over the & operator unless one interpreted foo & moo as meaning:
This page contains an item that matches both "foo" or "moo".
If the arguments to & included wildcards, such an interpretation might be meaningful (e.g. foo* & *oot might mean that a single item must start with "foo" and ends with "oot", rather than meaning that the page had to have an item that starts with "foo" and a possibly-different item that ends in "oot"), but without such wildcards, there are no items any page could contain that match both "foo" and "moo", and thus no page could contain such an item.
Perhaps the solution would be to have separate operators to join items versus joining pages. For example, if &&, &&!, and || join pages, while &, &!, and | join things to be matched, then foo && bar || moo && jar || quack && foo* | m* & *l &! *ll would match every page that contains both "foo" and "bar", every page that contains both "moo" and "jar" as well as every page which contains the word "quack" and also contains a word that starts with "foo", or that starts with "m", ends with "l", and doesn't end with "ll".

Ironically, in your first example you implicitly use the AND with higher priority, since
airline tickets bahama november
is surely to be understood like, for example
.... WHERE transport = "AIR" AND target like "%bahamas%" AND month = "Nov"
This exposes nicely the silliness of your idea.
The point is, it is always possible to come up with queries that are longer if precedences are like they are, and would be shorter with alternate precedences. Just like it would be possible to come up with arithmetic expressions that could be written with less parentheses if addition had a higher precedence than multiplication. But this in itself is no sufficient reason to alter the behaviour.

Can you programmatically detect pluralizations of English words, and derive the singular form?

Given some (English) word that we shall assume is a plural, is it possible to derive the singular form? I'd like to avoid lookup/dictionary tables if possible.
Some examples:
Examples -> Example a simple 's' suffix
Glitch -> Glitches 'es' suffix, as opposed to above
Countries -> Country 'ies' suffix.
Sheep -> Sheep no change: possible fallback for indeterminate values
Or, this seems to be a fairly exhaustive list.
Suggestions of libraries in language x are fine, as long as they are open-source (ie, so that someone can examine them to determine how to do it in language y)

It really depends on what you mean by 'programmatically'. Part of English works on easy to understand rules, and part doesn't. It has to do mainly with frequency. For a brief overview, you can read Pinker's "Words and Rules", but do yourself a favor and don't take the whole generative theory of linguistics entirely to heart. There's a lot more empiricism there than that school of thought really lends to the pursuit.
A lot of English can be statistically lemmatized. By the way, stemming or lemmatization is the term you're looking for. One of the most effective lemmatizers which work off of statistical rules bootstrapped with frequency-based exceptions is the Morpha Lemmatizer. You can give this a shot if you have a project that requires this type of simplification of strings which represent specific terms in English.
There are even more naive approaches that accomplish much with respect to normalizing related terms. Take a look at the Porter Stemmer, which is effective enough to cluster together most terms in English.

Going from singular to plural, English plural form is actually pretty regular compared to some other European languages I have a passing familiarity with. In German for example, working out the plural form is really complicated (eg Land -> Länder). I think there are roughly 20-30 exceptions and the rest follow a fairly simple ruleset:
-y -> -ies (family -> families)
-us -> -i (cactus -> cacti)
-s -> -ses (loss -> losses)
otherwise add -s
That being said, plural to singular form becomes that much harder because the reverse cases have ambiguities. For example:
pies: is it py or pie?
ski: is it singular or plural for 'skus'?
molasses: is it singular or plural for 'molasse' or 'molass'?
So it can be done but you're going to have a much larger list of exceptions and you're going to have to store a lot of false positives (ie things that appear plural but aren't).

Is "axes" the plural of "ax" or of "axis"? Even a human cannot tell without context.

You can take a look at Inflector.net - my port of Rails' inflection class.

No - English isn't a language which sticks to many rules.
I think your best bet is either:
use a dictionary of common words and their plurals (or group them by their plural rule, eg: group words where you just add an S, words where you add ES, words where you drop a Y and add IES...)
rethink your application

It is not possible, as nickf has already said. It would be simple for the classes of words you have described, but what about all the words that end with s naturally? My name, Marius, for example, is not plural of Mariu. Same with Bus I guess. Pluralization of words in English is a one way function (a hash function), and you usually need the rest of the sentence or paragraph for context.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008