Best Way to Implement Natural Language Search on a Site

Best Way to Implement Natural Language Search on a Site - mysql

I implemented a basic search on my site using a like clause in MySQL. But it doesn't help in many cases.
I have a search I am testing with: "swift bird"
The entry in the database is: "Swift"
What do people usually do in order to catch as many of the possibilities, abbreviations, and variations of the words they need to find when implementing their own basic search on their site?
If you want to test, here is the url for this:
http://www.comehike.com/outdoors/birds/search_birds.php
Thanks,
Alex

Have you investigated MySQL's Full Text Search capabilities?
http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html

Related

MySQL/PostgreSQL record search: any alternative to LIKE?

I'm implementing a search system in my system and i'm curious about the usage of LIKE. Many websites and books "crucify" the usage of LIKE. But, what's the proper alternative? I really don't want to install a third-party system like Elasticsearch or similar.

For search, the usual approach is the (very powerful) full text search functionality:
http://www.postgresql.org/docs/current/static/textsearch.html
Depending on your specific needs, there also are colorful tools such as n-grams and a case-insensitive text type in contrib:
http://www.postgresql.org/docs/current/static/pgtrgm.html
http://www.postgresql.org/docs/current/static/citext.html

How can I get started on programmatically analyzing web site content?

I've been looking for a new hobby programming project, and I think it would be interesting to dabble in ways to programmatically gather information from websites and then analyze that data to do things like aggregate or filter it. For example, if I wanted to write an application that could take Craiglist listings and then do something like display only the ones matching a specific city not just a geographical area. That's just a simple example, but you could go as advanced and sophisticated as how Google analyzes a site's content to know how to rank it.
I know next to nothing about that subject and I think it would be fun to learn more about it, or hopefully do a very modest programming project in that topic. My problem is, I know so little that I don't even know how to find more information about the subject.
What are these types of programs called? What are some useful keywords to use when searching on Google? Where can I get some introductory reading material? Are there interesting papers I should read?
All I need is someone to disabuse me of my ignorance, so that I can do some research on my own.

cURL (http://en.wikipedia.org/wiki/CURL) is a good tool to fetch a website's contents and hand it off to a processor.
If you are proficient with a particular language, see if it supports cURL. If not, PHP (php.net) may be a good place to start.
When you have retrieved a website's content via cURL, you can use the language's text processing functionality to parse the data. You can use regular expressions (http://www.regular-expressions.info/) or functions such as PHP's strstr() to find and extract the particular data you seek.

Programs that "scan" other sites are usually called web crawlers or spiders.

I recently completed a project that uses Google Search Appliance that basically crawls the whole .com domain of the web server.
GSA is very powerful tool that pretty much indexes all the urls it encounters and serves the results.
http://code.google.com/apis/searchappliance/documentation/60/xml_reference.html

How do I find methods?

Here's a somewhat general computer question. I've always been able to follow the LOGIC of programming, but when I go to code something, I always find that I don't know some method or another to get what I need to get done. When I see it, I always think, "OF COURSE!".
How do you go about finding relevant methods for your programming needs that are "built-in?" I don't enjoy re-inventing the wheel, but I find it difficult to find what I need to do what I want to do.

First try Google:
You can use google to search your required method. For example If I want to search a value in array in PHP then I go to Google and type "Search values in array in PHP". I find my required function at first place.
Then try Standard Documentation:
Try standard documentation to search for your required method. For example if my problem is related to strings in PHP then I go to String Functions documentation and find the required function.
Finally try Stackoverflow:
Otherwise you can ask your problem at Stackoverflow for your required methods and libraries. You will always get a shortest way.

What you are asking here is for the best way to do research. Well, that's hard skill to explain, even more so to teach.
Nevertheless here are some tips:
Go to a search engine. It makes no
sense to start in a place like MSDN,
since all of its content is indexed
by the search engines anyway.
Phrase your question several
different ways.
As you learn more
about the issue you will learn new
vocabulary about it. Use that new
vocabulary to do even more searches.
If the searches turn out empty,
switch to browsing a specific
section of the official
documentation that you think is the
most related to what you are doing. If nothing else, it will expand your horizons around the issue and give you more vocabulary to do more searches.
Finally, if all else fails ask a question on StackOverflow explaining what you want to do as clearly as possible.
Note that if there's a simple API that does what you need, you will rarely reach step 4.

You say:
It's very frustrating to suddenly find
an "easy" button mid-way through.
Try to see it differently. Think of these moments as blessings. You've just learned something. You invested a lot of effort - and instead of seeing that effort as wasted, see it as critical to proper learning. You - better than the guy who just happened across the magic method - really understand what it's for and something about how it works. And you really, really, understand why you need it, and you properly appreciate its value. You're never going to forget that method.
So it was costly, but you learned something important. Celebrate, and move on.

It is usually included in some form of documentation. Most IDEs support the documentation format and gives you auto-complete functionality.

if you are using MVS so MSDN is really good for it

In addition to this and this answer above, google's basic and advanced searching tips prove very helpful.
In addition to above, changing the order of keywords in search criteria also sorts the list in different orders.
In essence I believe that searching is still an art rather than a science, and is best learnt - quoting from David Reis' answer above: "2. As you learn more about the issue you will learn new vocabulary about it. Use that new vocabulary to do even more searches."

Search in the API documentation. But the best way to (I found so) is to search on the internet for multiple solutions and then choose the one that you think is best. Make your search as narrow as possible. For example you want to implement random number generation function, then search like this, "How to generate random numbers in Java?".

Namespaces, namingconventions, Autocomplete/Intellisence
I assume that you are trying to find some kind of Object-Oriented-apis . I use .net in my example.
First try to find a class that might be responsable for the method you are looking for.
Example: If you want to "Make a new Directory in the Filesystem" you must know (or learn) that (in dotnet) these classes are in the namespace System.IO:
This namespace contains subnamespaces like Compresseion and Classes like File, Path, Directory, ...
Second you sould know NamingConventions. There are common Naming-Prefixes for methods like Get, Set, Insert, Create. In the documentation for class Directory you will find a CreateDirectory-Method.
If you have an intelligent editor that knows your programming language and the classes and namespaces learning is much easier. In the dotnet-world this feature is called Autocomplete/Intellisence

HTML meta "keywords". Worth including?

Do you use "keywords" meta in your site, knowing that Google does not use them (and has no plans) in page ranking, and perhaps even search?

Yes you do; Google is not the only search engine in the web although its has the major market share. There are other engines including Yahoo which use the Keywords META to some extent.

No. I don't want competitors knowing what I am trying to rank for. Keywords are very valuable in some markets. If you found a good keyword phrase that is converting well (and your competitors don't know about it) do yourself a favor, and keep a monopoly on it.

That is NOT a reason to abandon keywords altogether.
They are still indexed and searchable, and so still have a function.

Yes, we do. No one knows the exact Google (or any other search engine) algorithm.
In addition, lots of companies use "keywords" for internal websites that host tons of Html generated content.

I've heard from so many people that Google doesn't use meta keywords, I'm not sure I'm convinced of that however. But, whether they do or they don't you should still use them because Bing and Yahoo (the other 2/3 or the big 3) still do use them. But remember to limit your keywords because (based on popular opinion) none of the 3 engines read past character 46.

I did, because I didn't know that fact. Now I am thinking that is not worth doing so insignificant I see the "energy conversion efficiency", even if we don't take Google in consideration, other SE seems do not take much attention to it...

Randomizing pages in Wikipedia with MySQL and Perl?

I found a perl script that manages randomizing the wikipedia articles in Wikipedia here. The code seems to be slightly computer generated. Due to my present interest in MySQL, I thought you could possibly have the links and related data in a database.
I know that MySQL is good in maintaining relations between tables, while it seems you can easily implement things with Perl. I feel it somehow fuzzy to draw a line to their specialties. So:
How can you randomize Wikipedia
articles with MySQL and Perl?

If you really want to know how THEY (Wikipedia) do it, have a look at this code directly from Media Wiki:
http://svn.wikimedia.org/svnroot/mediawiki/trunk/phase3/includes/specials/SpecialRandompage.php
It is open source software after all ;), and that's the beauty of it.
Edit: From having a quick glance at the code, I am pretty sure they're using a field called page_random, set at row creation time. Then, since it's an indexed field, ordering by it with limit 1 is instant (with a given random offset, valid for this application, of course).
This is a very standard way to make random access quick, due to ORDER BY RAND() being extremely slow, as I mentioned in the other answer.
Edit #2: I love how clean and proper OOP Wiki Media's code is. Definitely bookmarking it to show PHP newbies what good PHP code looks like (and to remind myself).

SELECT id FROM articles ORDER BY RAND() LIMIT 1
You could, of course, just link to http://en.wikipedia.org/wiki/Special:Random

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Best Way to Implement Natural Language Search on a Site - mysql

Have you investigated MySQL's Full Text Search capabilities? http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html

Related

MySQL/PostgreSQL record search: any alternative to LIKE?

How can I get started on programmatically analyzing web site content?

How do I find methods?

HTML meta "keywords". Worth including?

Randomizing pages in Wikipedia with MySQL and Perl?

Categories

Resources