full text search sql server (which stackoverflow turned down) - sql-server-2008

My application is a help (user assistance system) just like Online MSDN. but the only way to navigation is through SEARCH. Either the search is good or my system is dead.
I am looking for a third party search engine that can connect to database and provide
out of the box full text searching.
i have researched sql server 2008 ifts, lucene.net api, sql lite fts4 but all of them lack the ranking of result as good as google does.
em not expecting sth like google but i need best ranking search engine product.
Any suggestion or experience ?
maybe i should not go for third party search engine and use Lucene.NET or sql server 2008 FTS
but how can i establish good ranking for user provided Search query.. like
"how can i do upload excel file in XYZ interface" etc..

My short answer is discouraging: you won't be able to find do it yourself, even for an "okay" solution.
If you want good ranking:
Make your site friendly to search engines (which doesn't
necessarily mean that you have to open it to public, just make sure
search engines understand the URLs.)
Pay google to do it (look for google apps)
As you said, a search engine has to do two things at least. The first one is indexing, i.e., finding the documents out of the database based on queried keywords. The second is ranking, which sorts all documents and highlights the most relevant ones.
Ranking is one of the key factor of how good a search engine is. It's not surprising ranking is hard.
To give you an idea how hard it is, take the sentence in your question (i.e., "how can i do upload excel file in XYZ interface") for example. A search engine has to answer at least two questions to get good results:
Which keywords is most important? For example, XYZ might be more important than the word "how", and "can".
What's the possible meanings of the word? "Excel" can be microsoft excel, or Xcel energy(a company name excel)
There are a whole field in computer science dedicated to this problem. If you want some more evidences, take a quick look at ACM WWW.
One thing that is even more discouraging is that getting an "okay" solution would be difficult. The high level point is that the computer knows nothing about English, he has to read a lot to learn how to rank document.
Sadly, "a lot" means a lot of work -- For example, many textbooks suggest ranking documents based on TF/IDF, but getting a reasonable cut for these values requires crawling millions of web pages.
To summarize:
Ranking is hard.
Therefore it's not surprising that you won't be able to find any free, out-of-the-box solutions, and Google and Microsoft keep their ranking algorithms proprietary.
If you want to rank documents in a large database, get a search engine.

check out new feature for semantic search in sql server 2012:
http://msdn.microsoft.com/en-us/library/gg492075%28v=sql.110%29.aspx It won't be a silver bullet but might provides you a "out of the box" approach.

Related

How do I set up the architecture for a "big data" analysis project?

A friend of mine and I are in our senior year and will be starting a senior project soon. We had the idea to do a data analysis and data visualization project for it. Our project involves reading a CSV file that is updated every 2 minutes, parsing that data, then storing it in a database. Once that data is stored we want to run some analysis on it and provide an API through which we could access that data to visualize in some way. Our end goal would be to build an Android app that displays some of the raw data from the CSV and the analysis in a user friendly format. I talked to another CS Major and he explained that I would need a few different servers to accomplish this: One for the storage, another for analysis, and another for some type of queue that would make sure things don't get screwy while we are doing scraping and analysis. The problem is, I don't really know where to start with this. I've done some work with a SQL database before and a PHP front end, but nothing with multiple servers. I've heard of tools to use with big data projects like Hadoop but i'm not exactly sure where it fits in. If someone could point me to a resource of some kind to explain, or explain themselves, how I would start to structure this kind of project, that would be awesome!
Since you don't have much experience with these things you'll probably want to look at projects like Cloudera. Specifically their resources page has a nice set of videos and articles.
Another source of solid information (that I personally use) is by clicking on an Stack Overflow tag and selecting the votes option. Many good questions on a plethora of big data topics already exists.

What does it mean to "Index a page"?

For example in the sentence: "This tells Google how to index the page" what does Index the page mean in the grand scheme of things. Why would a page have an 'index.' What is it useful for?
Google servers are constantly visiting pages on the Internet (crawling) and reading their contents. Based on the contents Google builds an internal index, which is basically a data structure mapping from keywords to pages containing them (very simplified). Also when the crawler discovers hyperlinks, it will follow them and repeat the process on linked pages. This process happens all the time on thousands of servers.
In general, the term indexing means analyzing large amounts of data and building some sort of index to access the data in more efficient way based on some search criteria. Compare it with database indexes.
i guess you are asking the question of whats the need for indexing with google? Here it is why?
After creating a website that is very beautiful and have all good features. But as i guess You would have know that web is all about connecting the Webpages! And you have created a site, in which you can only look at it. If the world want to know about your site, the next step will be hosting! After that obviously you have to do index your webpage to any search engine, say for example google. Now your site will be indexed according to the google bot, i cant explain how bot works! And if the person searching your site name in any engine then that engine with the help of indexing can retrive your page as the result :) This is how you connect to the WEB!
This simply means Google is reading your page, figuring out what content is on it (via the page structure, links, etc.) assigning a page rank to it, among other things, and adding it to their database.
There is no specific terminology here.
See Web Crawler: http://en.wikipedia.org/wiki/Web_crawler
In short Index page this is page that originate from table of content that help to search materials in older to access data or information within the given basket of data that can be book or web-page easily.

Creating a Full Text Index search

I've created a blog and I wish to search through certain tables in my MySQL databases and then return results for the user on a separate search page. I do not wish to use Google CSE. How would I go about creating this for my site. I found a post on StackOverflow.com from a friend of mine in which he wished to make his more efficient. How would I go about implementing his search engine into my site?
His Code - Here
Are you limited to SQL? There is a lot of software better suitable for text search than any relational database engine. Sphinx, Lucene, Xapian, just to name few.
EDIT
MySQL has some full-text indexing capabilities as well. You may want to check them out.

Best Way To Partial Search in SQL 2008

I've looked into SQL 2008's built-in Full-Text search, and also Lucene.NET.. but I don't think they'll do what I need to do. And I just want to make sure I'm building my program as efficient as possible.
So here's the dream. I want to have a single textbox on a page (like google) and allow the user to enter ANYTHING in. And based on their text, I will search 10's of tables to find what they're looking for.
Example. My database contains thousands of locations, each of which have multiple names / codes. Within each location, there is tonnes of data associated with them.
So if the user wants to display all the locations with the codes that contain "VM" ("CD-VM01", "CD-VM02", "CD-VM03", etc).. they should be able to. Or if they want to find all the locations in Toronto, they just type Toronto.. I want to make the search as easy as possible for people. (I've found that people don't like thinking)..
Plus it ends up being easier to scale to more search options if I can just search the database, and not have to add new fields to a search screen.
So if I don't use Full Text search (which I can't for partial) the only thing I can see that i'm left with is "Like" .. is that right? is that my only option?
I guess the question is, even if you were able to do this in the database, how would you handle it in the UI?
Most likely every search result from a different table will have different attributes that need to be displayed in order for the end user to understand what it is.
The Google search box only needs to search one thing - the content of web pages - and return one type of result - web page URLs and excerpts. Fundamentally you are trying to search for many different things, and so you'll most likely need to handle each case separately.
Alternatively, you could maintain a denormalized search table that contains only the search text and the common attributes you think need to be displayed with each hit. Maintain it either with a scheduled task or with triggers. You'd be able to use FTS on this as well.
Update
Some of the comments express some uncertainty over what SQL Server Full-Text Search is capable of. FTS can most definitely search for a single string anywhere within the text of a column, and can do other things as well (proximity search, free-text search, etc.) If you're just getting started then I'd recommend the TechNet pages on the subject, the documentation is very comprehensive.
In particular I'd suggest having a look at the section on Configuring Catalogs and the Getting Started page (Cole's Notes: you have to create catalogs - writing CONTAINS queries without them won't get you very far). Then take a look at the querying page. I'd be very surprised if you can't find answers to any and all of your questions there.
If you still can't get it to work, I would post a new question with the specifics of your problem - what you've tried, what you're expecting, and what's happening instead.
I believe Lucene does exactly what you're looking for. You can add an index from any external data source (including multiple database tables), then query that index and you'll get back pointers to the matching records.
The drawback is that unlike with full-text indexing, you're responsible for building and maintaining the index yourself.
You can see an example of how Lucene.NET might be used.
It appears that the easiest / quickest solution for this exact problem would be to use LIKE.

Can I run an HTTP GET directly in SQL under MySQL?

I'd love to do this:
UPDATE table SET blobCol = HTTPGET(urlCol) WHERE whatever LIMIT n;
Is there code available to do this? I known this should be possible as the MySQL Docs include an example of adding a function that does a DNS lookup.
MySQL / windows / Preferably without having to compile stuff, but I can.
(If you haven't heard of anything like this but you would expect that you would have if it did exist, A "proly not" would be nice.)
EDIT: I known this would open a whole can-o-worms re security, however in my cases, the only access to the DB is via the mysql console app. Its is not a world accessible system. It is not a web back end. It is only a local data logging system
No, thank goodness — it would be a security horror. Every SQL injection hole in an application could be leveraged to start spamming connections to attack other sites.
You could, I suppose, write it in C and compile it as a UDF. But I don't think it really gets you anything in comparison to just SELECTing in your application layer and looping over the results doing HTTP GETs and UPDATEing. If we're talking about making HTTP connections, the extra efficiency of doing it in the database layer will be completely dwarfed by the network delays anyway.
I don't know of any function like that as part of MySQL.
Are you just trying to retreive HTML data from many URLs?
An alternative solution might be to use Google spreadsheet's importHtml function.
Google Spreadsheets Lets You Import Online Data
Proly not. Best practises in a web-enviroment is to have database-servers isolated from the outside, both ways, meaning that the db-server wouldn't be allowed to fetch stuff from the internet.
Proly not.
If you're absolutely determined to get web content from within an SQL environ, there are as far as I know two possibilities:
Write a custom MySQL UDF in C (as bobince mentioned). The could potentially be a huge job, depending on your experience of C, how much security you want, how complete you want the UDF to be: eg. Just GET requests? How about POST? HEAD? etc.
Use a different database which can do this. If you're happy with SQL you could probably do this with PostgreSQL and one of the snap-in languages such as Python or PHP.
If you're not too fussed about sticking with SQL you could use something like eXist. You can do this type of thing relatively easily with XQuery, and would benefit from being able to easily modify the results to fit your schema (rather than just lumping it into a blob field) or store the page "as is" as an xhtml doc in the DB.
Then you can run queries very quickly across all documents to, for instance, get all the links or quotes or whatever. You could even apply XSL to such a result with very little extra work. Great if you're storing the pages for reference and want to adapt the results into a personal "intranet"-style app.
Also since eXist is document-centric it has lots of great methods for fuzzy-text searching, near-word searching, and has a great full-text index (much better than MySQL's). Perfect if you're after doing some data-mining on the content, eg: find me all documents where a word like "burger" within 50 words of "hotdog" where the word isn't in a UL list. Try doing that native in MySQL!
As an aside, and with no malice intended; I often wonder why eXist is over-looked when people build CMSs. Its a database that can store content in its native format (XML, or its subset (x)HTML), query it with ease in its native format, and can translate it from its native format with a powerful templating language which looks and acts like its native format. Sometimes SQL is just plain wrong for the job!
Sorry. Didn't mean to waffle! :-$