Indexing html with Sphinx without complex scripts - html

As far as I know the sphinx search engine can index html, but it doesn't have any in-built drivers like it does for sql-data. That means we have to parse and prepare html content ourselves.
Does anyone know of any drivers or third party add-ons make sphinx index html automatically?
Can anyone help? Thanks in advance.

Well if you have a database of the .html filenames, can use
http://sphinxsearch.com/docs/current.html#conf-sql-file-field
to index them, sphinx will load each individiaul file in turn and index the contents.
Combine with
http://sphinxsearch.com/docs/current.html#conf-html-strip

Related

Full Text Search with NodeJS

I want to implete the full text search(FTS) queries in my node js application. The database I am using is MySQL. I know that MySQL does have inbuild support for FTS but sadly it does not support Singular/Plurals, Synonyms and Inflectional words.
There other FTS libraries available that can work with MySQl. Following are the two I am interested in
Lucene
Sphinx Search
I am very much sure that Shpinx Search has npm package and can be used with node js. I am not sure if Lucene can be used with node js ?
Please let me know if lucene can be used with node js if so provide the documentation for the same.
Thanks !
You have several alternative like query-engine and several other tools available for Lucene.
Also, if you want to use FTS with node, you could have a look to Norch like suggest this answer on a look-alike topic.
Best,
Maintainer of search-index and Norch here. They might be what you are looking for. You can even use your MySQL database as a back end if you want.
https://github.com/fergiemcdowall/search-index (the lib)
https://github.com/fergiemcdowall/norch (the server)

Indexing MySQL data with ElasticSearch

I would like to get some feedback from anyone who's had experience indexing MySQL data with ElasticSearch for full-text searching. How did you accomplish this? I've been researching this a bit and unfortunately I've noticed that ElasticSearch has no official plugin to accomplish this although I've come across three different 3rd party tools:
elasticsearch-river-jdbc
go-mysql-elasticsearch
elasticsearch-river-mysql
I'm unsure which one would be best in terms of performance although I suspect the Go tool might have an advantage due to it's compiled nature and the fact that it uses the mysql binary logs. I would appreciate any advice or examples anyone could provide me with.
Thanks!
UPDATE:
you can now use Logstash to do it..here is an useful blog about it click ere
well you can use mysql UDF functions in mysql triggers to execute external scripts, which will index the newly updated or inserted data in elasticsearch. thats one way. check this to see how to's on mysql UDF

Can Lucene, Sphinx (or any other engine) index binary data?

I already have a Sql Server 2008 based app in production, where am using Full text search by storing the binary (along with the file extension). Which means the same column can store doc, xls, pdf, docx... etc. I went for that approach (knowing it would be insert costly) because i have varied files which can be uploaded and I don't want to run into madness of converting text from various types (xls, xlsx, doc, docx, pdf etc) of files. Also i am not aware of any free tools which can do that for me. I don't want to use filesystem as that would be insecure and maintenance will be costly.
Now am looking for the ease (or difficulty) to move to mysql. Do have some options of full text search in mysql For ex: MySql Full text
search (which does not index binary), Sphinx and Solr.
I found this Question, which is kind of closest to what i need... Although i guess Sphinx doesn't index binary data... However, by using SphinxSE i can query the mysql tables and Sphinx to get related resultset (in the same connection). I hope that understanding is correct. But am not sure of the performance. Can someone add more insight?
Of what i have heard... Integrating Lucene with Mysql is difficult.
My need is to fetch ranked results based on criterion which can be structured (stored in RDBMS) and unstructured (textual dats which
shall be indexed).
Also, is there any other option which looks like more suitable in my given situation.
Have a look at ElasticSearch (uses lucene under the hood like Solr) I think it may do what you require I haven't needed document indexing though so not tried it.
See here though for more information
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-attachment-type.html
It uses Apache Tika to convert the documents to indexable content (same as SQL server does with IFilter plugins)

Opening MySql database to search engines

Most of my content on my web application gets stored in MySql database. I want to open this content for search engine to index it.
What is the best solution to do this.
Best could be either performance oriented or ease of implementation.
Thanks in advance!
You can also create a sitemaps xml file that could sit at example.com/sitemaps.xml and contain a dump of all blog posts, products, user profiles etc etc in a format google can understand (more so than a normal webpage).
You can also ping a url to tell google to come check your sitemap whenever you add or edit content.
Assuming you are talking about web based search engines (such as Google), then they index webpages.
Make webpages for all entries in the database and link to them.
Like David said, a webpage should be available for each resource. Not only to force indexing, but also as a "landing page" to which the search result will then direct you. This can then of course be a redirect to another page.
The pages can be dynamic of course, but make sure that they are reference somewhere on your site so the spiders can reach them.

Spider that tosses results into mysql

Looking to use Sphinx for site search, but not all of my site is in mysql. Rather than reinvent the wheel, just wondering if there's an open source spider that easily tosses its findings into a mysql database so that Sphinx can then index it.
Thanks for any advice.
There's also the XML pipe datasource that can feed documents to Sphinx. Not sure if it'd be any easier to set something up to output your site's content as XML than it would be to insert it into the DB, but it's an option.
If you're not 100% stuck on using Sphinx you could consider Lucerne like this site is? This should work regardless of underlying technology (database driven or static pages).
I am also currently looking to implement a site search. This question may also help.