How add html and text files to Sphinx index? - html

From Sphinx reference manual: «The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, mailboxes, and so on»
But I can't find how add text files and html files to index. Quick Sphinx usage tour show setup for MySQL database only.
How I can do this?

Your should look at the xmlpipe2 data source.
From the manual:
xmlpipe2 lets you pass arbitrary full-text and attribute data to Sphinx in yet another custom XML format. It also allows to specify the schema (ie. the set of fields and attributes) either in the XML stream itself, or in the source settings.

I would suggest that you insert the texts in a database. That way you can retrieve them and probably highlight your search results much easier and faster.

Related

Best practice to store HTML in DB

I'm developing a companies catalog, where description is kept in HTML format.
Should I store both HTML and text version of description?
Will it impact on full text search, that will be implemented later?
Of course, I can just strip HTML tags in rendering.
What is a best practice for this?
You could of course consider doing it the other way around. Store fields in the database, then use a HTML template to insert the fields in the required place. Then your data is not duplicated and you can potentially have multiple html templates for the same underlying data.
Alternatively, you could store your fields in a single db field in some structured format (e.g. XML), and then transform that into html. (e.g. XSL). Note: some dbs can understand XML natively, if your db doesnt support this then you can store individual fields, generate XML from them, then apply XSL to get your html.

Can Lucene, Sphinx (or any other engine) index binary data?

I already have a Sql Server 2008 based app in production, where am using Full text search by storing the binary (along with the file extension). Which means the same column can store doc, xls, pdf, docx... etc. I went for that approach (knowing it would be insert costly) because i have varied files which can be uploaded and I don't want to run into madness of converting text from various types (xls, xlsx, doc, docx, pdf etc) of files. Also i am not aware of any free tools which can do that for me. I don't want to use filesystem as that would be insecure and maintenance will be costly.
Now am looking for the ease (or difficulty) to move to mysql. Do have some options of full text search in mysql For ex: MySql Full text
search (which does not index binary), Sphinx and Solr.
I found this Question, which is kind of closest to what i need... Although i guess Sphinx doesn't index binary data... However, by using SphinxSE i can query the mysql tables and Sphinx to get related resultset (in the same connection). I hope that understanding is correct. But am not sure of the performance. Can someone add more insight?
Of what i have heard... Integrating Lucene with Mysql is difficult.
My need is to fetch ranked results based on criterion which can be structured (stored in RDBMS) and unstructured (textual dats which
shall be indexed).
Also, is there any other option which looks like more suitable in my given situation.
Have a look at ElasticSearch (uses lucene under the hood like Solr) I think it may do what you require I haven't needed document indexing though so not tried it.
See here though for more information
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-attachment-type.html
It uses Apache Tika to convert the documents to indexable content (same as SQL server does with IFilter plugins)

iOS multi language from database

A year ago i created an Application in dutch. Now I want to make this app multi language. I read that xcode has localized strings but all my text is downloaded from a MySQL database in an external location so there is no Local text.
Do I need to create this from ground up? My idea was reading the users preferred language setting. Then pointing to the right table in the database. Is this the best way to support multi language application from a database?
Btw the current method is just downloading the desired content from MySQL with php and json.
You need to re-create your application only to the extent that you need to store your text as part of your application. Whatever else you're loading from your database shouldn't be an issue. But Cocoa's (not Xcode's) localization scheme dictates that the text to be localized be stored as part of the application.
That being said, what do you do? Start by reading Apple's very own documentation on the subject. There's also a link within that, Preparing Your Nib Files for Localization that you should read as well.
You'll need to create a Localizable.strings file for each language you wish to support. Each of these files contains key/value pairs as described in the documentation. The key is a string that can be any arbitrary value, but it has to remain consistent across all of your Localizable.strings files. The value is the string rendered in the given language for that file.
Think about why you're loading your text from a database. It might be because some of it needs to be updated, but surely not all of it.
Best wishes to you in your endeavors ahead.

Can MySql load from XML directly

I am aware of the batch LOAD XML technique e.g. Load XML Update Table--MySQL
Can MySql insert/replace rows directly from xml. I'd like to pass an XML string to MySQL.
Something like replace into user XML VALUES maybe even using as to map the tags to the column names??
The primary thing is that I dont want to parse the XML in my code, I'd like MySql to handle this. I dont have a file, I have the XML as a string.
I have looked and found there are some XML Functions:
12.11. XML Functions
The XML functions can do XPath, but I think this is a little fiddly as I have a 1:1 mapping from the XML to the table structure so I'd hjst like to be able to say hey MySql, insert the values in the xml string in to the table.
Is this possible?
In a nutshell, No.
What your looking for is an XML storage engine for MySQL. There has never been one created officially, and i have never seen a third party one either (but feel free to google).
If you really want to achieve this, then the closest you would get is to look for an alternative (R)DMS, but then that might not support the type of queries you wish to perform, may require a bit of a learning curve, would no doubt require you are using a server with superuser access, and potentially mean re-factoring a lot of your code.

MySQL or XML files or something else?

I have a lot of text data broken down into articles and I anticipate once I release this data a lot of people will be viewing it.
So do I let each person query a MySQL database and house the data there or should I use XML and have one article per xml file and just parse it on the fly?
I am using PHP, MySQL.
I think that MySQL would be a lot faster?
What about some other format to store all this data in?
All things considered, you should look into NoSQL and document oriented databases.
As always, there's more than one way to kin a cat.
One way would be to store the articles into a database and use full text indexes.
You could consider reformatting the data into HTML (that is stored into a structures of files and directories) and index the articles using nutch, or solr.