Sphinx and Big Data - mysql

I would like to use a full text search engine and I decided to be Sphinx. But I am working with hadoop and Big data platform and Sphinx Search has a compatibility with mysql DB which cannot handle the big data.
So is there a way to use Sphinx with big data environments like hadoop or HDFS or any other nosql database?

Well it comes with built in drivers for loading data from RDBMS's but is certainly not limited as such.
For starters there 'pipe' indexing options...
http://sphinxsearch.com/docs/current.html#xmlpipe2
http://sphinxsearch.com/docs/current.html#xsvpipe
These just run a script and index the output. That script can fetch the data from just about any system imaginable.
Plenty of projects can use to get started, ramdom example:
https://github.com/georgepsarakis/mongodb-sphinx
You might also be able to get injest a CSV output from hadoop directly?
There are also real-time indexes. Where the data is inserted directly into an index, on the fly. Not a Hadoop expert, but in theory, could have a hadoop project inject the results directly into sphinx (the outputcommitter?), rather (or in addition to) writing the results to HDFS.
http://sphinxsearch.com/docs/current.html#rt-indexes
Might also be able to use something like
https://www.percona.com/blog/2014/06/02/using-infinidb-engine-mysql-hadoop-cluster-data-analytics/
as a bridge between hadoop and sphinx. (ie sphinx'es indexer creates an index via the fake mysql engine)

Related

Implementing a search with Elasticsearch using mysql data

I am new to Elasticsearch. I was using MySQL Full Text features till now.
I want my MySQL database as my primary database and want to use Elasticsearch alongside as a search engine in my website. I got several problems when thinking about it. The main problem is Syncing between MySQL database and Elastic search.
Some say to use Logstash. But even though I use it, would I need to write separate functions in my program to database transactions and Elasticsearch indexing?
You will need to run periodic job doing full reindex and/or send individual document updates for ES indexing. Logstash sounds like ill-suited thing for the purpose. You need just the usual ES API to index stuff.

Only solr or with Mysql

I want to use solr for my search index.What confuse me is ,should i put most the data fields in solr ,or only search for the id ,then get the data from Mysql,please help.Which is faster,better
I had the same Question in 2010 an decided to take Solr as a search index only to get a list of IDs in the first step an read the data from MySQL related to the IDs in the 2nd step.
That works fine in an Environment with 20 million docs.
During an reconstruction of the whole application in 2014, we decided to additional store the data in Solr (not only indexing) in order to fetch the whole docs during a search, so that the MySQL connect is not necessary anymore.
We are talking about an Web - Application with only max. 1-3 thousand parallel users and there is absolutely no perceived difference in application speed between the version from 2010 and 2014.
But there are some benefits, if you take the documents from Solr, not MySql.
The application code is a bit cleaner.
You only need one connect to get the data....
But: the main reason, why we begin to store the document in solr is: we needed to use the highlighting feature. This will only work well, if you store the docs on solr and fetches them from solr too.
Btw: there is no change in search performance if you store the docs or not.
The disadvantage is, that you have to hold the data twice:
1.) in MySQL as the base dataset and
2.) in Solr for your application.
And: if you have very big documents, solr probably is not the right tool to serve that kind of documents.
Putting all the data into Solr will absolutely be faster: you are saving yourself from having to make two queries, and also you are removing the need for a slow piece of code (PHP or whatever) to bridge the gap between these two where you pull out the id from solr and then query mysql. Alternatively you could put everything into MySQL, which would be of comparable speed. i.e. choose a technology suiting your needs best, but don't mix unnecessarily for performance reasons. A good comparison when you might want to use Solr vs MySQL can be found here.

Right database for machine learning on 100 TB of data

I need to perform classification and clustering on about 100tb of web data and I was planning on using Hadoop and Mahout and AWS. What database do you recommend I use to store the data? Will MySQL work or would something like MongoDB be significantly faster? Are there other advantages of one database or the other? Thanks.
The simplest and most direct answer would be to just put the files directly in HDFS or S3 (since you mentioned AWS) and point Hadoop/Mahout directly at them. Other databases have different purposes, but Hadoop/HDFS is designed for exactly this kind of high-volume, batch-style analytics. If you want a more database-style access layer, then you can add Hive without too much trouble. The underlying storage layer would still be HDFS or S3, but Hive can give you SQL-like access to the data stored there, if that's what you're after.
Just to address the two other options you brought up: MongoDB is good for low-latency reads and writes, but you probably don't need that. And I'm not up on all the advanced features of MySQL, but I'm guessing 100TB is going to be pretty tough for it to deal with, especially when you start getting into large queries that access all of the data. It's more designed for traditional, transactional access.

With Solr, Do I Need A SQL db as well?

I'm thinking about using solr to implement spatial and text indexing. At the moment, I have entries going in to a MYSQL database as well as solr. When solr starts, it reads all the data from MYSQL. As new entries come in, my web servers write them to MYSQL and, at the same time, adds documents to solr. More and more, it seems that my MYSQL implementation is just becoming a write-only persisten store (more or less, a backup for the data in solr) - all of the reading of entries are done via solr queries. Really the only data being read from MYSQL is user info, which doesn't need to be indexed/searched.
A few questions:
Do I really need the MYSQL implementation or could I simply store all of my data in solr?
If solr only, what are the risks associated with this solution?
Thanks!
Almost always, the answer is yes. It needn't be a database necessarily, but you should retain the original data somewhere outside of Solr in the event you alter how you index the data in Solr. Unlike most databases, which Solr is not, Solr can't simple re-index itself. You could hypothetically configure your schema so that all your original data is marked as "stored" and then perhaps to a CSV dump and re-index that way, but I wouldn't recommend this approach.
Shameless plug: For any information on using Solr, I recommend my book.
I recommend a separate repository. MySQL is one choice. Some people use the filesystem.
You often want a different schema for searching than for storing. That is easy to do with a separate repository.
When you change the Solr schema, you need to reload the content. Unloading all the content from Solr can be slow. If it is already in a separate repository, then you don't need to dump it from Solr, you can overwrite what is there.
In general, making Solr be both a search engine and a repository really reduces your flexibility and options for making search the best it can be.

Pulling data from MySQL into Hadoop

I'm just getting started with learning Hadoop, and I'm wondering the following: suppose I have a bunch of large MySQL production tables that I want to analyze.
It seems like I have to dump all the tables into text files, in order to bring them into the Hadoop filesystem -- is this correct, or is there some way that Hive or Pig or whatever can access the data from MySQL directly?
If I'm dumping all the production tables into text files, do I need to worry about affecting production performance during the dump? (Does it depend on what storage engine the tables are using? What do I do if so?)
Is it better to dump each table into a single file, or to split each table into 64mb (or whatever my block size is) files?
Importing data from mysql can be done very easily. I recommend you to use Cloudera's hadoop distribution, with it comes program called 'sqoop' which provides very simple interface for importing data straight from mysql (other databases are supported too).
Sqoop can be used with mysqldump or normal mysql query (select * ...).
With this tool there's no need to manually partition tables into files. But for hadoop it's much better to have one big file.
Useful links:
Sqoop User Guide
2)
Since I dont know your environment I will aire on the safe, side - YES, worry about affecting production performance.
Depending on the frequency and quantity of data being written, you may find that it processes in an acceptable amount of time, particularly if you are just writing new/changed data. [subject to complexity of your queries]
If you dont require real time or your servers have typically periods when they are under utilized (overnight?) then you could create the files at this time.
Depending on how you have your environment setup, you could replicate/log ship to specific db server(s) who's sole job is to create your data file(s).
3)
No need for you to split the file, HDFS will take care of partitioning the data file into bocks and replicating over the cluster. By default it will automatically split into 64mb data blocks.
see - Apache - HDFS Architecture
re: Wojtek answer - SQOOP clicky (doesn't work in comments)
If you have more questions or specific environment info, let us know
HTH
Ralph