Database for Full Text Search and 200M+ Records - mysql

Iam about to create a huge database with at least 200 Million entries.
The database needs to be searchable using full text and should be fast.
My database gets data from many different datasources and i need to import the new or updated data regularly.
Is it a good idea to store all my data in a relational database like mysql and then create a nosql document database (e.g. mongodb or elasticsearch) just for the purpose of searching or does that not provide any benefit in terms of
reliability and the prevention of redundant information?

I believe that keeping primary records in a SQL database and duplicating them to a noSQL database is a very common approach.
ElasticSearch has an ongoing status page about their resiliency. Even in the newest version, ElasticSearch can loose data in a number of different situations. A major change in the structure of an ElasticSearch index (such as adding analyzers) requires that you re-index all of the documents. This process is safer if you have another source for the documents. At the end of the day, ElasticSearch isn't designed to consistently store documents - I would only ever choose to use ElasticSearch as the primary store in situations where occasional data loss isn't a disaster.
Unlike ElasticSearch, MongoDB is designed to be resilient. You should be able to safely store documents in MongoDB. I've found trying to do full text searches in MongoDB can be a little painful, at least compared to ElasticSearch. In my opinion, for text search, the only advantage MongoDB has over MySQL's FULLTEXT is that it is distributed.
We are running ElasticSearch and MySQL right now - and the benefits greatly outweigh the hassles of extra infrastructure and dealing with replication between the two. We had previously attempted to use a noSQL solution as the primary datastore, with disastrous results. Running a ES in conjunction with a MySQL gets you the best of both worlds - consistency & safety of data in SQL, with the scalable, effective full text search in ES.

I don't know how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret and Solr), running some benchmarks.

Related

Solr-ish Query API on top of relational database

I have a data source which is sitting in relational database. I managed to index/store everything into Solr and thrilled to see the search performance and the awesome API (search/admin..etc).
However, people say if your data is truly structured, relational database should be fast if you index everything. However, even if I dump all the data into a relation database like MySQL, what I am missing is all the beautiful query API.
I guess my question is:
is it possible to only use the query API of Solr-ish and totally use relation database as the backend instead of using index at all.
if that is not possible, is there any mature project/product that can build a full stack query API on a relational database?
Document search engines and relational databases serves different usage patterns. If you're using Solr for anything that involves tokenization and analysis chains, replicating that in an RDBMS requires implementing that functionality yourself (or just using a subset, such as full text indices in certain RDBMSes). I detailed some of these differences and features in Should I just query the database or use a proper search engine solution?.
It's usually better to use the RDBMS as the main storage for your data and then push it into the search index as required. This will also let you get new features from those who care about search and the problem it tries to solve, without having to wait for a niche product to implement it on top of your RDBMS (there's still quite a few new features in each iteration of Lucene, Elastic and solr).

MongoDB vs Mysql Storage space compare

I am building a data ware house that is the range of 15+ TBs. While storage is cheap, but due to limited budget we have to squeeze as much data as possible in to that space while maintaining performance and flexibility since the data format changes quiet frequently.
I tried Infobright(community edition) as a SQL solution and it works wonderful in term of storage and performance, but the limitation on data/table alteration is making it almost a no go. and infobright's pricing on enterprise version is quiet steep.
After checking out MongoDB, it seems promising except one thing. I was in a chat with a 10gen guy, and he stated that they don't really give much of a thought in term of storage space since they flatten out the data to achieve the performance and flexibility, and in their opinion storage is too cheap nowadays to be bother with.
So any experienced mongo user out there can comment on its storage space vs mysql (as it is the standard for what we comparing against to right now). if it's larger or smaller, can you give rough ratio? I know it's very situation dependent on what sort of data you put in SQL and how you define the fields, indexing and such... but I am just trying to get a general idea.
Thanks for the help in advance!
MongoDB is not optimized for small disk space - as you've said, "disk is cheap".
From what I've seen and read, it's pretty difficult to estimate the required disk space due to:
Padding of documents to allow in-place updates
Attribute names are stored in each collection, so you might save quite a bit by using abbreviations
No built in compression (at the moment)
...
IMHO the general approach is to build a prototype, insert data and see how much disk space your specific use case requires. The more realistic you can model your queries (inserts and updates) the better your result will be.
For more details see http://www.mongodb.org/display/DOCS/Excessive+Disk+Space as well.
Pros and Cons of MongoDB
For the most part, users seem to like MongoDB. Reviews on TrustRadius give the document-oriented database 8.3 out of 10 stars.
Some of the things that authenticated MongoDB users say they like about the database include its:
Scalability.
Readable queries.
NoSQL.
Change streams and graph queries.
A flexible schema for altering data elements.
Quick query times.
Schema-less data models.
Easy installation.
Users also have negative things to say about MongoDB. Some cons reported by authenticated users include:
User interface, which has a fairly steep learning curve.
Lack of joins, which can make some data retrieval projects difficult.
Occasional slowness in the cloud environment.
High memory consumption
Poorly structured documentation.
Lack of built-in analytics.
Pros and Cons of MySQL
MySQL gets a slightly higher rating (8.6 out of 10 stars) on TrustRadius than MongoDB. Despite the higher rating, authenticated users still mention plenty of pros and cons of choosing MySQL.
Some of the positive features that users mention frequently include MySQL’s:
Portability that lets it connect to secondary databases easily.
Ability to store relational data.
Fast speed.
Excellent reliability.
Exceptional data security standards.
User-friendly interface that helps beginners complete projects.
Easy configuration and management.
Quick processing.
Of course, even people who enjoy using MySQL find features that they don’t like. Some of their complaints include:
Reliance on SQL, which creates a steeper learning curve for users who
do not know the language.
Lack of support for full-text searches in InnoDB tables.
Occasional stability issues.
Dependence on add-on features.
Limitations on fine-tuning and common table expressions.
Difficulties with some complex data types.
MongoDB vs MySQL Performance
When comparing the performance of MongoDB and MySQL, you must consider how each database will affect your projects on a case-by-case basis. While some performance features may appear to be objectively promising, your team members may never use the features that drew you to a database in the first place.
MongoDB Performance
Many people claim that MongoDB outperforms MySQL because it allows them to create queries in multiple ways. To put it another way, MongoDB can be used without knowing SQL. While the flexibility improves MongoDB's performance for some organizations, SQL queries will suffice for others.
MongoDB is also praised for its ability to handle large amounts of unstructured data. Depending on the types of data you collect, this feature could be extremely useful.
MongoDB does not bind you to a single vendor, giving you the freedom to improve its performance. If a vendor fails to provide you with excellent customer service, look for another vendor.
MySQL Performance
MySQL performs extremely well for teams that want an open-source relational database that can store information in multiple tables. The performance that you get, however, depends on how well you configure the MySQL database. Configurations should differ depending on the intended use. An e-commerce site, for example, might need a different MySQL configuration than a team of research scientists.
No matter how you plan to use MySQL, the database’s performance gets a boost from full-text indexes, a high-speed transactional system, and memory caches that prevent you from losing crucial information or work.
If you don’t get the performance that you expect from MySQL data warehouses and databases, you can improve performance by integrating them with an excellent ETL tool that makes data storage and manipulation easier than ever.
MySQL vs MongoDB Speed
In most speed comparisons between MySQL and MongoDB, MongoDB is the clear winner. MongoDB is much faster than MySQL at accepting large amounts of unstructured data. When dealing with large projects, it's difficult to say how much faster MongoDB is than MySQL. The speed you get depends on a number of factors, including the bandwidth of your internet connection, the distance between your location and the database server, and how well you organise your data.
If all else is equal, MongoDB should be able to handle large data projects much faster than MySQL.
Choosing Between MySQL and MongoDB
Whether you choose MySQL or MongoDB probably depends on how you plan to use your database.
Choosing MySQL
For projects that require a strong relational database management system, such as storing data in a table format, MySQL is likely to be the better choice. MySQL is also a great choice for cases requiring data security and fault tolerance. MySQL is a good choice if you have high-quality data that you've been collecting for a long time.
Keep in mind that to use MySQL, your team members will need to know SQL. You'll need to provide training to get them up to speed if they don't already know the language.
Choosing MongoDB
When you want to use data clusters and search languages other than SQL, MongoDB may be a better option. Anyone who knows how to code in a modern language will be able to get started with MongoDB. MongoDB is also good at scaling quickly, allowing multiple teams to collaborate, and storing data in a variety of formats.
Because MongoDB does not use data tables to make browsing easy, some people may struggle to understand the information stored there. Users can grow accustomed to MongoDB's document-oriented storage system over time.

Database mongoDB or mysql

I'm creating a search engine for deals, disscounts and coupons. First with my engine I collect deals from some sites and write that deals into database. So an records have a:
records: name,dissount,price,latitude,longitude
Now i'm using mysql but is my search engine will be faster if I use mongodB becouse all results in is similar json format
What is better solution if I have 1,000,000 records mysql or mongoDB ? I need faster searching.
http://test.pluspon.com
For your use case MongoDB would indeed be faster.
You can easy implement processing with multiple mongos in sharded environments, there would not be any blocking and even more performance gain for your use case.
But keep in mind that speed benchmarks and fast data processing is not the only thing you should care about. MongoDB is still at very young age compared to more mature enterprise databases. But for your named use case i would advise to go with it.
Also as commented there are other NoSQL databases that could help you even better in some cases. Read up this blog for more understanding

Which strategy to use for designing a log data storage?

We want to design a data storage with Relational database keeping the request message(http/s,xmpp etc.) logs. For generating logs we use a solution based on Apache synapse esb. However since we want to store the logs and read the logs only for maintenance issues the read/write ratio will be low. (write count will be intensive since system will receive many messages to be logged. ) We thought of using Cassandra for its distributed nature and clustering capabilities. However with Cassandra database schemas search queries with filter are difficult, always requiring secondary indexes.
To keep it short my question is whether should we try the clustering solutions of mysql or using Cassandra with suitable schema design for search queries with filters?
If you wish to do real time analytics over your semi-structured or unstructured data you can go with Cassandra + Hadoop cluster. Since Cassandra wiki itself suggests Datastax Brisk edition, for such kind of architecture. It is worth giving it a try
On the other side if you wish to do realtime queries over raw logs for small set of data. Ex.
select useragent from raw_log_table where id='xxx'
Then you should do a lots of research over you row key and column key design. Because that decides the complexity of the query. Better have a look at the case studies of people here http://www.datastax.com/cassandrausers1
Regards,
Tamil

Solr only vs. Solr/MySQL solution

Currently I have a system, which is based solely on Solr. Which means, that I store all data in Solr (using SolrJ) with no other datastore involved. The problem is now, that I experience some performance issues. I thought, that it maybe could make sense to store in MySQL and then synchronize the data with Solr with e.g. the DataImportHandler. So that I have the reading operations on the Solr index and the main writing operations in MySQL and then sometimes only Solr-Writing operations when synchronizing with Solr.
The thing is that I expect hundreds of millions documents which should be stored and I don't really now if that the MySQL/Solr makes sense.
Is there another better solution? Maybe Master-Solr for writing and Solr-slaves for reading?
Update: What I forgot to say is, that also in case of a schema.xml change, the "storing data in MySQL" solution could be useful in my opinion, because then I can re-commit all the data without caring about Solr's self-stored data.
Its not preferable to use the same Solr instance for both reading and writing as the activities (with commit and optimize) on Solr during writing would heavily impact the read operations.
Master - Slave confgurations would be nicer approach, with master primarily for writes and slaves for read only purposes.
Slaves being periodically refreshed with the contents from Master. (So there would be some delay)
You can always scale by adding multiple slaves.
Using MySQL as a persistant store with Master-Slave Solr would be a best approach.
MySQL providing a stable data store, and would guard you against index corruption or some more issues which would result in data lost.
Using dataimport handler you can do it easily with incremental updates, but there would be more time tag for latest data to appear on slaves.
With this you can also use Index swapping for full refreshes.
In case the index grows up hugh to be be maintainable and has performance impact, you may want to check solr shards.
I also thought about the same issue: storing everything in solr or stor in mySql and index in Solr.
I decided to go the 2nd way: store with MySQL and index in solr.
The reason: handling of data (reading and writing data) in MySql is much better than by Solr. Also data import/export from/to MySql is supported/possible by lots of tools, out of the box.
Next Point: Backup. There are much more established ways for backing up an MySql DB than an Solr index.
Of course, for fulltext-search, Solr is much more better than MySql. So i decided, that everyone should have to work where he knows best.
For your Information: i'm talking about an medium Index: 4GB for some million documents.
//Edit: don't forgett, that some features requiere stared data in lucene (not only indexed), like highlighting. If you need this, you have to store the documents in solr (additional). An alternative way could be implementing those features on client-side. (I did it this way)