What data is stored inside of a Couchbase Index? - couchbase

Situation
I'm working w/ a Couchbase database that keeps running into OOM issues, a lot of the time it's with the Index Service running out of memory. When I fire up the Couchbase dashboard to get more info I can see that index foobar has X amount of items that take up Y amount of memory.
Question
Is there a way to view the data stored in the index service? Is it a clone of the document in a new list? Is a pointer filter + a list of references of foo documents? What is an "item"?
I've combed through the docs and the closest I've come is here where they talk about the "items" in the dashboard table, but they don't actually define what the items actually are.
Using Node.js Package
couchbase: "2.6.12"

Is there a way to view the data stored in the index service?
Yes you can use one of the dump utilities in the Couchbase bin directory.. For example the plasma_dump tool is used to extract data from a index on Couchbase Server Enterprise Edition 7.x.
Here's an example extracting the index data from an index created when the built-in travel sample database is loaded into a Linux cluster.
/opt/couchbase/bin/plasma_dump dump /opt/couchbase/var/lib/couchbase/data/\#2i/travel-sample_def_airportname_4970026472206047478_0.index/mainIndex/
{"k":Raleigh Durham Intlairport_3626
","v":""},
{"k":Ralph Wien Memairport_3693
","v":""},
{"k":Ramona Airportairport_8608
","v":""},
{"k":Rampart Airportairport_7112
","v":""},
{"k":Rancho Murietaairport_3643
","v":""},
{"k":Rancho San Simeon Airportairport_9104
","v":""},
{"k":Randall Airportairport_8531
","v":""},
{"k":Randolph Afbairport_3757
","v":""},
{"k":Rapid City Regional Airportairport_4087
","v":""},
{"k":Rawlins Municipal Airport-Harvey Fieldairport_7986
","v":""},
{"k":Reading Regional Carl A Spaatz Fieldairport_5764
","v":""},
{"k":Red Bluff Municipal Airportairport_8137.....
Is it a clone of the document in a new list? Is a pointer filter + a list of references of foo documents? What is an "item"?
As you can see, each index item is a key and a value of the actual fields your index definition has defined. It is not a pointer.
This is actually how indexes are used in databases, in order to speed up the queries. You typically will want to include the document fields in the index to satisfy your queries but exclude the fields which aren't required to reduce the size of the index.
In database terminology, a covered index is an index that includes all the fields that are needed to satisfy a query, so that the database engine does not need to look up the actual data fields from the data service. This can improve the performance of the database, as looking up data in an index is generally faster than gathering all the raw data from the data service.
In order to be a covered index, an index must include all the fields that are needed to satisfy the query. If the query requires fields that are not included in the index, the database engine will have to look up the actual data in the bucket to retrieve the necessary data. This can reduce the performance benefits of using an index.
Thanks,
Ian McCloy (Couchbase Product Manager)

Related

How to keep normalized models when searching via ElasticSearch?

When setting up a MySQL / ElasticSearch combo, is it better to:
Completely sync all model information to ES (even the non-search data), so that when a result is found, I have all its information handy.
Only sync the searchable fields, and then when I get the results back, use the id field to find the actual data in the MySQL database?
The Elasticsearch model of data prefers non-normalized data, usually. Depending on the use case (large amount of data, underpowered machines, too few nodes etc) keeping relationships in ES (parent-child) to mimic the inner joins and the like from the RDB world is expensive.
Your question is very open-ended and the answer depends on the use-case. Generally speaking:
avoid mimicking the exact DB Tables - ES indices plus their relationships
advantage of keeping everything in ES is that you don't need to update both mechanisms at the same time
if your search-able data is very small compared to the overall amount of data, I don't see why you couldn't synchronize just the search-able data with ES
try to flatten the data in ES and resist any impulse of using parent/child just because this is how it's done in MySQL
I'm not saying you cannot use parent/child. You can, but make sure you test this before adopting this approach and make sure you are ok with the response times. This is, anyway, a valid advice for any kind of approach you choose.
ElasticSearch is a search engine. I would advise you to not use it as a database system. I suggest you to only index the search data and a unique id from your database so that you can retrieve the results from MySQL using the unique key returned by ElasticSearch.
This way you'll be using both applications for what they're intended. Elastic search is not the best for querying relations and you'll have to write lot more code for operating on related data than simply using MySql for it.
Also, you don't want to tie up your persistence layer with search layer. These should be as independent as possible, and change in one should not affect the other, as much as possible. Otherwise, you'll have to update both your systems if either has to change.
Querying MySQL on some IDs is very fast, so you can use it and leave the slow part (querying on full text) to elastic search.
Although it's depend on situation, I would suggest you to go with #2:
Faster when indexing: we only fetch searchable data from DB and index to ES, compare to fetch all and index all
Smaller storage size: since indexed data is smaller than #1, it's more easier to backup, restore, recover, upgrade your ES in production. It'll also keep your storage size small when your data growing up, and you can also consider to use SSD to enhance performance with lower cost.
In general, a search app will search on some fields and show all possible data to user. E.g searching for products but will show pricing/stock info.. in result page, which only available in DB. So it's nature to have a 2nd step to query for extra info in DB and combine it with search results to display.
Hope it help.

MySQL partitioning by customer

We have a product that uses diferent MySQL shemas for diferent customers, and a single Java application that uses diferent persistence units for one for every customer. This makes it dificult to add a cutomer withowt redeploying the application.
We are planing to use a single MySQL database schema that hold all the customers with each table having a field which is a KEY sibolizing one customer, so that adding a new customer is a mater od few sql updates/inserts.
What is the best aproach to handle this kind of data in MySQL...does MySQL provide any partitioning tables by key or something like that. And what could be the performance issues of that aproach?
There are a few questions here:
Schema Design Question
Partitioning question
Can mySQL handle a HASH MAP Query O(1)
Schema Design Question:
Yes, this is much better then launching a new app per customer.
Can mySQL handle a HASH MAP Query O(1)
Yes, if the data remains in memory and has enough CPU cycles mySQL can easily do 300K selects a second. Otherwise if the data is I/O bounded and the Disk subsystem is not saturated mySQL can easily do 20-30K Selects per second dependent on the traffic pattern, concurrency, and how many IOPS the database disk subsystem can do.
Partitioning
Partitioning means different things in the context of talking about mySQL. Partitioning is a storage engine that sits on top of another storage engine in mySQL to allocate data to a certain table but exposing a group of partition tables as a single table to the calling application. Partitioning could also mean having certain databases execute a subset of all tables. In your context I think you are asking if you federate by customer what are the performance impact. I.e. can you allocate a database per customer if necessary with the same schema. This concept is more along the ideals of Sharding, taking the data as a whole and allocating resources per unit of data e.g. a customer.
My suggestion to you
Make the schema the same per customer. Benchmark all the queries involved that a customer would do. Query patterns that is. Verify that each query with EXPLAIN does not produce a filesort or temporary table, nor scans 100K rows at a time and you should be able to scale no problem. Once you run into issues with a single or set of boxes getting close to you're IOP ceiling think about splitting the data.

GraphDatabase (Neo4J) vs Relational database (MySql) - query on specific column of a specific table

Is it true that relational database, like MySql, performs better than a graph database, like Neo4j, when a query is about to search for specific data within a specific table and a specific column.
For instance, if the query is: "search for all events that took place in Paris".
Let's assume for simplicity that MySql would have an Event table with an index upon "City" to optimize this kind of query.
What about Neo4j?
One might think that a graph database has to traverse all graphs to retrieve the concerned events...
However it's possible to create some indexes with Neo4j as its documentation precises.
Why RDMBS would be faster than it for this kind of analysis/statistics request?
As you already mentioned: you would create indices for this purpose. The default index provider in Neo4j is lucene, which is very fast and allows fine grained indexing and querying possibilities.
Indices can be used for nodes or relationships and (normally) keep track which values have been set on certain properties on nodes or relationships.
You normally have to do the indexing in your application code unless you're using neo4j's auto indexing feature that automatically indexes all nodes and/or relationships with given properties.
So queries like "search for all events that took place in Paris" are absolutely no problem and are very performant when indices are used.

Efficient and scalable storage for JSON data with NoSQL databases

We are working on a project which should collect journal and audit data and store it in a datastore for archive purposes and some views. We are not quite sure which datastore would work for us.
we need to store small JSON documents, about 150 bytes, e.g. "audit:{timestamp: '86346512',host':'foo',username:'bar',task:'foo',result:0}" or "journal:{timestamp:'86346512',host':'foo',terminalid:1,type='bar',rc=0}"
we are expecting about one million entries per day, about 150 MB data
data will be stored and read but never modified
data should stored in an efficient way, e.g. binary format used by Apache Avro
after a retention time data may be deleted
custom queries, such as 'get audit for user and time period' or 'get journal for terminalid and time period'
replicated data base for failsafe
scalable
Currently we are evaluating NoSQL databases like Hadoop/Hbase, CouchDB, MongoDB and Cassandra. Are these databases the right datastore for us? Which of them would fit best?
Are there better options?
One million inserts / day is about 10 inserts / second. Most databases can deal with this, and its well below the max insertion rate we get from Cassandra on reasonable hardware (50k inserts / sec)
Your requirement "after a retention time data may be deleted" fits Cassandra's column TTLs nicely - when you insert data you can specify how long to keep it for, then background merge processes will drop that data when it reaches that timeout.
"data should stored in an efficient way, e.g. binary format used by Apache Avro" - Cassandra (like many other NOSQL stores) treats values as opaque byte sequences, so you can encode you values how ever you like. You could also consider decomposing the value into a series of columns, which would allow you to do more complicated queries.
custom queries, such as 'get audit for user and time period' - in Cassandra, you would model this by having the row key to be the user id and the column key being the time of the event (most likely a timeuuid). You would then use a get_slice call (or even better CQL) to satisfy this query
or 'get journal for terminalid and time period' - as above, have the row key be terminalid and column key be timestamp. One thing to note is that in Cassandra (like many join-less stores), it is typical to insert the data more than once (in different arrangements) to optimise for different queries.
Cassandra has a very sophisticate replication model, where you can specify different consistency levels per operation. Cassandra is also very scalable system with no single point of failure or bottleneck. This is really the main difference between Cassandra and things like MongoDB or HBase (not that I want to start a flame!)
Having said all of this, your requirements could easily be satisfied by a more traditional database and simple master-slave replication, nothing here is too onerous
Avro supports schema evolution and is a good fit for this kind of problem.
If your system does not require low latency data loads, consider receiving the data to files in a reliable file system rather than loading directly into a live database system. Keeping a reliable file system (such as HDFS) running is simpler and less likely to have outages than a live database system. Also, separating the responsibilities ensures that your query traffic won't ever impact the data collection system.
If you will only have a handful of queries to run, you could leave the files in their native format and write custom map reduces to generate the reports you need. If you want a higher level interface, consider running Hive over the native data files. Hive will let you run arbitrary friendly SQL-like queries over your raw data files. Or, since you only have 150MB/day, you could just batch load it into MySQL readonly compressed tables.
If for some reason you need the complexity of an interactive system, HBase or Cassandra or might be good fits, but beware that you'll spend a significant amount of time playing "DBA", and 150MB/day is so little data that you probably don't need the complexity.
We're using Hadoop/HBase, and I've looked at Cassandra, and they generally use the row key as the means to retrieve data the fastest, although of course (in HBase at least) you can still have it apply filters on the column data, or do it client side. For example, in HBase, you can say "give me all rows starting from key1 up to, but not including, key2".
So if you design your keys properly, you could get everything for 1 user, or 1 host, or 1 user on 1 host, or things like that. But, it takes a properly designed key. If most of your queries need to be run with a timestamp, you could include that as part of the key, for example.
How often do you need to query the data/write the data? If you expect to run your reports and it's fine if it takes 10, 15, or more minutes (potentially), but you do a lot of small writes, then HBase w/Hadoop doing MapReduce (or using Hive or Pig as higher level query languages) would work very well.
If your JSON data has variable fields, then a schema-less model like Cassandra could suit your needs very well. I'd expand the data into columns rather then storing it in binary format, that will make it easier to query. With the given rate of data, it would take you 20 years to fill a 1 TB disk, so I wouldn't worry about compression.
For the example you gave, you could create two column families, Audit and Journal. The row keys would be TimeUUIDs (i.e. timestamp + MAC address to turn them into unique keys). Then the audit row you gave would have four columns, host:'foo', username:'bar', task:'foo', and result:0. Other rows could have different columns.
A range scan over the row keys would allow you to query efficiently over time periods (assuming you use ByteOrderedPartitioner). You could then use secondary indexes to query on users and terminals.

Combine MySQL, Sphinx and MongDB. Good idea?

For a new project I'm looking to combine MySQL, Sphinx and MongoDB. MySQL for the relational data and searching on numeric values, Sphinx for free text search and MongoDB for geodata. As far as my (quick) benchmarks shows MongoDB is the fastest for geo queries, sphinx for free text search and MySQL for relational data searches. So to get the best performance I might have to combine them in my project.
There are however three drawbacks to this.
Three points of failure, i.e. Sphinx, MySQL, and MongoDB can crash
which will stop my site
I need data in three databases and need to keep them up to date
(all data only changes ones per day so its not the worst problem).
Hardware requirements and mainly RAM is going through the roof
since all databases wants to have a large portion of the RAM to be
able to perform.
So the questions is should I combine the three, leave one out (probably MongoDB and use Sphinx for geodata as well) or even go with only one (MongoDB or MySQL)?
To give an idea of the data, the relational data is aprox 6GB, the geodata about 4GB and the freetext data about 16GB.
Didn't quite understood if the records/collections/documents contained in the 3 dbs have inter-db references. EG if user names, jobs, telephone numbers are in Mysql and user addresses are in Mongo. I'll assume that the answer is Yes.
IMHO having 3 different storage solutions is not recommended, because:
1) (most important) You can not aggregate data from 2 DBs (in a scalable way).
Example:
Let's say that you keep user data (user names) in Mysql and user geo coordinates in Mongo. You can't query having filters/sorts on fields located on both dbs. For example, you can't:
SELECT all users
WHERE name starts with 'A'
SORT BY distance_from_center
Same applies for Sphinx.
Solution: you either limit to data available on a single DB, or you duplicate/mirror data from one db to another.
2) Maintenance costs: 3 servers to maintain, different backup/redundance strategies, different scaling strategies; Development costs: developer must use 3 querying libraries, with 3 different ways to query, etc etc.
3) Inconsistence/Synchronization issues that must be manually dealt with(EG you want to insert data both in mongo and in mysql; let's say that mongo wrote the data, but mysql raised a referential integrity exception, so now you have an inconsistency between dbs)
4) About HW costs, the only RAM-eater is MongoDB (the recommendation is that it has to have all indexes in ram). For MySQL and Solr servers, you can control memory consumption.
What I would do:
If I don't need all the SQL features (like transactions, referential integrity, joins, etc) I would go with Mongo
If I need those features, and I can live with a lower performance on geo operations, I would go with MySQL
now, If I need (I mean, I really really need) full-text search, and Mongo/Mysql FTS capabilities are not enough, I would attach also a FTS server like Sphinx, Solr, Elasticsearch, etc