How to keep normalized models when searching via ElasticSearch? - mysql

When setting up a MySQL / ElasticSearch combo, is it better to:
Completely sync all model information to ES (even the non-search data), so that when a result is found, I have all its information handy.
Only sync the searchable fields, and then when I get the results back, use the id field to find the actual data in the MySQL database?

The Elasticsearch model of data prefers non-normalized data, usually. Depending on the use case (large amount of data, underpowered machines, too few nodes etc) keeping relationships in ES (parent-child) to mimic the inner joins and the like from the RDB world is expensive.
Your question is very open-ended and the answer depends on the use-case. Generally speaking:
avoid mimicking the exact DB Tables - ES indices plus their relationships
advantage of keeping everything in ES is that you don't need to update both mechanisms at the same time
if your search-able data is very small compared to the overall amount of data, I don't see why you couldn't synchronize just the search-able data with ES
try to flatten the data in ES and resist any impulse of using parent/child just because this is how it's done in MySQL
I'm not saying you cannot use parent/child. You can, but make sure you test this before adopting this approach and make sure you are ok with the response times. This is, anyway, a valid advice for any kind of approach you choose.

ElasticSearch is a search engine. I would advise you to not use it as a database system. I suggest you to only index the search data and a unique id from your database so that you can retrieve the results from MySQL using the unique key returned by ElasticSearch.
This way you'll be using both applications for what they're intended. Elastic search is not the best for querying relations and you'll have to write lot more code for operating on related data than simply using MySql for it.
Also, you don't want to tie up your persistence layer with search layer. These should be as independent as possible, and change in one should not affect the other, as much as possible. Otherwise, you'll have to update both your systems if either has to change.
Querying MySQL on some IDs is very fast, so you can use it and leave the slow part (querying on full text) to elastic search.

Although it's depend on situation, I would suggest you to go with #2:
Faster when indexing: we only fetch searchable data from DB and index to ES, compare to fetch all and index all
Smaller storage size: since indexed data is smaller than #1, it's more easier to backup, restore, recover, upgrade your ES in production. It'll also keep your storage size small when your data growing up, and you can also consider to use SSD to enhance performance with lower cost.
In general, a search app will search on some fields and show all possible data to user. E.g searching for products but will show pricing/stock info.. in result page, which only available in DB. So it's nature to have a 2nd step to query for extra info in DB and combine it with search results to display.
Hope it help.

Related

Syncing pagination between relational and non-relational database

I use mysql as my main database and I sync some data to elasticsearch to make use of features like fuzzy search and aggregations. However, this problem can be applied to and couple of relational and non-relational databases.
When user searches something, I make query to elastic, get ids (primary keys in mysql) and make another query to mysql database, where I filter by ids that were returned from elastic. I use this approach as you often need to load some additional data from relational database, and it would be hell to maintain these relations inside document-based elastic (e.g. load user with comment).
Problem is, same filters will not be applied to elastic query and mysql query. In above example, what if you need to filter comments by some user param - that filter will be applied to mysql query, but not elastic. If same filters won't be applied, pagination will mismatch - 2nd page in mysql can be 4th in elastic. If I take all of the ids from elastic (no pagination), I am afraid of a long response time and clusters failing + you can't get more than 10K records from elastic without scroll api.
I need a conceptual solution here, not actual query examples. Feel free to suggest totaly different approach altogether. Also, I don't need a perfect pagination match, since mysql will do pagination anyway. If elastic needs to get more records, it's fine, I just don't want to couse too heavy load.
Im afraid there is no general solution for the problem you are explaining . It varies by your response time expectations; size of data etc.
For example,
If you can ensure that one side of JOIN data will be much lesser - you could change join direction; First do the query on mySQL and then do an id based terms search in ES.
Consider using database embedded search like postgres depending on how complex your queries are and other features of ES you are leveraging

Big quantity of data with MySql

i have a question about Mass storage. Actually, i'm working with 5 sensors which sends a lot of datas with a different frequency for each one and i'm using MySQL DATABASE.
so here is my questions:
1) is MySQL the perfect solution.
2) if not, is there a solution to store this big quantity of data in a data base?
3) I'm using Threads in this and i'm using mutexs also, i'm afraid if this can cause problems, Actually,it seems to be.
i hope i will have an answer to this question.
MySql is good solution for OLTP scenarios where you are storing transactions to serve web or mobile apps. But it does not scale well (despite of cluster abilities).
There are many options out there based on what is important to you:
File System: You can device your own write-ahead-log solution to solve multi-threading problems and achieve "eventual consistency". That way you don't have to lock data for one thread at a time. You can use schema-full files like CSV, Avro or Parquet. Also you can use S3 or WSB for cloud based block storage. Or HDFS for just block and replicated storage.
NoSql: You can store each entry as document in NoSql Document stores. If you want to keep data in memory for faster read, explore Memcached or Redis. If you want to perform searches on data, use Solr or ElasticSearch. MongoDB is popular but it has scalability issues similar to MySql, instead I would chose Cassandra or HBase if you need more scalability. With some of NoSql stores, you might have to parse your "documents" at read time which may impact analytics performance.
RDBMS: As MySql is not scalable enough, you might explore Teradata and Oracle. Latest version of Oracle offers petabyte query capabilities and in-memory caching.
Using a database can add extra computation overhead if you have a "lot of data". Another question is what you do with the data? If you only stack them, a map/vector can be enough.
The first step is maybe to use map/vector that you can serialize to a file when needed. Second you can add the database if you wish.
About mutex if you share some code with different thread and if (in this code) you work on the same data at the same time, then you need them. Otherwise remove them. BTW if you can separate read and write operations then you don't need mutex/semaphore mechanism.
You can store data anywhere, but the data storage structure selection would depends on the use cases (the things, you want to do with the data).
It could be HDFS files, RDBMS, NoSQL DB, etc.
For example your common could be:
1. to save the sensor data very quickly.
2. get the sensor data on the definite date.
Then, you can use MongoDB or Cassandra.
If you want to get deep analytics (to get monthly average sensor data), you definitely should think about another solutions.
As for MySQL, it could also be used for some reasonable big data storage,
as it supports sharding. It fits some scenarios well, some not.
But I repeat, all would depend on use cases, i.e. the things you want to do with data.
So you could provide question with more details (define desired use-cases), or ask again.
There are several Questions that discuss "lots of data" and [mysql]. They generally say "yes, but it depends on what you will do with it".
Some general statements (YMMV):
a million rows -- no problem.
a billion rows or a terabyte of data -- You will run into problems, but they are no insurmountable.
100 inserts per second on spinning disk -- probably no problem
1000 rows/second inserted can be done; troubles are surmountable
creating "reports" from huge tables is problematical until you employ Summary Tables.
Two threads storing into the same table at the "same" time? Every RDBMS (MySQL included) solves that problem before the first release. The Mutexes (or whatever) are built into the code; you don't have to worry.
"Real time" -- If you are inserting 100 sensor values per second and comparing each value to one other value: No problem. Comparing to a million other values: big problem with any system.
"5 sensors" -- Read each hour? Yawn. Each minute? Yawn. Each second? Probably still Yawn. We need more concrete numbers to help you!

JPA - Hibernate : Select query on the continuously growing table

I have a Mysql table which holds about 10 million records currently. Records are inserted by another batch application on continues basis and keep on growing.
On the front end user can search the data on this table based on different criterion. I am using query DSL and JPA repository to create dynamic queries and getting data from the table. But the performance of the query with pagination is very slow. I have tried indexing, InnoDB related tweaks,session management by HikariCP and ehcahe solutions but still it is taking about 100 seconds to get the data.
Also entities are simple POJO with no relation with other entities.
What is the best way/technology/framework to implement this scenario?
In a table of this size, dynamic query is a really, REALLY bad idea, you need to really control the access to the table and avoid table scans at all cost.
Ultimately, this sounds like a data warehouse solution, whereas the data is ETL'ed into a report-like format and not raw transactional data. Even so, you'll still need to define the access patterns you need and design your DWH to support that.
If you decide that the raw data is still the best format, another approach would be to define support metadata tables that could be queried to more quickly reduce the number of rows returned.
Could also look at clustering the data if you can find some way to logically break out data into chunks. However, when you say dynamic queries, this might not be possible.
My suggestion would be to create a dedicated cache and the web app should query the cache instead of the DB. If the ETL batch to your main table is at a defined period, you can keep the cache hot by triggering a loading from the main table to the cache. This can any in memory cache like Ignite or Infinispan.
However, this is not a sustainable solution and eventually you would need to restrict your users into seeing data within a manageable date range only and will have to either discard or send the old data asynchronous via flat file generated reports.
Not the entire history of the huge dataset can be made available in the UI to the user.
You could also try data virtualization tools to figure out what users are more comfortable with before deciding on the partition strategy in production.

How to handle a table with billion of rows with lots of read and write operations

Please guide me through my problem
I receive data at every 1 sec at my server from different sources.My data is structured i parse it and now i have to store this parsed data into single table around 5 lacs of records in a day. Also daily i do lots of read operation on this table.After some time this table will have billions of record.
How should i solve this problem? I want to know should i go with RDBMS or HBase or any other option.
My question is regarding what sort of database repository you wish to use: RAM? Flash? Disk?
RAM responds in nanoseconds.
Flash in microseconds.
Disk in milliseconds.
And, of course, you might want to create a hybrid of all three, especially if some keys were "hotter" than others -- more likely to be read over and over.
If you want to do a lot of fast processing, and scale it "wide" (many CPUs in a cluster for faster read performance), you are a likely candidate for a NoSQL database. I'd need to know more about your data model to know whether it would work as a key-value store, and how it might require more internal structure such as JSON/BSON.
Caveat: I am biased towards Aerospike, my employer. Yet you should do some kicking-of-the-tires with us or any other key-value stores you're considering to see if it would work with your data before betting the farm. Obviously, each NoSQL vendor would claim itself to be "the best," but much depends on your use case. A vendor's "solution" will only work well for certain data models. We tend to be best for fast in-memory RAM/Flash or hybrid implementations.
If in case your table would reach billions of records, RDBMS definitely won't scale.
Regarding HBASE, it depends on your requirements whether it would be a good solution or not.
If you are looking for real time reads, Hbase would only help if you are only looking for a specific key. If you want to do random reads on different columns, Hbase won't be an ideal solution here. Hbase would scale really well in case of updates.
I would suggest you to design your Hbase schema efficiently and store your data in way which suits your querying.
However if you are interested in running aggregation queries you can also map your hbase table to an external table in Hive and run sql type queries on your data.
You can use HBase as a NoSQL database in this case. To make search more customized and faster use ElasticSearch along with Hbase.
If you writes are at 1/second, most of the available databases should be able to support this. Since you are looking for longer term/persistent store, you should consider a database that provides you horizontal scale so that you could add more nodes as and when you would like to increase the capacity. Databases with auto-sharding abilities would be great fit for you (cassandra, aerospike ...). Make sure you choose a auto-sharding database that doesn't require client/application to manage which data is stored where. In-memory databases would not fit the bill in this case.
When your storage is a few tera-bytes, you may have to worry about the database scale, throughput so that your infra cost doesn't bogg you down.
Your query patterns would be very crucial in choosing the right solution. You may not want to index everything, but fine-tune what you index so that you could query on the keys and/or only those data elements from within your records so that index storage overhead doesn't become too much, and hence you keep the cost under control. You should also look for time-range query ability for the database solutions, which seems to be part of your typical query pattern.
Last but not the least, you would want to have your queries processes in fastest possible time. You should try out Cassandra (good for horizontal scaling, less on the throughput) and aerospike (good for horizontal scaling, pretty good on throughput).

Redis vs MySQL for Financial Data?

I realize that this question is pretty well discussed, however I would like to get your input in the context of my specific needs.
I am developing a realtime financial database that grabs stock quotes from the net multiple times a minute and stores it in a database. I am currently working with SQLAlchemy over MySQL, but I came across Redis and it looks interesting. It looks good especially because of its performance, which is crucial in my application. I know that MySQL can be fast too, I just feel like implementing heavy caching is going to be a pain.
The data I am saving is by far mostly decimal values. I am also doing a significant amount of divisions and multiplications with these decimal values (in a different application).
In terms of data size, I am grabbing about 10,000 symbols multiple times a minute. This amounts to about 3 TB of data a year.
I am also concerned by Redis's key quantity limitation (2^32). Is Redis a good solution here? What other factors can help me make the decision either toward MySQL or Redis?
Thank you!
Redis is an in-memory store. All the data must fit in memory. So except if you have 3 TB of RAM per year of data, it is not the right option. The 2^32 limit is not really an issue in practice, because you would probably have to shard your data anyway (i.e. use multiple instances), and because the limit is actually 2^32 keys with 2^32 items per key.
If you have enough memory and still want to use (sharded) Redis, here is how you can store space efficient time series: https://github.com/antirez/redis-timeseries
You may also want to patch Redis in order to add a proper time series data structure. See Luca Sbardella's implementation at:
https://github.com/lsbardel/redis
http://lsbardel.github.com/python-stdnet/contrib/redis_timeseries.html
Redis is excellent to aggregate statistics in real time and store the result of these caclulations (i.e. DIRT applications). However, storing historical data in Redis is much less interesting, since it offers no query language to perform offline calculations on these data. Btree based stores supporting sharding (MongoDB for instance) are probably more convenient than Redis to store large time series.
Traditional relational databases are not so bad to store time series. People have dedicated entire books to this topic:
Developing Time-Oriented Database Applications in SQL
Another option you may want to consider is using a bigdata solution:
storing massive ordered time series data in bigtable derivatives
IMO the main point (whatever the storage engine) is to evaluate the access patterns to these data. What do you want to use these data for? How will you access these data once they have been stored? Do you need to retrieve all the data related to a given symbol? Do you need to retrieve the evolution of several symbols in a given time range? Do you need to correlate values of different symbols by time? etc ...
My advice is to try to list all these access patterns. The choice of a given storage mechanism will only be a consequence of this analysis.
Regarding MySQL usage, I would definitely consider table partitioning because of the volume of the data. Depending on the access patterns, I would also consider the ARCHIVE engine. This engine stores data in compressed flat files. It is space efficient. It can be used with partitioning, so despite it does not index the data, it can be efficient at retrieving a subset of data if the partition granularity is carefully chosen.
You should consider Cassandra or Hbase. Both allow contiguous storage and fast appends, so that when it comes to querying, you get huge performance. Both will easily ingest tens of thousands of points per second.
The key point is along one of your query dimensions (usually by ticker), you're accessing disk (ssd or spinning), contiguously. You're not having to hit indices millions of times. You can model things in Mongo/SQL to get similar performance, but it's more hassle, and you get it "for free" out of the box with the columnar guys, without having to do any client side shenanigans to merge blobs together.
My experience with Cassandra is that it's 10x faster than MongoDB, which is already much faster than most relational databases, for the time series use case, and as data size grows, its advantage over the others grows too. That's true even on a single machine. Here is where you should start.
The only negative on Cassandra at least is that you don't have consistency for a few seconds sometimes if you have a big cluster, so you need either to force it, slowing it down, or you accept that the very very latest print sometimes will be a few seconds old. On a single machine there will be zero consistency problems, and you'll get the same columnar benefits.
Less familiar with Hbase but it claims to be more consistent (there will be a cost elsewhere - CAP theorem), but it's much more of a commitment to setup the Hbase stack.
You should first check the features that Redis offers in terms of data selection and aggregation. Compared to an SQL database, Redis is limited.
In fact, 'Redis vs MySQL' is usually not the right question, since they are apples and pears. If you are refreshing the data in your database (also removing regularly), check out MySQL partitioning. See e.g. the answer I wrote to What is the best way to delete old rows from MySQL on a rolling basis?
>
Check out MySQL Partitioning:
Data that loses its usefulness can often be easily removed from a partitioned table by dropping the partition (or partitions) containing only that data. Conversely, the process of adding new data can in some cases be greatly facilitated by adding one or more new partitions for storing specifically that data.
See e.g. this post to get some ideas on how to apply it:
Using Partitioning and Event Scheduler to Prune Archive Tables
And this one:
Partitioning by dates: the quick how-to