Elasticseach stuck at 213 documents and lost data - json

I have an elasticsearch setup with 1 node and no replica nodes sharing a droplet with a Kibana setup in digital ocean. My droplet has 2GB ram and enough CPU. My elasticsearch JVM is set to use 768MB RAM (so kibana can have its share).
My problem is that I seem to be losing data since my node is stuck at 213 documents and I have already noticed that some important documents are gone.
I couldn't find documentation on how this works. The only thing I found about this is that more ram is better when dealing with large amounts of data, and that having a secondary node to store replicas is a good practice.
Should I allocate more ram? How can I know if my data is beeing deleted to allocate more? Is this some sort of pagination? Can this be a kibana problem?

I solved it. The problem was my ID generation. I was generating IDs manually and not storing persistently the last id generated. Once the system is restarted ids are lost and elastic search allows id overriding.

Related

What is the best strategy to store redis data to MySQL for permanent storage?

I am running a couple of crawlers that produce millions of datasets per day. The bottleneck is the latency between the spiders and the remote database. In case the location of the spider server is too large, the latency will slow the crawler down to a point where it can not longer complete the datasets needed for a day.
In search for a solution I came upon redis with the idea on installing redis the spider server where it will temporarily store the data collected with low latency and then redis will pull that data to mysql some how.
The setup is like this until now:
About 40 spiders running on multiple instances feed one central MySQL8 remote server on a dedicated machine over TCP/IP.
Each spider writes different datasets, one kind of spider gets positions and prices of search results, where there are 100 results with around 200-300 inserts on one page. Delay is about 2-10s between the next request/page.
The later one is the problem as the spider yields every position within that page and creates a remote insert within a transaction, maybe even a connect (not sure at the moment).
This currently only works as spiders and remote MySQL server are close (same data center) with ping times of 0.0x ms, it does not work with ping times of 50ms as the spiders can not write fast enough.
Is redis or maybe DataMQ a valid approach to solve the problem or are there other recommended ways of doing this?
Did you mean you have installed a Redis Server on each spider?
Actually it was not a good solution for you case. But if you have already done this and still want to use MySQL to persistent your data, cronjob on each server will be an option.
You can create a cronjob on each spider server(based on your dataset and your need, you can choose daily or hourly sync job). And write a data transfer script to scan your Redis and transfer to MySQL tables.
I recommend using MongoDB instead of MySQL to store data

Can I really persist data on disk using couchbase

I have a lot of data needed to stored on disk.
Since it is only key-value pairs, I want to use couchbase to do it.
The data is several GB and I only allocate 1 GB RAM to the bucket.
I though RAM to couchbase is only a cache.
But after inserting a lot of data I got:
Hard Out Of Memory Error. Bucket "test2" on node 100.66.32.169 is full. All memory allocated to this bucket is used for metadata.
when I open the couchbase web console.
Can couchbase be a database to store data on disk? Or it is RAM oritented?
Update:
OK, let me make the question more specific:
In couchbase:
If I allocate the RAM of a bucket to be 1 GB, can I store 10 GB data to that bucket?
If I can do 1. , can I consider that 1 GB RAM is a kind of cache of the 10 GB data (just like CPU L2 cache is a cache of RAM) ?
By default, Couchbase stores all keys (and some metadata) in RAM, and fills whatever remains with values. Starting with version 3.0, you can set your bucket to full-eviction mode, which only keeps the keys of cached documents in RAM. This lets you store much more data than you have memory, but at a cost to performance to some read operations, especially trying to retrieve keys that don't exist.
To solve your specific problem, edit the bucket and set it to full metadata eviction. Note that this will restart the bucket.
Couchbase tries to keep as much of the "live dataset" (ie. most used / requested keys) into the node's memory. This is key to the performance of the database, and part of the design, so good memory sizing of your nodes and quotas of your buckets are key.
It does offer persistence, but I'd say this is not a disk-first oriented database.
Persistence to disk is mainly for two things: making data durable and resilient to node shutdown (of course) and off-load data (in priority least used data) from RAM to disk.
I think you're asking a bunch of different questions here.
Specifically about the error message: looks like your bucket is simply too small to hold all the data you're storing in it.
About persisting to disc: you can force couchbase to write to disc (and even configure the number of nodes that docs are replicated into) but as noted above, that would probably hurt your performance a little.
Have a look, for example, at the persist_to flag in the set() api of the python client for couchbase.
couchbase client for python

Caching the data result of complex computation

I have a Spring Boot server application. Clients of this server ask for statistics about different things all the time. These statistics can be shared among clients, and must not be real time.
It's good enough if these statistics are refreshed every 15-30 mins.
Also, computing these statistics requires reading the whole database.
So, I'd like to cache these computed statistics and update them now and then.
What is your suggestion, what tool or pattern should I use?
I have the following ideas so far:
using memcached
upgrading to MySQL 5.7 which has JSON store, and store the data there
Please keep in mind that the hardware of my server is not too powerful: 512MB RAM and 1 CPU (cheapest option in DigitalOcean).
Thank you in advance!
Edit 1:
These statistics are composed of quite simple data structures: int to int maps, lists, etc. and they are NOT fitting well for a relational database.
Edit 2:
The whole data is only a few megabytes. The crutial point is that creating this data requires a lot of database reads, and a lot of clients are asking for it.
I also want to keep my server application stateless. I think it's important to mention.
A simple solution for the problem, is saving the data in JSON format to a file, and that's it.
Additionally, this file can be on a ram disk partition, so it will be blazing fast.

Mysql cluster for dummies

So what's the idea behind a cluster?
You have multiple machines with the same copy of the DB where you spread the read/write? Is this correct?
How does this idea work? When I make a select query the cluster analyzes which server has less read/writes and points my query to that server?
When you should start using a cluster, I know this is a tricky question, but mabe someone can give me an example like, 1 million visits and a 100 million rows DB.
1) Correct. Every data node does not hold a full copy of the cluster data, but every single bit of data is stored on at least two nodes.
2) Essentially correct. MySQL Cluster supports distributed transactions.
3) When vertical scaling is not possible anymore, and replication becomes impractical :)
As promised, some recommended readings:
Setting Up Multi-Master Circular Replication with MySQL (simple tutorial)
Circular Replication in MySQL (higher-level warnings about conflicts)
MySQL Cluster Multi-Computer How-To (step-by-step tutorial, it assumes multiple physical machines, but you can run your test with all processes running on the same machine by following these instructions)
The MySQL Performance Blog is a reference in this field
1->your 1st point is correct in a way.But i think if multiple machines would share the same data it would be replication instead of clustering.
In clustering the data is divided among the various machines and there is horizontal partitioning means the dividing of the data is based on the rows,the records are divided by using some algorithm among those machines.
the dividing of data is done in such a way that each record will get a unique key just as in case of a key-value pair and each machine also has a unique machine_id related which is used to define which key value pair would go to which machine.
we call each machine a cluster and each cluster consists of an individual mysql-server, individual data and a cluster manager.and also there is a data sharing between all the cluster nodes so that all the data is available to the every node at any time.
the retrieval of data is done through memcached devices/servers for fast retrieval and
there is also a replication server for a particular cluster to save the data.
2->yes, there is a possibility because there is a sharing of all the data among all the cluster nodes. and also you can use a load balancer to balance the load.But the idea of load balancer is quiet common because they are being used by most of the servers. but if you are trying you just for your knowledge then there is no need because you will not get to notice the type of load that creates the requirement of a load balancer the cluster manager itself can do the whole thing.
3->RandomSeed is right. you do feel the need of a cluster when your replication becomes impractical means if you are using the master server for writes and slave for reads then at some time when the traffic becomes huge such that the sever would not be able to work smoothly then you will feel the need of clustering. simply to speed up the whole process.
this is not the only case, this is just one of the scenario this is only just a case.
hope this is helpful for you!!

Storing image files in Mongo database, is it a good idea?

When working with mysql, it is a bad idea to store images as BLOB in the database, as it makes the database quite large which is harmful for normal usage of the database. Then, it is better to save image files on disk and save link to them within the database.
However, I think this is different for MongoDB, as increasing the database file size has a negligible influence on performance (this is the reason that MongoDB can successfully handle billions of records).
Do you think it is better to save image files on MongoDB (as GridFS) to reduce number of files stored on the server; or still it is better to keep the database as small as possible?
The problem isn't so much that the database gets big, databases can handle that (although MongoDB isn't as good as many other in that respect). The problem is that to send the data to the client it first has to be moved into RAM by the database, then copied over to the application's memory, then handed off to the kernel to be sent through the socket. It's wasting lots of RAM and CPU cycles. The reason it's better to have large files in the filesystem is that it's easier to get around copying it, you can ask the kernel to stream the file from disk to the socket directly.
The downside of storing large files in the filesystem is that it's much harder to distribute. Using a database, and something like Mongo's GridFS makes it possible to scale out. You just have to make sure you don't copy the whole file into the application's memory at once, but a chunk at a time. Most web app frameworks have some support for sending chunked HTTP responses nowadays.
The answer is yes. Back in the old cave-man days, servers had mutable file systems you could change. This was great till we tried to scale things.
Cave-people nowadays build apps with immutable deployments. Heroku and Dokku are examples of this. Because the web app server has no state, they can be created, upgraded, scaled, and destroyed easily.
Since we still have files, we need to put them somewhere. There are several solutions: nfs, our database, someone elses database.
nfs is a 'network file system' which let's you do file i/o on network resources. If you're dealing with the network anyways, IMHO it doesn't add much value unless it's what you know already.
Our database - For MongoDB there are two options: (file > 16mb) ? GridFS : BinData
Someone elses database - Some are basic like Amazon S3 and some offer extra services like Cloudinary or Dropbox.
If you're on an big-budget enterprise team and someone spends 40 hrs a week taking care of servers then sure - use the file system. If you're building web apps that scale, putting files in the DB makes sense.
If you're concerned about performance:
1) Using a proxy (e.g. nginx) or a CDN to host your content for clients. Your server should just be serving cache misses.
2) Use streaming IO Nodeschool has a cool tutorial for Node.js.
Storing images is not a good idea in any DB, because:
read/write to a DB is always slower than a filesystem
your DB backups grow to be huge and more time consuming
access to the files now requires going through your app and DB layers
The last two are the real killers.
Source: Three things you should never put in your database.
So if you can make your application crafty, then better not to upload your pictures to MongoDB.
However, if you are close to deadline... and the database will be so small that it will not grow up a lot and its size will never exceed the available RAM on the machine running your application, then I think (as opposed to the author of the cited article), you may consider storing the images in MongoDB. It's simply, convenient, quick to implement and gives you some flexibility.
MongoDB's GridFS is designed for this sort of storage and is quite handy for storing image files across many different servers in a way that all servers can use them.