Get streaming stats for each node inside IPFS cluster? - ipfs

I'm using ipfs for video files storage and streaming , inside the cluster there are 4 nodes by now and anyone can join the cluster as ipfs node and can provide streaming for rewards.
I need the information that how we can calculate streaming provided by any particular node since beginning so we can reward them each month ?
what are the possible solutions ?
reference from doc would be helpful as well.

Related

how to do ipfs pin add and get within 10 seconds?

In my project, I need to download data from ipfs by giving a CID.
What I do is:
ipfs pin add {CID}
ipfs get {CID}
But I found these two steps are quite time-consuming, it takes at least 1min above.
I tried localhost and infura.
What can I do to let it download faster?
When you just want to download files, there is no need to pin them first. This might save a (tiny) bit of overhead.
However, the bulk time is probably spent in
Looking up nodes in the distributed hash table that provide your data, and
Actually transferring the data from these nodes.
For small data sizes, the first item is probably the limiting factor. In general, the duration depends on the number of nodes that host your connects to and how fast these nodes (can) transfer the data to your client.

couchbase cluster document not replicating but splitting up

I've set up a couchbase cluster with 2 nodes containing 300k docs on 4 buckets. the option replicas is forced to 1 as there are only 2 machines.
But documents are splitted half in one node half in the other, I need to have double copy of each document so if a node goes down the other one che still supply all data to my app.
Is there a setting I missed in creating the cluster?
can I still set the cluster to replicate all documents?
I hope someone can help.
thanks
PS: I'm using couchbase community 4.5
UPDATE:
I add screenshots of cluster web interface and cbstast output:
the following is the state with one node only
next the one with both node up:
then cbstats results on both node when both are up and running:
AS you can see with only one node there are half items displayed. Does it mean that the other half resides as replicas but are not shown???
can I still run consistenly my app with only one node???
UPDATE:
I had to click fail-over manually to see replicas become active on the remaining node. As with just two cluster auto fail-over is disabled!!!
Couchbase Server will partition or shard the documents across the two nodes, as you observed. It will also place replicas on those nodes, based on your one-replica configuration.
To access a replica, you must use one of the Client SDKs.
For example, this Java code will attempt to retrieve a replica (getFromReplica("id", ReplicaMode.ALL)) if the active document retrieval fails (get("id")).
bucket.async()
.get("id")
.onErrorResumeNext(bucket.async().getFromReplica("id", ReplicaMode.ALL))
.subscribe();
The ReplicaMode.ALL tells Couchbase to try all nodes with replicas and the active node.
So what was happening with only two nodes in the cluster was that auto fail-over didn't start automatically as specified here:
https://developer.couchbase.com/documentation/server/current/clustersetup/automatic-failover.html
this means data replicas where not activated in the remaining node unless fail-over was triggerd manullay.
The best thing is to have more than TWO nodes in the cluster before going in production.
To be honest I should have ridden documentation very carefully before asking any question.
thanks Jeff Kurtz for your help, you pushed me towards the solution. (the understanding of how couchbase replicas policy works).

Frequently updating a large JSON file on Amazon S3 and potential write conflict

I first want to give a little overview on what I'm trying to tackle. My service is frequently fetching posts from various sources such as Instagram, Twitter, etc. and I want to store the posts in one large JSON file on S3. The file name would be something like: {slideshowId}_feed.json
My website will display the posts in a slideshow, and the slideshow will simply poll the S3 file every minute or so to get the latest data. It might even poll another file such as {slideshowId}_meta.json that has timestamp from when the large file changed in order to save bandwidth.
The reason I want to keep the posts in a single JSON file is mainly to save cost. I could have each source as its own file, e.g. {slideshowId}_twitter.json, {slideshowId}_instagram.json, etc. but then the slideshow would need to send GET request to every source every minute, thus increasing the cost. We're talking about thousands of slideshows running at once, so the cost needs to scale well.
Now back to the question. There may be more than one instance of the service running that checks Instagram and other sources for new posts, depending on how much I need to scale out. The problem with that is the risk of one service overwriting the S3 file while another one might
already be writing to it.
Each service that needs to save posts to the JSON file would first have to GET the file, process it and check that the new posts are not duplicated in the JSON file, and then store the new or updated posts.
Could I have each service write the data to some queue like the Simple Queue Service
(SQS) and then have some worker that takes care of writing the posts to the S3 file?
I thought about using AWS Kinesis, but it just processes the data
from the sources and dumps it to S3. I need to process what has been
written to the large JSON file as well to do some book keeping.
I had an idea of using DynamoDB to store the posts (basically to do the book keeping), and
then I would simply have the service query all the data needed for a
single slideshow from DynamoDB and store it to S3. That way the services would simply send the posts to DynamoDB.
There must be some clever way to solve this problem.
Ok for your use case
there are many users for a single large s3 file
the file is updated often
the file path (ideally) should be consistent to make it easier to get and cache
the s3 file is generated by a process on a ec2 and updated once per minute
If the GET rate is less than 800 per second then AWS is happy with it. If not then you'll have to talk to them and maybe find another way. See http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
The file updates will be atomic so there are no issues with locking etc. See http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPUT.html
Presumably if a user requests "during" an update they will see the old version. This behaviour is transparent to both parties
File updates are "eventually" consistent. As you want to keep the url the same you will be updating the same object path in s3.
If you are serving across regions then the time it takes to become consistent might be an issue. For the same region it seems to take a few seconds. AWS don't seem to be very open about this, so it's probably best to test it for your use case. As your file is small and the updates are per 60 seconds then I would imagine it would be ok. You might have to assume in your API description that updates actually happen over a greater time than 60 seconds to take this into account
As ec2 and s3 run on different parts of the AWS infrastructure (ec2 in a VPC and s3 behind a public https) You will pay for transfer costs from ec2 to s3
I would imagine that you will be serving the s3 file via the s3 "pretend to be a website" feature. You will have to configure this too, but that is trivial
This is what I would do:
The Kinesis stream would need to have enough capacity to handle writes from all your feed producers. For about 25/month you get to do 2000 writes per second.
Lambda would be simply fired whenever there is enough new items on your stream. You can configure trigger to wait for 1000 new items and then run the Lambda to read all new items from the stream, process them and write them to REDIS (ElastiCache). Your bill for that should be well under 10/month.
Smart key selection would take care of duplicate items. You can also set the items to expire if you need to. According to your description your items should definitely fit into memory and you can add instances if you need more capacity for reading and/or reliability. Running two REDIS instances with enough memory to handle your data would cost around 26/month.
Your service would use REDIS instead of S3, so you would only pay for the data transfer and only if your service is not on AWS (<10/month?).

cassandra sstableloader load data from csv with various partition keys

I want to load a large CSV file to my cassandra cluster (1 node at this moment).
Basing on: http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
My data is transformed by CQLSSTableWriter to SSTables files, then I use SSTableLoader to load that SSTables to a cassandra table already containing some data.
That CSV file contains various partition keys.
Now lets assume that multi-node cassandra cluser is used.
My questions:
1) Is the loading procedure that I use correct in case of multinode cluster?
2) Will that SSTable files be splitted by SSTableLoader and send to nodes responsible for the specific partition keys?
Thank you
1) Loading into a single-node cluster or 100-node cluster is the same. The only difference is that the data will be distributed around the ring if you have a multi-node cluster. The node where you run sstableloader becomes the coordinator (as #rtumaykin already stated) and will send the writes to the appropriate nodes.
2) No. As in my response above, the "splitting" is done by the coordinator. Think of the sstableloader utility as just another instance of a client sending writes to the cluster.
3) In response to your follow-up question, the sstableloader utility is not sending files to nodes but sending writes of the rows contained in those SSTables. The sstableloader reads the data and sends write requests to the cluster.
Yes
It will be actually done by the coordinator node, not by the SSTableLoader.

Starting MySQL Cluster

Im new to clustering and Im doing a project on cluster database. I want to make use of MySQL Cluster. Im using it for a small scale database and this is my plan:
5 node:
1 management node,
2 SQL node,
2 API node.
My questions are:
1) Is my plan for the node process alright?
2) What should I do when I got the error "Failed to allocate node id..."?
3) Is it a requirement to use multi-threaded data node?
4) Where do I place my web server page for the user to access the database?
Please reply. Thank you so much.
This answer might be a little late but can be helpful for someone else getting started:
1) Is my plan for the node process alright?
Your plan for the node process is ok for a small cluster. I would recommend adding an additional management node and 2 more Data Nodes if the number of replicas you have configured is 2. Reason being since you currently have only 2 data nodes, your cluster will not be functionally should once of those nodes "die" . This is because of the two-phase commit that takes place. In the case of a faiure only 1 data node will be able to persist the data , the other one would be unreachable and therefore the transaction would be marked as incomplete.
2) What should I do when I got the error "Failed to allocate node
id..."?
This error is usually thrown if you have assigned the same id to other nodes in your configuration file. Each node should have a unique Id.
3) Is it a requirement to use multi-threaded data node?
It is not a requirement but recommended. Using mulch-threaded data node allows you to leverage modern computer architecture with multiply cpus to allow for your data processing queries to be processed much faster. As a result updates and queries will be done much faster
4) Where do I place my web server page for the user to access the
database?
Hmm.Not sure why you want to achieve here. This would be a separate question, if you are using PHP or usually any other language. You will have to have a web server configured. Place them in the root of the http directory to get started