Ethereum: What's a good way to retrieve a large amount of old smartcontract log data from a RPC service for a backfill? - ethereum

The problem I'm posed with is backfilling a specialized database, using data from the event log of a given smartcontract on an Ethereum blockchain.
The question is however: how to do so without reaching the limits of eth_getLogs (also without limits: how to have reasonably sized RPC responses)
What I tried so far
I prefer to use Infura, but they limit this call at 100 entries per response. And rightfully so, querying should be done in small chunks for load balancing etc. Is api pagination + eth_getLogs the right way to collect data for backfills?
Idea 1: eth_getLogs on ranges of blocks
I don't know of any way to paginate the eth_getLogs other than querying for ranges of blocks. A block may contain more than 100 events however, which prevents me from reading all of the data when using Infura. Maybe there is a way to paginate on log index? (100 is something I came accross when experimenting, but I can't find documentation on this)
Idea 2: log filters
Using a filter RPC call is another option: i.e. start a "watcher" on a range of old blocks. I tried this, but the Infura websocket RPC I am using doesn't seem to give any response, and neither does Ganache when testing locally. Non-archive (i.e. live watching) logs work, so I known that my code is working as intended at least. (My go-ethereum Watch... generated binding call works, but does not result in responses on the output channel when specifying an old block in bind.WatchOpts.Start)
Does anyone have any suggestions on how to retrieve large amounts of log data? Or a link to other projects that tackled this problem?

Related

'Search' endpoint seems unstable

When using Forge Data Management API endpoint
projects/:project_id/folders/:folder_id/search we have two problems.
It seems that we sometimes have to wait for several minutes (hours?)
after a model is uploaded until it can be found by the search.
We often get error 429 "Too Many Requests" even that we only call do
very few calls (less than 10 within an hour).
These issues makes the endpoint hard to use in production code. Is there anything we can do to improve the success rate? Is Autodesk going to improve the endpoint?
This question is related to How to find cloud Item id of a Revit model?
We are aware of those issues and working on improving things in those areas. However, I cannot tell when exactly those will become available.
In the meantime, depending on your workflow, there are two things that could be of help:
Use webhooks in order to be notified about new files being added to BIM 360/ACC
Use folder/contents endpoint to find the file you need, which also supports filtering just like the folder/search endpoint. You would have to iterate through subfolders though if you wanted to look for items in them as well. Newly added files should show up here straight away.

Best strategy to script a web journey in jmeter with too many json submits in requests

I am in process of developing a Jmeter script for a web journey which has close to 30 transactions, I observe that there are like 20 requests(API calls) which are submitting heavy Json load(with lots of fields) and this Json structure varies significantly with key data( Account number), it doesn't look like corelating 100s of fields of each of these json requests is an efficient way of scripting. I did give a try to submit predefined Json files in the requests, but that way I would need at least 10 Json( 1 json per request) for each Account number and considering I am looking to test like 200 Account numbers, this also doesn't look like a reasonable approach. Can someone please suggest pointers to approach this ? Should I test APIs independently ?
There is only one strategy: your load test must represent real life application usage with 100% accuracy so I would go for separate APIs testing in 2 cases:
As the last resort if for some reason realistic workload is not possible
When one particular API is known to be a bottleneck and you need to repeat the test focusing on one endpoint only
With regards to correlation there are some semi and fully automated correlation solutions like:
Correlations Recorder Plugin for JMeter where you can specify the correlation rules prior to recording and all matching values will be automatically replaced with the respective extractors/variables
BlazeMeter Proxy Recorder which is capable of exporting recorded scripts in "SmartJMX" mode with automatic detection and correlation of dynamic elements
If the next request can be "constructed" from the previous response but with some transformation of the JSON structure you can amend the response and convert it to the required form using JSR223 Test Elements - see Apache Groovy - Parsing and Producing JSON article for more information

What is the RESTful way to return a JSON + binary file in an API

I have to implement a REST endpoint that receives start and end dates (among other arguments). It does some computations to generate a result that is a kind of forecast according to the server state at invocation epoch and the input data (imagine a weather forecast for next few days).
Since the endpoint does not alter the system state, I plan to use GET method and return a JSON.
The issue is that the output includes also an image file (a plot). So my idea is to create a unique id for the file and include an URI in the JSON response to be consumed later (I think this is the way suggested by HATEOAS principle).
My question is, since this image file is a resource that is valid only as part of the response to a single invocation to the original endpoint, I would need a way to delete it once it was consumed.
Would it be RESTful to deleting it after serving it via a GET?
or expose it only via a DELETE?
or not delete it on consumption and keep it for some time? (purge should be performed anyway since I can't ensure the client consumes the file).
I would appreciate your ideas.
Would it be RESTful to deleting it after serving it via a GET?
Yes.
or expose it only via a DELETE?
Yes.
or not delete it on consumption and keep it for some time?
Yes.
The last of these options (caching) is a decent fit for REST in HTTP, since we have meta-data that we can use to communicate to general purpose components that a given representation has a finite lifetime.
So this reference of the report (which includes the link to the plot) could be accompanied by an Expires header that informs the client that the representation of the report has an expected shelf life.
You might, therefore, plan to garbage collect the image resource after 10 minutes, and if the client hasn't fetched it before then - poof, gone.
The reason that you might want to keep the image around after you send the response to the GET: the network is unreliable, and the GET message may never reach its destination. Having things in cache saves you the compute of trying to recalculate the image.
If you want confirmation that the client did receive the data, then you must introduce another message to the protocol, for the client to inform you that the image has been downloaded successfully.
It's reasonable to combine these strategies: schedule yourself to evict the image from the cache in some fixed amount of time, but also evict the image immediately if the consumer acknowledges receipt.
But REST doesn't make any promises about liveness - you could send a response with a link to the image, but 404 Not Found every attempt to GET it, and that's fine (not useful, of course, but fine). REST doesn't promise that resources have stable representations, or that the resource is somehow eternal.
REST gives us standards for how we request things, and how responses should be interpreted, but we get a lot of freedom in choosing which response is appropriate for any given request.
You could offer a download link in the JSON response to that binary resource that also contains the parameters that are required to generate that resource. Then you can decide yourself when to clean that file up (managing disk space) or cache it - and you can always regenerate it because you still have the parameters. I assume here that the generation doesn't take significant time.
It's a tricky one. Typically GET requests should be repeatable as an import HTTP feature, in case the original failed. Some people might rely on it.
It could also be construed as a 'non-safe' operation, GET resulting in what is effectively a DELETE.
I would be inclined to expire the image after X seconds/minutes instead, perhaps also supporting DELETE at that endpoint if the client got the result and wants to clean up early.

Is using a WebServer for executing methods remotely a good architectural decision?

A very small explanation of what i am talking about is presented below.
In my organization a strange kind of effort for load distributing is
implemented, i will attempt to explain it below.
A JBoss server runs in a machine with some IP xxx.xxx.xxx.xxx
There is an app which has many heavy time & resource consuming work
to do (heavy in terms of I/O operation ex- Large file uploads &
downloads - in Gigabytes)
Methods written in a Rest WebApp accessible by the URLs & parameters
to methods passed as URL params, set to the type POST, return value
of the methods as JSON output.
The App in question has a framework written just to make post calls
to the aforementioned WebApp wrapped nicely so that the calls can be
made without the programmer knowing what happens in the background.
This framework has externalized parameters which takes in the IP of
the machines running the WebApps and configure a list of such
machines available & route the method calls to the one least busy
whenever a method call to the framework is made.
Everything looks good but i suspect wrapping things in a http websever & doing processing there & sending the outputs in json may slow down things & collecting the logs in case of failures maybe tough.
Questions
I want to know the views of other programmers on this & whether this is a nice approach or not.
Also do any existing commercial applications follow something similar while trying to distribute the load?
In my opinion, the efficiency of load balancing lies in the tricky process of choosing the right node that will process your request.
A usual approach can be to monitor the CPU usage of the nodes that have to take the load and send the load to the one least busy. This process should be both accurate and efficient.
In your case, wrapping up of request and transfer of Json data should be a thing of least concern as they seem to be required activities and light ones too. Focus should be on the load balancing activity.
Answer to comment
If I understood correctly, http methodserver are there to process the requests from the central app. Load balancing will not be done by them. There has to be a central load balancing mechanism/tool that has to decide which methodserver will take which request.
All the methodservers (slaves) should be identical. That reduces maintainability effort as a fix can be done on one node and can be propagated to other nodes. This is how it is done in my organization.
It may be a requirement if a similar operation is done repeatedly, application server may reduce some duplicity here.

Message queuing solution for millions of topics

I'm thinking about system that will notify multiple consumers about events happening to a population of objects. Every subscriber should be able to subscribe to events happening to zero or more of the objects, multiple subscribers should be able to receive information about events happening to a single object.
I think that some message queuing system will be appropriate in this case but I'm not sure how to handle the fact that I'll have millions of the objects - using separate topic for every of the objects does not sound good [or is it just fine?].
Can you please suggest approach I should should take and maybe even some open source message queuing system that would be reasonable?
Few more details:
there will be thousands of subscribers [meaning not plenty of them],
subscribers will subscribe to tens or hundreds of objects each,
there will be ~5-20 million of the objects,
events themselves dont have to carry any message. just information that that object was changed is enough,
vast majority of objects will never be subscribed to,
events occur at the maximum rate of few hundreds per second,
ideally the server should run under linux, be able to integrate with the rest of the ecosystem via http long-poll [using node js? continuations under jetty?].
Thanks in advance for your feedback and sorry for somewhat vague question!
I can highly recommend RabbitMQ. I have used it in a couple of projects before and from my experience, I think it is very reliable and offers a wide range of configuraions. Basically, RabbitMQ is an open-source ( Mozilla Public License (MPL) ) message broker that implements the Advanced Message Queuing Protocol (AMQP) standard.
As documented on the RabbitMQ web-site:
RabbitMQ can potentially run on any platform that Erlang supports, from embedded systems to multi-core clusters and cloud-based servers.
... meaning that an operating system like Linux is supported.
There is a library for node.js here: https://github.com/squaremo/rabbit.js
It comes with an HTTP based API for management and monitoring of the RabbitMQ server - including a command-line tool and a browser-based user-interface as well - see: http://www.rabbitmq.com/management.html.
In the projects I have been working with, I have communicated with RabbitMQ using C# and two different wrappers, EasyNetQ and Burrow.NET. Both are excellent wrappers for RabbitMQ but I ended up being most fan of Burrow.NET as it is easier and more obvious to work with ( doesn't do a lot of magic under the hood ) and provides good flexibility to inject loggers, serializers, etc.
I have never worked with the amount of amount of objects that you are going to work with - I have worked with thousands ( not millions ). However, no matter how many objects I have been playing around with, RabbitMQ has always worked really stable and has never been the source to errors in the system.
So to sum up - RabbitMQ is simple to use and setup, supports AMQP, can be managed via HTTP and what I like the most - it's rock solid.
Break up the topics to carry specific events for e.g. "Object updated topic" "Object deleted"...So clients need to only have to subscribe to the "finite no:" of event based topics they are interested in.
Inject headers into your messages when you publish them and put intelligence into the clients to use these headers as message selectors. For eg, client knows the list of objects he is interested in - and say you identify the object by an "id" - the id can be the header, and the client will use the "id header" to determine if he is interested in the message.
Depending on whether you want, you may also want to consider ensuring guaranteed delivery to make sure that the client will receive the message even if it goes off-line and comes back later.
The options that I would recommend top of the head are ActiveMQ, RabbitMQ and Redis PUB SUB ( Havent really worked on redis pub-sub, please use your due diligance)
Finally here are some performance benchmarks for RabbitMQ and Redis
Just saw that you only have few 100 messages getting pushed out / sec, this is not a big deal for activemq, I have been using Amq on a system that processes 240 messages per second , and it just works fine. I use a thread pool of workers to asynchronously process the messages though . Look at a framework like akka if you are in the java land, if not stick with nodejs and the cool Eco system around it.
If it has to be open source i'd go for ActiveMQ, and an application server to provide the JMS functionality for topics and it has Ajax Support so you can access them from your client
So, you would use the JMS infrastructure to publish the topics for the objects, and you can create topis as you need them
Besides, by using an java application server you may be able to take advantages from clustering, load balancing and other high availability features (obviously based on the selected product)
Hope that helps!!!
Since your messages are very small might want to consider MQTT, which is designed for small devices, although it works fine on powerful devices as well. Key consideration is the low overhead - basically a 2 byte header for a small message. You probably can't use any simple or open source MQTT server, due to your volume. You probably need a heavy duty dedicated appliance like a MessageSight to handle your volume.
Some more details on your application would certainly help. Also you don't mention security at all. I assume you must have some needs in this area.
Though not sure about your work environment but here are my bits. Can you identify each object with unique ID in your system. If so, you can have a topic per each event type. for e.g. you want to track object deletion event, object updation event and so on. So you can have topic for each event type. These topics would be published with Ids of object whenever corresponding event happened to the object. This will limit the no of topics you needed.
Second part of your problem is different subscribers want to subscribe to different objects. So not all subscribers are interested in knowing events of all objects. This problem statement scoped to message selector(filtering) mechanism provided by messaging framework. So basically you need to seek on what basis a subscriber interested in particular object. Have that basis as a message filtering mechanism. It could be anything: object type, object state etc. So ultimately your system would consists of one topic for each event type with someone publishing event messages : {object-type:object-id} information. Subscribers could subscribe to any topic and with an filtering criteria.
If above solution satisfy, you can use any messaging solution: activeMQ, WMQ, RabbitMQ.