Is there a way to capure tuple/sec through an operator in IBM Streams (not through Streams console) - infosphere-spl

I want to capture the number of tuples/sec through an operator and log it in a file. I can't use 'Throttle Operator' to set the tuple rate by myself. Also, to add again, I am not talking about capturing the information through console, but through SPL application.

There is no direct "give me the throughput for this operator" metric available. You could implement a primitive operator that accesses the nTuplesProcessed metric over time and calculates the throughput from that. (The list of available metrics.) But, I actually find it much easier to use the following composite operator:
public composite PeriodicThroughputSink(input In) {
param expression<float64> $period;
expression<rstring> $file;
graph
stream<boolean b> Period = Beacon() {
param period: $period;
}
stream<float64 throughput> Throughput = Custom(In; Period) {
logic state: {
mutable uint64 _count = 0;
float64 _period = $period;
}
onTuple In: {
++_count;
}
onTuple Period: {
if (_count > 0ul) {
submit({throughput=((float64)_count / _period)}, Throughput);
_count = 0ul;
}
}
config threadedPort: queue(Period, Sys.Wait); // ensures that the throughput calculation and file
// writing is on a different thread from the rest
// of the application
}
() as Sink = FileSink(Throughput) {
param file: $file;
format: txt;
flush: 1u;
}
}
You can then use the composite operator as a "throughput tap", where it consumes the stream from whatever operator whose throughput you want to record. For example, you may use it like so:
stream<Data> Result = OperatorYouCareAbout(In) {}
() as ResultThroughput = PeriodicThroughputSink(Result) {
param period: 5.0;
file: "ResultThroughput.txt";
}
Of course, you can then still use the Result stream elsewhere in your application. Keep in mind that this method may have some impact on the performance of the application: we're putting a tap on the data path. But, the impact should not be large, particularly if you make sure that the operators in the PeriodicThroughputSink are fused into the same PE as whatever operator you're tapping. Also, the shorter the period, the more likely it will impact application performance.
Again, we could do something similar in a C++ or Java primitive operator by accessing the nTuplesProcessed metric, but I find the above approach much easier. You could also grab the system metrics from outside of your application; say, you could have a script that periodically uses streamtool capturestate or the REST API, and then parse the output, find the nTuplesProcessed metric for the operator you care about and use that to calculate throughput. But I find the technique in this composite operator much easier.

Related

How to convert Pulumi Output<t> to string?

I am dealing with creating AWS API Gateway. I am trying to create CloudWatch Log group and name it API-Gateway-Execution-Logs_${restApiId}/${stageName}. I have no problem in Rest API creation.
My issue is in converting restApi.id which is of type pulumi.Outout to string.
I have tried these 2 versions which are proposed in their PR#2496
const restApiId = apiGatewayToSqsQueueRestApi.id.apply((v) => `${v}`);
const restApiId = pulumi.interpolate `${apiGatewayToSqsQueueRestApi.id}`
here is the code where it is used
const cloudWatchLogGroup = new aws.cloudwatch.LogGroup(
`API-Gateway-Execution-Logs_${restApiId}/${stageName}`,
{},
);
stageName is just a string.
I have also tried to apply again like
const restApiIdStrign = restApiId.apply((v) => v);
I always got this error from pulumi up
aws:cloudwatch:LogGroup API-Gateway-Execution-Logs_Calling [toString] on an [Output<T>] is not supported.
Please help me convert Output to string
#Cameron answered the naming question, I want to answer your question in the title.
It's not possible to convert an Output<string> to string, or any Output<T> to T.
Output<T> is a container for a future value T which may not be resolved even after the program execution is over. Maybe, your restApiId is generated by AWS at deployment time, so if you run your program in preview, there's no value for restApiId.
Output<T> is like a Promise<T> which will be eventually resolved, potentially after some resources are created in the cloud.
Therefore, the only operations with Output<T> are:
Convert it to another Output<U> with apply(f), where f: T -> U
Assign it to an Input<T> to pass it to another resource constructor
Export it from the stack
Any value manipulation has to happen within an apply call.
So long as the Output is resolvable while the Pulumi script is still running, you can use an approach like the below:
import {Output} from "#pulumi/pulumi";
import * as fs from "fs";
// create a GCP registry
const registry = new gcp.container.Registry("my-registry");
const registryUrl = registry.id.apply(_=>gcp.container.getRegistryRepository().then(reg=>reg.repositoryUrl));
// create a GCP storage bucket
const bucket = new gcp.storage.Bucket("my-bucket");
const bucketURL = bucket.url;
function GetValue<T>(output: Output<T>) {
return new Promise<T>((resolve, reject)=>{
output.apply(value=>{
resolve(value);
});
});
}
(async()=>{
fs.writeFileSync("./PulumiOutput_Public.json", JSON.stringify({
registryURL: await GetValue(registryUrl),
bucketURL: await GetValue(bucketURL),
}, null, "\t"));
})();
To clarify, this approach only works when you're doing an actual deployment (ie. pulumi up), not merely a preview. (as explained here)
That's good enough for my use-case though, as I just want a way to store the registry-url and such after each deployment, for other scripts in my project to know where to find the latest version.
Short Answer
You can specify the physical name of your LogGroup by specifying the name input and you can construct this from the API Gateway id output using pulumi.interpolate. You must use a static string as the first argument to your resource. I would recommend using the same name you're providing to your API Gateway resource as the name for your Log Group. Here's an example:
const apiGatewayToSqsQueueRestApi = new aws.apigateway.RestApi("API-Gateway-Execution");
const cloudWatchLogGroup = new aws.cloudwatch.LogGroup(
"API-Gateway-Execution", // this is the logical name and must be a static string
{
name: pulumi.interpolate`API-Gateway-Execution-Logs_${apiGatewayToSqsQueueRestApi.id}/${stageName}` // this the physical name and can be constructed from other resource outputs
},
);
Longer Answer
The first argument to every resource type in Pulumi is the logical name and is used for Pulumi to track the resource internally from one deployment to the next. By default, Pulumi auto-names the physical resources from this logical name. You can override this behavior by specifying your own physical name, typically via a name input to the resource. More information on resource names and auto-naming is here.
The specific issue here is that logical names cannot be constructed from other resource outputs. They must be static strings. Resource inputs (such as name) can be constructed from other resource outputs.
Encountered a similar issue recently. Adding this for anyone that comes looking.
For pulumi python, some policies requires the input to be stringified json. Say you're writing an sqs queue and a dlq for it, you may initially write something like this:
import pulumi_aws
dlq = aws.sqs.Queue()
queue = pulumi_aws.sqs.Queue(
redrive_policy=json.dumps({
"deadLetterTargetArn": dlq.arn,
"maxReceiveCount": "3"
})
)
The issue we see here is that the json lib errors out stating type Output cannot be parsed. When you print() dlq.arn, you'd see a memory address for it like <pulumi.output.Output object at 0x10e074b80>
In order to work around this, we have to leverage the Outputs lib and write a callback function
import pulumi_aws
def render_redrive_policy(arn):
return json.dumps({
"deadLetterTargetArn": arn,
"maxReceiveCount": "3"
})
dlq = pulumi_aws.sqs.Queue()
queue = pulumi_aws.sqs.Queue(
redrive_policy=Output.all(arn=dlq.arn).apply(
lambda args: render_redrive_policy(args["arn"])
)
)

Apache Nifi - When utilizing SplitText on large files, how can I make the put files write out immediately

I am reading in text files with 50k rows of data where each row represents a complete record.
Our Nifi flow is utilizing the SplitText to handle the file in batches of 1000 rows. (This was setup before my time for memory issues I'm told)
Is it possible to have the PutFile execute immediately? I want the files to just right out the PutFile record once it is done and not just sit in queue waiting for all 50k+ rows of data have been processed. Seems rather dumb to do that if it is being split up.
I was reading up on documentation but I cannot find if this is by design and not configurable.
Appreciate any documentation guidance that can help answer/configure my flow.
TL;DR A workaround is to use multiple SplitTexts, the first one splitting into 10k rows for example, then the second to split into 1000 rows. Then the first 10k rows will be split into 10 flow files and sent downstream while the second 10k rows are being processed by the second SplitText.
EDIT: Adding another workaround, a Groovy script to be used in InvokeScriptedProcessor:
class GroovyProcessor implements Processor {
def REL_SUCCESS = new Relationship.Builder().name("success").description('FlowFiles that were successfully processed are routed here').build()
def REL_FAILURE = new Relationship.Builder().name("failure").description('FlowFiles that were not successfully processed are routed here').build()
def REL_ORIGINAL = new Relationship.Builder().name("original").description('After processing, the original incoming FlowFiles are routed here').build()
def ComponentLog log
void initialize(ProcessorInitializationContext context) { log = context.logger }
Set<Relationship> getRelationships() { return [REL_FAILURE, REL_SUCCESS, REL_ORIGINAL] as Set }
Collection<ValidationResult> validate(ValidationContext context) { null }
PropertyDescriptor getPropertyDescriptor(String name) { null }
void onPropertyModified(PropertyDescriptor descriptor, String oldValue, String newValue) { }
List<PropertyDescriptor> getPropertyDescriptors() { null }
String getIdentifier() { null }
void onTrigger(ProcessContext context, ProcessSessionFactory sessionFactory) throws ProcessException {
def session1 = sessionFactory.createSession()
def session2 = sessionFactory.createSession()
try {
def inFlowFile = session1.get()
if(!inFlowFile) return
def inputStream = session1.read(inFlowFile)
inputStream.eachLine { line ->
def outFlowFile = session2.create()
outFlowFile = session2.write(outFlowFile, {outputStream ->
outputStream.write(line.bytes)
} as OutputStreamCallback)
session2.transfer(outFlowFile, REL_SUCCESS)
session2.commit()
}
inputStream.close()
session1.transfer(inFlowFile, REL_ORIGINAL)
session1.commit()
} catch (final Throwable t) {
log.error('{} failed to process due to {}; rolling back session', [this, t] as Object[])
session2.rollback(true)
session1.rollback(true)
throw t
}}}
processor = new GroovyProcessor()
For completeness:
The Split processors were designed to support the Split/Merge pattern, and in order to merge them back together later, they each need the same "parent ID" as well as the count.
If you send flow files out before you've split everything up, you won't know the total count and won't be able to merge them back later. Also if something goes wrong with split processing, you may want to "rollback" the operation instead of having some flow files already downstream, and the rest of them sent to failure
In order to send out some flow files before all processing, you have to "commit the process session". This prevents you from doing the things above, and it creates a break in the provenance for the incoming flow file, as you have to commit/transfer that file in the session that originally takes it in. All following commits will need new flow files created, which breaks the provenance/lineage chain.
Although there is an open Jira for this (NIFI-2878), there has been some dissent on the mailing lists and pull requests about adding this feature to processors that accept input (i.e. non-source processors). NiFi's framework is fairly transactional, and this kind of feature flies in the face of that.

Pre-/suffixing a Source stream in Play 2.5

Lets say I have a Play Controller with this method -
def persons(): Action[AnyContent] =
Action { _ =>
Ok.chunked(personSource.map { p => JsObject(p) }
)
}
The Akka Source stream is a large but finite stream of Persons from, say our db. Loading all of it into memory at once would lead to out of memory exceptions.
The code above works fine, I get a long stream of json objects:
{"name": "TestPerson1}{"name": "TestPerson2"}
But now a client has requested the response has this format:
[{"name": "TestPerson1},{"name": "TestPerson2"}]
I am having troubles finding how to emit prefix/suffix to the stream. Maybe a filter, or nesting Actions? But the examples I find of that tend to operate on the Request such as redirecting, or having side-effecting operations such as logging something before handing over processing to the inner Action.
I would like to emit "[" at the start of the http response, keep the Source async chunked processing in the middle, and then emit a "]" at the end.
Solution found thanks to #cchantep
val persons = source.map { p => JsObject(p).toString }.intersperse(",")
Action { _ =>
Ok.chunked(Source(List("[")).concat(persons).concat(Source(List("]"))))
}
Or even simpler (thanks to this page, which I didn't find before):
Ok.chunked(source.map { p => JsObject(p).toString }.intersperse("[", ",", "]") )

Returning a reference does not live long enough

I've just started learning Rust, and have come from a mainly JavaScript background so I'm a bit stumped when it comes to the whole borrowing system and memory management.
I have the following code:
fn load(db: &MyPool, id: i32) -> &Account{
let accounts: Vec<Account> = db.prepare("SELECT id, balance, name FROM `accounts` WHERE `id`=?")
.and_then(|mut stmt| {
stmt.execute(&[&id]).map(|result| {
result.map(|x| x.unwrap()).map(|row| {
Account{
id: from_value(&row[0]),
balance: from_value(&row[1]),
name: from_value(&row[2])
}
}).collect()
})
}).unwrap();
&accounts[0]
}
And I've managed to fix all the errors the compiler throws out apart from
/main.rs:42:4: 42:12 error: 'accounts' does not live long enough
Is this the best way to get one result from the MySQL query, or have I been going at it completely wrong?
You don't want to return a reference to an account, but you want to pass ownership to the caller after retrieving from the db.
Thus, change the signature to:
fn load(db: &MyPool, id: i32) -> Account
Now the idea would be to return the object by value, not by reference:
accounts[0]
However doing so will fail with an error: cannot move out of indexed content. A better approach would be to avoid collecting in a vector altogether, and use Iterator::next(&self) to take the first element. This would look like:
fn load(db: &MyPool, id: i32) -> Account{
let account: Account = db.prepare("SELECT id, balance, name FROM `accounts` WHERE `id`=?")
.and_then(|mut stmt| {
stmt.execute(&[&id]).map(|result| {
result.map(|x| x.unwrap()).map(|row| {
Account{
id: from_value(&row[0]),
balance: from_value(&row[1]),
name: from_value(&row[2])
}
}).next().unwrap() // <- next() takes the first elt of the iterator
})
}).unwrap();
account // <- return by value, pass ownership to caller
}
(Untested as I couldn't reproduce your dev environment.)
Kind of unrelated, but it is worth noting that those multiple unwrap() calls render your function extremely brittle as any failure will crash your whole program with a panic. Fortunately the answer to this bad smell is easy: you want to return Option<Account> rather than Account. Then remove all calls to unwrap() and let the Option<Account> propagate throughout calls (your use of map() is good because it says "return None if you find None and return Some(f(a)) if you find Some(a)".)

Nodejs MySql Global Variable Limitation

Why is the new Monsters object only being filled while being called inside the mysql function scope?
Monsters = {};
db.query('SELECT * from rpg_monsters ', function(err, results) {
for(i=0; i < results.length; i++){
Monsters[results[i].monster_id] = {
monster_id: results[i].monster_id,
monster_name: results[i].monster_name,
monster_filename: results[i].monster_filename,
monster_hp: results[i].monster_hp,
monster_chp: results[i].monster_hp,
monster_type: results[i].monster_type,
monster_level: results[i].monster_level,
xPosition: results[i].monster_xPosition,
yPosition: results[i].monster_yPosition
}
}
console.log(Monsters); // This Returns True with all the objects properties!
db.end();
});
console.log(Monsters); // This Returns Empty?
Is it possible use the Monsters object (or any) outside of the mysql callback?
Short answer: No.
Long answer: Welcome to asynchronous programming in NodeJS! Your error is not due to a scoping issue, but a time issue! Your asynchronous code does not run in source order. Your code does this:
initialize global variable Monsters reference an empty Javascript object
submit an asynchronous request to db.query, and pass a callback taking params err and results
console.log the contents of Monsters, which still references an empty array
NodeJS reaches the end of it's main loop and ticks. This may happen many times while waiting for db.query to finish its asynchronous IO. At some point in the future we resume:
db.query resolves and runs your callback function.
clone all elements in results into object at global variable Monsters
log value of Monsters
You will need to restructure your code to follow the asynchronous callback structure, or you can invesetigate alternatives like Promises or coroutines.
Please see How to return the response from an asynchronous call?, it has quite the explanation of your issue.