Detect data anomalies in data pipe and trigger scheduled datapipeline - palantir-foundry

In Foundry, we have a data pipeline where we want to insert a code node (repo or workbook) that detects anomalies and then sends and email or some other alert about the problem.
Having trouble finding this in the documentation, can someone point me to it?
Ideally we would love to have the code trigger the Scheduler to do a pipeline run to create a REPORT, (maybe even Quiver, to do some timeline analysis). Is this possible? Are there examples in the documentation?

Check out the documentation in the Data Health section of the platform documentation. There are a number of patterns possible, including defining data expectations in your code.
Whether defined as expectations or dataset health checks, failures can be set up to create Issues within the platform, which can have default assignees (individuals or groups) that will also send notifications, which are both in platform and over email (depending on per-user configuration).
Health check failures will also automatically populate the data health tab in the Project Catalog view, which can serve as a dashboard to view the overall health of the project. You can also surface these in the Data Lineage view with a coloring based on Data Health to understand issues across the breadth of the pipeline.
For a comprehensive approach to pipeline health, review the Pipelines and best practices section in the Code Repositories documentation.

Related

How to implement Event-carried State Transfer?

I watched Mr. Martin Fowler's seminar on Event-Drivent Architecture. I see the benefits of Event-carried State Transfer but still haven't found a way to do it as he said. How can I copy data from this database to the database continuously, and can this copy cause errors?
Copy directly from one database to another is usually a bad idea as it creates coupling. A better approach is for one service to publish events about the changes, events that other then can subscribe to.
The publishing of events can be implemented in many different ways. For example:
The publisher can publish an ATOM feed that the subscribers can poll and traverse for changes. For example the EventStoreDB publishes ATOM feeds to support this.
The publisher can publish its events to Kafka, that then subscribers can consume events from.

Programmatically Recomputing Precise Part Volume From Third-Party Files Using Forge APIs

I'm looking for best practices and performance-guided recommendations for recomputing a model's volume when it's missing from the source file. This is in the context of a web application I am working to build that enables:
Uploading 3D models in a variety of file formats
Interacting with these models using the AutoDesk Viewer
Displaying mass properties, eg volume and surface area, alongside the viewer (subject of this post)
Background
Some file formats have very reliable volume information that is computed and written to the file by the authoring application. For these files, we can access volume as a property via AutoDesk Viewer.
Other formats, however, do not carry volume information - at least not in a manner that is openly accessible using tools other than the authoring application (prime example here is SolidWorks). This leaves us with a giant gap to fill - we need to recompute the model's volume using what's in the file.
Known Workarounds and Options
AutoDesk published a blog post detailing an approach for approximating model volume using triangles of the model inside the viewer. I think it's an ideal solution for use cases that can afford to trade accuracy for a bump in performance - and it centers everything in the viewer making development and subsequent maintenance simpler. This application, however, cannot rely on such approximations. I'm left reviewing options for leveraging the AutoDesk Design Automation API to:
Spin up an instance of Inventor
Load the model file
Rely on iLogic to trigger a re-computation of the model's part properties (perhaps like this?)
Push that data back to my web application
Where I Need Help
My understanding is that an AppBundle and Activity are defined ahead of time and then every uploaded model would be submitted as a work item.
I am hoping for guidance in:
whether this is the only approach or whether there are other options worth considering
how best to orchestrate the end-to-end process from an order of operations/workflow standpoint to maximize performance
Current Thinking
For example, I'm thinking that my first step after the source file is uploaded is to immediately initialize two parallel processes: the first to translate the source file for the viewer, the second to spin up Inventor and trigger the related downstream process to get volume.
The other option I'm considering is handling all of the work in Inventor - and pushing out an SVF file to the viewer that's enriched with volume data. The advantage of this approach is that my frontend will have only one source for volume data, (it will be in the enriched SVF no matter whether it was supplied in the original file or not).
In an ideal world I'd be able to only invoke the Design Automation API when volume data is missing from the source file - but I'd only know that after translating the file and bringing it back to the viewer. Given that many of our files are created in SolidWorks and other high-end proprietary CAD platforms, my working hypothesis is that we'll be needing to fill in volume gaps more often than not.
Your understanding is correct:
appbundle is simply a collection of files (binaries, data) encapsulating a specific Inventor/Revit/3dsMax/AutoCad plugin
activity is a kind of a job template specifying which application should be invoked, which appbundle should be loaded into the application, what inputs will be provided to the job, and what outputs will be generated
work item is then a specific instance of a job, binding the activity inputs and outputs to specific URLs
There is currently no other way to access the Design Automation functionality other then using these 3 types of entities.
I would suggest the following:
wherever possible, use the Design Automation for Inventor to compute the precise areas/volumes
for file formats that cannot be imported into Inventor or any other Design Automation engine, you could use tools like https://github.com/petrbroz/forge-convert-utils to parse the SVF and compute (a very rough estimate of) the area/surface from the triangular meshes; however, this will be quite computationally expensive, and imprecise

Message bus vs. Service bus vs. Event hub vs Event grid

I'm learning the messaging system and got confused by those terminology.
All the messaging system below provides loose coupling between services with different sets of features.
queue - FIFO, pulling mechanism, 1 consumer each queue but any number of producers?
message bus - pub/sub model, any number of consumers with any number of producers processing messages? Is Azure Service Bus an implementation of message bus?
event bus - pub/sub model, any number of consumers with any number of producers processing events?
Do people use message bus and event bus interchangeably as far as terminology goes?
What are the difference between events and messages? Are those just synonyms in this context?
event hub - pub/sub model, partition, replay, consumers can store events in external storage or close to real-time data analysis. What exactly is event hub?
event grid - it can be used as a downstream service of event hub. What does it exactly do that event hub doesn't do?
Can someone provide some historical context as how each technology evolve to another each tied with some practical use cases?
I've found message bus vs. message queue helpful
Even thou all these services deal with the transfer of data from source to target and might seem similar falling under the umbrella messaging services they do differ in their intent.
High-level definition:
Azure Event Grids – Event-driven publish-subscribe model (think reactive programming)
Azure Event Hubs – Multiple source big data streaming pipeline (think telemetry data)
Azure Service Bus- Traditional enterprise broker messaging system (Similar to Azure Queue but provide many advanced features depending on use case full comparison)
Difference between Event Grids & Event Hubs
Event Grids doesn’t guarantee the order of the events, but Event Hubs use partitions which are ordered sequences, so it can maintain the order of the events in the same partition.
Event Hubs are accepting only endpoints for the ingestion of data and they don’t provide a mechanism for sending data back to publishers. On the other hand, Event Grids sends HTTP requests to notify events that happen in publishers.
Event Grid can trigger an Azure Function. In the case of Event Hubs, the Azure Function needs to pull and process an event.
Event Grids is a distribution system, not a queueing mechanism. If an event is pushed in, it gets pushed out immediately and if it doesn’t get handled, it’s gone forever. Unless we send the undelivered events to a storage account. This process is known as dead-lettering.
In Event Hubs the data can be kept for up to seven days and then replayed. This gives us the ability to resume from a certain point or to restart from an older point in time and reprocess events when we need it.
Difference between Event Hubs & Service Bus
To the external publisher or the receiver Service Bus and Event Hubs can look very similar and this is what makes it difficult to understand the differences between the two and when to use what.
Event Hubs focuses on event streaming where Service Bus is more of a traditional messaging broker.
Service Bus is used as the backbone to connects applications running in the cloud to other applications or services and transfers data between them whereas Event Hubs is more concerned about receiving massive volume of data with high throughout and low latency.
Event Hubs decouples multiple event-producers from event-receivers whereas Service Bus aims to decouple applications.
Service Bus messaging supports a message property ‘Time to Live’ whereas Event Hubs has a default retention period of 7 days.
Service Bus has the concept of message session. It allows relating messages based on their session-id property whereas Event Hubs does not.
Service Bus the messages are pulled out by the receiver & cannot be processed again whereas Event Hubs message can be ingested by multiple receivers.
Service Bus uses the terminology of queues and topics whereas Event Hubs partitions terminology is used.
Use this loose general rule of thumb.
SOMETHING HAS HAPPENED – Event Hubs
DO SOMETHING or GIVE ME SOMETHING – Service Bus
As #Louie Almeda stated you may find this link to the official Azure documentation useful.
I found this comparison from Azure docs extremely helpful. Here's the key distinction between events and messages.
Event vs. message services
There's an important distinction to note
between services that deliver an event and services that deliver a
message.
Event
An event is a lightweight notification of a condition or a state
change. The publisher of the event has no expectation about how the
event is handled. The consumer of the event decides what to do with
the notification. Events can be discrete units or part of a series.
Discrete events report state change and are actionable. To take the
next step, the consumer only needs to know that something happened.
The event data has information about what happened but doesn't have
the data that triggered the event. For example, an event notifies
consumers that a file was created. It may have general information
about the file, but it doesn't have the file itself. Discrete events
are ideal for serverless solutions that need to scale.
Series events
report a condition and are analyzable. The events are time-ordered and
interrelated. The consumer needs the sequenced series of events to
analyze what happened.
Message
A message is raw data produced by a service to be consumed or stored
elsewhere. The message contains the data that triggered the message
pipeline. The publisher of the message has an expectation about how
the consumer handles the message. A contract exists between the two
sides. For example, the publisher sends a message with the raw data,
and expects the consumer to create a file from that data and send a
response when the work is done.
Comparison of those different services were also discussed, so be sure to check it out.
I agree with your remarks about overloaded terms, especially with cloud-service marketing jargon....
Historically, I events and messages had more distinct meanings
- event was term used to refer to communication within the same process whereas
- message referred to communication across different processes.
Regarding the "bus", I can give you some "historical" information, as I learned to be a sound engineer. In a music mixer, you also have a "bus" and "routing" for mixing signals. In the case of a mixer, we are talking about electrical signals, either being in the mix or not!
Regarding the messaging system, think of "bus", "hub" and "grid" to be synonyms! They are all fancy words for the same thing. They are trying to express some kind of transportation system that includes some kind of routing, because you always have producers and consumers - and this can be an N:M relation. Depending on the use case.
A queue is typically a bit different, but its effect can be the same. A queue means something where things are in line, like a queue of people to buy something! (Theatre tickets....)
Nowadays, everything is digital, which in its essence means it can be countable. That's how "messages" came into existance! A music mixer would traditionally mix analogous signals, which are not countable but continuous, so the information would be f.ex. spoken voices or any kind of sound. Today, a "message" means some kind of information package, which is unique and countable. So it is a "thing" you can add to and remove from a queue, or send it to a hub for consumers to consume it.
Don't worry, you'll get used to those terms! I hope I was able to give you an idea.

Message queuing solution for millions of topics

I'm thinking about system that will notify multiple consumers about events happening to a population of objects. Every subscriber should be able to subscribe to events happening to zero or more of the objects, multiple subscribers should be able to receive information about events happening to a single object.
I think that some message queuing system will be appropriate in this case but I'm not sure how to handle the fact that I'll have millions of the objects - using separate topic for every of the objects does not sound good [or is it just fine?].
Can you please suggest approach I should should take and maybe even some open source message queuing system that would be reasonable?
Few more details:
there will be thousands of subscribers [meaning not plenty of them],
subscribers will subscribe to tens or hundreds of objects each,
there will be ~5-20 million of the objects,
events themselves dont have to carry any message. just information that that object was changed is enough,
vast majority of objects will never be subscribed to,
events occur at the maximum rate of few hundreds per second,
ideally the server should run under linux, be able to integrate with the rest of the ecosystem via http long-poll [using node js? continuations under jetty?].
Thanks in advance for your feedback and sorry for somewhat vague question!
I can highly recommend RabbitMQ. I have used it in a couple of projects before and from my experience, I think it is very reliable and offers a wide range of configuraions. Basically, RabbitMQ is an open-source ( Mozilla Public License (MPL) ) message broker that implements the Advanced Message Queuing Protocol (AMQP) standard.
As documented on the RabbitMQ web-site:
RabbitMQ can potentially run on any platform that Erlang supports, from embedded systems to multi-core clusters and cloud-based servers.
... meaning that an operating system like Linux is supported.
There is a library for node.js here: https://github.com/squaremo/rabbit.js
It comes with an HTTP based API for management and monitoring of the RabbitMQ server - including a command-line tool and a browser-based user-interface as well - see: http://www.rabbitmq.com/management.html.
In the projects I have been working with, I have communicated with RabbitMQ using C# and two different wrappers, EasyNetQ and Burrow.NET. Both are excellent wrappers for RabbitMQ but I ended up being most fan of Burrow.NET as it is easier and more obvious to work with ( doesn't do a lot of magic under the hood ) and provides good flexibility to inject loggers, serializers, etc.
I have never worked with the amount of amount of objects that you are going to work with - I have worked with thousands ( not millions ). However, no matter how many objects I have been playing around with, RabbitMQ has always worked really stable and has never been the source to errors in the system.
So to sum up - RabbitMQ is simple to use and setup, supports AMQP, can be managed via HTTP and what I like the most - it's rock solid.
Break up the topics to carry specific events for e.g. "Object updated topic" "Object deleted"...So clients need to only have to subscribe to the "finite no:" of event based topics they are interested in.
Inject headers into your messages when you publish them and put intelligence into the clients to use these headers as message selectors. For eg, client knows the list of objects he is interested in - and say you identify the object by an "id" - the id can be the header, and the client will use the "id header" to determine if he is interested in the message.
Depending on whether you want, you may also want to consider ensuring guaranteed delivery to make sure that the client will receive the message even if it goes off-line and comes back later.
The options that I would recommend top of the head are ActiveMQ, RabbitMQ and Redis PUB SUB ( Havent really worked on redis pub-sub, please use your due diligance)
Finally here are some performance benchmarks for RabbitMQ and Redis
Just saw that you only have few 100 messages getting pushed out / sec, this is not a big deal for activemq, I have been using Amq on a system that processes 240 messages per second , and it just works fine. I use a thread pool of workers to asynchronously process the messages though . Look at a framework like akka if you are in the java land, if not stick with nodejs and the cool Eco system around it.
If it has to be open source i'd go for ActiveMQ, and an application server to provide the JMS functionality for topics and it has Ajax Support so you can access them from your client
So, you would use the JMS infrastructure to publish the topics for the objects, and you can create topis as you need them
Besides, by using an java application server you may be able to take advantages from clustering, load balancing and other high availability features (obviously based on the selected product)
Hope that helps!!!
Since your messages are very small might want to consider MQTT, which is designed for small devices, although it works fine on powerful devices as well. Key consideration is the low overhead - basically a 2 byte header for a small message. You probably can't use any simple or open source MQTT server, due to your volume. You probably need a heavy duty dedicated appliance like a MessageSight to handle your volume.
Some more details on your application would certainly help. Also you don't mention security at all. I assume you must have some needs in this area.
Though not sure about your work environment but here are my bits. Can you identify each object with unique ID in your system. If so, you can have a topic per each event type. for e.g. you want to track object deletion event, object updation event and so on. So you can have topic for each event type. These topics would be published with Ids of object whenever corresponding event happened to the object. This will limit the no of topics you needed.
Second part of your problem is different subscribers want to subscribe to different objects. So not all subscribers are interested in knowing events of all objects. This problem statement scoped to message selector(filtering) mechanism provided by messaging framework. So basically you need to seek on what basis a subscriber interested in particular object. Have that basis as a message filtering mechanism. It could be anything: object type, object state etc. So ultimately your system would consists of one topic for each event type with someone publishing event messages : {object-type:object-id} information. Subscribers could subscribe to any topic and with an filtering criteria.
If above solution satisfy, you can use any messaging solution: activeMQ, WMQ, RabbitMQ.

ways to learn implementing workflow of a software

How many ways are there to learn implementing workflow of a software? What are them?
If you mean the user workflow, how the user is guided through the software...
I usually use some sort of state machine to limit what functionality can be triggered by the user and what information will be presented to the user in a particular state of the workflow. This way I can concentrate on designing each segment of the flow in its own "sandbox" and decision making becomes a lot easier.
If you do not mean user workflow, you can ignore this reply.
Usually you do have steps in workflow. Step consist of some precondition (business logic hidden from UI), some user interaction (user entering some data, and doing some “user stuff”), and post conditions. Usually user interaction part has one or more user chosen “exists”, and every exit consist it’s own post condition (usually every user exit has it’s own business logic depending of a meaning of an exit from a step). Exits navigate workflow to next step. Sometimes you can have fully automatic steps (i.e. using some external data source, calling some web service, important calculation, and so on).
If your workflow is simple, you may implement it as a set of classes representing each step, and configuration of steps order can be put in XML. When your workflow will grow bigger, and bigger, it may be reasonable to search for some workflow engine, (discussion of WF engines is I think beyond the scope of this question).
One important thing – steps can be orthogonal, but it is harder to design. If your steps rely one on another, person configuring workflow and steps order must be fully aware of such dependencies (e.g: user address step will probably depends on user object creation step, and removing user object creation step from a workflow, will result in trying to access nonexistent object).