Duplicates on Apache Beam / Dataflow inputs even when using withIdAttribute - duplicates

I am trying to ingest data from a 3rd party API into a Dataflow pipeline. Since the 3rd party doesn't make webhooks available, I wrote a custom script that constantly polls their endpoint for more data.
The data is refreshed every 15 minutes, but since I don't want to miss any datapoints and I want to consume as soon as new data is available, my "crawler" runs every 1 minute. The script then sends the data to a PubSub topic. Easy to see that PubSub will receive about 15 repeated messages for each datapoint in the source.
My first attempt to identify and discard those repeated messages was to add a custom attribute to each PubSub message (eventid), created from a hash of its [ID + updated_time] at source.
const attributes = {
eventid: Buffer.from(`${item.lastupdate}|${item.segmentid}`).toString('base64'),
timestamp: item.timestamp.toString()
};
const dataBuffer = Buffer.from(JSON.stringify(item))
publisher.publish(dataBuffer, attributes)
Then I configured Dataflow with a withIdAttribute() (which is the new idLabel(), based on Record IDs).
PCollection<String> input = p
.apply("ReadFromPubSub", PubsubIO
.readStrings()
.fromTopic(String.format("projects/%s/topics/%s", options.getProject(), options.getIncomingDataTopic()))
.withTimestampAttribute("timestamp")
.withIdAttribute("eventid"))
.apply("OutputToBigQuery", ...)
With that implementation, I was expecting that when the script sends the same datapoint a second time, the repeated eventid would be the same and the message discarded. But for some reason, I still see duplicates on the output dataset.
Some questions:
Is there a clever way to ingest the data to dataflow from that 3rd party API if they don't provide webhooks?
Any ideas on why dataflow is not discarding the messages on this situation?
I know about the 10-minute restriction for deduplication on dataflow, but I see duplicated data even on the 2nd insertion (2 minutes).
Any help will be greatly appreciated!

I think you are on the right track, instead of the hash I recommend to use timestamps. A better way to to this is by using windows. Review this document which filters data that is outside of the window.
Regarding the additional duplicate data, if you are using pull subscriptions and the acknowledgement deadline is reached before having the data processed the message will be resent as per the at-least-once delivery. In this case change the acknowledgement deadline, the defaults is 10 seconds.

Related

How to fix a query in functions within foundry which is hiting ObjectSet:PagingAboveConfiguredLimitNotAllowed?

I have phonorgraph object with billions of rows and we are querying it through object set service
for example, I want to get all DriverLicences from certain city.
#Function()
public getDriverLicences(city: string): ObjectSet<DriverLicences> {
let drivers = Objects.search().DriverLicences().filter(row => row.city.exactMatch(city));
return drivers ;
}
I am facing this error when I am trying query it from slate:
ERROR 400: {"errorCode":"INVALID_ARGUMENT","errorName":"ObjectSet:PagingAboveConfiguredLimitNotAllowed","errorInstanceId":"0000-000","parameters":{}}
I understand that I am probably retrieving more than 100 000 results but I need all the results because of the implemented logic in the front is a complex slate dashboard built by another team that we cannot re-factor.
The issue here is that, specifically in the Slate <> Function connector, there is a "translation layer" that serializes the contents of the object set and provides a response data structure that materializes the property:value pairs for each object in the set.
This clearly doesn't work for large object sets where throwing so much data into the browser is likely to overwhelm the resources allocated to the tab.
From context it seems like you might be migrating an existing Slate app over to Functions; in the current version, how is the query limiting the number of results returned? It certainly must not be returning several 100 thousand results for further processing on the front end? (And if so, that might be an anti-pattern to consider addressing).
As for options that you could currently explore, you can sort your object set and then specify a smaller limit to return:
Objects.search().DriverLicences().filter(row => row.city.exactMatch(city)).orderBy(date_of_issue).take(100)
You'll find a few more details in the Functions documentation Reference entry on Ontology API: Object Sets in the section on Ordering and limiting.
You can even make a work around for the (current) lack of paging when return an ObjectSet to Slate by using the last value from the property ordered on (i.e. date_of_issue) as a filter in the subsequent request and return the next N objects.
This can work if you need a Slate table or HTML widget that renders on set of results then, on a user action, gets the next page.

Looking for an example of a OBD-II complete data frame

I'm developing an OBD-II reader where I want to query requests to read PID parameters with a stm32 processor. I already understand what should go on the data field, but the ID is giving me a headache. As I have read, one must send 0x7DF to broadcast a request, and each ECU will respond with his own ID. However, I have been asked to do this within the SAE J1939 protocol, which uses the 29 bit extended identifier, and I don't know what I need to add to this ID.
As I stated in the title, could someone show me some actual data from a bus using this method? I've been searching on the internet for real frames but did not have any luck so far.
I woud also appreciate if someone could shred some light to if the OBD-II communication needs some acknowledgment to work properly.
Thanks
I would suggest you to take a look on the SAE J1939 documentation, in the more specifically on the J1939/21,J1939-71 and J1939/73.
Generally, a J1939 transport protocol response sequence can be processed as follows:
Identify the BAM frame, indicating a new sequence being initiated
(via the PGN 60416 - 0xEC00 can be reach by 0x1CECFF00 )
Extract the J1939 PGN from bytes 6-8 of the BAM payload to use as the
identifier of the new frame
Construct the new data payload by concatenating bytes 2-8 of the data
transfer frames (i.e. excl. the 1st byte)
A J1939 data transfer messages with ID 1CEBFF00 (PGN 60160 or EB00).
Above, the last 3 bytes of the BAM equal E3FE00. When reordered, these equal the PGN FEE3 aka Engine Configuration 1 (EC1). Further, the payload is found by combining the the first 39 bytes across the 6 data transfer packets/fram
The administrative control device or any device issuing the vehicle use status PID should be sensitive to the run switch status (SPN 3046 - 0xFDC0 which probably can be reach by 0xCFDC000) and any other locally defined criteria for authorized use (i.e., driver log-ons) before the vehicle use status PID is used to generate an unauthorized use alarm.
Also, you can't forget to uses a read/send to extend ID message, since that is a 24-bit.
In fact, i will suggest you to use can-utils to make your a analyses even easier. A simple can-dump or can-sniffer you can see what is coming on your broadcast.
Some car's dbc https://github.com/commaai/opendbc

Google Dataflow (Apache beam) JdbcIO bulk insert into mysql database

I'm using Dataflow SDK 2.X Java API ( Apache Beam SDK) to write data into mysql. I've created pipelines based on Apache Beam SDK documentation to write data into mysql using dataflow. It inserts single row at a time where as I need to implement bulk insert. I do not find any option in official documentation to enable bulk inset mode.
Wondering, if it's possible to set bulk insert mode in dataflow pipeline? If yes, please let me know what I need to change in below code.
.apply(JdbcIO.<KV<Integer, String>>write()
.withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(
"com.mysql.jdbc.Driver", "jdbc:mysql://hostname:3306/mydb")
.withUsername("username")
.withPassword("password"))
.withStatement("insert into Person values(?, ?)")
.withPreparedStatementSetter(new JdbcIO.PreparedStatementSetter<KV<Integer, String>>() {
public void setParameters(KV<Integer, String> element, PreparedStatement query) {
query.setInt(1, kv.getKey());
query.setString(2, kv.getValue());
}
})
EDIT 2018-01-27:
It turns out that this issue is related to the DirectRunner. If you run the same pipeline using the DataflowRunner, you should get batches that are actually up to 1,000 records. The DirectRunner always creates bundles of size 1 after a grouping operation.
Original answer:
I've run into the same problem when writing to cloud databases using Apache Beam's JdbcIO. The problem is that while JdbcIO does support writing up to 1,000 records in one batch, in I have never actually seen it write more than 1 row at a time (I have to admit: This was always using the DirectRunner in a development environment).
I have therefore added a feature to JdbcIO where you can control the size of the batches yourself by grouping your data together and writing each group as one batch. Below is an example of how to use this feature based on the original WordCount example of Apache Beam.
p.apply("ReadLines", TextIO.read().from(options.getInputFile()))
// Count words in input file(s)
.apply(new CountWords())
// Format as text
.apply(MapElements.via(new FormatAsTextFn()))
// Make key-value pairs with the first letter as the key
.apply(ParDo.of(new FirstLetterAsKey()))
// Group the words by first letter
.apply(GroupByKey.<String, String> create())
// Get a PCollection of only the values, discarding the keys
.apply(ParDo.of(new GetValues()))
// Write the words to the database
.apply(JdbcIO.<String> writeIterable()
.withDataSourceConfiguration(
JdbcIO.DataSourceConfiguration.create(options.getJdbcDriver(), options.getURL()))
.withStatement(INSERT_OR_UPDATE_SQL)
.withPreparedStatementSetter(new WordCountPreparedStatementSetter()));
The difference with the normal write-method of JdbcIO is the new method writeIterable() that takes a PCollection<Iterable<RowT>> as input instead of PCollection<RowT>. Each Iterable is written as one batch to the database.
The version of JdbcIO with this addition can be found here: https://github.com/olavloite/beam/blob/JdbcIOIterableWrite/sdks/java/io/jdbc/src/main/java/org/apache/beam/sdk/io/jdbc/JdbcIO.java
The entire example project containing the example above can be found here: https://github.com/olavloite/spanner-beam-example
(There is also a pull request pending on Apache Beam to include this in the project)

Loggly - Refactoring context format - Indexing unique fieldnames limit

For the past couple of months, we have been logging to Loggly incorrectly. Our contexts historically have been an numerical array of strings.
['message1', 'message2, 'message3' ...]
We are looking to send to loggly an array of objects moving forward which should use less keys.
Example new loggly payload:
['orderId' => 123, 'logId' => 456, 'info' => json_encode(SOMEARRAY)]
In testing a new format whereby we have cleaner logging format, Loggly provides the following message:
2 out of 9 sent in this event were not indexed due to max allowed
(100) unique fieldnames being exceeded for this account. The following
were the affected fields: [json.context.queue, json.context.demandId]
We are on a 30 day plan. Does this mean that for our contexts to be indexed correctly, we need to wait 30 days for the old indexed logs to expire? Is there a way of rebuilding the indexing to accommodate the new format logs?
You do not need to wait for 30 days. As long as you stop sending logs in the old format, usually within a few hours or at most a couple of days you will be able to send data with new fields. You can also reach out to support#loggly.com.

Set UITableViewCell data from remote JSON file

I have UITableView representing list of cities (100 cities).
For each city I want to call specific remote(URL) JSON to get city's weather information and populate response data for each city cell in the UITableView.
When I run application, I want to see my table as fast as possible, so I don't need to wait for all json responses. I want that informations got asynchronously (when specific json is loaded, set it's information for corresponding city cell in the UITableView).
Note: It is important for me to call seperate remote JSON files.
Which technic is the best for this task?
I would start with the following approach:
Create a data structure to hold city information, including:
path to your data service,
service call "state" (idle, waiting, completed, error),
weather information (from JSON returned by service call)
When you first show the table, you will want to:
initialize your array (of the aforementioned data structure),
initiate each service call asynchronously,
set each row (city) state to waiting.
You will also probably want to return a custom UITableCellView with the city name (if you already have it) and a spinning activity indicator. This will be your best option to have a fast load time (not waiting for services to complete) and give some visual indication that the data is loading.
Each service call should use the ViewController as its delegate; you will need a key field so that when the services return, they can identify with which row/city they are associated.
As each service completes and calls the delegate, it will send the data to the ViewController, which (in turn) will update the array and initiate a UITableView update.
The UITableView update is, in my opinion, the most difficult part. Typically cells are drawn or updated when they become visible; the table pre-fetches all visible cells' geometry and then queries the actual contents when it's ready to draw each cell; as a result, your strategy for updating cells will depend on how your table is used.
If your cell geometry changes, you will most likely need to redraw your entire table; I shudder to think about what 50 simultaneous UITableView redraws will do for your app, so you might need to set a time-threshold to "chunk" updates and handle drawing more intelligently.
[theTableView reloadData] will cause the entire table to be re-queried and redrawn.
If your cell geometry does not change, you can try to be more surgical of updating only the visible cells (the non-visible ones aren't an issue since their data will be queried when they become visible).
[theTableView visibleCells] returns an array of visible cells; when your service call returns, you could update the data and then search the array to see if the cell in question is visible; if it is, you will probably need to send the specific UITableCellView a setNeedsDisplay message.
There is a good explanation of setNeedsDisplay, setNeedsLayout, and 'reloadData' at http://iosdevelopertips.com/cocoa/understanding-reload-repaint-and-re-layout-for-uitableview.html.
There is a relevant SO question at How to refresh UITableViewCell?
Lastly, you will probably want to implement some updating logic in the service delegate error routine, just so you don't create endlessly spinning activity indicators.
I do this now while searching multiple servers. I use Core Data, but you can use an NSMutableArray to accumulate your JSON responses.
Every time you finish receiving date from one of your servers (for example, when connectionDidFinishLoading executes), take the JSON data object and add it to an NSMutableArray (let's call it weatherResults) (add it using the addObject method). You may want to convert the JSON to an NSDictionary before adding it to the mutable array weatherResults.
Assuming your dataSource delegate methods refer to what is in the weatherResults NSMutableArray (for example, getting the number of rows from the size of the array using [weatherResults count]) you can do the following:
After inserting the object to the array, you can simply call reloadData in the dataSource controller. You will see the table update as each new JSON results arrives. The results should append to the bottom of the table as they come in. If you want to sort the NSMutableArray each time a JSON results arrives, you can do that too.
I do this and the time it takes to resort and reload the table is insignificant on my iPad. If you do not resort, it should be even faster.
By the way, in this explanation, I assume that the JSON response contains all of the information that you need to fill in your table cell. That may not be the case. If it's not, you will have to correlate the response with other information you have, such as a list of cities that your program is presenting.