Can someone tell me what are the performance and differences between these two functions (CZML.load and CZML.process) when calling CZML data sources on our cesium files.
Thanks
Per the documentation, load will clear out existing data from the dataSource prior to processing the incoming data. process will not, so can be used to update existing entities, for example if realtime telemetry is being streamed in.
Related
I am in the process of building my first live node.js web app. It contains a form that accepts data regarding my clients current stock. When submitted, an object is made and saved to an array of current stock. This stock is then permanently displayed on their website until the entry is modified or deleted.
It is unlikely that there will ever be more than 20 objects stored at any time and these will only be updated perhaps once a week. I am not sure if it is necessary to use MongoDB to store these, or whether there could be a simpler more appropriate alternative. Perhaps the objects could be stored to a JSON file instead? Or would this have too big an implication on page load times?
You could potentially store in a JSON file or even in a cache of sorts such as Redis but I still think MongoDB would be your best bet for a live site.
Storing something in a JSON file is not scalable so if you end up storing a lot more data than originally planned (this often happens) you may find you run out of storage on your server hard drive. Also if you end up scaling and putting your app behind a load balancer, then you will need to make sure there are matching copy's of that JSON file on each server. Further more, it is easy to run into race conditions when updating a JSON file. If two processes are trying to update the file at the same time, you are going to potentially lose data. Technically speaking, JSON file would work but it's not recommended.
Storing in memory (i.e.) Redis has similar implications that the data is only available on that one server. Also the data is not persistent, so if your server restarted for whatever reason, you'd lose what was stored in memory.
For all intents and purposes, MongoDB is your best bet.
The only way to know for sure is test it with a load test. But as you probably read html and js files from the file system when serving web pages anyway, the extra load of reading a few json files shouldn't be a problem.
If you want to go with simpler way i.e JSON file use nedb API which is plenty fast as well.
My app would be consuming data from multiple API's . This data can either be a single event or a batch of events. The data I am dealing with is click streams, where my app would run a cron job every minute to fetch data with our partners using their API's and eventually save everything to MySQL for detailed analysis. I am looking for ways to buffer this data somewhere, and then batch insert it to MySQL.
For example, say I receive a batch of 1000 click events with one API call, what data structures can I use to buffer it in Redis, and then eventually have a worker process to consume this data and insert to MySQL.
One simple approach would be to just fetch the data and store it to MySQL just like that. But since I am dealing with ad tech, where the size and the velocity of the data is always subject to change, this hardly seems like an approach to start with.
Oh, and the app would be built on top of Spring Boot and Tomcat.
Any help/discussion would be greatly appreciated. Thank you!
I read the document of official AWS Kinesis Firehose but it doesn't mention how to handle duplicated events. Does anyone have experience on it? I googled someone use ElasticCache to do filtering, does it mean I need to use AWS Lambda to encapsulate such filtering logic? Is there any simple way like firehose to ingest data into Redshift and at the same time has "exactly once" semantics? Thanks a lot!
You can have duplication on both sides of the Kinesis Stream. You might put the same events twice into the Stream, and you might read the event twice by the consumers.
The producers side can happen if you try to put an event to the Kinesis stream, but for some reason you are not sure if it was written successfully or not, and you decide to put it again. The consumer side can happen if you are getting a batch of events and start processing them, and you crash before you managed to checkpoint your location, and the next worker is picking the same batch of events from the Kinesis stream, based on the last checkpoint sequence-ID.
Before you start solving this problem, you should evaluate how often do you have such duplication and what is the business impact of such duplications. Not every system is handling financial transactions that can't tolerate duplication. Nevertheless, if you decide that you need to have such de-duplication, a common way to solve it is to use some event-ID and track if you processed that event-ID already.
ElasticCache with Redis is a good place to track your event-ID. Every time you pick up an event for processing, you check if you already have it in the hash table in Redis, if you find it, you skip it, and if you don't find it, you add it to the table (with some TTL based on the possible time window for such duplication).
If you choose to use Kinesis Firehose (instead of Kinesis Streams), you no longer have control on the consumer application and you can't implement this process. Therefore, you either want to run such de-duplication logic on the producer side, switch to use Kinesis Streams and run your own code in Lambda or KCL, or settle for the de-duplication functions in Redshift (see below).
If you are not too sensitive to duplication, you can use some functions in Redshift, such as COUNT DISTINCT or LAST_VALUE in a WINDOW function.
Not sure if this could be a solution. But to handle duplicates, you need to write your own KCL. Firehose cannot gurantee no duplication. You can get rid of Firehose once you have your own KCL consumers that processes your data from the Kinesis Date Stream.
If you do so you can follow the linked article, (full disclosure, auther here), which stores events into S3 after deduplicating and processing it through a KCL consumer.
Store events by grouping them based on the minute they were received by the Kinesis data stream by looking at their ApproximateArrivalTimestamp. This allows us to always save our events on the same key prefix, given a batch of records no matter when they are processed. For e.g. all events received by Kinesis at 2020/02/02/ 15:55 Hrs will be stored at /2020/02/02/15/55/*. Therefore, if the key is already present in the given minute, it means that the batch has already been processed and stored to S3.
You can implement your own ISequenceStore which will be implemented against Redshift in your case (In the article, it is done against S3). Read the full article below.
https://www.nabin.dev/avoiding-duplicate-records-with-aws-kcl-and-s3
I have a number of systems, most of which are capable of generating data using JSON Activity Streams[1] (or can be coerced into doing so), and I want to use this data for analytics.
I want to use both a traditional SQL datamart for OLAP use, and also to dump the raw JSON data into Hadoop for running batch mapreduce jobs.
I've been reading up on Kafka, Flume, Scribe, S4, Storm and a whole load of other tools but I'm still not sure which is best suited to the task at hand. These seem to be either focussed on logfile data, or real-time processing of the activity stream, whereas I guess I'm more interested in doing ETL on activity streams.
The type of setup I'm thinking of is where I provide a configuration for all the streams I'm interested in (URLs, params, credentials), and the tool periodically polls them, dumps the output in HDFS, and also has a hook for me to process and transform the JSON for insertion into the datamart.
Do any of the existing open-source tools fit this case particularly well?
(In terms of scale I expect a max of 30,000 users interacting with ~10 systems - not simultaneously - so not really "Big Data", but not trivial either.)
Thanks!
[1] http://activitystrea.ms/
You should check out streamsets.com
It's an open source tool (and free to use) built exactly for these kinds of use cases.
You can use the HTTP Client source and HDFS destination to achieve your main goals. If you decide you need to use Kafka or Flume as well, support for both are also built-in.
You can also do transformations in a variety of ways including writing python or javascript for more complex transformations (or your own stage in Java if you choose).
You can also check out logstash (elastic.co) and NiFi to see if one of those works better for you.
*Full disclosure, I'm an engineer at StreamSets.
I know this question has been possed before, but the explanation was a little unclear to me, my question is a little more general. I'm trying to conceptualize how one would periodically update data in an iPhone app, using a remote web service. In theory a portion of the data on the phone would be synced periodically (only when updated). While other data would require the user be online, and be requested on the fly.
Conceptually, this seems possible using XML-RPC or JSON and Core data. I wonder if anyone has an opinion on the best way to implement this, I am a novice iPhone developer, but I understand much of the process conceptually.
Thanks
To synchronize a set of entities when you don't have control over the server, here is one approach:
Add a touched BOOL attribute to your entity description.
In a sync attempt, mark all entity instances as untouched (touched = [NSNumber numberWithBool:NO]).
Loop through your server-side (JSON) instances and add or update entities from your Core Data store to your server-side store, or vice versa. The direction of updating will depend on your synchronization policy, and what data is "fresher" on either side. Either way, mark added, updated or sync'ed Core Data entities as touched (touched = [NSNumber numberWithBool:YES])
Depending on your sync policy, delete all entity instances from your Core Data store which are still untouched. Untouched entities were presumably deleted from your server-side store, because no addition, update or sync event took place between the Core Data store and the server for those objects.
Synchronization is a fair amount of work to implement and will depend on what degree of synchronization you need to support. If you're just pulling data, step 3 is considerably simpler because you won't need to push object updates to the server.
Syncing is hard, very hard. Ideally you would want to receive deltas of the changes from the server and then using a unique id for each record in Core Data, update only those records that are new or changed.
Assuming you can do that, then the code is pretty straight forward. If you are syncing in both directions then things get more complicated because you need to track deltas on both sides and handle collisions.
Can you clarify what type of syncing you are wanting to accomplish? Is it bi-directional or pull only?
I have an answer, but it's sucky. I'm currently looking for a more acceptable/reliable solution (i.e. anything Marcus Zarra cooks up).
What I've done needs some work ... seriously, because it doesn't work all the time...
The mobile device has a json catalog of entities, their versions, and a url pointing to a json file with the entity contents.
The server has the same setup, the catalog listing the entities, etc.
Whenever the mobile device starts, it compares the entity versions of it's local catalog with the catalog on the server. If any of those versions on the server are newer, it offers the user an opportunity to download the entity updates.
When the user elects to update, the mobile device now has the url for each of the new/changed entities and downloads it. Once downloaded, the app will blow away all objects for each of the changed entities, and then insert the new objects from JSON. In the event of an error, the deletions/insertions are rolled back to pre-update status.
This works, sort of. I can't catch it in a debug session when it goes awry, so I'm not sure what might cause corruption or inconsistency in the process.