I'm mostly interested in ingesting data but would also be curious to know if there a also a way to emit messages.
After investigation, apparently not at the moment but new connectors are being developed all the time, so in the future users should reach out to their Palantir representative to double check.
Related
I’m looking at storing raw JSON data received from a bunch of websockets and am feeling a bit overwhelmed with all the choices available. I found this from 2012 but things have moved on a bit since.
My requirements are to be able to query the data through an API, to get the latest message (ideally in realtime) or a subset of messages (like all messages received on a particular date).
I'll be storing around 10,000 messages daily from different sources with different schemas.
From some basic research I think I need some kind of nosql document store? The main ones seem to be:
Elastic search
Redis
MongoDB
CouchDB
Am I on the right lines here? What DB service should I use to store JSON data?
Thanks.
I ended up streaming data from Kafka to both Postgres (for persistent storage) and Elasticsearch (you know, for search) if that helps anyone.
I'm working on migration/integration of large on-premise Oracle monolithic app to cloud based Microservices. For a long time, microservices will need to be fed from and synchronized with the Oracle DB.
One of the alternatives is using Oracle Golden Gate for DB-to-DB(s) near-real-time replication. The advantage is that it seems to be reliable and resilient. The disadvantage is that it works on low-level CDC/DB changes (as opposed to app-level events).
An alternative is creating higher level business events from source DB by enriching data and then pushing it to something like Kafka. The disadvantage is that it puts more load on source DB, and requires durability on the source.
Anybody dealt with similar problems? Any advice is appreciated.
The biggest problem for us has been that legacy data is on the LAN, and our microservices are in the public cloud (in an attempt to avoid a "new legacy" hybrid cloud future).
Oracle Goldengate for Big Data can push change records as JSON to Kafka/Confluent. There's also the option to write your own handlers. You can find a lot of our PoC code in github.
As time has gone by, it became apparent the number of feeds was going to end up in the 300+ range, and we're now considering a data virtualisation + caching approach rather than pushing the legacy data to the cloud apps
http://pubapi.cryptsy.com/api.php?method=marketdatav2
I would like to synchronize market data on a continuous basis (e.g. cryptsy and other exchanges). I would like to show latest buy/sell price from the respective orders from these exchanges on a regular basis as a historical time series.
What backend database should I used to store and render or plot any parameter from the retrieved data as a historical timeseries data.
I'd suggest you look at a database tuned for handling time series data. The one that springs to mind is InfluxDB. This question has a more general take on time series databases.
I think it needs more detail about the requirement.
It just describe, "it needs sync time series data". What is scenario? what is data source and destination?
Option 1.
If it is just data synchronization issues between two data based, easiest solution is CouchDB NoSQL Series (CouchDB, CouchBase, Cloudant)
All they are based on CouchDB, anyway they provides data center level data replication feature (XCDR). So you can replicate the date to other couchDB in other data center or even in couchDB in mobile devices.
I hope it will be useful to u.
Option 2.
Other approach is Data Integration approach. You can sync data by using ETL batch job. Batch worker can copy data to destination periodically. It is most common way to replicate data to other destination. There are a lot of tools it supports ETL line Pentaho ETL, Spring Integration, Apache Camel.
If you provide me more detail scenario, i can help u in more detail
Enjoy
-Terry
I think mongoDB is a good choice. Here is why:
You can easily scale out, and thus be able to store tremendous amount of data. When using an according shard key, you might even be able to position the shards close to the exchange they follow in order to improve speed, if that should become a concern.
Replica sets offer automatic failover, which implicitly could be an issue
Using the TTL feature, data can be automatically deleted after their TTL, effectively creating a round robin database.
Both the aggregation and the map/reduce framework will be helpful
There are some free classes at MongoDB University which will prevent you to avoid the most common pitfalls
I have a number of systems, most of which are capable of generating data using JSON Activity Streams[1] (or can be coerced into doing so), and I want to use this data for analytics.
I want to use both a traditional SQL datamart for OLAP use, and also to dump the raw JSON data into Hadoop for running batch mapreduce jobs.
I've been reading up on Kafka, Flume, Scribe, S4, Storm and a whole load of other tools but I'm still not sure which is best suited to the task at hand. These seem to be either focussed on logfile data, or real-time processing of the activity stream, whereas I guess I'm more interested in doing ETL on activity streams.
The type of setup I'm thinking of is where I provide a configuration for all the streams I'm interested in (URLs, params, credentials), and the tool periodically polls them, dumps the output in HDFS, and also has a hook for me to process and transform the JSON for insertion into the datamart.
Do any of the existing open-source tools fit this case particularly well?
(In terms of scale I expect a max of 30,000 users interacting with ~10 systems - not simultaneously - so not really "Big Data", but not trivial either.)
Thanks!
[1] http://activitystrea.ms/
You should check out streamsets.com
It's an open source tool (and free to use) built exactly for these kinds of use cases.
You can use the HTTP Client source and HDFS destination to achieve your main goals. If you decide you need to use Kafka or Flume as well, support for both are also built-in.
You can also do transformations in a variety of ways including writing python or javascript for more complex transformations (or your own stage in Java if you choose).
You can also check out logstash (elastic.co) and NiFi to see if one of those works better for you.
*Full disclosure, I'm an engineer at StreamSets.
Our team now uses zabbix for monitoring and alert. In addition, we use fluent to gather log to an central mongoDB and it is put to work for a week. Recently we were discussing another solution - Logstash. I wanna ask which difference between them? In my opinion, I'd like use zabbix as a data-gathering and alert-sending platform and fluent plays the 'data-gathering' role in the whole infrastructure. While I've looked into Logstash website and found out that Logstash is not only a log-gathering system, but also a whole solutions for gathering, presentation and search.
Would anybody can give some advice or share some experience?
Logstash is pretty versatile (disclaimer: have only been playing with it for a few weeks).
We'd been looking at Graylog2 for a while (listening for syslog and providing a nice search UI) but the message processing functionality in it is based on the Drools engine and is.. arcane at best.
I found it was much easier to have logstash read syslog files from our central server, massage the events and output to Graylog2. Gave us much more flexibility and should allow us to add application level events alongside the OS level syslog data.
It has a zabbix output, so you might find it's worth a look.
Logstash is a great fit with Zabbix.
I forked a repo on github to take the logstash statsd output and send it to Zabbix for trending/alerting. As was mentioned by another, logstash also has a Zabbix output plugin which is great for notifying/sending on matching events.
Personally, I prefer the native Logstash->Elasticsearch backend to Logstash->Graylog2(->Elasticsearch).
It's easier to manage, especially if you have a large volume of log data. At present, Graylog2 uses Elasticsearch as well, but uses only a single index for all data. If you're periodically cleaning up old data, this means the equivalent of a lot of SQL "delete from table where date>YYYY.MM.DD" type-calls to clean out old data, where Logstash defaults to daily indexes (the equivalent of "drop table YYYY.MM.DD"), so clean-up is nicer.
It also results in cleaner searches, requiring less heap space, as you can search over a known date because the index is named for the day's data it contains.