How to synchronize market data frequently and show as a historical timeseries data - mysql

http://pubapi.cryptsy.com/api.php?method=marketdatav2
I would like to synchronize market data on a continuous basis (e.g. cryptsy and other exchanges). I would like to show latest buy/sell price from the respective orders from these exchanges on a regular basis as a historical time series.
What backend database should I used to store and render or plot any parameter from the retrieved data as a historical timeseries data.

I'd suggest you look at a database tuned for handling time series data. The one that springs to mind is InfluxDB. This question has a more general take on time series databases.

I think it needs more detail about the requirement.
It just describe, "it needs sync time series data". What is scenario? what is data source and destination?
Option 1.
If it is just data synchronization issues between two data based, easiest solution is CouchDB NoSQL Series (CouchDB, CouchBase, Cloudant)
All they are based on CouchDB, anyway they provides data center level data replication feature (XCDR). So you can replicate the date to other couchDB in other data center or even in couchDB in mobile devices.
I hope it will be useful to u.
Option 2.
Other approach is Data Integration approach. You can sync data by using ETL batch job. Batch worker can copy data to destination periodically. It is most common way to replicate data to other destination. There are a lot of tools it supports ETL line Pentaho ETL, Spring Integration, Apache Camel.
If you provide me more detail scenario, i can help u in more detail
Enjoy
-Terry

I think mongoDB is a good choice. Here is why:
You can easily scale out, and thus be able to store tremendous amount of data. When using an according shard key, you might even be able to position the shards close to the exchange they follow in order to improve speed, if that should become a concern.
Replica sets offer automatic failover, which implicitly could be an issue
Using the TTL feature, data can be automatically deleted after their TTL, effectively creating a round robin database.
Both the aggregation and the map/reduce framework will be helpful
There are some free classes at MongoDB University which will prevent you to avoid the most common pitfalls

Related

What is a good AWS solution (DB, ETL, Batch Job) to store large historical trading data (with daily refresh) for machine learning analysis?

I want to build a machine learning system with large amount of historical trading data for machine learning purpose (Python program).
Trading company has an API to grab their historical data and real time data. Data volume is about 100G for historical data and about 200M for daily data.
Trading data is typical time series data like price, name, region, timeline, etc. The format of data could be retrieved as large files or stored in relational DB.
So my question is, what is the best way to store these data on AWS and what'sthe best way to add new data everyday (like through a cron job, or ETL job)? Possible solutions include storing them in relational database like Or NoSQL databases like DynamoDB or Redis, or store the data in a file system and read by Python program directly. I just need to find a solution to persist the data in AWS so multiple team can grab the data for research.
Also, since it's a research project, I don't want to spend too much time on exploring new systems or emerging technologies. I know there are Time Series Databases like InfluxDB or new Amazon Timestream. Considering the learning curve and deadline requirement, I don't incline to learn and use them for now.
I'm familiar with MySQL. If really needed, i can pick up NoSQL, like Redis/DynamoDB.
Any advice? Many thanks!
If you want to use AWS EMR, then the simplest solution is probably just to run a daily job that dumps data into a file in S3. However, if you want to use something a little more SQL-ey, you could load everything into Redshift.
If your goal is to make it available in some form to other people, then you should definitely put the data in S3. AWS has ETL and data migration tools that can move data from S3 to a variety of destinations, so the other people will not be restricted in their use of the data just because of it being stored in S3.
On top of that, S3 is the cheapest (warm) storage option available in AWS, and for all practical purposes, its throughout is unlimited. If you store the data in a SQL database, you significantly limit the rate at which the data can be retrieved. If you store the data in a NoSQL database, you may be able to support more traffic (maybe) but it will be at significant cost.
Just to further illustrate my point, I recently did an experiment to test certain properties of one of the S3 APIs, and part of my experiment involved uploading ~100GB of data to S3 from an EC2 instance. I was able to upload all of that data in just a few minutes, and it cost next to nothing.
The only thing you need to decide is the format of your data files. You should talk to some of the other people and find out if Json, CSV, or something else is preferred.
As for adding new data, I would set up a lambda function that is triggered by a CloudWatch event. The lambda function can get the data from your data source and put it into S3. The CloudWatch event trigger is cron based, so it’s easy enough to switch between hourly, daily, or whatever frequency meets your needs.

Best IoT Database?

I have many IoT devices sending data currently to MySQL Database.
I want to port it to some other Database, which will be Open Source and provide me with:
JSON support
Scalability
Flexibility to add multiple columns automatically as per payload
Python and PHP Support
Extremely Fast Read, Write
Ability to export at least 6 months of data in CSV format
Please revert back soon.
Any help will be appreciated.
Thanks
Shaping your database based on input data is a mistake. Think of tomorrow your data will be CSV or XML, in a slight different format. Design your database based on your abstract data model, normalize it and apply existing data to your model. Shape your structure based on what input you have and what output you plan to get. If you retrieve the same content as the input, storing data in files will be sufficient, you don't need a database.
Also, you don't want to store "raw" records the database. Even if your database can compose a data record out of the raw element at run time, you cannot run a selection based on a certain extracted element, without visiting all the records.
Most of the databases allow you to connect from anywhere (there is not such thing as better support of PostgreSQL in Java as compared to Python, but the quality and level of standardization for drivers may vary). The question is what features shall your driver support. For example, you may require support for bulk import (don't issue large INSERT sets to the database).
What you actually look for is:
scalability: can your database grow with your data? Would the DB benefit of adding additional CPUs (MySQL particularly doesn't for large queries). Can you shard your database on multiple instances? (MySQL again fails to handle that).
does your model looks like a snowflake? If yes, you may consider NoSQL, otherwise stay away of it. If you manage to model as a snowflake (and this means you are open for compromises) you may use anything like Lucene based search products, Mongo, Cassandra, etc. The fact you have timeseries doesn't qualify you for NoSQL. For example, you may have 10K devices issuing 5k message types. Specific data is redundantly recorded at device level and at message type level. In that case, because of the n:m relation, you don't have the snowflake anymore.
why do you store the data? What queries are you going to issue?
Why do you want to move away from MySQL? It is open source and can meet all of the criteria you listed above. This is a very subjective question so it's hard to give a good answer, but MySQL is not a bad option

Why do we need SSIS and star schema of Data Warehouse?

If SSAS in MOLAP mode stores data, what is the application of SSIS and why do we need a Data Warehouse and the ETL process of SSIS?
I have a SQL Server OLTP database. I am using SSIS to transfer my SQL Server data from OLTP database to a Data Warehouse database that contains fact and dimension tables.
After that I want to create cubes using SSAS form Data Warehouse data.
I know that MOLAP stores data. Do I need any Data warehouse with Fact and Dimension tables?
Is not it better to avoid creating Data warehouse and create cubes directly from OLTP database?
This might be a candidate for "Too Broad" but I'll give it a go.
Why would I want to store my data 3 times?
I have my data in my OLTP (online, transaction processing system), why would I want to move that data into a completely new structure (data warehouse) and then move it again into an OLAP system?
Let's start simple. You have only one system of record and it's not amazingly busy. Maybe you can get away with an abstraction layer (views in the database or named queries in SSAS) and skip the data warehouse.
So, you build out your cubes, dimensions and people start using it and they love it.
"You know what'd be great? If we could correlate our Blats to the Foos and Bars we already have in there" Now you need to integrate your simple app with data from a completely unrelated app. Customer id 10 in your app is customer id {ECA67697-1200-49E2-BF00-7A13A549F57D} in the CRM app. Now what? You're going to need to present a single view of the Customer to your users or they will not use the tool.
Maybe you rule with an iron fist and say No, you can't have that data in the cube and your users go along with it.
"Do people's buying habits change after having a child?" We can't answer that because our application only stores the current version of a customer. Once they have a child, they've always had a child so you can't cleanly identify patterns before or after an event.
"What were our sales like last year" We can't answer that because we only keep a rolling 12 weeks of data in the app to make it manageable.
"The data in the cubes is stale, can you refresh it?" Egads, it's the middle of the day. The SSAS processing takes table locks and would essentially bring our app down until it's done processing.
Need I go on with these scenarios?
Summary
The data warehouse serves as an integration point for diverse systems. It has conformed dimensions (everyone's has a common definition for what a thing is). The data in the warehouse may exceed the lifetime of the data in the source systems. The business needs might drive the tracking of data that the source application does not support. The data in the DW supports business activities while your OLTP system supports itself.
SSIS is just a tool for moving data. There are plenty out there, some better, some worse.
So No, generally speaking, it is not better to avoid creating a DW and build your cubes based on your OLTP database.

What would be a preferrable approach for rendering time series data

We have a simple JSON feed which provides stock/price information at a certain point in time.
e.g.
{t0, {MSFT, 20}, {AAPL, 30}}
{t1, {MSFT, 10}, {AAPL, 40}}
{t2, {MSFT, 5}, {AAPL, 50}}
What would be a preferred mechanism to store/retrieve this data and to plot a graph based on this data (say MSFT). Should I use redis or mysql?
I would also want to show the latest entries to all users in the portal as and when new data is received. The data could be retrieved every minute. Should I use node.js for this
Ours is a rails application and would like to know what libraries/database should I use to model this capability.
Depends on the traffic and the data. If the data is relational, meaning it is formally described and organized according to the relational model, then MySQL is better. If most of the queries are get and set with key->value , meaning you are going to get the data using one key, and you need to support many clients and many set/get per minute, then defiantly go with Redis. There are many other noSQL DBs that might fit, have a look at this post for a great review of some of the most popular ones.
So many ways to do this.. if getting an update once a minute is enough have the client do AJAX calls every minute to get the updated data, and then you can build your server side using php, .NET, java servlet ot node.js, again, depend on the expected user concurrency. PHP is very easy to develop on, while node.js can support many short i/o requests. Another option you might want to consider is you use server push (Node's socket.io for example) instead of the client AJAX call. In this way the client will be notified immediately on an update.
Personally, I like both node.js and Redis and used them couple of production applications, supporting many concurrent users using a single server. I like node since it's easy to develop, and support many users, and Redis for it's amazing speed and concurrent requests. Having said that, I also use MySQL for saving relational data, and PHP servers for fast development of APIs. Each have its own benefits.
Hope you'll find this info helpful.
Kuf.
As Kuf mentioned, there are so many ways to go about this and it really does depends on your needs: low latency, data storage, or ease of implementation.
Redis will most likely be the best solution if you are going for a low latency and easy solution to implement. You can use Pub/Sub to push updates to clients (e.g. Node’s socket.io) in real-time and run a second Redis instance to store the JSON data as a sorted set using the timestamp as a score. I’ve used the same to much success storing time-based statistical data. The downside to this solution is that it is resource (i.e. memory) expensive if you want to store a lot of data. 
If you are looking to store a lot of data in JSON format and want to use a pull to fetch data every minute, then using ElasticSearch to store/retrieve data is another possibility. You can use ElasticSearch’s range query to search using a timestamp field, for example:
"range": {
"#timestamp": {
"gte": date_from,
"lte": now
}
}
This adds the flexibility of using an extremely scalable and redundant system, storing larger amounts of data, and a RESTful real-time API. 
Best of luck!
Since you're basically storing JSON data...
Postgres has a native JSON datatype
Also MongoDB might be a good fit too as JSON -> BSON
But if its just serving data even something as simple as memcached would suffice.
If you have a lot of data to keep updated in real-time like stock ticker prices, the solution should involve the server publishing to the client, not the client continually hitting the server for updates. Publish/subscribe (pub/sub) type model with websockets might be a good choice at the moment, depending on your client requirements.
For plotting the data using data from websockets there is already a question about that here.
Ruby-toolbox has a category called HTTP Pub Sub which might be a good place to start. Whether MySQL or Redis is better depends on what you will be doing with it aside from just streaming stock prices. Redis may be a better choice for performance. Note also that websocket-rails assumes Redis, if you were to use that- just as an example.
I would not recommend a simple JSON API (non-pubsub) in this case, because it will not scale as well (see this answer), but if you don't think you'll have many clients, go for it.
Cube could be a good example for reference. It uses MongoDB for data storage.
For plotting time series data, you may try out cubism.js.
Both projects are from square.

Which strategy to use for designing a log data storage?

We want to design a data storage with Relational database keeping the request message(http/s,xmpp etc.) logs. For generating logs we use a solution based on Apache synapse esb. However since we want to store the logs and read the logs only for maintenance issues the read/write ratio will be low. (write count will be intensive since system will receive many messages to be logged. ) We thought of using Cassandra for its distributed nature and clustering capabilities. However with Cassandra database schemas search queries with filter are difficult, always requiring secondary indexes.
To keep it short my question is whether should we try the clustering solutions of mysql or using Cassandra with suitable schema design for search queries with filters?
If you wish to do real time analytics over your semi-structured or unstructured data you can go with Cassandra + Hadoop cluster. Since Cassandra wiki itself suggests Datastax Brisk edition, for such kind of architecture. It is worth giving it a try
On the other side if you wish to do realtime queries over raw logs for small set of data. Ex.
select useragent from raw_log_table where id='xxx'
Then you should do a lots of research over you row key and column key design. Because that decides the complexity of the query. Better have a look at the case studies of people here http://www.datastax.com/cassandrausers1
Regards,
Tamil