What is a good AWS solution (DB, ETL, Batch Job) to store large historical trading data (with daily refresh) for machine learning analysis? - mysql

I want to build a machine learning system with large amount of historical trading data for machine learning purpose (Python program).
Trading company has an API to grab their historical data and real time data. Data volume is about 100G for historical data and about 200M for daily data.
Trading data is typical time series data like price, name, region, timeline, etc. The format of data could be retrieved as large files or stored in relational DB.
So my question is, what is the best way to store these data on AWS and what'sthe best way to add new data everyday (like through a cron job, or ETL job)? Possible solutions include storing them in relational database like Or NoSQL databases like DynamoDB or Redis, or store the data in a file system and read by Python program directly. I just need to find a solution to persist the data in AWS so multiple team can grab the data for research.
Also, since it's a research project, I don't want to spend too much time on exploring new systems or emerging technologies. I know there are Time Series Databases like InfluxDB or new Amazon Timestream. Considering the learning curve and deadline requirement, I don't incline to learn and use them for now.
I'm familiar with MySQL. If really needed, i can pick up NoSQL, like Redis/DynamoDB.
Any advice? Many thanks!

If you want to use AWS EMR, then the simplest solution is probably just to run a daily job that dumps data into a file in S3. However, if you want to use something a little more SQL-ey, you could load everything into Redshift.
If your goal is to make it available in some form to other people, then you should definitely put the data in S3. AWS has ETL and data migration tools that can move data from S3 to a variety of destinations, so the other people will not be restricted in their use of the data just because of it being stored in S3.
On top of that, S3 is the cheapest (warm) storage option available in AWS, and for all practical purposes, its throughout is unlimited. If you store the data in a SQL database, you significantly limit the rate at which the data can be retrieved. If you store the data in a NoSQL database, you may be able to support more traffic (maybe) but it will be at significant cost.
Just to further illustrate my point, I recently did an experiment to test certain properties of one of the S3 APIs, and part of my experiment involved uploading ~100GB of data to S3 from an EC2 instance. I was able to upload all of that data in just a few minutes, and it cost next to nothing.
The only thing you need to decide is the format of your data files. You should talk to some of the other people and find out if Json, CSV, or something else is preferred.
As for adding new data, I would set up a lambda function that is triggered by a CloudWatch event. The lambda function can get the data from your data source and put it into S3. The CloudWatch event trigger is cron based, so it’s easy enough to switch between hourly, daily, or whatever frequency meets your needs.

Related

Alternate methods to supply data for machine learning (Other than using CSV files)

I have a question which is relating to machine learning application in real world. It might be sounds stupid lol.
I've been self study machine learning for a while and most of the exercise was using the csv file as data source (both processed and raw). I would like to ask is there any other methods other than import csv file to channel/supply data for machine learning?
Example: Streaming Facebook/ Twitter live feed's data for machine learning in real-time, rather than collect old data and stored them into CSV file.
The data source can be anything. Usually, it's provided as a CSV or JSON file. But in the real world, say you have a website such as Twitter, as you're mentioning, you'd be storing your data in a rational DB such as SQL databases, and for some data you'd be putting them in an in-memory cache.
You can basically utilize both of these to retrieve your data and process it. The thing here is when you have too much data to fit in the memory, you can't really just query everything and process it, in that case, you'll be utilizing some smart algorithms to process data in chunks.
Good thing about some databases such as SQL is that they provide you with a set of functions that you can invoke right in your SQL script to efficiently calculate some data. For example you can get a sum of a column across the whole table or something using SUM() function SQL, which allows for efficient and easy data manipulation

Backing up DynamoDB tables via data pipeline vs manually creating a json for dynamoDB

I need to back up a few DynamoDB tables which are not too big for now to S3. However, these are tables another team uses/works on but not me. These back ups need to happen once a week, and will only be used to restore the DynamoDB tables in disastrous situations (so hopefully never).
I saw that there is a way to do this by setting up a data pipeline, which I'm guessing you can schedule to do the job once a week. However, it seems like this would keep the pipeline open and start incurring charges. So I was wondering, if there is a significant cost difference between backing the tables up via the pipeline and keeping it open, or creating something like a powershellscript that will be scheduled to run on an EC2 instance, which already exists, which would manually create a JSON mapping file and update that to S3.
Also, I guess another question is more of a practicality question. How difficult it is to backup dynamoDB tables to Json format. It doesn't seem too hard but wasn't sure. Sorry if these questions are too general.
Are you are working under the assumption that Data Pipeline keeps the server up forever? That is not the case.
For instance, you have defined a Shell Activity, after the activity completes, the server will terminate. (You may manually set the termination protection. Ref.
Since you only run a pipeline once a week, the costs are not high.
If you run a cron job on ec2 instance, that instance needs to up when you want to run the backup - and that could be a point of failure.
Incidentally, Amazon provides a Datapipeline sample on how to export data from dynamodb.
I just checked the pipeline cost page, and it says "For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $0.60 per month". So I think I'm safe.

Logging high volume of impression data (50 million records/month)

We are currently logging impression data for several websites using MySQL and are seeking a more appropriate replacement for logging the high volume of traffic our sites now see. What we ultimately need in the MySQL database is aggregated data.
By "high volume" I mean that we are logging about 50 million entries per month for this impression data. It is important to note that this table activity is almost exclusively write and only rarely read. (Different from this use-case on SO: Which NoSQL database for extremely high volumes of data). We have worked around some of the MySQL performance issues by partitioning the data by range and performing bulk inserts, but in the big picture, we shouldn't be using MySQL.
What we ultimately need in the MySQL database is aggregated data and I believe there are other technologies much better suited for the high-volume logging portion of this use-case. I have read about mongodb, HBase (with map reduce), Cassandra, and Apache Flume and I feel like I'm on the right track, but need some guidance on what technology (or combination) I should be looking at.
What I would like to know specifically is what platforms are best suited for high-volume logging and how to get an aggregated/reduced data set fed into MySQL on a daily basis.
Hive doesn't store information, it only allow you to query "raw" data with like sql language (HQL).
If your aggregated data is enough small to be stored in MySQL and that is the only use of your data, then HBase could be too much for you.
My suggestion is use Hadoop (HDFS and MapReduce
Create log files (text files) with the impression events.
Then move them into HDFS (an alternative could be use kafka or storm if you require a near real-time solution).
Create a MapReduce job capable to read and aggregate your logs and in the reduce output use a DBOutputFormat to store the aggregated data into MySql.
One approach could be to simply dump the raw impression log into flat files. There would be a daily batch which will process these file using MapReduce program. The MapReduce aggregated output could be stored into Hive or HBase.
Please let me know, if you see any problem in this approach. The Bigdata technology stack have many option based on type of data and the way it needs to be aggregated.

How to synchronize market data frequently and show as a historical timeseries data

http://pubapi.cryptsy.com/api.php?method=marketdatav2
I would like to synchronize market data on a continuous basis (e.g. cryptsy and other exchanges). I would like to show latest buy/sell price from the respective orders from these exchanges on a regular basis as a historical time series.
What backend database should I used to store and render or plot any parameter from the retrieved data as a historical timeseries data.
I'd suggest you look at a database tuned for handling time series data. The one that springs to mind is InfluxDB. This question has a more general take on time series databases.
I think it needs more detail about the requirement.
It just describe, "it needs sync time series data". What is scenario? what is data source and destination?
Option 1.
If it is just data synchronization issues between two data based, easiest solution is CouchDB NoSQL Series (CouchDB, CouchBase, Cloudant)
All they are based on CouchDB, anyway they provides data center level data replication feature (XCDR). So you can replicate the date to other couchDB in other data center or even in couchDB in mobile devices.
I hope it will be useful to u.
Option 2.
Other approach is Data Integration approach. You can sync data by using ETL batch job. Batch worker can copy data to destination periodically. It is most common way to replicate data to other destination. There are a lot of tools it supports ETL line Pentaho ETL, Spring Integration, Apache Camel.
If you provide me more detail scenario, i can help u in more detail
Enjoy
-Terry
I think mongoDB is a good choice. Here is why:
You can easily scale out, and thus be able to store tremendous amount of data. When using an according shard key, you might even be able to position the shards close to the exchange they follow in order to improve speed, if that should become a concern.
Replica sets offer automatic failover, which implicitly could be an issue
Using the TTL feature, data can be automatically deleted after their TTL, effectively creating a round robin database.
Both the aggregation and the map/reduce framework will be helpful
There are some free classes at MongoDB University which will prevent you to avoid the most common pitfalls

Database for sequential data

I'm completely new to databases so pardon the simplicity of the question. We have an embedded Linux system that needs to store data collected over a time span of several hours. The data will need to be searchable sequentially and includes data like GPS, environmental data, etc. This data will need to saved off in a folder on a removable SSD and labeled as a "Mission". Several "Missions" can exists on a single SSD and should not be mixed together because they need to be copied and saved off individually at the users discretion to external media. Data will be saved off as often as 10 times a second and needs to be very robust because of the potential for power outages.
The data will need to be searchable on the system it is created on but also after the removalable disk is taken to another system (also Linux) it needs to be loaded and used there also. In the past we have done custom files to store the data but it seems like a database might be the best option. How portable are databases like MySQL? Can a user easily remove a disk with a database on it and plug it in a new machine to use without too much effort? Our queries will mostly be time based because the user will be "playing" through the data after it is collected in perhaps 10x the collection rate. Also, our base code is written in Qt (C++) so we would need to interact with the database in that way.
I'd go with SQLite. It's small and lite. It stores all its data into one file. You can copy or move the file to another computer and read it there. You data writer can just remake the file, empty when it detects that today's ssd does not already have the file.
It's also worth mentioning that SQLite undergoes testing at the level afforded only by select few safety-critical pieces of software. The test suite, while partly autogenerated, is a staggering 100 million lines of code. It is not lite at all when it comes to robustness. I would trust SQLite more than a random self-made database implementation.
SQLite is used in certified avionics AFAIK.