I have a huge dataset in AWS S3 that I can't load into my computer to train a neural network. I want to use Pytorch Lightning to train that NN with that dataset.
My question is, how can I use data modules (or any other tool from Pytorch Lightning) in order to load batches from that dataset, preprocess them and feed them into the training loop?
Keep in mind that my filesystem is not big enough to hold the data, so I guess caching is out of the picture. However, it would be interesting to load several batches in parallel so that data loading is not a bottleneck.
Related
For standard S3 storage, it costs 2.3 cents per GB per month. But for S3 Glacier Instant Retrieval, costs 0.4 cents per GB per month. I wondering if this storage solution would be good for a personal web project that would serve up recipe data in JSON format. (it also is amazing that the data can be stored on tapes and yet be accessed in milliseconds, which it confusing why it behaves so differently than regular Glacier storage class)
I haven't been able to see any problems of horrible latency for the customers accessing that JSON data once upon visiting the website and then having the frontend SPA sort and filter through that JSON document accordingly.
Is there an obvious reason why not to use this storage class if my goal is to reduce costs for myself?
I have Data Dog log data archives streaming to an Azure Blob stored in a single 150MB JSON file compressed in a 15MB .gz file. These are being generated every 5 minutes. Need to do some analytics on this data. What is the most efficient and cost effective solution to get this data into delta lake?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Has anyone done this successfully without breaking the bank?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Yes, that's the big downside of gzip format - it is not splitable and therefore cannot be distributed across all your workers and cores - the Driver has to load a file in its entirety and decompress it in a single batch. Topic related to this question.
The only sensible workaround I've used myself is to make Driver have only few cores but as powerful ones as possible - I assume, since you are using Azure Blob, then you are using Databricks on Azure as well and here you can find all Azure VM types - just have to pick the one with fastest cores.
While implementing an NLP system, I wonder why CSV files are often used to store text Corpora in Academia and common Python Examples (in particular: NLTK-based). I have personally ran into issues, using a system that generates a number of corpora automatically and accesses them later.
These are issues that come from CSV files:
- Difficult to automate back up
- Difficult to ensure availability
- Potential transaction race and thread accessing issues
- Difficult to distribute/shard over multiple servers
- Schema not clear or defined, if corpora becomes complicated
- Accessing via a filename is risky. It could be altered.
- File Corruption possible
- Fine grained permissions not typically used for file-access
Issues from using MySQL, or MongooseDB:
- Initial set up, keeping a dediated server running with DB instance online
- Requires spending time creating and defining a Schema
Pros of CSV:
- Theoretically easier to automate zip and unzipping of contents
- More familiar to some programmers
- Easier to transfer to another academic researcher, via FTP or even e-mail
Viewing multiple academic articles, even in cases of more advanced NLP research, for example undertaking Named Entity Recognition or statement extraction, research seems to use CSV.
Are there other advantages to the CSV format, that make it so widely used? What should an Industry system use?
I will organize the answer into two parts:
Why CSV:
A dataset for an nlp task, be it a classification or sequence annotation basically requires two things per each training instance in a corpus:
Text(might be a single token, sentence or document) to be annotated and optionally pre-extracted features.
Corresponding labels/tag.
Because of this simple tabular organization of data that is consistent across different NLP problems, CSV is a natural choice. CSV is easy to learn, easy to parse, easy to serialize and easy to include different encodings and languages. CSV is easy to work with Python(which is the most dominant for NLP) and there are excellent libraries like Pandas that makes it really easy to manipulate and re-organize the data.
Why not database
A database is an overkill really. An NLP model is always trained offline, i.e you fit all the data at once in an ML/DL model. There are no concurrency issues. The only parallelism that exists during training is managed inside a GPU. There is no security issue during training: you train the model in your machine and you only deploy a trained model in a server.
I'm having a lambda do a query in mysql then save that query result to S3,where it will be used by another lambda to do something else with all that data.
Im running out of lambda memory since the query result is too big and is being saved to a variable before being sent to S3.
You guys have any idea of how can I approach this?
As outlined in the Lambda Limit resource the max memory is 3,008 MB and the max directory storage a Lambda can use is 512 MB.
While I do not know your exact use case, this is an example of streaming data in a memory efficient way, using python: Memory-efficient large dataset streaming to S3
In my experience it is better to use tools designed around data extraction/loading for rds to s3 than building tooling in Lambda. For example, you could use Data Pipeline to Export MySQL Data to Amazon S3. Data pipeline can be configured with ec2 instances to handler larger data sets. Much larger and more efficent than a Lambda handler.
Is the data you're sending to s3 in CSV or JSON format? Cause if it is you can use this s3 feature called s3 select which would query that object for you given some sql syntax and save memory in your Lambda. Would this work for the use case? Here is a blog post where they show how to do this:
https://aws.amazon.com/blogs/aws/s3-glacier-select/
HTH
-James
I want to build a machine learning system with large amount of historical trading data for machine learning purpose (Python program).
Trading company has an API to grab their historical data and real time data. Data volume is about 100G for historical data and about 200M for daily data.
Trading data is typical time series data like price, name, region, timeline, etc. The format of data could be retrieved as large files or stored in relational DB.
So my question is, what is the best way to store these data on AWS and what'sthe best way to add new data everyday (like through a cron job, or ETL job)? Possible solutions include storing them in relational database like Or NoSQL databases like DynamoDB or Redis, or store the data in a file system and read by Python program directly. I just need to find a solution to persist the data in AWS so multiple team can grab the data for research.
Also, since it's a research project, I don't want to spend too much time on exploring new systems or emerging technologies. I know there are Time Series Databases like InfluxDB or new Amazon Timestream. Considering the learning curve and deadline requirement, I don't incline to learn and use them for now.
I'm familiar with MySQL. If really needed, i can pick up NoSQL, like Redis/DynamoDB.
Any advice? Many thanks!
If you want to use AWS EMR, then the simplest solution is probably just to run a daily job that dumps data into a file in S3. However, if you want to use something a little more SQL-ey, you could load everything into Redshift.
If your goal is to make it available in some form to other people, then you should definitely put the data in S3. AWS has ETL and data migration tools that can move data from S3 to a variety of destinations, so the other people will not be restricted in their use of the data just because of it being stored in S3.
On top of that, S3 is the cheapest (warm) storage option available in AWS, and for all practical purposes, its throughout is unlimited. If you store the data in a SQL database, you significantly limit the rate at which the data can be retrieved. If you store the data in a NoSQL database, you may be able to support more traffic (maybe) but it will be at significant cost.
Just to further illustrate my point, I recently did an experiment to test certain properties of one of the S3 APIs, and part of my experiment involved uploading ~100GB of data to S3 from an EC2 instance. I was able to upload all of that data in just a few minutes, and it cost next to nothing.
The only thing you need to decide is the format of your data files. You should talk to some of the other people and find out if Json, CSV, or something else is preferred.
As for adding new data, I would set up a lambda function that is triggered by a CloudWatch event. The lambda function can get the data from your data source and put it into S3. The CloudWatch event trigger is cron based, so it’s easy enough to switch between hourly, daily, or whatever frequency meets your needs.