I've recently been researching data transfer methods that could replace my current inefficient setup.
So just to get started I will explain the issue that I am having with my current MySQL data transfer method...
I have a database that keeps track of product inventory levels within my warehouses, this stock data is constantly changing at a rapid rate.
A CSV file is created and set to cron every 15 minutes, I then delivery this CSV file via FTP to multiple sites that we manage to update inventory stock levels. These sites are also using MySQL and the data in the CSV file is added with a script.
This method is very inefficient and doesn't update our stock levels to our other databases as fast as we would like.
My question is are there any other methods to transfer MySQL data in real time between multiple databases other than CSV files? I have been researching another method however haven't came across anything useful as of yet.
Thanks
Related
I want to build a machine learning system with large amount of historical trading data for machine learning purpose (Python program).
Trading company has an API to grab their historical data and real time data. Data volume is about 100G for historical data and about 200M for daily data.
Trading data is typical time series data like price, name, region, timeline, etc. The format of data could be retrieved as large files or stored in relational DB.
So my question is, what is the best way to store these data on AWS and what'sthe best way to add new data everyday (like through a cron job, or ETL job)? Possible solutions include storing them in relational database like Or NoSQL databases like DynamoDB or Redis, or store the data in a file system and read by Python program directly. I just need to find a solution to persist the data in AWS so multiple team can grab the data for research.
Also, since it's a research project, I don't want to spend too much time on exploring new systems or emerging technologies. I know there are Time Series Databases like InfluxDB or new Amazon Timestream. Considering the learning curve and deadline requirement, I don't incline to learn and use them for now.
I'm familiar with MySQL. If really needed, i can pick up NoSQL, like Redis/DynamoDB.
Any advice? Many thanks!
If you want to use AWS EMR, then the simplest solution is probably just to run a daily job that dumps data into a file in S3. However, if you want to use something a little more SQL-ey, you could load everything into Redshift.
If your goal is to make it available in some form to other people, then you should definitely put the data in S3. AWS has ETL and data migration tools that can move data from S3 to a variety of destinations, so the other people will not be restricted in their use of the data just because of it being stored in S3.
On top of that, S3 is the cheapest (warm) storage option available in AWS, and for all practical purposes, its throughout is unlimited. If you store the data in a SQL database, you significantly limit the rate at which the data can be retrieved. If you store the data in a NoSQL database, you may be able to support more traffic (maybe) but it will be at significant cost.
Just to further illustrate my point, I recently did an experiment to test certain properties of one of the S3 APIs, and part of my experiment involved uploading ~100GB of data to S3 from an EC2 instance. I was able to upload all of that data in just a few minutes, and it cost next to nothing.
The only thing you need to decide is the format of your data files. You should talk to some of the other people and find out if Json, CSV, or something else is preferred.
As for adding new data, I would set up a lambda function that is triggered by a CloudWatch event. The lambda function can get the data from your data source and put it into S3. The CloudWatch event trigger is cron based, so it’s easy enough to switch between hourly, daily, or whatever frequency meets your needs.
I have very Large CSV file which contains 2 million log data for each customer coming every day, We have to develop the analytics tool which gives the summary of a various group by of CSV file data.
We have developed using Mysql-InnoDB but running very slow. we have applied proper indexing on tables and hardware is also good.
Is Mysql capable for this time of analytical tool or do I need to check any other database?
Each SQL Select query contains 15-20 sec to get output from the single table.
I am assuming that the data that you is insert-only and that you are mostly looking to build dashboards that show some metrics to clients.
You can approach this problem in a different way. Instead of directly storing the CSV data in the SQL database you can process the CSV first using Spark or Spring batch or AirFlow depending the language options. Doing this lets you reduce the amount of data that you have to store.
Another approach that you can consider is processing the CSV and pushing them to something like BigQuery or Redshift. These databases are designed to process and query large data.
To fasten queries, you can also create Materialized views to build dashboards quickly. I would not recommend this though as it is not a very scalable approach.
I recommend that you process the data first and generate the metrics that are required and store them in SQL and build dashboards on top of them instead of directly saving them.
I'm building a cloud sync application which syncs a users data across multiple devices. I am at a crossroads and am deciding whether to store the data on the server as files, or in a relational database. I am using Amazon Web Services and will use S3 for user files or their database service if I choose to store the data in a table instead. The data I'm storing is the state of the application every ten seconds. This could be problematic to be storing in a database because the average number of rows per user that would be stored is 100,000 and with my current user base of 20,000 people that's 2 billion rows right off the bat. Would I be better off storing that information in files? Because that would be about 100 files totaling 6 megabytes per user.
As discussed in the comments, I would store these as files.
S3 is perfectly suited to be a key/value store and if you're able to diff the changes and ensure that you aren't unnecessarily duplicating loads of data, the sync will be far easier to do by downloading the relevant files from S3 and syncing them client side.
You get a big cost saving of not having to operate a database server that can store tonnes of rows and stay up to provide them to the clients quickly.
My only real concern would be that the data in these files can be difficult to parse if you wanted to aggregate stats/data/info across multiple users as a backend or administrative view. You wouldn't be able to write simple SQL queries to sum up values etc, and would have to open the relevant files, process them with something like awk or regular expressions etc, and then compute the values that way.
You're likely doing that on the client side any for the specific files that relate to that user though, so there's probably some overlap there!
Just looking for some advice on best way to handle data importing via scheduled Web Jobs.
I have 8 json files that are imported every 5 hours via an FTP client using JSON serializer into memory and then these JSON objects are processed and inserted into Azure SQL using EF6. Each file is processed in a loop sequentially as I wanted to make sure that all data is inserted correctly as when I tried to use a Parallel ForEach some of the data was not being inserted on related tables. So if the WebJob fails i know there has been an error and we can run again, problem is this is now taking a long time to complete, near on 2hrs as we have a lot data - each file has 500 locations and each location has 11 days and 24 hour data.
Anyone have any ideas on how to do this quicker whilst ensuring that the data is always inserted correctly or handle any errors. Was looking at using Storage queues but we may need to point to other databases in the future or can I use 1 web job per file so have 8 web jobs for each file being scheduled every 5 hours as i think there is a limit to the number of web jobs i can run per day.
Or is there an alternative way of importing data into Azure SQL that can be scheduled.
Azure Web Jobs (via the Web Jobs SDK) can monitor and process BLOBs. There is no need to create a scheduled job. The SDK can monitor for new BLOBs and process them as they are created. You could break up your processing to smaller files and load them as they are created.
Azure Stream Analytics has similar capabilities.
I need to back up a few DynamoDB tables which are not too big for now to S3. However, these are tables another team uses/works on but not me. These back ups need to happen once a week, and will only be used to restore the DynamoDB tables in disastrous situations (so hopefully never).
I saw that there is a way to do this by setting up a data pipeline, which I'm guessing you can schedule to do the job once a week. However, it seems like this would keep the pipeline open and start incurring charges. So I was wondering, if there is a significant cost difference between backing the tables up via the pipeline and keeping it open, or creating something like a powershellscript that will be scheduled to run on an EC2 instance, which already exists, which would manually create a JSON mapping file and update that to S3.
Also, I guess another question is more of a practicality question. How difficult it is to backup dynamoDB tables to Json format. It doesn't seem too hard but wasn't sure. Sorry if these questions are too general.
Are you are working under the assumption that Data Pipeline keeps the server up forever? That is not the case.
For instance, you have defined a Shell Activity, after the activity completes, the server will terminate. (You may manually set the termination protection. Ref.
Since you only run a pipeline once a week, the costs are not high.
If you run a cron job on ec2 instance, that instance needs to up when you want to run the backup - and that could be a point of failure.
Incidentally, Amazon provides a Datapipeline sample on how to export data from dynamodb.
I just checked the pipeline cost page, and it says "For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $0.60 per month". So I think I'm safe.