Data Importing using Azure Web Jobs to Azure SQL - json

Just looking for some advice on best way to handle data importing via scheduled Web Jobs.
I have 8 json files that are imported every 5 hours via an FTP client using JSON serializer into memory and then these JSON objects are processed and inserted into Azure SQL using EF6. Each file is processed in a loop sequentially as I wanted to make sure that all data is inserted correctly as when I tried to use a Parallel ForEach some of the data was not being inserted on related tables. So if the WebJob fails i know there has been an error and we can run again, problem is this is now taking a long time to complete, near on 2hrs as we have a lot data - each file has 500 locations and each location has 11 days and 24 hour data.
Anyone have any ideas on how to do this quicker whilst ensuring that the data is always inserted correctly or handle any errors. Was looking at using Storage queues but we may need to point to other databases in the future or can I use 1 web job per file so have 8 web jobs for each file being scheduled every 5 hours as i think there is a limit to the number of web jobs i can run per day.
Or is there an alternative way of importing data into Azure SQL that can be scheduled.

Azure Web Jobs (via the Web Jobs SDK) can monitor and process BLOBs. There is no need to create a scheduled job. The SDK can monitor for new BLOBs and process them as they are created. You could break up your processing to smaller files and load them as they are created.
Azure Stream Analytics has similar capabilities.

Related

Converting JSON .gz files into Delta Tables

I have Data Dog log data archives streaming to an Azure Blob stored in a single 150MB JSON file compressed in a 15MB .gz file. These are being generated every 5 minutes. Need to do some analytics on this data. What is the most efficient and cost effective solution to get this data into delta lake?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Has anyone done this successfully without breaking the bank?
From what I understand the driver that unpacks this data can only run on a single node spark cluster, which will take a very long time and cost a lot of DBU's.
Yes, that's the big downside of gzip format - it is not splitable and therefore cannot be distributed across all your workers and cores - the Driver has to load a file in its entirety and decompress it in a single batch. Topic related to this question.
The only sensible workaround I've used myself is to make Driver have only few cores but as powerful ones as possible - I assume, since you are using Azure Blob, then you are using Databricks on Azure as well and here you can find all Azure VM types - just have to pick the one with fastest cores.

Strategies for mass insertion in MySQL

My app would be consuming data from multiple API's . This data can either be a single event or a batch of events. The data I am dealing with is click streams, where my app would run a cron job every minute to fetch data with our partners using their API's and eventually save everything to MySQL for detailed analysis. I am looking for ways to buffer this data somewhere, and then batch insert it to MySQL.
For example, say I receive a batch of 1000 click events with one API call, what data structures can I use to buffer it in Redis, and then eventually have a worker process to consume this data and insert to MySQL.
One simple approach would be to just fetch the data and store it to MySQL just like that. But since I am dealing with ad tech, where the size and the velocity of the data is always subject to change, this hardly seems like an approach to start with.
Oh, and the app would be built on top of Spring Boot and Tomcat.
Any help/discussion would be greatly appreciated. Thank you!

Google Cloud SQL Timeseries Statistics

I have a massive table that records events happening on our website. It has tens of millions of rows.
I've already tried adding indexing and other optimizations.
However, it's still very taxing on our server (even though we have quite a powerful one) and takes 20 seconds on some large graph/chart queries. So long in fact that our daemon intervenes to kill the queries often.
Currently we have a Google Compute instance on the frontend and a Google SQL instance on the backend.
So my question is this - is there some better way of storing an querying time series data using the Google Cloud?
I mean, do they have some specialist server or storage engine?
I need something I can connect to my php application.
Elasticsearch is awesome for time series data.
You can run it on compute engine, or they have a hosted version.
It is accessed via an HTTP JSON API, and there are several PHP clients (although I tend to make the API calls directly as i find it better to understand their query language that way).
https://www.elastic.co
They also have an automated graphing interface for time series data. It's called Kibana.
Enjoy!!
Update: I missed the important part of the question "using the Google Cloud?" My answer does not use any specialized GC services or infrastructure.
I have used ElasticSearch for storing events and profiling information from a web site. I even wrote a statsd backend storing stat information in elasticsearch.
After elasticsearch changed kibana from 3 to 4, I found the interface extremely bad for looking at stats. You can only chart 1 metric from each query, so if you want to chart time, average time, and avg time of 90% you must do 3 queries, instead of 1 that returns 3 values. (the same issue existing in 3, just version 4 looked more ugly and was more confusing to my users)
My recommendation is to choose a Time Series Database that is supported by graphana - a time series charting front end. OpenTSDB stores information in a hadoop-like format, so it will be able to scale out massively. Most of the others store events similar to row-based information.
For capturing statistics, you can either use statsd or reimann (or reimann and then statsd). Reimann can add alerting and monitoring before events are sent to your stats database, statsd merely collates, averages, and flushes stats to a DB.
http://docs.grafana.org/
https://github.com/markkimsal/statsd-elasticsearch-backend
https://github.com/etsy/statsd
http://riemann.io/

Backing up DynamoDB tables via data pipeline vs manually creating a json for dynamoDB

I need to back up a few DynamoDB tables which are not too big for now to S3. However, these are tables another team uses/works on but not me. These back ups need to happen once a week, and will only be used to restore the DynamoDB tables in disastrous situations (so hopefully never).
I saw that there is a way to do this by setting up a data pipeline, which I'm guessing you can schedule to do the job once a week. However, it seems like this would keep the pipeline open and start incurring charges. So I was wondering, if there is a significant cost difference between backing the tables up via the pipeline and keeping it open, or creating something like a powershellscript that will be scheduled to run on an EC2 instance, which already exists, which would manually create a JSON mapping file and update that to S3.
Also, I guess another question is more of a practicality question. How difficult it is to backup dynamoDB tables to Json format. It doesn't seem too hard but wasn't sure. Sorry if these questions are too general.
Are you are working under the assumption that Data Pipeline keeps the server up forever? That is not the case.
For instance, you have defined a Shell Activity, after the activity completes, the server will terminate. (You may manually set the termination protection. Ref.
Since you only run a pipeline once a week, the costs are not high.
If you run a cron job on ec2 instance, that instance needs to up when you want to run the backup - and that could be a point of failure.
Incidentally, Amazon provides a Datapipeline sample on how to export data from dynamodb.
I just checked the pipeline cost page, and it says "For example, a pipeline that runs a daily job (a Low Frequency activity) on AWS to replicate an Amazon DynamoDB table to Amazon S3 would cost $0.60 per month". So I think I'm safe.

'Real Time' data methods MySQL

I've recently been researching data transfer methods that could replace my current inefficient setup.
So just to get started I will explain the issue that I am having with my current MySQL data transfer method...
I have a database that keeps track of product inventory levels within my warehouses, this stock data is constantly changing at a rapid rate.
A CSV file is created and set to cron every 15 minutes, I then delivery this CSV file via FTP to multiple sites that we manage to update inventory stock levels. These sites are also using MySQL and the data in the CSV file is added with a script.
This method is very inefficient and doesn't update our stock levels to our other databases as fast as we would like.
My question is are there any other methods to transfer MySQL data in real time between multiple databases other than CSV files? I have been researching another method however haven't came across anything useful as of yet.
Thanks