Strategies for mass insertion in MySQL - mysql

My app would be consuming data from multiple API's . This data can either be a single event or a batch of events. The data I am dealing with is click streams, where my app would run a cron job every minute to fetch data with our partners using their API's and eventually save everything to MySQL for detailed analysis. I am looking for ways to buffer this data somewhere, and then batch insert it to MySQL.
For example, say I receive a batch of 1000 click events with one API call, what data structures can I use to buffer it in Redis, and then eventually have a worker process to consume this data and insert to MySQL.
One simple approach would be to just fetch the data and store it to MySQL just like that. But since I am dealing with ad tech, where the size and the velocity of the data is always subject to change, this hardly seems like an approach to start with.
Oh, and the app would be built on top of Spring Boot and Tomcat.
Any help/discussion would be greatly appreciated. Thank you!

Related

JavaFX program - How to keep TableView data synchronized amount different client computers?

I'm new to Java and just started writing some JavaFX applications.
My current project is to write an application for a consulting company that store a list of customers, add them to a queue and serve them one by one. There are a few staffs and they will running a copy of the application I write on their PC.
What I've done so far:
create Customer.class to handle personal info and store them in a MySQL db
create Staff.class to handle staff info
create Service.class to handle kind of services are available for the customers
create Consultation.class to handle info of a particular consultation such as date of consultation, customer being served, which staff is providing service, the services offered and the outcome
create an ObservableArrayList, store the data in the MySQL db, and display the data on a TableView of each client PC
What I want to do is, after a staff editing the data in the list, the changes will be updated on the TableView of other client PCs automatically.
The possible solutions I can think of includes:
Option 1
Program the application to query the db regularly for an update.
This method is more simple to implement, but I don't want to keep the MySQL server busy by non-stop querys from a number of clients. I do not want any delay between data write and update on other clients. There are more than 10 clients. If each client update once a second, that will mean at least 10 queries per second and the server will never rest. I don't want to put any stress on the server's harddisk.
Option 2
Program the application to broadcast a message every time after they write data to the db and other clients query the database every time they receive a broadcast. I prefer do it this way but I'm not familiar with network programming. That will mean I'll have to spend some time on it before I can continue the project.
Which of the above is a better choice? Is there other way to keep the TableView on the clients keep synchronized?
Which of the above is a better choice? Is there another way to keep the TableView on the clients keep synchronized?
Before choosing - you may consider optimizing them,
Option 1 seems quite expensive as it has to request frequently. But you can optimize it using connection-pool and specifying certain time-interval(minimum 10 sec) to fetch the data.
Option2 is much more convincing as it applies the lazy-loading concept. You may consider looking socket programming to notify all clients to fetch data.
It's quite hard to say which one is the better option - somehow, I prefer to go with the first approach if your application may insert data frequently, otherwise go with the second one.
An alternative solution - listening to the data changes
Here are some QA, these solutions may help you to implement your requirement.
How to implement a DB listener in Java
How to make a database listener with java?
How to listen to new DB records through java

A way to execute pipeline periodically from bounded source in Apache Beam

I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!
This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.

Data Importing using Azure Web Jobs to Azure SQL

Just looking for some advice on best way to handle data importing via scheduled Web Jobs.
I have 8 json files that are imported every 5 hours via an FTP client using JSON serializer into memory and then these JSON objects are processed and inserted into Azure SQL using EF6. Each file is processed in a loop sequentially as I wanted to make sure that all data is inserted correctly as when I tried to use a Parallel ForEach some of the data was not being inserted on related tables. So if the WebJob fails i know there has been an error and we can run again, problem is this is now taking a long time to complete, near on 2hrs as we have a lot data - each file has 500 locations and each location has 11 days and 24 hour data.
Anyone have any ideas on how to do this quicker whilst ensuring that the data is always inserted correctly or handle any errors. Was looking at using Storage queues but we may need to point to other databases in the future or can I use 1 web job per file so have 8 web jobs for each file being scheduled every 5 hours as i think there is a limit to the number of web jobs i can run per day.
Or is there an alternative way of importing data into Azure SQL that can be scheduled.
Azure Web Jobs (via the Web Jobs SDK) can monitor and process BLOBs. There is no need to create a scheduled job. The SDK can monitor for new BLOBs and process them as they are created. You could break up your processing to smaller files and load them as they are created.
Azure Stream Analytics has similar capabilities.

Google Cloud SQL Timeseries Statistics

I have a massive table that records events happening on our website. It has tens of millions of rows.
I've already tried adding indexing and other optimizations.
However, it's still very taxing on our server (even though we have quite a powerful one) and takes 20 seconds on some large graph/chart queries. So long in fact that our daemon intervenes to kill the queries often.
Currently we have a Google Compute instance on the frontend and a Google SQL instance on the backend.
So my question is this - is there some better way of storing an querying time series data using the Google Cloud?
I mean, do they have some specialist server or storage engine?
I need something I can connect to my php application.
Elasticsearch is awesome for time series data.
You can run it on compute engine, or they have a hosted version.
It is accessed via an HTTP JSON API, and there are several PHP clients (although I tend to make the API calls directly as i find it better to understand their query language that way).
https://www.elastic.co
They also have an automated graphing interface for time series data. It's called Kibana.
Enjoy!!
Update: I missed the important part of the question "using the Google Cloud?" My answer does not use any specialized GC services or infrastructure.
I have used ElasticSearch for storing events and profiling information from a web site. I even wrote a statsd backend storing stat information in elasticsearch.
After elasticsearch changed kibana from 3 to 4, I found the interface extremely bad for looking at stats. You can only chart 1 metric from each query, so if you want to chart time, average time, and avg time of 90% you must do 3 queries, instead of 1 that returns 3 values. (the same issue existing in 3, just version 4 looked more ugly and was more confusing to my users)
My recommendation is to choose a Time Series Database that is supported by graphana - a time series charting front end. OpenTSDB stores information in a hadoop-like format, so it will be able to scale out massively. Most of the others store events similar to row-based information.
For capturing statistics, you can either use statsd or reimann (or reimann and then statsd). Reimann can add alerting and monitoring before events are sent to your stats database, statsd merely collates, averages, and flushes stats to a DB.
http://docs.grafana.org/
https://github.com/markkimsal/statsd-elasticsearch-backend
https://github.com/etsy/statsd
http://riemann.io/

Technology stack for a multiple queue system

I'll describe the application I'm trying to build and the technology stack I'm thinking at the moment to know your opinion.
Users should be able to work in a list of task. These tasks are coming from an API with all the information about it: id, image urls, description, etc. The API is only available in one datacenter and in order to avoid the delay, for example in China, the tasks are stored in a queue.
So you'll have different queues depending of your country and once that you finish with your task it will be send to another queue which will write this information later on in the original datacenter
The list of task is quite huge that's why there is an API call to get the tasks(~10k rows), store it in a queue and users can work on them depending on the queue the country they are.
For this system, where you can have around 100 queues, I was thinking on redis to manage the list of tasks request(ex: get me 5k rows for China queue, write 500 rows in the write queue, etc).
The API response are coming as a list of json objects. These 10k rows for example need to be stored somewhere. Due to you need to be able to filter in this queue, MySQL isn't an option at least that I store every field of the json object as a new row. First think is a NoSQL DB but I wasn't too happy with MongoDB in the past and an API response doesn't change too much. Like I need relation tables too for other thing, I was thinking on PostgreSQL. It's a relation database and you have the ability to store json and filter by them.
What do you think? Ask me is something isn't clear
You can use HStore extension from PostgreSQL to store JSON, or dynamic columns from MariaDB (MySQL clone).
If you can move your persistence stack to java, then many interesting options are available: mapdb (but it requires memory and its api is changing rapidly), persistit, or mvstore (the engine behind H2).
All these would allow to store json with decent performances. I suggest you use a full text search engine like lucene to avoid searching json content in a slow way.