My project shall be deployed into multiple cities. And it uses a simple requests response policy, i.e. load data from mysql to redis when project boots, read requests get data from redis, write requests put data into redis and mysql at the same time.
However, I want the different deploy instances can share the same data asap. I used the mysql to synchronise data, but it works bad because of the bad network. Often, data written into one city, can hardly be read from another city for hours.
I want your suggestions. Many thanks.
Related
I am running a couple of crawlers that produce millions of datasets per day. The bottleneck is the latency between the spiders and the remote database. In case the location of the spider server is too large, the latency will slow the crawler down to a point where it can not longer complete the datasets needed for a day.
In search for a solution I came upon redis with the idea on installing redis the spider server where it will temporarily store the data collected with low latency and then redis will pull that data to mysql some how.
The setup is like this until now:
About 40 spiders running on multiple instances feed one central MySQL8 remote server on a dedicated machine over TCP/IP.
Each spider writes different datasets, one kind of spider gets positions and prices of search results, where there are 100 results with around 200-300 inserts on one page. Delay is about 2-10s between the next request/page.
The later one is the problem as the spider yields every position within that page and creates a remote insert within a transaction, maybe even a connect (not sure at the moment).
This currently only works as spiders and remote MySQL server are close (same data center) with ping times of 0.0x ms, it does not work with ping times of 50ms as the spiders can not write fast enough.
Is redis or maybe DataMQ a valid approach to solve the problem or are there other recommended ways of doing this?
Did you mean you have installed a Redis Server on each spider?
Actually it was not a good solution for you case. But if you have already done this and still want to use MySQL to persistent your data, cronjob on each server will be an option.
You can create a cronjob on each spider server(based on your dataset and your need, you can choose daily or hourly sync job). And write a data transfer script to scan your Redis and transfer to MySQL tables.
I recommend using MongoDB instead of MySQL to store data
I need to create a system with local webservers on Raspberry Pi 4 running laravel for API calls, websockets, etc. Each RPI will be installed in multiple customers places.
For this project i want to have the abality to save/sync the database to a remote server (when the local system is connected to internet).
Multiple locale databases => One remote database cutomers based
The question is, how to synchronize databases and identify properly each customers data and render them in a mutualised remote dashboard.
My first thought was to set a customer_id or a team_id on each tables but it seems dirty.
The other way is to create multiple databases on the remote server for the synchronization and one extra database to set customers ids and database connection informations...
Someone has already experimented something like that? Is there a sure and clean way to do this?
You refer to locale but I am assuming you mean local.
From what you have said you have two options at the central site. The central database can either store information from the remote databases into a single table with an additional column that indicates which remote site it's from, or you can setup a separate table (or database) for each remote site.
How do you want to use the data?
If you only ever want to work with the data from one remote site at a time it doesn't really matter - in both scenarios you need to identify what data you want to work with and build your SQL statement to either filter by the appropriate column, or you need to direct it to the appropriate table(s).
If you want to work on data from multiple remote sites at the same time, then using different tables requires tyhat you use UNION queries to extract the data and this is unlikely to scale well. In that case you would be better off using a column to mark each record with the remote site it references.
I recommend that you consider using Uuids as primary keys - it may be that key collision will not be an issue in your scenario but if it becomes one trying to alter the design retrospectively is likely to be quite a bit of work.
You also asked about how to synchronize the databases. That will depend on what type of connection you have between the sites and the capabilities of your software, but typically you would have the local system periodically talk to a webservice at the central site. Assuming you are collecting sensor data or some such the dialogue would be something like:
Client - Hello Server, my last sensor reading is timestamped xxxx
Server - Hello Client, [ send me sensor readings from yyyy | I don't need any data ]
You can include things like a signature check (for example an MD5 sum of the records within a time period) if you want to but that may be overkill.
I want to build a machine learning system with large amount of historical trading data for machine learning purpose (Python program).
Trading company has an API to grab their historical data and real time data. Data volume is about 100G for historical data and about 200M for daily data.
Trading data is typical time series data like price, name, region, timeline, etc. The format of data could be retrieved as large files or stored in relational DB.
So my question is, what is the best way to store these data on AWS and what'sthe best way to add new data everyday (like through a cron job, or ETL job)? Possible solutions include storing them in relational database like Or NoSQL databases like DynamoDB or Redis, or store the data in a file system and read by Python program directly. I just need to find a solution to persist the data in AWS so multiple team can grab the data for research.
Also, since it's a research project, I don't want to spend too much time on exploring new systems or emerging technologies. I know there are Time Series Databases like InfluxDB or new Amazon Timestream. Considering the learning curve and deadline requirement, I don't incline to learn and use them for now.
I'm familiar with MySQL. If really needed, i can pick up NoSQL, like Redis/DynamoDB.
Any advice? Many thanks!
If you want to use AWS EMR, then the simplest solution is probably just to run a daily job that dumps data into a file in S3. However, if you want to use something a little more SQL-ey, you could load everything into Redshift.
If your goal is to make it available in some form to other people, then you should definitely put the data in S3. AWS has ETL and data migration tools that can move data from S3 to a variety of destinations, so the other people will not be restricted in their use of the data just because of it being stored in S3.
On top of that, S3 is the cheapest (warm) storage option available in AWS, and for all practical purposes, its throughout is unlimited. If you store the data in a SQL database, you significantly limit the rate at which the data can be retrieved. If you store the data in a NoSQL database, you may be able to support more traffic (maybe) but it will be at significant cost.
Just to further illustrate my point, I recently did an experiment to test certain properties of one of the S3 APIs, and part of my experiment involved uploading ~100GB of data to S3 from an EC2 instance. I was able to upload all of that data in just a few minutes, and it cost next to nothing.
The only thing you need to decide is the format of your data files. You should talk to some of the other people and find out if Json, CSV, or something else is preferred.
As for adding new data, I would set up a lambda function that is triggered by a CloudWatch event. The lambda function can get the data from your data source and put it into S3. The CloudWatch event trigger is cron based, so it’s easy enough to switch between hourly, daily, or whatever frequency meets your needs.
For a project we are working with an several external partner. For the project we need access to their MySQL database. The problem is, they cant do that. Their databse is hosted in a managed environment where they don't have much configuration possibilities. And they dont want do give us access to all of their data. So the solution they came up with, is the federated storage engine.
We now have one table for each table of their database. The problem is, the amount of data we get is huge and will even increase in the future. That means there are a lot of inserts performed on our database. The optimal solution for us would be to intercept all incoming MySQL traffic, process it and then store it in bulk. We also thought about using someting like redis to store the data.
Additionnaly, we plan to get more data from different partners. They will potentialy provide us the data in different ways. So using redis would allow us, to have all our data in one place.
Copying the data to redis after its stored in the mysql database is not an option. We just cant handle that many inserts and we need the data as fast as possible.
TL;DR
is there a way to pretend to be a MySQL server so we can directly process data received via the federated storage engine?
We also thought about using the blackhole engine in combination with binary logging on our side. So incoming data would only be written to the binary log and wouldn't be stored in the database. But then performance would still be limited by Disk I/O.
I have a dashboard application that has PHP backend and javascript frontend. Data is read from multiple sources and I have access to databases of all the sources.
While designing the application, is it a good idea to store remote data locally instead of hitting the remote database everytime the application has a request?
Store locally? Reason being the data is not live. I can write a cron to run in the background to update the data every 5 min and the application always will read the data from local DB thereby giving faster load times.
Read from remote every time? Since I have direct database access to all these remote DB's, I do not notice any performance gain of storing data locally over fetching from remote everytime.
Which approach scales better?
What you're describing is called "caching." It's a common optimization.
Fetching data remotely is much more expensive than getting it out of a local cache.
You should learn the Latency Numbers Every Programmer Should Know.
The tricky part of caching is knowing when you need to discard the local cached copy of data and re-fetch it from the remote database. This is a hard problem with no single answer.
There's an old joke attributed to Phil Karlton:
“There are only two hard things in Computer Science: cache invalidation and naming things.”