I'm building a cloud sync application which syncs a users data across multiple devices. I am at a crossroads and am deciding whether to store the data on the server as files, or in a relational database. I am using Amazon Web Services and will use S3 for user files or their database service if I choose to store the data in a table instead. The data I'm storing is the state of the application every ten seconds. This could be problematic to be storing in a database because the average number of rows per user that would be stored is 100,000 and with my current user base of 20,000 people that's 2 billion rows right off the bat. Would I be better off storing that information in files? Because that would be about 100 files totaling 6 megabytes per user.
As discussed in the comments, I would store these as files.
S3 is perfectly suited to be a key/value store and if you're able to diff the changes and ensure that you aren't unnecessarily duplicating loads of data, the sync will be far easier to do by downloading the relevant files from S3 and syncing them client side.
You get a big cost saving of not having to operate a database server that can store tonnes of rows and stay up to provide them to the clients quickly.
My only real concern would be that the data in these files can be difficult to parse if you wanted to aggregate stats/data/info across multiple users as a backend or administrative view. You wouldn't be able to write simple SQL queries to sum up values etc, and would have to open the relevant files, process them with something like awk or regular expressions etc, and then compute the values that way.
You're likely doing that on the client side any for the specific files that relate to that user though, so there's probably some overlap there!
Related
I have a video surveillance project running on a cloud infrastructure and using MySQL database.
We are now integrating some artificial intelligence into our project including face recognition, plate recognition, tag search, etc.. which implies a huge amount of data every day
All the photos and the images derived from those photos by image processing algorithms are stored in cloud storage but their references and tags are stored in the database.
I have been thinking of the best way to integrate this, do I have to stick to MySQL or use another system. The different options I thought about are:
1- Use another database MongoDB to store the photos references and tags. This will cost me another database server, as well as the integration with a new database system along with the existent MySQL server
2- Use elastic search to retrieve data and perform tag searching. This leads to question the capacity of MySql to store this amount of data
3- Stick with MySQL purely, but is the user experience will be impacted?
Would you guide me to the best option to choose or give me another proposal?
EDIT:
For more information:
The physical pictures are stored in cloud storage, only the URLs are stored in the database.
In the database, we will store the metadata of the picture like id, the id of the client, URL, tags, date of creation, etc...
Operations are of the type :
It will be generally a SELECTs based on different criteria and search by tags
How big the data is?
Imagine a camera placed outdoor in the street and each time it detects a face it will send an image.
Imagine thousands of cameras are doing so. Then, we are talking about millions of images per client.
MySQL can handle billions of rows. You have not provided enough other information to comment on the rest of your questions.
Large blobs (images, videos, etc) are probably best handled by some large, cheap, storage. And then, as you say, a URL to the blob would be stored in the database.
How many rows? How frequently inserting? Some desired SELECT statements? Is it mostly just writing to the database? Or will you have large, complex, queries?
Our employees are doing work for clients, and the clients send us files which contain information that we turn into performance metrics (we do not have direct access to this information - it needs to be sent from the clients). These files are normally .csv or .xlsx so typically I read them with pandas and output a much cleaner, smaller file.
1) Some files contain call drivers or other categorical information which repeats constantly (for example, Issue Driver 1 with like 20 possibilities and Issue Driver 2 with 100 possibilities) - these files are about 100+ million records per year so they become pretty large if I consolidate them. Is it better to create a dictionary and map each driver out to an integer? I read a bit about the category dtype in pandas - does this make output file sizes smaller too or just in-memory?
2) I store the output as .csv which means that I lose the dtypes if I ever read the file again. How do I maintain dtypes and should I save the files to sqlite instead perhaps instead of massive .csv files? My issue now is that I literally create codes to break the files up into separate .csvs per month and then maintain a massive file which I use for analysis (dump it into Tableau normally). If I need to to make changes to the monthly files I have to re-write them all which is slow on my laptops non-SSD hard drive.
3) I normally only need to share data with one or two people. And most analysis requests are adhoc but involve like one - three years worth of very granular data (individual surveys or interactions each represented by a single row in separate files). In other words I do not need a system with high concurrency of read-write. Just want something fast, efficient, and consolidated.
I'm developing an app in which I'll need to collect, from a MySQL server, a 5 years daily data (so, approximately 1825 rows of a table with about 6, 7 columns).
So, for handling this data, I can, after retrieving it, store it in a local SQLite database, or just keep it in memory.
I admit that, so far, the only advantage I could find for storing it in a local database, instead of just using what's already loaded, would be to have the data accessible in a next time the user were to open the app.
But I think I might not be taking into account all important factors.
Which factors should I take into account to decide between storing data in a local database or keep it in memory?
Best regards,
Nicolas Reichert
With respect, you're overthinking this. You're talking about a small amount of data: 2K rows is nothing for a MySQL server.
Therefore, I suggest you keep your app simple. When you need those rows in your app fetch them from MySQL. If you run the app again tomorrow, run the query again and fetch them again.
Are the rows the result of some complex query? To keep things simple you might consider creating a VIEW from the query. On the other hand, you can just as easily keep the query in your app.
Are the rows the result of a time-consuming query? In that case you could create a table in MySQL to hold your historical data. That way you'd only have to do the time-consuming query on your newer data.
At any rate, adding some alternative storage tech to your app (be it RAM or be it a local sqlite instance) isn't worth the trouble IMHO. Keep It Simpleā¢.
If you're going to store the data locally, you have to figure out how to make it persistent. sqlite does that. It's not clear to me how RAM would do that unless you dump it to the file system.
I'm completely new to databases so pardon the simplicity of the question. We have an embedded Linux system that needs to store data collected over a time span of several hours. The data will need to be searchable sequentially and includes data like GPS, environmental data, etc. This data will need to saved off in a folder on a removable SSD and labeled as a "Mission". Several "Missions" can exists on a single SSD and should not be mixed together because they need to be copied and saved off individually at the users discretion to external media. Data will be saved off as often as 10 times a second and needs to be very robust because of the potential for power outages.
The data will need to be searchable on the system it is created on but also after the removalable disk is taken to another system (also Linux) it needs to be loaded and used there also. In the past we have done custom files to store the data but it seems like a database might be the best option. How portable are databases like MySQL? Can a user easily remove a disk with a database on it and plug it in a new machine to use without too much effort? Our queries will mostly be time based because the user will be "playing" through the data after it is collected in perhaps 10x the collection rate. Also, our base code is written in Qt (C++) so we would need to interact with the database in that way.
I'd go with SQLite. It's small and lite. It stores all its data into one file. You can copy or move the file to another computer and read it there. You data writer can just remake the file, empty when it detects that today's ssd does not already have the file.
It's also worth mentioning that SQLite undergoes testing at the level afforded only by select few safety-critical pieces of software. The test suite, while partly autogenerated, is a staggering 100 million lines of code. It is not lite at all when it comes to robustness. I would trust SQLite more than a random self-made database implementation.
SQLite is used in certified avionics AFAIK.
What do you think is a data store appropriate for sensor data, such as temperature, humidity, velocity, etc. Users will have different sets of fields and they need to be able to run basic queries against data.
Relational databases like MySQL is not flexible in terms of schema, so I was looking into NoSql approaches, but not sure if there's any particular project that's more suitable for sensor data. Most of NoSql seem to be geared toward log output as such.
Any feedback is appreciated.
Thanks!
I still think I would use an RDBMS simply because of the flexibility you have to query the data. I am currently developing applications that log approximately one thousand remote sensors to SQL Server, and though I've had some growing pains due to the "inflexible" schema, I've been able to expand it in ways that provides for multiple customers. I'm providing enough columns of data that, collectively, can handle a vast assortment of sensor values, and each user just queries against the fields that interest them (or that their sensor has features for).
That said, I originally began this project by writing to a comma separated values (CSV) file, and writing an app that would parse it for graphing, etc. Each sensor stored its data in a separate CSV, but by opening multiple files I was able to perform comparisons between two or more sets of sensor data. Also CSV is highly compressible, opens in major office applications, and is fairly user-editable (such as trimming sections that you don't need).