I am analizing how to store over 10000 signals 50 times per second. Probably I will read them from memory. Each signal has a timestamp (8 bytes) and a double (8 bytes). This process will be running 4 hours 1 day a week. Then:
10000 x 50 x 16 = 8 MBS / seconds.
8 x 3600 x 4 = 115 GBS / week.
What database (or other option like files) should I use to store this data quickly. Are MondoDB or Cassandra good options? What language would be good? Is Java enough fast to read data from memory and store in the database or C is a better choice?
Is needed a cluster solution?
Thanks.
Based on your description, I'd suggest Sqlite database. It's very light weight and faster than MySQL and MongoDb.
See benchmark here.
It is roughly 700~800 MB of data per single day - so if you need to query it - after one month- 25 GB will be scanned.
In this case you probably will need a clustered/sharded solution to split the load.
As data will grove constantly - you need to have a dynamic solution which can use mongoDB shards and replica sets to span load and manage data distribution.
Related
I need to compare few scenarios, which can be fulfilled either by calling:
an additional timestamp column from mysql database or
Looping over the resultant array.
Elaborating further:
CASE 1: 144 byte columns + 4 byte timestamp column for 10K rows, then looping on array of size 50.(download size- 1480000 Bytes)
CASE 2: 144 byte columns for 10K rows, looping on array of size 10000. (download size- 1440000 bytes)
Data download roughly 40KB more for Case1 while 10000 more loop iteration for case2.
Which of the 2 scenarios could be faster, downloading 40KB more or 10000 loop iterations?
Your first scenario is by far the best. Here is why.
SQL is designed to extract subsets of rows from tables. It's designed to allow your tables to be many orders of magnitude bigger than your RAM. If you use a query like
SELECT *
FROM mytable
WHERE mytimestamp >= TODAY() - INTERVAL 1 DAY
AND mytimestamp < TODAY()
you will get all the rows with timestamps anytime yesterday, for example. If you put an index on the mytimestamp column, SQL will satisfy this query very quickly and only give the rows you need. And it will satisfy any queries looking for a range of timestamps similarly quickly.
There are no answers that are true in 100% of situations. For example, I would prefer to do the first query when I use a fast enough network (anything 1Gbps or higher) and data transfer is free. The difference in the example you show is 40,000 bytes, but it's only 2.7% of the total.
On the other hand, if you need to optimize to minimize bandwidth usage, that's a different scenario. Like if you are transmitting data to a satellite over a very slow link. Or you have a lot of competing network traffic (enough to use up all the bandwidth), and saving 2.7% is significant. Or if you are using a cloud vendor that charges for total bytes transferred on the network.
You aren't taking into account the overhead of executing 1000 separate queries. That means 1000x the bytes sent to the database server, as you send queries. That takes some network bandwidth too. Also the database server needs to parse and optimize each query (MySQL does not cache optimization plans as some other RDBMS products do). And then begin executing the query, starting with an index lookup without the context of the previous query result.
"Performance" is a slippery term. You seem to be concerned only with bandwidth use, but there are other ways of measuring performance.
Throughput (bytes per second)
Latency (seconds per response)
Wait time (seconds until the request begins)
All of these can be affected by the total load, i.e. number of concurrent requests. A busy system may have traffic congestion or request queueing.
You can see that this isn't a simple problem.
Considering moving my MySQL architecture to AWS DynamoDB. My application has a requirement of 1,000 r/w requests per second. How does this play with PHP/updates? Having 1,000 workers process DynamoDB r/w's seems like it will have a higher toll on CPU/Memory than MySQL r/w's.
I have thought about a log file to store the updated information, then create scripts to process the log files to remove db load - however stunted by file locking, would be curious if anyone had any ideas on implementing this - 300 separate script's would be writing to a single log file. The log file could then be processed every minute to the db. Not sure how this could be implemented without locking. Server script is written in PHP.
Current Setup
MYSQL Database (RDS on AWS)
table A has 5m records- the main db table, 30 columns mostly numerical + text <500 chars. (Growth +30k records per day). Has relationships with 4 other tables containing;
table b - 15m records (Growth +90k records per day).
table c - 2m records (Growth +10k records per day).
table d - 4m records (Growth +15k records per day).
table c - 1m records (Growth +5k records per day).
Table A updates around 1,000 records per second then updated / added rows are queued for adding to SOLR search.
Would appreciate some much needed advice to lower costs. Are there hidden costs or other solutions I should be aware of before starting development?
I afraid the scope for performance improvement for your DB just too broad.
IOPS. Some devops choose provision 200GB storage (200 x 3 = 600 IOPS)
than the "provisioned IOPS" for smaller storage (say they only need
50GB then purchase provisioned IOS). You need to launch an excel
sheet to find the pricing/performance sweet spot.
You might need to create another "denormalised table" from table A,
if frequent select from table A but not traverse the whole
text <500 chars. Don't underestimated the text workload.
Index, index , index.
if you deal with tons of non-linear search, perhaps copy part of relevant data to dynamodb that you think will improve the performance, test it first, but maintain the RDBMS structure.
there is no one size fit all solutions. Please also inspect usage of Message queue if required.
Adding 200k records/days actually not much for today RDBMS. Even 1000 IOPS are only happen in burst. If query is the heaviest part, then you need to optimize that part.
I have a lot of spectra that I want to store in a database. A spectrum is basically an array of integers with in my case a variable length of typically 512 or 1024. How best to store these spectra? Along with the spectra I want to store some additional data like time and a label, which will be simple fields in my database. The spectra will not be retrieved often and if I need them, I need them as a whole.
For storing the spectra I can think of 2 possible solutions:
Storing them as a string, like "1,7,9,3,..."
Storing the spectra in a separate table, with each value in a separate row, containing fields like spectrum_id, index and value
Any suggestions on which one to use? Other solutions are much appreciated of course!
Your first solution is a common mistake when people transition from the procedural/OO programming mindset to the database mindset. It's all about efficiency, least number of records to fetch etc. The database world requires a different paradigm to store and retrieve data.
Here's how I'd do it: make 2 tables:
spectra
---------
spectra_id (primary key)
label
time
spectra_detail
---------
spectra_id
index
value
To retrieve them:
SELECT *
FROM spectra s
INNER JOIN spectra_detail sd ON s.spectra_id = sd.spectra_id
WHERE s.spectra_id = 42
If you have a small dataset (hundreds of MB), there is no problem in using an SQL DBMS with any of the alternatives.
As proposed by Maciej, serialization is an improvement over the other alternative, such as you can group each spectrum sweep into a single tuple (row in a table), reducing the overhead in keys and other information.
For the serialization, you may consider using objects such as linestring or multipoint in order to be able to better process the data using SQL functions. This will require some scaling but will allow querying the data and if you use WKB you may also achieve a relevant gain in storage use with little loss in the performance.
The problem is that spectrum data tends to accumulate and storage usage may become a problem that will not be easily solved by the serialization trick. You should carefully consider this in your project.
Working on a similar problem, I came to the conclusion that it is a bad idea to use any SQL DMBS (MySQL, SQL Server, Postgre, and such) to manage large numerical matrix data, such as spectrum sweep measurements. It's a bit like trying to create an image library CMS by storing images pixel by pixel into a database.
The following table presents a comparison between a few formats in my experiment. This may help to understand the problem of using SQL DBMS to store numerical data matrixes.
MySQL Table Table with key - Int(10) - and value - decimal(4,1) 1 157 627 904 B
TXT CSV decimal(4,1), equivalent to 14bit 276 895 606 B
BIN (original) Matrix 1 byte x 51200 columns x 773 rows + Metadata 40 038 580 B
HDF5 Matrix 3 bytes x 51200 columns x 773 rows + Metadata 35 192 973 B
TXT + Zip CSV decimal (4,1) + standard zip compression 34 175 971 B
PNGRGBa Matrix 4 bytes x 51200 columns x 773 rows 33 997 095 B
ZIP(BIN) Original BIN file compressed with standard zip 26 028 780 B
PNG 8bIndexed Matrix 1 byte x 51200 columns x 773 rows + Color scale 25 947 324 B
The example using MySQL didn't use any serialization. I didn't try it but one may expect a reduction to almost half the size of the occupied storage by using WKT linestrings or similar features. Even so, the storage used would be almost the double of the corresponding CSV and more than 20 times the size of a PNG8b with the same data.
These numbers are expected when you stop to think about how much extra data you are storing in terms of keys and search optimization when you use an SQL DBMS.
For closing remarks, I would suggest that you consider the use of PNG, TIFF, HDF5, or any other digital format that is more suitable to construct your front-end to store the spectrum data (or any other large matrix) and maybe using an SQL DBMS for the dimensions around this core data, such as who measures, when, with which equipment, to which end, etc. In short, have a BLOB within the database with the files or outside, as it better suits your system architecture.
Alternatively, is worthwhile to consider using a big data solution around some digital format such as HDF5. Each tool to its end.
I am currently deciding on a long term architecture solution for storing DNS logs. The amount of data we are talking about numbers some 80 GBs of logs per day at the peak. Currently I am looking at noSQL databases such as mongoDB, as well as relational - mySQL. I want to structure a solution that has three requirements:
Storage: This is a long term project, so I want the necessary capability to store 80 GBs of logs per day (~30 TB a year!). I realize this is pretty ridiculous, so I'm willing to have a retention period (keep 6 months' worth of logs = 15 TB constant).
Scalability: As it is a long term solution, this is a big issue. I've heard that mongoDB is horizontally scalable, while mySQL is not? Any elaboration on this would be very well received.
Query speed: As close to instantaneous querying as possible.
It should be noted that our logs are stored on an intermediary server, so we do not need to forward logs from our dns servers.
I want to retrive the data in lighting speed for my analytics solution.This problem is as follows.
I am processing a lot of data every 15 minutes and creating different cubes [tuples] with very huge number of different distinct dimensions. in detail i am segregating data in 100 Countries X 10000 states X 5000 TV-Channels X 999 Programs X ... X 15 Min Time Interval. So every 15 minutes this is creating lots of different tuples. Currently i am using Mysql Database and Dump data via file so it is much faster to write. I also have different tables for 15 Min,1 hour, 1 Day ,1 Week, 1Month, 1 YEAR granularity aggregated tables which i use for different types of queries. But while retrieving it is taking lot of time[ even after best indexing done on database tables.]
Please anyone provide me solution of this problem ? If possible distinguishing NoSql with MySql database ?
I am getting my data in simple txt file as server log. which is generated by my java web service application using logging functionality. like this ... – Dhimant Jayswal 13 mins ago
As i mentioned i am creating different dimensions via processing these logs every 15 minutes and saving into mysql database table doing aggregation on same dimensions during that time(for example - for 1 hour there will be 4 15-min-dimensions will be aggregated to create 1 dimension in more detail, i am creating a dimension[/Object] like time=2012-11-16 10:00:00, Country='US', Location='X', His=15 ; another one time=2012-11-16 10:15:00, Country='US', Location='X',His=67). and then these values goes to database tables. So i can retrive data anytime by an hour or 15 mins or day