Spark memory requirements for querying 20gb - csv

Before dive into the actual coding I am trying to understand the logistics around Spark.
I have server logs split in 10 csv's round 2GB each.
I am looking for a way to extract some data e.g. how many failures occured in a period of 30 minutes per server.
(the logs have entries from multiple servers aka there is no any predefined order in time and per server)
Is that something I could do with spark?
If yes would that mean I need a box with 20+ GB of RAM?
When I operate in Spark with RDDs,does take into account the full dataset?E.g. an operation of ordering by timestamp and server id would execute to the full 20GB dataset?
Thanks!

Related

Loading 5 million records from server A to Server B using SSIS

I'm new to SSIS, I need to your suggestion, I have created SSIS package which retrieve data from source server around 5 million records from server A and save data into destination server. in this process it is taking nearly 3 hours to complete the task. can we have any other way to reduce the timeline. I have tried to increase the buffer size, but still same.
Thanks in Advance.
There are many factors influencing the speed of execution. both hardware and software. Based on the structure of the database, a solution can be determined.
In a test project, I have transferred 40 million records in 30 minutes on a system with 4 GB of RAM.

tell mysql to store table in memory or on disk

I have a rather large (10M+) table with quite a lot data coming from IOT devices.
People only access the last 7 days of data but have access on demand to older data.
as this table is growing at a very fast pace (100k/day) I choose to split this table in 2 tables, one only holding the 7 last days of data, and another one with older data.
I have a cron running that basically takes the oldest data and moves it to the other table..
How could I tell Mysql to only keep the '7days' table loaded in memory to speed up read access and keep the 'archives' table on disk (ssd) for less frequent access ?
Is this something implemented in Mysql (Aurora) ?
I couldn't find anything in the docs besides in memory tables, but these are not what I'm after.

Mysql RAM requirement for 22 Billion Records Select query

I have a table which is expected to have 22 billion records yearly. How much will be the RAM requirement if each of the records cost around 4 KB of data.
It is expected to have around 8 TB of storage for the same table.
[update]
There is no join queries involved. I just need the select queries to be executed efficiently.
I have found that there is no general rule of thumb when it comes to how much RAM you need for x amount of records in MYSQL.
The first factor which you need to look at is the design of the database itself. This is one of the most impacting factors of them all. If your database is poorly designed, throwing RAM at it isn't going to fix your problem.
Another factor is how this data is going to be accessed, i.e if a specific row is being accessed by 100 people SELECT * FROM table where column = value then you could get away with a tiny amount of RAM as you would just use query caching.
It MAY (not always) be a good idea to keep your entire database in RAM to allow it to be read quicker (Dependent on the total size of the database). I.e. if your database is 100GB in size then 128GB of RAM should be proficient to deal with any overheads such as the OS and other factors.
As per my system i am supporting daily Oracle 224GB CDR record to a network operator.
Also for another system daily 20 lack data retrieve from SQL database .
you can use 128 GB if you are using one server or else
if you are using load balancer then you can use 62 GB on every PC.

MySQL database size is larger than export file?

I have Drupal site with MySQL database on my hosting provider. He tells me that my DB is almost on max limit about 1GB but when I export DB to output file, the file is only 80 MB for whole DB.
It is logical DB to be smaller than my output file or same but almost when DB on hosting is 10 times larger than export file, I think it is impossible.
Can you help me to find out is it possible, or my hosting provider manipulate my data and write me every day messages to get bigger DB storage for more money?
Functioning MySQL databases use plenty of data storage for such things as indexes. A 12x ratio between your raw (exported) data size and the disk space your database uses is a large ratio, but it's still very possible.
MySQL makes the tradeoff of using lots of disk space to improve query performance. That's a wise tradeoff these days because the cost of disk space is low, and decreasing very fast.
You may (or may not) recover some data space by going in to phpmyadmin and optimizing all your tables. This query will optimize a list of tables.
OPTIMIZE TABLE tablename, tablename, tablename
It's smart to back up your data base before doing this.
A 1GiB limit on a MySQL database size is absurdly low. (A GiB of storage at AWS or another cloud provider costs less than $US 0.20 per month these days.) It may be time to get a better hosting service.

redusing processing time of Database

We have two Database i.e. DB-A and DB-B, we have almost more than 5000 tables in DB-A, we daily process our whole Database. Here Processing means we get the data from multiple tables of DB-A and then insert these data into some of the tables of DB-B, now after inserting these data into DB-B, we access these data many times because we need to process the whole data of DB-B. we access these data of DB-B whenever we need to process it, and we need to process it more than 500 time in a day and every time we access only that data which we need to process. Now since we are accessing these database(DB-B) multiple time it requires more than 2 hr time to get processed.
Now the problem is that i want to access the data from DB-A and then wants to process this data and then wants to insert this data into DB-B in one shot. but the constraint is that we have limited resources that is we have only 16 GB ram and we are not in the position to increase ram.
we have done indexing n all but still it is taking almost more than 2 hr time. please suggest me how can i reduce the processing time of this data ?