Store Large CSV file which contains 2 million log data - mysql

I have very Large CSV file which contains 2 million log data for each customer coming every day, We have to develop the analytics tool which gives the summary of a various group by of CSV file data.
We have developed using Mysql-InnoDB but running very slow. we have applied proper indexing on tables and hardware is also good.
Is Mysql capable for this time of analytical tool or do I need to check any other database?
Each SQL Select query contains 15-20 sec to get output from the single table.

I am assuming that the data that you is insert-only and that you are mostly looking to build dashboards that show some metrics to clients.
You can approach this problem in a different way. Instead of directly storing the CSV data in the SQL database you can process the CSV first using Spark or Spring batch or AirFlow depending the language options. Doing this lets you reduce the amount of data that you have to store.
Another approach that you can consider is processing the CSV and pushing them to something like BigQuery or Redshift. These databases are designed to process and query large data.
To fasten queries, you can also create Materialized views to build dashboards quickly. I would not recommend this though as it is not a very scalable approach.
I recommend that you process the data first and generate the metrics that are required and store them in SQL and build dashboards on top of them instead of directly saving them.

Related

How can we use streaming in spark from multiple source? e.g First take data from HDFS and then consume streaming from Kafka

The problem arise when I already have a system and I want to implement a Spark Streaming on top.
I have 50 million rows transactional data on MySQL, I want to do reporting on those data. I thought to dump the data into HDFS.
Now, Data are coming everyday also in DB and I am adding KAFKA for new data.
I want to know how can I combine multiple source data and do analytics in real-time (1-2 minutes delay is ok) and save those results because future data needs previous results.
Joins are possible in SparkSQL, but what happens when you need to update data in mysql? Then your HDFS data becomes invalid very quickly (faster than a few minutes, for sure). Tip: Spark can use JDBC rather than need HDFS exports.
Without knowing more about your systems, I say keep the mysql database running, as there is probably something else actively using it. If you want to use Kafka, then that's a continous feed of data, but HDFS/MySQL are not. Combining remote batch lookups with streams will be slow (could be more than few minutes).
However, if you use Debezium to get data into Kafka from mysql, then you then have data centralized in one location, and then ingest from Kafka into an indexable location such as Druid, Apache Pinot, Clickhouse, or maybe ksqlDB to ingest.
Query from those, as they are purpose built for that use case, and you don't need Spark. Pick one or more as they each support different use cases / query patterns.

Should I use files or a database?

I'm building a cloud sync application which syncs a users data across multiple devices. I am at a crossroads and am deciding whether to store the data on the server as files, or in a relational database. I am using Amazon Web Services and will use S3 for user files or their database service if I choose to store the data in a table instead. The data I'm storing is the state of the application every ten seconds. This could be problematic to be storing in a database because the average number of rows per user that would be stored is 100,000 and with my current user base of 20,000 people that's 2 billion rows right off the bat. Would I be better off storing that information in files? Because that would be about 100 files totaling 6 megabytes per user.
As discussed in the comments, I would store these as files.
S3 is perfectly suited to be a key/value store and if you're able to diff the changes and ensure that you aren't unnecessarily duplicating loads of data, the sync will be far easier to do by downloading the relevant files from S3 and syncing them client side.
You get a big cost saving of not having to operate a database server that can store tonnes of rows and stay up to provide them to the clients quickly.
My only real concern would be that the data in these files can be difficult to parse if you wanted to aggregate stats/data/info across multiple users as a backend or administrative view. You wouldn't be able to write simple SQL queries to sum up values etc, and would have to open the relevant files, process them with something like awk or regular expressions etc, and then compute the values that way.
You're likely doing that on the client side any for the specific files that relate to that user though, so there's probably some overlap there!

Logging high volume of impression data (50 million records/month)

We are currently logging impression data for several websites using MySQL and are seeking a more appropriate replacement for logging the high volume of traffic our sites now see. What we ultimately need in the MySQL database is aggregated data.
By "high volume" I mean that we are logging about 50 million entries per month for this impression data. It is important to note that this table activity is almost exclusively write and only rarely read. (Different from this use-case on SO: Which NoSQL database for extremely high volumes of data). We have worked around some of the MySQL performance issues by partitioning the data by range and performing bulk inserts, but in the big picture, we shouldn't be using MySQL.
What we ultimately need in the MySQL database is aggregated data and I believe there are other technologies much better suited for the high-volume logging portion of this use-case. I have read about mongodb, HBase (with map reduce), Cassandra, and Apache Flume and I feel like I'm on the right track, but need some guidance on what technology (or combination) I should be looking at.
What I would like to know specifically is what platforms are best suited for high-volume logging and how to get an aggregated/reduced data set fed into MySQL on a daily basis.
Hive doesn't store information, it only allow you to query "raw" data with like sql language (HQL).
If your aggregated data is enough small to be stored in MySQL and that is the only use of your data, then HBase could be too much for you.
My suggestion is use Hadoop (HDFS and MapReduce
Create log files (text files) with the impression events.
Then move them into HDFS (an alternative could be use kafka or storm if you require a near real-time solution).
Create a MapReduce job capable to read and aggregate your logs and in the reduce output use a DBOutputFormat to store the aggregated data into MySql.
One approach could be to simply dump the raw impression log into flat files. There would be a daily batch which will process these file using MapReduce program. The MapReduce aggregated output could be stored into Hive or HBase.
Please let me know, if you see any problem in this approach. The Bigdata technology stack have many option based on type of data and the way it needs to be aggregated.

HIVE, HBASE which one I have to use for My Data Analytics

I have 150 GB of MySQL data, Plan to replace MySQL to Casandra as backend.
Analytics, plan to go with Hadoop, HIVE or HBASE.
Currently I have 4 physical machines for POC. Please some one help me to come up with best efficient architecture.
Per day I will get 5 GB of Data.
Daily Status report I have to send to each customer.
Have to give Analysis report based on request : for example : 1 week report or last month first 2 week report. Is it possible to produce report instantly using HIVe or HBASE ?
I want to give best performance using Cassandra, Hadoop .
Hadoop can process your data using map reduce paradigm or other, using emerging technologies such as Spark. The advantage is a reliable distributed filesystem and the usage of data locality to send the computation to the nodes that have the data.
Hive is a good SQL-like way of processing files and generate your reports once a day. It's batch processing and 5 more GB a day shouldn't produce a big impact. It has a high overhead latency though, but shouldn't be a problem if you do it once a day.
HBase and Cassandra are NoSQL databases whose purpose is to serve data with low latency. If that's a requirement, you should go with any of those. HBase uses the DFS to store the data and Cassandra has good connectors to Hadoop, so it's simple to run jobs consuming from these two sources.
For reports based on request, specifying a date range, you should store the data in an efficient way so you don't have to ingest data that's not needed for your report. Hive supports partitioning and that can be done using date (i.e. /<year>/<month>/<day>/). Using partitioning can significantly optimize your job execution times.
If you go to the NoSQL approach, be sure the rowkeys have some date format as prefix (e.g. 20140521...) so that you can select those that start by the dates you want.
Some questions you should also consider are:
how many data do you want to store in your cluster – e.g. last 180
days, etc. This will affect the number of nodes / disks. Beware data is usually replicated 3 times.
how many files do you have in HDFS – when the number of files is high,
Namenode will be hit hard on retrieval of file metadata. Some
solutions exist such as replicated namenode or using MapR Hadoop
distribution which doesn't rely on a Namenode per se.

Mechanism for extracting data out of Cassandra for load into relational databases

We use Cassandra as our primary data store for our application that collects a very large amount of data and requires large amount of storage and very fast write throughput.
We plan to extract this data on a periodic basis and load into a relational database (like mySQL). What extraction mechanisms exist that can scale to the tune of hundreds of millions of records daily? Expensive third party ETL tools like Informatica are not an option for us.
So far my web searches have revealed only Hadoop with Pig or Hive as an option. However being very new to this field, I am not sure how well they would scale and also how much load they would put on the Cassandra cluster itself when running? Are there other options as well?
You should take a look at sqoop, it has an integration with Cassandra as shown here.
This will also scale easily, you need a Hadoop cluster to get sqoop working, the way it works is basically:
Slice your dataset into different partitions.
Run a Map/Reduce job where each mapper will be responsible for transferring 1 slice.
So the bigger the dataset you wish to export, the higher the number of mappers, which means that if you keep increasing your cluster the throughput will keep increasing. It's all a matter of what resources you have.
As far as the load on the Cassandra cluster, I am not certain since I have not used the Cassandra connector with sqoop personally, but if you wish to extract data you will need to put some load on your cluster anyway. You could for example do it once a day at a certain time where the traffic is lowest, so that in case your Cassandra availability drops the impact is minimal.
I'm also thinking that if this is related to your other question, you might want to consider exporting to Hive instead of MySQL, in which case sqoop works too because it can export to Hive directly. And once it's in Hive you can use the same cluster as used by sqoop to run your analytics jobs.
There is no way to extract data out of cassandra other than paying for etl tool. I tried different way like copy command or cql query -- all the methods gives times out irrespective of changing timeout parameter in Cassandra.Yaml. Cassandra experts say you can not query the data without 'where' clause. This is big restriction to me. This may be one of the main reason not to use cassandra at least for me.