MySQL and HBase Optimized data storage (1 TB) - mysql

I have approximately 600 million rows of data is 157 CSV files. The data is in the following format:
A: 8 digit int
B: 64bit unsigned int
C: 140 characters long string
D: int
I will use the CSV to load data into a MySQL and HBase database. I am deciding on how to optimize the process of loading? I need help with the following queries:
Use a single table to store all data or Shard it into multiple tables
What Optimizations can I do to reduce load time?
Improve overall performance of the database? Should normalize the table to store the information?
I will be using an M1.Large EC2 instance each to load the CSV into MySQL and HBase database.
============UPDATE============
I used a C3.8XLarge instance and it took 2 hours to load 20 CSV files (157 total) of 250Mb each. Eventually I
had to stop it as it was taking too long. The CPU utilization was only
2% throughout the entire time period. If anyone can help, then please
do!

For HBase, you can use the standard CSV Bulk Load
For MySQL, you will have to use regular CSV MySQL Load
And Normalizing the data is upto you.
Looking at your data structure, I think you probably don't need normalization.

Related

What to do with 4 GB dataset? MySQL?

I always use MySQL for storing and operating on data.
But this time I have around 4 GB csv dataset.
I have imported this into MySQL.
It was importing for about 2-3 hours.
It's a one table with about 7.500.000 rows and few columns.
Importing time was long.
Operating with MySQL queries on this dataset happens long too.
Do I really do good thing to use this with MySQL database?
Maybe I should use something like nosql database? Or serverless database?
I don't know if I am doing it proper way.
What should I do with it? How should I operate on this dataset?

How to increase data processing speed in CSV data into SQL Server?

I have used Nifi-0.6.1 with combination of GetFile+SplitText+ReplaceText processor to split the csv data which has 30MB (300 000 rows).
GetFile is able to pass 30mb to SplitText very quickly.
In SpliText +Replace Text takes 25 mins to split the data into Json.
Just 30 mb data is taking 25 mins for store csv into SQL Server.
It performs conversion byte by byte.
I have tried Concurrent Task option in Processor. It can able to speed but it also take more time. At that time it attain 100% cpu Usage.
How can I perform csv data into sql Server faster?
Your incoming CSV file has ~300,000 rows? You might try using multiple SplitText processors to break that down in stages. One big split can be very taxing on system resources, but dividing it into multiple stages can smooth out your flow. The typically recommended maximum is between 1,000 to 10,000 per split.
See this answer for more details.
You mention splitting the data into JSON, but you're using SplitText and ReplaceText. What does your incoming data look like? Are you trying to convert to JSON to use ConvertJSONtoSQL?
If you have CSV incoming, and you know the columns, SplitText should pretty quickly split the lines, and ReplaceText can be used to create an INSERT statement for use by PutSQL.
Alternatively, as #Tomalak mentioned, you could try to put the CSV file somewhere where SQLServer can access it, then use PutSQL to issue a BULK INSERT statement.
If neither of these is sufficient, you could use ExecuteScript to perform the split, column parsing, and translation to SQL statement(s).

Optimizations for MySQL Data Import

So far every time I have done a data load, it was almost always on enterprise grade hardware and so I guess I never realized 3.4 million records is big. Now to the question...
I have a local MySQL server on Windows 7, 64 bit, 4 GB RAM machine. I am importing a csv through standard 'Import into Table' functionality that is shipped with the Developer package.
My data has around 3,422,000 rows and 18 columns. 3 Columns of type double and rest is all text. Size of cvs file is about 500 MB
Both data source (CSV) and destination (MySQL) are on the same machine so I guess no network bottle neck.
It took almost 7 hours to load 200,000 records. By this speed it might take me 4 days to load entire data. Given the widespread popularity of MySQL I think there has to be a better way to make data load faster.
I have data only in CSV format and the rudimentary way I can think of is to split it into different blocks and try loading it.
Can you please suggest what optimizations can I do to speed up ?

Load very large CSV into Neo4j

I want to load a set of large rdf triple files into Neo4j. I have already written a map-reduce code to read all input n-triples and output two CSV files: nodes.csv (7GB - 90 million rows) and relationships.csv (15GB - 120 million rows).
I tried batch-import command from Neo4j v2.2.0-M01, but it crashes after loading around 30M rows of nodes. I have 16GB of RAM in my machine so I set wrapper.java.initmemory=4096 and wrapper.java.maxmemory=13000. So, I decided to split nodes.csv and relationships.csv into smaller parts and run batch-import for each part. However, I don't know how to merge the databases created from multiple imports.
I appreciate any suggestion on how to load large CSV files into Neo4j.
I could finally load the data using batch-import command in Neo4j 2.2.0-M02. It took totally 56 minutes. The issue preventing Neo4j from loading the CSV files was having \" in some values, which was interpreted as a quotation character to be included in the field value and this was messing up everything from this point forward.
Why don't you try this approach (using groovy): http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/
you will create uniqueness constraint on nodes, so duplicates won't be created.

MONGODB Mongoimport possible speed up?

I have a 2 GB CSV file with 9 M records that I import into MongoDB using the native mongoimport tool. It imports the CSV at a rate of 8K per second. The overall time taken is 10 minutes. The speed of import is quite reasonable, but it seems to be much slower than the MySQL LOAD DATA INFILE version(takes only 2 minutes to insert all of the records into the database). While this is acceptable (MongoDB is built for JSON type objects and speed ups are generally in querying and not in inserting), I would like to know if there is some way that I can speed up the number of inserts per second done in MongoDB mongoimport?
I have only one computer with 8 GB RAM and 4 cores.
Thanks.
Since the majority of time is likely spent serializing JSON objects into BSON (native MongoDB format) you will likely get faster import if you can split up your file and have several parallel jobs each running mongoimport with a separate file.