I have a 2 GB CSV file with 9 M records that I import into MongoDB using the native mongoimport tool. It imports the CSV at a rate of 8K per second. The overall time taken is 10 minutes. The speed of import is quite reasonable, but it seems to be much slower than the MySQL LOAD DATA INFILE version(takes only 2 minutes to insert all of the records into the database). While this is acceptable (MongoDB is built for JSON type objects and speed ups are generally in querying and not in inserting), I would like to know if there is some way that I can speed up the number of inserts per second done in MongoDB mongoimport?
I have only one computer with 8 GB RAM and 4 cores.
Thanks.
Since the majority of time is likely spent serializing JSON objects into BSON (native MongoDB format) you will likely get faster import if you can split up your file and have several parallel jobs each running mongoimport with a separate file.
Related
I'm trying to perform machine learning on a 20 GB dataset which is in .csv format.
One of the things that is advertised about spark is the speed. I'm a bit new to spark.
In a pyspark environment if I simple perform a spark.read.csv(file.csv) it takes about 5 seconds on NFS and 1.5 seconds over HDFS. The problem with this the headers are labeled _C1, _C2, _C3, instead of using the actual headers of the dataset.
So I thought I'd try the following to get the dataset to read:
spark.read.csv(headers= "true", inferSchema="true", path= file.csv)
the dataframe provides the appropriate schemas, however this takes 10 minutes over NFS and longer on HDFS which is slower than Pandas. Is there a configuration setting I need to set to make it go faster.
I've tried this datasets on a dask environment which takes only 4 seconds and the dask dataframe provides all the appropriate labels and headers, but my memory card doesn't have enough memory to load all the data which is why I can't use this option.
Before dive into the actual coding I am trying to understand the logistics around Spark.
I have server logs split in 10 csv's round 2GB each.
I am looking for a way to extract some data e.g. how many failures occured in a period of 30 minutes per server.
(the logs have entries from multiple servers aka there is no any predefined order in time and per server)
Is that something I could do with spark?
If yes would that mean I need a box with 20+ GB of RAM?
When I operate in Spark with RDDs,does take into account the full dataset?E.g. an operation of ordering by timestamp and server id would execute to the full 20GB dataset?
Thanks!
In doing tests with ingesting files directly from GCS to bigquery, we get much better performance over streaming inserts. However, the performance also fluctuates much more,
For example, we tested loading large CSV into BQ (10M rows, 2GB): loaded in 2.275 min the first time but ~ 8 minutes the second time. Why is there such a fluctuation in the import times?
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
Update: This turned out to be a change in a threshold value:
Turned out it depends on MaxError property. The time I got CSV imported in 2 min was when MaxError too low and some errors (like too long fields) prevented it for parsing CSV file fully. I have raised MaxError to 1000 since.
Tried couple of times, and it takes 7-8 minutes to complete parsing with this threshold value set.
Load is basically a query on federated data sources, with the results saved to the destination table. Performance of a query is dependent on the load of the backend system. Felipe explains this well in BigQuery Performance.
So far every time I have done a data load, it was almost always on enterprise grade hardware and so I guess I never realized 3.4 million records is big. Now to the question...
I have a local MySQL server on Windows 7, 64 bit, 4 GB RAM machine. I am importing a csv through standard 'Import into Table' functionality that is shipped with the Developer package.
My data has around 3,422,000 rows and 18 columns. 3 Columns of type double and rest is all text. Size of cvs file is about 500 MB
Both data source (CSV) and destination (MySQL) are on the same machine so I guess no network bottle neck.
It took almost 7 hours to load 200,000 records. By this speed it might take me 4 days to load entire data. Given the widespread popularity of MySQL I think there has to be a better way to make data load faster.
I have data only in CSV format and the rudimentary way I can think of is to split it into different blocks and try loading it.
Can you please suggest what optimizations can I do to speed up ?
I have approximately 600 million rows of data is 157 CSV files. The data is in the following format:
A: 8 digit int
B: 64bit unsigned int
C: 140 characters long string
D: int
I will use the CSV to load data into a MySQL and HBase database. I am deciding on how to optimize the process of loading? I need help with the following queries:
Use a single table to store all data or Shard it into multiple tables
What Optimizations can I do to reduce load time?
Improve overall performance of the database? Should normalize the table to store the information?
I will be using an M1.Large EC2 instance each to load the CSV into MySQL and HBase database.
============UPDATE============
I used a C3.8XLarge instance and it took 2 hours to load 20 CSV files (157 total) of 250Mb each. Eventually I
had to stop it as it was taking too long. The CPU utilization was only
2% throughout the entire time period. If anyone can help, then please
do!
For HBase, you can use the standard CSV Bulk Load
For MySQL, you will have to use regular CSV MySQL Load
And Normalizing the data is upto you.
Looking at your data structure, I think you probably don't need normalization.