I'm trying to perform machine learning on a 20 GB dataset which is in .csv format.
One of the things that is advertised about spark is the speed. I'm a bit new to spark.
In a pyspark environment if I simple perform a spark.read.csv(file.csv) it takes about 5 seconds on NFS and 1.5 seconds over HDFS. The problem with this the headers are labeled _C1, _C2, _C3, instead of using the actual headers of the dataset.
So I thought I'd try the following to get the dataset to read:
spark.read.csv(headers= "true", inferSchema="true", path= file.csv)
the dataframe provides the appropriate schemas, however this takes 10 minutes over NFS and longer on HDFS which is slower than Pandas. Is there a configuration setting I need to set to make it go faster.
I've tried this datasets on a dask environment which takes only 4 seconds and the dask dataframe provides all the appropriate labels and headers, but my memory card doesn't have enough memory to load all the data which is why I can't use this option.
Related
In doing tests with ingesting files directly from GCS to bigquery, we get much better performance over streaming inserts. However, the performance also fluctuates much more,
For example, we tested loading large CSV into BQ (10M rows, 2GB): loaded in 2.275 min the first time but ~ 8 minutes the second time. Why is there such a fluctuation in the import times?
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
Update: This turned out to be a change in a threshold value:
Turned out it depends on MaxError property. The time I got CSV imported in 2 min was when MaxError too low and some errors (like too long fields) prevented it for parsing CSV file fully. I have raised MaxError to 1000 since.
Tried couple of times, and it takes 7-8 minutes to complete parsing with this threshold value set.
Load is basically a query on federated data sources, with the results saved to the destination table. Performance of a query is dependent on the load of the backend system. Felipe explains this well in BigQuery Performance.
I have used Nifi-0.6.1 with combination of GetFile+SplitText+ReplaceText processor to split the csv data which has 30MB (300 000 rows).
GetFile is able to pass 30mb to SplitText very quickly.
In SpliText +Replace Text takes 25 mins to split the data into Json.
Just 30 mb data is taking 25 mins for store csv into SQL Server.
It performs conversion byte by byte.
I have tried Concurrent Task option in Processor. It can able to speed but it also take more time. At that time it attain 100% cpu Usage.
How can I perform csv data into sql Server faster?
Your incoming CSV file has ~300,000 rows? You might try using multiple SplitText processors to break that down in stages. One big split can be very taxing on system resources, but dividing it into multiple stages can smooth out your flow. The typically recommended maximum is between 1,000 to 10,000 per split.
See this answer for more details.
You mention splitting the data into JSON, but you're using SplitText and ReplaceText. What does your incoming data look like? Are you trying to convert to JSON to use ConvertJSONtoSQL?
If you have CSV incoming, and you know the columns, SplitText should pretty quickly split the lines, and ReplaceText can be used to create an INSERT statement for use by PutSQL.
Alternatively, as #Tomalak mentioned, you could try to put the CSV file somewhere where SQLServer can access it, then use PutSQL to issue a BULK INSERT statement.
If neither of these is sufficient, you could use ExecuteScript to perform the split, column parsing, and translation to SQL statement(s).
I'm using neo4j-import command line to load large csv files into neo4j. I've tested the command line with subset of the data and it works well. The size of csv file is about 200G, containing ~10M nodes and ~B relationships. Currently, I'm using default neo4j configuration and it takes hours to create nodes, and it got stuck at [*SORT:20.89 GB-------------------------------------------------------------------------------] 0 I'm worried that it will take even longer time to create relationships. Thus, I would like to know possible ways to speedup data import.
It's a 16GB machine, and the neo4j-import output message shows the following.
free machine memory: 166.94 MB Max heap memory : 3.48 GB Should I change neo4j configuration to increase memory? Will it help?
I'm setting neo4j-import --processes=8. However, the CPU usages of the JAVA command is only about ~1%. Does it look right?
Can someone give me a ballpark number of loading time, given the size of my dataset? It's a 8-core, 16GB memory standalone machine.
Anything else I should look at to speedup the data import?
Updated:
The machine does not have SSD disk
I run top command, and it shows that 85% of RAM is being used by the JAVA process, which I think belongs to the neo4j-import command.
The import command is: neo4j-import --into /var/lib/neo4j/data/graph.db/ --nodes:Post Posts_Header.csv,posts.csv --nodes:User User_Header.csv,likes.csv --relationships:LIKES Likes_Header.csv,likes.csv --skip-duplicate-nodes true --bad-tolerance 100000000 --processors 8
4.Posts_Header:Post_ID:ID(Post),Message:string,Created_Time:string,Num_Of_Shares:int,e:IGNORE, f:IGNORE User_Header:a:IGNORE,User_Name:string,User_ID:ID(User) Likes_Header: :END_ID(Post),b:IGNORE,:START_ID(User)
I ran the sample data import and it's pretty fast, like several seconds. Since I use the default neo4j heap setting and default Java memory setting, will it help if I configure these numbers?
Some questions:
What kind of disk do you have (SSD is preferable).
It also seems all your RAM is already used up, check with top or ps what other processes use the memory and kill them.
Can you share the full neo4j-import command?
What does a sample of your CSV and the header line look like?
It seems that you have a lot of properties? Are they all properly quoted? Do you really need all of them in the graph?
Try with a sample first, like head -100000 file.csv > file100k.csv
Usually it can import 1M records / s, with a fast disk.
That includes nodes, property and relationship-records.
So far every time I have done a data load, it was almost always on enterprise grade hardware and so I guess I never realized 3.4 million records is big. Now to the question...
I have a local MySQL server on Windows 7, 64 bit, 4 GB RAM machine. I am importing a csv through standard 'Import into Table' functionality that is shipped with the Developer package.
My data has around 3,422,000 rows and 18 columns. 3 Columns of type double and rest is all text. Size of cvs file is about 500 MB
Both data source (CSV) and destination (MySQL) are on the same machine so I guess no network bottle neck.
It took almost 7 hours to load 200,000 records. By this speed it might take me 4 days to load entire data. Given the widespread popularity of MySQL I think there has to be a better way to make data load faster.
I have data only in CSV format and the rudimentary way I can think of is to split it into different blocks and try loading it.
Can you please suggest what optimizations can I do to speed up ?
I have a 2 GB CSV file with 9 M records that I import into MongoDB using the native mongoimport tool. It imports the CSV at a rate of 8K per second. The overall time taken is 10 minutes. The speed of import is quite reasonable, but it seems to be much slower than the MySQL LOAD DATA INFILE version(takes only 2 minutes to insert all of the records into the database). While this is acceptable (MongoDB is built for JSON type objects and speed ups are generally in querying and not in inserting), I would like to know if there is some way that I can speed up the number of inserts per second done in MongoDB mongoimport?
I have only one computer with 8 GB RAM and 4 cores.
Thanks.
Since the majority of time is likely spent serializing JSON objects into BSON (native MongoDB format) you will likely get faster import if you can split up your file and have several parallel jobs each running mongoimport with a separate file.