I have a csv of 1.9M rows / 187MB and it gives me TransientError: There is not enough memory to perform the current task when I try to LOAD CSV it.
I did increase dbms.memory.heap.max_size as error message suggested, setting initial size to 4G and max size to 32G.
So my question is how much memory do I need to load this, as I understand, not-so-big dataset? Is it even possible with 16G ram home computer?
Much thanks for any help..
If you are not already specifying USING PERIODIC COMMIT, as indicated by the dev manual for your data size, you should. That would allow LOAD CSV to process your data in smaller chunks instead of trying to do everything in a single transaction, which is likely why you are running out of memory.
Here is a simple example:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///foo.csv' AS line
CREATE (:Person { name: line[1], address: line[2] });
Related
It is taking too long, and I don't have a way of knowing if it is going to be loaded as expected after it finishes. Can I query the table to at least make sure that the data is being loaded as expected ? is there a way of seeing some rows while the load is working?
If we assume you are using the LOAD DATA INFILE statement to do the bulk load, then the answer is no, the bulk-load executes atomically. This means no other session can see the result of the bulk-load until it is complete. If it fails for some reason, the entire dataset is rolled back.
If you want to see incremental progress, you would need to use some client that reads the CSV file and inserts individual rows (or at least subsets of rows) and commits the inserts at intervals.
Or you could use LOAD DATA INFILE if you split your CSV input file into multiple smaller files, so you can load them in batches. If you just want to test if the loading is done properly, you should start with a much smaller file and load that.
I'm batching CSV 15GB (30mio rows) into a mysql-8 database.
Problem: the task takes about 20min, with approxy throughput of 15-20 MB/s. While the harddrive is capable of transfering files with 150 MB/s.
I have a RAM disk of 20GB, which holds my csv. Import as follows:
mysqlimport --user="root" --password="pass" --local --use-threads=8 mytable /tmp/mydata.csv
This uses LOAD DATA under the hood.
My target table does not have any indexes, but approx 100 columns (I cannot change this).
What is strange: I tried tweaking several config parameters as follows in /etc/mysql/my.cnf, but they did not give any significant improvement:
log_bin=OFF
skip-log-bin
innodb_buffer_pool_size=20G
tmp_table_size=20G
max_heap_table_size=20G
innodb_log_buffer_size=4M
innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0
innodb_autoinc_lock_mode=2
Question: does LOAD DATA / mysqlimport respect those config changes? Or does it bypass? Or did I use the correct configuration file at all?
At least a select on the variables shows they are correctly loaded by the mysql server. For example show variables like 'innodb_doublewrite' shows OFF
Anyways, how could I improve import speed further? Or is my database the bottleneck and there is no way to overcome the 15-20 MB/s threshold?
Update:
Interestingly if I import my csv from harddrive into the ramdisk, performance is almost the same (just a little bit better, but never over 25 MB/s). I also tested the same amount of rows, but only with a few (5) columns. And there I'm getting to about 80 MB/s. So clearly the number of columns is the bottleneck? But why do more columns slow down this process?
MySQL/MariaDB engine have little parallelization when making bulk inserts. It can only use one CPU core per LOAD DATA statement. You may probably monitor CPU utilization during load to see one core is fully utilized and it can provide only so much of output data - thus leaving disk throughput underutilized.
The most recent version of MySQL has new parallel load feature: https://dev.mysql.com/doc/mysql-shell/8.0/en/mysql-shell-utilities-parallel-table.html . It looks promising but probably hasn't received much feedback yet. I'm not sure it would help in your case.
I saw various checklists on the internet that recommended having higher values in the following config params: log_buffer_size, log_file_size, write_io_threads, bulk_insert_buffer_size . But the benefits were not very pronounced when I performed comparison tests (maybe 10-20% faster than just innodb_buffer_pool_size being large enough).
This could be normal. Let's walk through what is being done:
The csv file is being read from a RAM disk, so no IOPs are being used.
Are you using InnoDB? If so, the data is going into the buffer_pool. As blocks are being built there, they are being marked 'dirty' for eventual flushing to disk.
Since the buffer_pool is large, but probably not as large as the table will become, some of the blocks will need to be flushed before it finishes reading all the data.
After all the data is read, and the table is finished, the dirty blocks will gradually be flushed to disk.
If you had non-unique indexes, they would similarly be written in a delayed manner to disk (cf 'Change buffering'). The change_buffer, by default occupies 25% of the buffer_pool.
How large is the resulting table? It may be significantly larger, or even smaller, than the 15GB of the csv file.
How much time did it take to bring the csv file into the ram disk? I proffer that that was wasted time and it should have been read from disk while doing the LOAD DATA; that I/O can be overlapped.
Please SHOW GLOBAL VARIABLES LIKE 'innodb%';; there are several others that may be relevant.
More
These are terrible:
tmp_table_size=20G
max_heap_table_size=20G
If you have a complex query, 20GB could be allocated in RAM, possibly multiple times!. Keep those to under 1% of RAM.
If copying the csv from hard disk to ram disk runs slowly, I would suspect the validity of 150 MB/s.
If you are loading the table once every 6 hours, and it takes 1/3 of an hour to perform, I don't see the urgency of making it faster. OTOH, there may be something worth looking into. If that 20 minutes is downtime due to the table being locked, that can be easily eliminated:
CREATE TABLE t LIKE real_table;
LOAD DATA INFILE INTO t ...; -- not blocking anyone
RENAME TABLE real_table TO old, t TO real_table; -- atomic; fast
DROP TABLE old;
In doing tests with ingesting files directly from GCS to bigquery, we get much better performance over streaming inserts. However, the performance also fluctuates much more,
For example, we tested loading large CSV into BQ (10M rows, 2GB): loaded in 2.275 min the first time but ~ 8 minutes the second time. Why is there such a fluctuation in the import times?
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
Update: This turned out to be a change in a threshold value:
Turned out it depends on MaxError property. The time I got CSV imported in 2 min was when MaxError too low and some errors (like too long fields) prevented it for parsing CSV file fully. I have raised MaxError to 1000 since.
Tried couple of times, and it takes 7-8 minutes to complete parsing with this threshold value set.
Load is basically a query on federated data sources, with the results saved to the destination table. Performance of a query is dependent on the load of the backend system. Felipe explains this well in BigQuery Performance.
I'm using neo4j-import command line to load large csv files into neo4j. I've tested the command line with subset of the data and it works well. The size of csv file is about 200G, containing ~10M nodes and ~B relationships. Currently, I'm using default neo4j configuration and it takes hours to create nodes, and it got stuck at [*SORT:20.89 GB-------------------------------------------------------------------------------] 0 I'm worried that it will take even longer time to create relationships. Thus, I would like to know possible ways to speedup data import.
It's a 16GB machine, and the neo4j-import output message shows the following.
free machine memory: 166.94 MB Max heap memory : 3.48 GB Should I change neo4j configuration to increase memory? Will it help?
I'm setting neo4j-import --processes=8. However, the CPU usages of the JAVA command is only about ~1%. Does it look right?
Can someone give me a ballpark number of loading time, given the size of my dataset? It's a 8-core, 16GB memory standalone machine.
Anything else I should look at to speedup the data import?
Updated:
The machine does not have SSD disk
I run top command, and it shows that 85% of RAM is being used by the JAVA process, which I think belongs to the neo4j-import command.
The import command is: neo4j-import --into /var/lib/neo4j/data/graph.db/ --nodes:Post Posts_Header.csv,posts.csv --nodes:User User_Header.csv,likes.csv --relationships:LIKES Likes_Header.csv,likes.csv --skip-duplicate-nodes true --bad-tolerance 100000000 --processors 8
4.Posts_Header:Post_ID:ID(Post),Message:string,Created_Time:string,Num_Of_Shares:int,e:IGNORE, f:IGNORE User_Header:a:IGNORE,User_Name:string,User_ID:ID(User) Likes_Header: :END_ID(Post),b:IGNORE,:START_ID(User)
I ran the sample data import and it's pretty fast, like several seconds. Since I use the default neo4j heap setting and default Java memory setting, will it help if I configure these numbers?
Some questions:
What kind of disk do you have (SSD is preferable).
It also seems all your RAM is already used up, check with top or ps what other processes use the memory and kill them.
Can you share the full neo4j-import command?
What does a sample of your CSV and the header line look like?
It seems that you have a lot of properties? Are they all properly quoted? Do you really need all of them in the graph?
Try with a sample first, like head -100000 file.csv > file100k.csv
Usually it can import 1M records / s, with a fast disk.
That includes nodes, property and relationship-records.
I'm trying to create a database with data collected from google n-grams. It's actually a lot of data, but after the creation of the CSV files the insertion was pretty fast. The problem is that, immediately after the insertion, the neo4j-import tool indexes the data, and this step its taking too much time. It's been more than an hour and it looks like it achieved 10% of progress.
Nodes
[*>:9.85 MB/s---------------|PROPERTIES(2)====|NODE:198.36 MB--|LABE|v:22.63 MB/s-------------] 25M
Done in 4m 54s 828ms
Prepare node index
[*SORT:295.94 MB-------------------------------------------------------------------------------] 26M
This is the console info atm. Does anyone have a suggestion about what to do to speed up this process?
Thank you. (:
Indexing takes a long time depending on number of nodes. I tried indexing with 10 million nodes and it took around 35 minutes, but you can still try these settings :
Increase your page cache size which is stored in '/var/lib/neo4j/conf/neo4j.properties' file (in my ubuntu system). Edit the following line
dbms.pagecache.memory=4g
according to your RAM, allocate size, here, 4g means 4gb space. Also, you can try changing java memory size which is stored in neo4j-wrapper.conf
wrapper.java.initmemory=1024
wrapper.java.maxmemory=1024
You can also read neo4j documentation on this - http://neo4j.com/docs/stable/configuration-io-examples.html