I want to load a set of large rdf triple files into Neo4j. I have already written a map-reduce code to read all input n-triples and output two CSV files: nodes.csv (7GB - 90 million rows) and relationships.csv (15GB - 120 million rows).
I tried batch-import command from Neo4j v2.2.0-M01, but it crashes after loading around 30M rows of nodes. I have 16GB of RAM in my machine so I set wrapper.java.initmemory=4096 and wrapper.java.maxmemory=13000. So, I decided to split nodes.csv and relationships.csv into smaller parts and run batch-import for each part. However, I don't know how to merge the databases created from multiple imports.
I appreciate any suggestion on how to load large CSV files into Neo4j.
I could finally load the data using batch-import command in Neo4j 2.2.0-M02. It took totally 56 minutes. The issue preventing Neo4j from loading the CSV files was having \" in some values, which was interpreted as a quotation character to be included in the field value and this was messing up everything from this point forward.
Why don't you try this approach (using groovy): http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/
you will create uniqueness constraint on nodes, so duplicates won't be created.
Related
In doing tests with ingesting files directly from GCS to bigquery, we get much better performance over streaming inserts. However, the performance also fluctuates much more,
For example, we tested loading large CSV into BQ (10M rows, 2GB): loaded in 2.275 min the first time but ~ 8 minutes the second time. Why is there such a fluctuation in the import times?
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.load
Update: This turned out to be a change in a threshold value:
Turned out it depends on MaxError property. The time I got CSV imported in 2 min was when MaxError too low and some errors (like too long fields) prevented it for parsing CSV file fully. I have raised MaxError to 1000 since.
Tried couple of times, and it takes 7-8 minutes to complete parsing with this threshold value set.
Load is basically a query on federated data sources, with the results saved to the destination table. Performance of a query is dependent on the load of the backend system. Felipe explains this well in BigQuery Performance.
I have used Nifi-0.6.1 with combination of GetFile+SplitText+ReplaceText processor to split the csv data which has 30MB (300 000 rows).
GetFile is able to pass 30mb to SplitText very quickly.
In SpliText +Replace Text takes 25 mins to split the data into Json.
Just 30 mb data is taking 25 mins for store csv into SQL Server.
It performs conversion byte by byte.
I have tried Concurrent Task option in Processor. It can able to speed but it also take more time. At that time it attain 100% cpu Usage.
How can I perform csv data into sql Server faster?
Your incoming CSV file has ~300,000 rows? You might try using multiple SplitText processors to break that down in stages. One big split can be very taxing on system resources, but dividing it into multiple stages can smooth out your flow. The typically recommended maximum is between 1,000 to 10,000 per split.
See this answer for more details.
You mention splitting the data into JSON, but you're using SplitText and ReplaceText. What does your incoming data look like? Are you trying to convert to JSON to use ConvertJSONtoSQL?
If you have CSV incoming, and you know the columns, SplitText should pretty quickly split the lines, and ReplaceText can be used to create an INSERT statement for use by PutSQL.
Alternatively, as #Tomalak mentioned, you could try to put the CSV file somewhere where SQLServer can access it, then use PutSQL to issue a BULK INSERT statement.
If neither of these is sufficient, you could use ExecuteScript to perform the split, column parsing, and translation to SQL statement(s).
Every month, I do some analysis on a customer database. My predecessor would create a segment in Eloqua (Our CRM) for each country, and then spend about 10 (tedious, slow) hours refreshing them all. When I took over, I knew that I wouldn't be able to do it in Excel (we had over 10 million customers) so I used Access.
This process has worked pretty well. We're now up to 12 million records, and it's still going strong. However, when importing the master list of customers prior to doing any work on it, the database is inflating. This month it hit 1.3 GB.
Now, I'm not importing ALL of my columns - only 3. And Access freezes if I try to do my manipulations on a linked table. What can I do to reduce the size of my database during import? My source files are linked CSVs with only the bare minimum of columns; after I import the data, my next steps have to be:
Manipulate the data to get counts instead of individual lines
Store the manipulated data (only a few hundred KB)
Empty my imported table
Compress and Repair
This wouldn't be a problem, but i have to do all of this 8 times (8 segments, each showing a different portion of the database), and the 2GB limit is looming over the next horizon.
An alternate question might be: How can I simulate / re-create the "Linked Table" functionality in MySQL/MariaDB/something else free?
For such big number of records MS Access with 2 GB limit is not good solution as data storage. I would use MySQL as backend:
Create table in MySQL and link it to MS Access
Import CSV data directly to MySQL table using native import features of MySQL. Of course Access can be used for data import, but it will work slower.
Use Access for data analyse using this linked MySQL table as regular table.
You could import the CSV to a (new/empty) separate Access database file.
Then, in your current application, link the table from that file. Access will not freeze during your operations as it will when linking text files directly.
I'm working on importing a CSV into a Mysql table that has roughly 11.8 million rows. As it stands right now I'm using the LOAD DATA LOCAL INFILE method of mysql to load the data. This part is fine takes roughly 10 minutesish to load all 11.8 mil.
The problem comes afterwards whenever I need to update all of the records with data from another table. I have a working (and currently running but nearing the 24 hour mark) process that runs an update all on the table. I'm starting to question if there isn't a faster way of doing this.
One idea I had was during the process I'm splitting the CSV into 100k line files and processing those. I could do an update as the 100k lines get added by only updating where the fields that are being populated by the join are null. Another idea is to try to run some preprocessing on the CSV to add the data to it and then upload it using the quick method.
Overall just contemplating what the best method of doing the update is while the one I wrote is running. Any help / ideas would be awesome!
Sidenote: This process is only run yearly so it doesn't have to be amazingly optimized, but I'd like to have some glimmer of hope that once the data is in staging it won't take an extremely long time to move it to production
I’ve read a few csv vs database debates and in many cased people recommended db solution over csv.
However it has never been exactly the same setup I have.
So here is the setup.
- Every hour around 50 csv files generated representing performance group from around 100 hosts
- Each performance group has from 20 to 100 counters
- I need to extract data to create a number of predefined reports (e.g. daily for certain counters and nodes) - this should be relatively static
- I need to extract data add-hoc when needed (e.g. for investigation purposes) based on variable time period, host, counter
- In total around 100MB a day (in all 50 files)
Possible solutions?
1) Keep it in csv
- To create a master csv file for each performance group and every hour just append the latest csv file
To generate my reports using just scripts with shell commands (grep, sed, cut, awk)
2) Load it to database (e.g. MySQL)
- To create tables mirroring performance groups and load those csv files into the tables
To generate my reports using sql queries
When I tried simulate and to use just shell commands on csv files and it was very fast.
I worry that database queries would be slower (considering the amount of data).
I also know that databases don’t like too wide tables – in my scenario I would need in some cases 100+ columns.
It will be read only for most of time (only appending new files).
I’d like to keep data for a year so it would be around 36GB. Would the database solution still perform ok (1-2 core VM, 2-4GB memory expected).
I haven’t simulate the database solution that’s why I’d like to ask you if you have any view/experience with similar scenario.