i want to import csv-Files with about 40 million lines into neo4j. For this i try to use the "batchimporter" from https://github.com/jexp/batch-import.
Maybe it's a problem that i provide own IDs. This is the example
nodes.csv
i:id
l:label
315041100 Person
201215100 Person
315041200 Person
rels.csv :
start
end
type
relart
315041100 201215100 HAS_RELATION 30006
315041200 315041100 HAS_RELATION 30006
the content of batch.properties:
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=1000M
neostore.relationshipstore.db.mapped_memory=5000M
neostore.propertystore.db.mapped_memory=4G
neostore.propertystore.db.strings.mapped_memory=2000M
neostore.propertystore.db.arrays.mapped_memory=1000M
neostore.propertystore.db.index.keys.mapped_memory=1500M
neostore.propertystore.db.index.mapped_memory=1500M
batch_import.node_index.node_auto_index=exact
./import.sh graph.db nodes.csv rels.csv
will be processed without errors, but it takes about 60 seconds!
Importing 3 Nodes took 0 seconds
Importing 2 Relationships took 0 seconds
Total import time: 54 seconds
When i use smaller IDs - for example 3150411 instead of 315041100 - it takes just 1 second!
Importing 3 Nodes took 0 seconds
Importing 2 Relationships took 0 seconds
Total import time: 1 seconds
Actually i would take even bigger IDs with 10 digits. I don't know what i'm doing wrong. Can anyone see an error?
JDK 1.7
batchimporter 2.1.3 (with neo4j 2.1.3)
OS: ubuntu 14.04
Hardware: 8-Core-Intel-CPU, 16GB RAM
I think the problem is that the batch importer is interpreting those IDs as actual physical ids on disk. And so the time is spent in the file system, inflating the store files up to the size where they can fit those high ids.
The ids that you're giving are intended to be "internal" to the batch import, or? Although I'm not sure how to tell the batch importer that is the case.
#michael-hunger any input there?
the problem is that those ID's are internal to Neo4j where they represent disk record-ids. if you provide high values there, Neo4j will create a lot of empty records until it reaches your ids.
So either you create your node-id's starting from 0 and you store your id as normal node property.
Or you don't provide node-id's at all and only lookup nodes via their "business-id-value"
i:id id:long l:label
0 315041100 Person
1 201215100 Person
2 315041200 Person
start:id end:id type relart
0 1 HAS_RELATION 30006
2 0 HAS_RELATION 30006
or you have to configure and use an index:
id:long:people l:label
315041100 Person
201215100 Person
315041200 Person
id:long:people id:long:people type relart
0 1 HAS_RELATION 30006
2 0 HAS_RELATION 30006
HTH Michael
Alternatively you can also just write a small java or groovy program to import your data if handling those ids with the batch-importer is too tricky.
See: http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/
Related
I'm using Apache Hive.
I created a table in Hive (similar to external table) and loaded data into the same using the LOAD DATA LOCAL INPATH './Desktop/loc1/kv1.csv' OVERWRITE INTO TABLE adih; command.
While I am able to retrieve simple data from the hive table adih (e.g. select * from adih, select c_code from adih limit 1000, etc), Hive gives me errors when I ask for data involving slight computations (e.g. select count(*) from adih, select distinct(c_code) from adih).
The Hive cli output is as shown in the following link -
hive> select distinct add_user from adih;
Query ID = latize_20161031155801_8922630f-0455-426b-aa3a-6507aa0014c6
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapreduce.job.reduces=
Starting Job = job_1477889812097_0006, Tracking URL = http://latize-data1:20005/proxy/application_1477889812097_0006/
Kill Command = /opt/hadoop-2.7.1/bin/hadoop job -kill job_1477889812097_0006
[6]+ Stopped $HIVE_HOME/bin/hive
Hive stops displaying any further logs / actions beyond the last line of "Kill Command"
Not sure where I have gone wrong (many answers on StackOverflow tend to point back to YARN configs (environment config detailed below).
I have the log as well but it contains more than 30000 characters (Stack Overflow limit)
My hadoop environment is configured as follows -
1 Name Node & 1 Data Node. Each has 20 GB of RAM with sufficient ROM. Have allocated 13 GB of RAM for the yarn.scheduler.maximum-allocation-mb and yarn.nodemanager.resource.memory-mb each with the mapreduce.map.memory.mb being set as 4 GB and the mapreduce.reduce.memory.mb being set as 12 GB. Number of reducers is currently set to default (-1). Also, Hive is configured to run with a MySQL DB (rather than Derby).
You should set the appropriate values to the properties show in your trace,
eg: Edit the properties in hive-site.xml
<property>
<name>hive.exec.reducers.bytes.per.reducer</name>
<value>67108864</value></property>
Looks like you have set mapred.reduce.tasks = -1, which makes Hive refer to its config to decide the number of reduce tasks.
You are getting an error as the number of reducers is missing in Hive config.
Try setting it using below command:
Hive> SET mapreduce.job.reduces=XX
As per official documentation: The right number of reduces seems to be 0.95 or 1.75 multiplied by (< no. of nodes > * < no. of maximum containers per node >).
I managed to get Hive and MR to work - increased the memory configurations for all the processes involved:
Increased the RAM allocated to YARN Scheduler and maximum RAM allocated to the YARN Nodemanager (in yarn-site.xml), alongside increasing the RAM allocated to the Mapper and Reducer (in mapred-site.xml).
Also incorporated parts of the answers by #Sathiyan S and #vmorusu - set the hive.exec.reducers.bytes.per.reducer property to 1 GB of data, which directly affects the number of reducers that Hive uses (through application of its heuristic techniques).
I am trying to use weka to analyze some data. I've got a dataset with 3 variables and 1000+ instances.
The dataset references movie remakes and
how similar they are (0.0-1.0)
the difference in years between the movie and the remake
and lastly if they were made by the same studio (yes or no)
I am trying to make a decision tree to analyze the data. Using the J48 (because that's all I have ever used) I only get one leaf. Im assuming I'm doing something wrong. Any help is appreciated.
Here is a snippet from the data set:
Similarity YearDifference STUDIO TYPE
0.5 36 No
0.5 9 No
0.85 18 No
0.4 10 No
0.5 15 No
0.7 6 No
0.8 11 No
0.8 0 Yes
...
If interested the data can be downloaded as a csv here http://s000.tinyupload.com/?file_id=77863432352576044943
Your data set is not balanced cause there are almost 5 times more "No" then "Yes" for class attribute. That's why J48 is tree which is actually just one leaf that classifies everything as "NO". You can do one of these things:
sample your data set so you have equal number of No and Yes
Try using better classification algorithm e.g. Random Forest (it's located few spaces below J48 in Weka explorer GUI)
While testing out some of our save task performances on Django (bulk_create),
I noticed in a cProfile test that prior to saving, django runs a get_db_prep_save, which calls a value_to_db_decimal. However, because we have so many rows (dealing with financial data) we are updating at a time, the cumulative time is over 20 seconds for this process. You can see the amount of calls made in the ncalls.
cProfile
ncalls tottime percall cumtime percall filename:lineno(function)
328166 0.344 0.000 21.733 0.000 __init__.py:891(get_db_prep_save)
328166 0.334 0.000 20.521 0.000 __init__.py:860(value_to_db_decimal)
309035 1.253 0.000 20.288 0.000 util.py:142(format_number)
I've attempted creating Django objects with DecimalFields set up exactly to our Django models.py (4 Decimal places) using Decimal(x).quantize(.0004), but no matter what I do, this specific function ends up taking 20+ seconds. (I was secretly hoping by passing Decimal types, it wouldn't need to be reformatted by Django.) Is there a way I can get around this when I create the Django object to be saved into MySQL?
One option seems to be override get_db_prep_save, but I'm just wondering if there's something simpler I'm missing. Thanks!
I am new to Cassandra and just run a cassandra cluster (version 1.2.8) with 5 nodes, and I have created several keyspaces and tables on there. However, I found all data are stored in one node (in the below output, I have replaced ip addresses by node numbers manually):
Datacenter: 105
==========
Address Rack Status State Load Owns Token
4
node-1 155 Up Normal 249.89 KB 100.00% 0
node-2 155 Up Normal 265.39 KB 0.00% 1
node-3 155 Up Normal 262.31 KB 0.00% 2
node-4 155 Up Normal 98.35 KB 0.00% 3
node-5 155 Up Normal 113.58 KB 0.00% 4
and in their cassandra.yaml files, I use all default settings except cluster_name, initial_token, endpoint_snitch, listen_address, rpc_address, seeds, and internode_compression. Below I list those non-ip address fields I modified:
endpoint_snitch: RackInferringSnitch
rpc_address: 0.0.0.0
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
parameters:
- seeds: "node-1, node-2"
internode_compression: none
and all nodes using the same seeds.
Can I know where I might do wrong in the config? And please feel free to let me know if any additional information is needed to figure out the problem.
Thank you!
If you are starting with Cassandra 1.2.8 you should try using the vnodes feature. Instead of setting the initial_token, uncomment # num_tokens: 256 in the cassandra.yaml, and leave initial_token blank, or comment it out. Then you don't have to calculate token positions. Each node will randomly assign itself 256 tokens, and your cluster will be mostly balanced (within a few %). Using vnodes will also mean that you don't have to "rebalance" you cluster every time you add or remove nodes.
See this blog post for a full description of vnodes and how they work:
http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2
Your token assignment is the problem here. An assigned token are used determines the node's position in the ring and the range of data it stores. When you generate tokens the aim is to use up the entire range from 0 to (2^127 - 1). Tokens aren't id's like with mysql cluster where you have to increment them sequentially.
There is a tool on git that can help you calculate the tokens based on the size of your cluster.
Read this article to gain a deeper understanding of the tokens. And if you want to understand the meaning of the numbers that are generated check this article out.
You should provide a replication_factor when creating a keyspace:
CREATE KEYSPACE demodb
WITH REPLICATION = {'class' : 'SimpleStrategy', 'replication_factor': 3};
If you use DESCRIBE KEYSPACE x in cqlsh you'll see what replication_factor is currently set for your keyspace (I assume the answer is 1).
More details here
About 2 months ago, I imported EnWikipedia data(http://dumps.wikimedia.org/enwiki/20120211/) into mysql.
After finished importing EnWikipedia data, I have been creating index in the tables of the EnWikipedia database in mysql for about 2 month.
Now, I have reached the point of creating index in "pagelinks".
However, it seems to take an infinite time to pass that point.
Therefore, I checked the time remaining to pass to ensure that my intuition was correct or not.
As a result, the expected time remaining was 60 days(assuming that I create index in "pagelinks" again from the beginning.)
My EnWikipedia database has 7 tables:
"categorylinks"(records: 60 mil, size: 23.5 GiB),
"langlinks"(records: 15 mil, size: 1.5 GiB),
"page"(records: 26 mil, size 4.9 GiB),
"pagelinks"(records: 630 mil, size: 56.4 GiB),
"redirect"(records: 6 mil, size: 327.8 MiB),
"revision"(records: 26 mil, size: 4.6 GiB) and "text"(records: 26 mil, size: 60.8 GiB).
My server is...
Linux version 2.6.32-5-amd64 (Debian 2.6.32-39),Memory 16GB, 2.39Ghz Intel 4 core
Is that common phenomenon for creating index to take so long days ?
Does anyone have a good solution to create index more quickly ?
Thanks in advance !
P.S: I made following operations for checking the time remaining.
References(Sorry,following page is written in Japanese): http://d.hatena.ne.jp/sh2/20110615
1st. I got records in "pagelink".
mysql> select count(*) from pagelinks;
+-----------+
| count(*) |
+-----------+
| 632047759 |
+-----------+
1 row in set (1 hour 25 min 26.18 sec)
2nd. I got the amount of records increased per minute.
getHandler_write.sh
#!/bin/bash
while true
do
cat <<_EOF_
SHOW GLOBAL STATUS LIKE 'Handler_write';
_EOF_
sleep 60
done | mysql -u root -p -N
command
$ sh getHandler_write.sh
Enter password:
Handler_write 1289808074
Handler_write 1289814597
Handler_write 1289822748
Handler_write 1289829789
Handler_write 1289836322
Handler_write 1289844916
Handler_write 1289852226
3rd. I computed the speed of recording.
According to the result of 2. ,the speed of recording is
7233 records/minutes
4th. Then the time remaining is
(632047759/7233)/60/24 = 60 days
Those are pretty big tables, so I'd expect the indexing to be pretty slow. 630 million records is a LOT of data to index. One thing to look at is partitioning, with data sets that large, without correctly partitioned tables, performance will be sloooow. Here's some useful links:
using partioning on slow indexes you could also try looking at the buffer size settings for building the indexes (the default is 8MB, do for your large table that's going to slow you down a fair bit. buffer size documentation