Loading CSV in Neo4j is time consuming - csv

I want load a CDR csv file with 648000 records to neo4j (4.4.10), But it is about 4 days and And it is not yet completed.
My CSV have 648000 records with 7 columns. and the size of file is about 48 MB.
My computer have 100 GB RAM and intel Zeon E5 CPU.
the columns of CSV are:
OP_Name
TP_Name
Called_Number
OP_ANI
Setup_Time
Duration
OP_Price
the code that I use to load CSV in Neo4j is:
```Cypher
:auto load csv with headers from 'file:///cdr.csv' as line FIELDTERMINATOR ','
with line
where line['Called_Number'] is not null and line['OP_ANI'] is not null
with line['OP_ANI'] as OP_Phone,
(CASE line['OP_Name']
WHEN 'TIC' THEN 'IRAN'
ELSE 'Foreign' END) AS OP_country,
line['Called_Number'] as Called_Phone,
(CASE line['TP_Name']
WHEN 'TIC' THEN 'IRAN'
ELSE 'Foreign' END) AS TP_country,
line['Setup_Time'] as Setup_Time,
line['Duration'] as Duration,
line['OP_Price'] as OP_Price
call {
with OP_Phone, OP_country, Called_Phone, TP_country, Setup_Time, Duration, OP_Price
MERGE (c:Customer{phone: toInteger(Called_Phone)})
on create set c.country = TP_country
WITH c, OP_Phone, OP_country, Called_Phone, TP_country, Setup_Time, Duration, OP_Price
CALL apoc.create.addLabels( c, [ c.country ] ) YIELD node
MERGE (c2:Customer{phone: toInteger(OP_Phone)})
on create set c2.country = OP_country
WITH c2, OP_Phone, OP_country, Called_Phone, TP_country, Setup_Time, Duration, OP_Price, c
CALL apoc.create.addLabels( c2, [ c2.country ] ) YIELD node
MERGE (c2)-[r:CALLED{setupTime: Setup_Time,
duration: Duration,
OP_Price: OP_Price}]->(c)
} IN TRANSACTIONS
```
How can I speed up the load operation?

MERGE acts as an upsert in Neo4j. So the statement:
MERGE (c:Customer{phone: toInteger(Called_Phone)})
checks if there is a Customer node with the given phone number is there or not. If it is, it performs the update otherwise creates the node. When there is a large number of nodes, this lookup can be very slow, and CSV import will be slow overall. Creating an index on the phone property of Customer should do the trick. You can create the index like this:
CREATE INDEX phone IF NOT EXISTS FOR (n:Customer) ON (n.phone)

Related

Adding frequency counter between nodes in neo4j during csv import

I've got a csv file with ManufacturerPartNumbers and Manufacturers. Both values can potentially be duplicated across rows one or more times. Meaning I could have ManufacturerParnNumber,Manufactuerere: A|X , A|Y, A|Y, B|X, C,X
In this case, I'd like to create ManufacturerPartNumber nodes (A), (B), (C) and Manufacturer nodes (X), (Y)
I also want to create relationships of
(A)-[MADE_BY]->(X)
(A)-[MADE_BY]->(Y)
And I also want to apply a weighting value in the relationship between A -> Y since it appears twice in my dataset, so that I know that there's a more frequent relationship between A|Y than there is between A|X.
Is there a more efficient way of doing this? I'm dealing with 10M rows of csv data and it is crashing during import.
:param UploadFile => 'http://localhost:11001/project-f64568ab-67b6-4560-ae89-8aea882892b0/file.csv';
//open the CSV file
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM $UploadFile AS csvLine with csvLine where csvLine.id is not null
//create nodes
MERGE (mfgr:Manufacturer {name: COALESCE(trim(toUpper(csvLine.Manufacturer)),'NULL')})
MERGE (mpn:MPN {name: COALESCE(trim(toUpper(csvLine.MPN)),'NULL')})
//set relationships
MERGE (mfgr)-[a:MAKES]->(mpn)
SET a += {appearances: (CASE WHEN a.appearances is NULL THEN 0 ELSE a.appearances END) + 1, refid: (CASE WHEN a.refid is NULL THEN csvLine.id ELSE a.refid + ' ~ ' + csvLine.id END)}
;
Separating the node creation from the relationships creation and then setting the values helped a bit.
Ultimately what had the most impact was that I spun up an AuraDB at max size and then imported all of the data, followed by resizing it back down. Probably not an ideal way to handle it, but it worked better than all the other optimization and only cost me a few bucks!
//QUERY ONE: var2 and var1 nodes
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM $UploadFile AS csvLine with csvLine where csvLine.id is not null
MERGE (var2:VAR2 {name: COALESCE(trim(toUpper(csvLine.VAR2)),'NULL')})
MERGE (var1:VAR1 {name: COALESCE(trim(toUpper(csvLine.VAR1)),'NULL')})
;
//QUERY TWO: var2 and var1 nodes
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM $UploadFile AS csvLine with csvLine where csvLine.id is not null
MERGE (var2:VAR2 {name: COALESCE(trim(toUpper(csvLine.VAR2)),'NULL')})
MERGE (var1:VAR1 {name: COALESCE(trim(toUpper(csvLine.VAR1)),'NULL')})
MERGE (var2)-[a:RELATES_TO]->(var1) SET a += {appearances: (CASE WHEN a.appearances is NULL THEN 0 ELSE a.appearances END) + 1}
;
//QUERY THREE: handle descriptors
//open the CSV file
:auto USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM $UploadFile AS csvLine with csvLine where csvLine.id is not null
UNWIND split(trim(toUpper(csvLine.Descriptor)), ' ') AS DescriptionSep1 UNWIND split(trim(toUpper(DescriptionSep1)), ',') AS DescriptionSep2 UNWIND split(trim(toUpper(DescriptionSep2)), '|') AS DescriptionSep3 UNWIND split(trim(toUpper(DescriptionSep3)), ';') AS DescriptionSep4
MERGE (var2:VAR2 {name: COALESCE(trim(toUpper(csvLine.VAR2)),'NULL')})
MERGE (var1:VAR1 {name: COALESCE(trim(toUpper(csvLine.VAR1)),'NULL')})
MERGE (descriptor:Descriptor {name: COALESCE(trim(toUpper(DescriptionSep4)),'NULL')})
SET descriptor += {appearances: (CASE WHEN descriptor.appearances is NULL THEN 0 ELSE descriptor.appearances END) + 1}
MERGE (descriptor)-[d:DESCRIBES]->(var1)
SET d += {appearances: (CASE WHEN d.appearances is NULL THEN 0 ELSE d.appearances END) + 1}
;

how parallel fetch data from MySQL with Sequel Pro in R

I want to fetch data from mysql with seqlpro in R but when I run the query it takes ages.
here is my code :
old_value<- data.frame()
new_value<- data.frame()
counter<- 0
for (i in 1:length(short_list$id)) {
mydb = OpenConn(dbname = '**', user = '**', password = '**', host = '**')
query <- paste0("select * from table where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" )
temp_old <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1
query <- paste0("select * from table2 where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" )
temp_new <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1)
RMySQL::dbDisconnect(mydb)
new_value<- rbind(temp_new,new_value)
old_value<- rbind(temp_old,old_value)
counter=counter+1
base::print(paste("completed for ",counter),sep="")
}
is there any way that I can writ it more efficient and call the queries faster because i have around 5000 rows which should go into the loop. Actually this query works but it takes time.
I have tried this but still it gives me error :
#parralel computing
clust <- makeCluster(length(6))
clusterEvalQ(cl = clust, expr = lapply(c('data.table',"RMySQL","dplyr","plyr"), library, character.only = TRUE))
clusterExport(cl = clust, c('config','short_list'), envir = environment())
new_de <- parLapply(clust, short_list, function(id,country) {
for (i in 1:length(short_list$id)) {
mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**')
query <- paste0("select * from table1 where id IN (",short_list$id[i],") and country IN ('",short_list$country[i],"') and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1';", sep = "" )
temp_data <- RMySQL::dbFetch(RMySQL::dbSendQuery(mydb, query), n = -1) %>% data.table::data.table()
RMySQL::dbDisconnect(mydb)
return(temp_data)}
})
stopCluster(clust)
gc(reset = T)
new_de <- data.table::rbindlist(new_de, use.names = TRUE)
I have also defined the list of short_list as following:
short_list<- as.list(short_list)
and inside short_list is:
id. country
2 US
3 UK
... ...
However it gives me this error:
Error in checkForRemoteErrors(val) :
one node produced an error: object 'i' not found
However when I remove i from the id[i] and country[i] it only give me the first row result not get all ids and country result.
I think an alternative is to upload the ids you need into a temporary table, and query for everything at once.
tmptable <- "mytemptable"
dbWriteTable(conn, tmptable, short_list, create = TRUE)
alldat <- dbGetQuery(conn, paste("
select t1.*
from ", tmptable, " tmp
left join table1 t1 on tmp.id=t1.id and tmp.country=t1.country
where t1.`date` >= '2019-04-31' and t1.`date` <= '2020-09-1'"))
dbExecute(conn, paste("drop table", tmptable))
(Many DBMSes use a leading # to indicate a temporary table that is only visible to the local user, is much less likely to clash in the schema namespace, and is automatically cleaned when the connection is closed. I generally encourage use of temp-tables here, check with your DB docs, schema, and/or DBA for more info here.)
The order of tables is important: by pulling all from mytemptable and then left join table1 onto it, we are effectively filtering out any data from table1 that does not include a matching id and country.
This doesn't solve the speed of data download, but some thoughts on that:
Each time you iterate through the queries, you have not-insignificant overhead; if there's a lot of data then this overhead should not be huge, but it's still there. Using a single query will reduce this overhead significantly.
Query time can also be affected by any index(ices) on the tables. Outside the scope of this discussion, but might be relevant if you have a large-ish table. If the table is not indexed efficiently (or the query is not structured well to use those indices), then each query will take a finite amount of time to "compile" and return data. Again, overhead that will be reduced with a single more-efficient query.
Large queries might benefit from using the command-line tool mysql; it is about as fast as you're going to get, and might iron over any issues in RMySQL and/or DBI. (I'm not saying they are inefficient, but ... it is unlikely that a free open-source driver will be faster than MySQL's own command-line utility.
As for doing this in parallel ...
You're using parLapply incorrectly. It accepts a single vector/list and iterates over each object in that list. You might use it iterating over the indices of a frame, but you cannot use it to iterate over multiple columns within that frame. This is exactly like base R's lapply.
Let's show what is going on when you do your call. I'll replace it with lapply (because debugging in multiple processes is difficult).
# parLapply(clust, mtcars, function(id, country) { ... })
lapply(mtcars, function(id, country) { browser(); 1; })
# Called from: FUN(X[[i]], ...)
debug at #1: [1] 1
id
# [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2
# [24] 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4
country
# Error: argument "country" is missing, with no default
Because the argument (mtcars here, short_list in yours) is a data.frame, since it is a list-like object, lapply (and parLapply) operate on each column at a time. You were hoping that it would "unzip" the data, applying the first column's value to id and the second column's value to country. In fact, the is a function that does this: Map (and the parallel's clusterMap, as I suggested in my comment). More on that later.
The intent of parallelizing things is to not use the for loop inside the parallel function. If short_list has 10 rows, and if your use of parLapply were correct, then you would be querying all rows 10 times, making your problem significantly worse. In pseudo-code, you'd be doing:
parallelize for each row in short_list:
# this portion is run simultaneously in 10 difference processes/threads
for each row in short_list:
query for data related to this row
Two alternatives:
Provide a single argument to parLapply representing the rows of the frame.
new_de <- new_de <- parLapply(clust, seqlen(NROW(short_list)), function(rownum) {
mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**')
on.exit({ DBI::dbDisconnect(mydb) })
tryCatch(
DBI::dbGetQuery(mydb, "
select * from table1
where id=? and country=?
and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1'",
params = list(short_list$id[rownum], short_list$country[rownum])),
error = function(e) e)
})
Use clusterMap for the same effect.
new_de <- clusterMap(clust, function(id, country) {
mydb = OpenConn(dbname = '*', user = '*', password = '*', host = '**')
on.exit({ DBI::dbDisconnect(mydb) })
tryCatch(
DBI::dbGetQuery(mydb, "
select * from table1
where id=? and country=?
and source_event_date >= date >= '2019-04-31' and `date` <= '2020-09-1'",
params = list(id, country)),
error = function(e) e)
}, short_list$id, short_list$country)
If you are not familiar with Map, it is like "zipping" together multiple vectors/lists. For example:
myfun1 <- function(i) paste(i, "alone")
lapply(1:3, myfun1)
### "unrolls" to look like
list(
myfun1(1),
myfun1(2),
myfun1(3)
)
myfun3 <- function(i,j,k) paste(i, j, k, sep = '-')
Map(f = myfun3, 1:3, 11:13, 21:23)
### "unrolls" to look like
list(
myfun3(1, 11, 21),
myfun3(2, 12, 22),
myfun3(3, 13, 23)
)
Some liberties I took in that adapted code:
I shifted from the dbSendQuery/dbFetch double-tap to a single call to dbGetQuery.
I'm using DBI functions, since DBI functions provide a superset of what each driver's package provides. (You're likely using some of it anyway, perhaps without realizing it.) You can switch back with no issue.
I added tryCatch, since sometimes errors can be difficult to deal with in parallel processes. This means you'll need to check the return value from each of your processes to see if either inherits(ret, "error") (problem) or is.data.frame (normal).
I used on.exit so that even if there's a problem, the connection closure should still occur.

Python. If value on column 1 (row X) = value from column 2 (row Y), print row Y of column 3

I have a .csv file, df, with 3 columns (C1, C2 and C3). All columns are of the same length (aprox. 600000 rows) and have unique values. Values in C1, which represent SNPs (single nucleotide polymorphisms) are ordered according to their location on chromosomes. C2 has the same values as C1 but they are disordered. Values in C2 are coupled to corresponding values (chromosome locations) in the same row on C3. What I want to do is to couple the chromosomal locations on C3 to the values in C1 keeping the column order of C1. In other words, generate another column with chromosome locations for the ordered SNPs on C1. So far, I tried to create a dictionary with keys from C2 and values from C3 and then using a for loop to match values on C1 and print the ordered chromosome positions, but I get C3. I understand why I get that but I don't manage to get what I want.
Any suggestion/help would be welcome. I am new into programming.
import csv
from collections import OrderedDict # to save keys order
import sys
sys.stdout = open("output1.csv", "w")
# C1= rows[0], C2= rows[1], C3= rows[2]
with open('df1.csv', 'rU') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader) #skip header
d = OrderedDict((rows[1], rows[2]) for rows in reader)
for rows in reader:
if rows[0] in d:
print rows[2]
Input example:
C1 C2 C3
12082473 2980300 785989
11240776 4245756 799463
2980300 12082473 740857
2905036 2341354 918573
4245756 3748597 888659
3748597 11240776 765269
2341354 2905036 792480
2465126 2465126 947034
Desired output:
C1 C4
12082473 740857
11240776 765269
2980300 785989
2905036 792480
4245756 799463
3748597 888659
2341354 918573
2465126 947034
I am not entirely sure I understand what you are trying to do.
I think your error is from using the generator expression d = OrderedDict((rows[0], rows[3]) for rows in reader1) and then referring to it after the file has been closed at the end of the with block.
You might try something along these lines:
import csv
from collections import OrderedDict
d=OrderedDict()
with open('df1.csv', 'rU') as csv1, open('df2.csv', 'rU') as csv2:
reader1 = csv.reader(csv1, delimiter=',')
reader2 = csv.reader(csv2, delimiter=',')
next(reader1) #skip header
next(reader2) #skip header
for row in reader1:
d[row[0]]=row[3]
# d = OrderedDict(("a", "b") for rows in reader1)
for row in reader2:
if row[0] in d:
print d[row[0]]
I do not see any reason you need an OrderedDict since this is just a mapping between row[0] and row[3] as written. You are not using the order currently.

Data set too large to load into memory for processing

I have a bigger rapidly growing data set of around 4 million rows, in order to define and exclude the outliers (for statistics / analytics usage) I need the algorithm to consider all entries in this data set. However this is too much data to load into memory and my system chokes. I'm currently using this to collect and process the data:
#scoreInnerFences = innerFence Post.where( :source => 1 ).
order( :score ).
pluck( :score )
Using the typical divide and conquer method won't work, I don't think because every entry has to be considered to keep my outlier calculation accurate. How can this be achieved efficiently?
innerFence identifies the lower quartile and upper quartile of the data set, then uses those findings to calculate the outliers. Here is the (yet to be refactored, non-DRY) code for this:
def q1(s)
q = s.length / 4
if s.length % 2 == 0
return ( s[ q ] + s[ q - 1 ] ) / 2
else
return s[ q ]
end
end
def q2(s)
q = s.length / 4
if s.length % 2 == 0
return ( s[ q * 3 ] + s[ (q * 3) - 1 ] ) / 2
else
return s[ q * 3 ]
end
end
def innerFence(s)
q1 = q1(s)
q2 = q2(s)
iq = (q2 - q1) * 3
if1 = q1 - iq
if2 = q2 + iq
return [if1, if2]
end
This is not the best way, but it is an easy way:
Do several querys. First you count the number of scores:
q = Post.where( :source => 1 ).count
then you do your calculations
then you fetch the scores
q1 = Post.where( :source => 1 ).
reverse_order(:score).
select("avg(score) as score").
offset(q).limit((q%2)+1)
q2 = Post.where( :source => 1 ).
reverse_order(:score).
select("avg(score) as score").
offset(q*3).limit((q%2)+1)
The code is probably wrong but I'm sure you get the idea.
For large datasets, I sometimes drop down below ActiveRecord. It's a memory hog, even I imagine, using pluck. Of course it's less portable, but sometimes it's worth it.
scores = Post.connection.execute('select score from posts where score > 1 order by score').map(&:first)
Don't know if that will help enough for 4 million record. If not, maybe look at a stored procedure?

A way to improve performance of the following query (update table)

This question is related to a question I published a while ago, and can be found here: Update values of database with values that are already in DB .
I've the following situation: A table that stores data from different sensors (I've a total of 8 sensors). Each row of the table has the following structure:
SensorID --- TimestampMS --- RawData --- Data
So, for example, for a temperature sensor called TEMPSensor1 I have the following:
TEMPSensor1 --- 1000 --- 200 --- 2 TEMPSensor1 --- 2000 --- 220 --- 2.2
And so on, for each sensor (in total I've 8). I've some problems reading the data, and there are rows which data "is not correct". Precisely when the rawdata field is 65535, I should update that particular row. And what I would like to do is put the next value (in time) to that "corrupted data". So, if we have this:
TEMPSensor1 --- 1000 --- 200 --- 2 TEMPSensor1 --- 2000 --- 220 --- 2.2
TEMPSensor1 --- 3000 --- 65535 --- 655.35 TEMPSensor1 --- 4000 --- 240 --- 2.4
After doing the Update, the content of the table should be changed to:
TEMPSensor1 --- 1000 --- 200 --- 2 TEMPSensor1 --- 2000 --- 220 --- 2.2
TEMPSensor1 --- 3000 --- 240 --- 2.4 TEMPSensor1 --- 4000 --- 240 --- 2.4
I've ended up doing the following:
UPDATE externalsensor es1
INNER JOIN externalsensor es2 ON es1.sensorid = es2.sensorid AND (es2.timestampms - es1.timestampms) > 60000 AND (es2.timestampms - es1.timestampms) < 120000 AND es1.rawdata <> 65535
SET es1.rawdata = es2.rawdata, es1.data = es2.data
WHERE es1.rawdata = 65535
Because I know that between two reads from a sensor there are between 60000 and 120000 ms. However, if I have two following "corrupted" readings, that won't work. Can anyone suggest a way to do this more efficiently, with no use of subquery selects, but JOINS? My idea would be to have a JOIN that gives you all the possible values for that sensor after its timestampms, and just get the first one, but I don't know how I can limit that JOIN result.
Appreciate.
Here's a solution without correlated subqueries, but with a triangular join (not sure which is worse):
UPDATE externalsensor bad
INNER JOIN (
SELECT
es1.SensorID,
es1.TimestampMS,
MIN(es2.TimestampMS) AS NextGoodTimestamp
FROM externalsensor es1
INNER JOIN externalsensor es2
ON es1.SensorID = es2.SensorID AND
es1.TimestampMS < es2.TimestampMS
WHERE es1.RawData = 65535
AND es2.RawData <> 65535
GROUP BY
es1.SensorID,
es1.TimestampMS
) link ON bad.SensorID = link.SensorID AND
bad.TimestampMS = link.TimestampMS
INNER JOIN externalsensor good
ON link.SensorID = good.SensorID AND
link.NextGoodTimestamp = good.TimestampMS
SET
bad.RawData = good.RawData,
bad.Data = good.Data
It is assumed that the timestamps are unique within a single sensor group.
A completely different approach, using a procedure that runs through the whole table (in descending time order for every sensor):
DELIMITER $$
CREATE PROCEDURE updateMyTable()
BEGIN
SET #dummy := -9999 ;
SET #dummy2 := -9999 ;
SET #sensor := -999 ;
UPDATE myTable m
JOIN
( SELECT n.SensorID
, n.TimestampMS
, #d := (n.RawData = 65535) AND (#sensor = n.SensorID) AS problem
, #dummy := IF(#d, #dummy, n.RawData) as goodRawData
, #dummy2 := IF(#d, #dummy2, n.Data) as goodData
, #sensor := n.SensorID AS previous
FROM myTable n
ORDER BY n.SensorID
, n.TimeStampMS DESC
) AS upd
ON m.SensorID = upd.SensorID
AND m.TimeStampMS = upd.TimeStampMS
SET m.RawData = upd.goodRawData
, m.Data = upd.goodData
WHERE upd.problem
;
END$$
DELIMITER ;
Since you don't want to use Dems solution from the previous question, here's a "solution" with JOIN's:
UPDATE myTable m
JOIN myTable n
ON m.SensorID = n.SensorID
AND n.RawData <> 65535
AND m.TimestampMS < n.TimestampMS
JOIN myTable q
ON n.SensorID = q.SensorID
AND q.RawData <> 65535
AND n.TimestampMS <= q.TimestampMS
SET
m.RawData = n.RawData,
m.Data = n.Data
WHERE
m.RawData = 65535
;
EDIT
My query above is wrong, dead wrong. It appears to be working in my test db but the logic is flawed. I'll explain below.
Why the above query works fine but is dead wrong:
First, why it's wrong.
Because it will not return one row for every (sensorID, bad timestamp) combination but many rows. If m (m.TimestampMS) is the bad timestamp we want to find, it will return all combinations of that bad timetsamp and later good timestamps n and q with n.TimestampMS <= q.TimestampMS. It would be a correct query if it found the MINIMUM of these n timestamps.
Now, how come it actually works all right in my test db?
I think it's because MySQL, when it comes to use the SET ... and has a lot of options (rows) it just uses first option. But lucky me, I added the test rows in increasing timestamp order so they were saved in that order in the db, and (again) lucky me, this is how the query plan is scheduled (I presume).
Even this query works in my test db:
UPDATE myTable m
JOIN myTable n
ON m.SensorID = n.SensorID
AND n.RawData <> 65535
AND m.TimestampMS < n.TimestampMS
SET
m.RawData = n.RawData,
m.Data = n.Data
WHERE
m.RawData = 65535
;
while being flawed for the same reasons.