I am trying to load a *.csv file into neo4j and in the same load statement split the line (which has no delimiters but has a set location for the data that I need to create nodes from). I want to use the substring function, I can't figure out how to get it to work. The data reads in as a single line:
0067011990999991958051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
I have tried using the following code:
LOAD CSV WITH HEADERS FROM "file:/c:/itw/Ltemps.csv" AS line
WITH line
WHERE line.year IS split((substring(line, 15, 19))) and line.temp IS split((substring(line, 88, 92))) and line.qlfr IS split((substring(line, 87, 88))) and line.qual IS split((substring(line, 92, 93)))
MERGE (y:Year {year:line.year})
MERGE (t:Temp {temp:line.temp})
MERGE (f:Qlfr {qlfr:line.qlfr})
MERGE (q:Qual {qual:line.qual})
CREATE (y)-[r:HAS_TEMP]->(t);
I am looking to get 4 nodes: year, temp (an absolute value), a qualifier (positive or negative symbol), and a quality number. The indexes on for where the data lies in the string should be accurate.
First, let's try to get the indexes and types right. To convert numeric substrings to integers, we use the toInteger function:
WITH '0067011990999991958051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999' AS line
RETURN
toInteger(substring(line, 15, 4)) AS year,
toInteger(substring(line, 88, 2)) AS temp,
substring(line, 87, 1) AS qlfr,
toInteger(substring(line, 92, 1)) AS qual
This gives:
╒══════╤══════╤══════╤══════╕
│"year"│"temp"│"qlfr"│"qual"│
╞══════╪══════╪══════╪══════╡
│1958 │0 │"+" │1 │
└──────┴──────┴──────┴──────┘
If the results good look, add back LOAD CSV the MERGE clauses. Two things:
I don't think it makes sense to use WITH HEADERS, as headers are useless in this case. Simply load the row and use row[0] as the line for splitting.
It is possible to simplify the MERGE by combining the your first two MERGE clauses with the CREATE clause.
So the loader code is the following:
LOAD CSV FROM 'file:/c:/itw/Ltemps.csv' AS row
WITH row[0] AS line
WITH
toInteger(substring(line, 15, 4)) AS year,
toInteger(substring(line, 88, 2)) AS temp,
substring(line, 87, 1) AS qlfr,
toInteger(substring(line, 92, 1)) AS qual
MERGE (y:Year {year: year})-[r:HAS_TEMP]->(t:Temp {temp: temp})
MERGE (f:Qlfr {qlfr: qlfr})
MERGE (q:Qual {qual: qual})
Related
data looks like
212253820000025000.00000002500.00000000375.00111120211105202117
212456960000000750.00000000075.00000000011.25111120211102202117
212387470000010000.00000001000.00000000150.00111120211105202117
need to add separator like
21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17
The CSV file length is high nearly 20000 rows are there is there any possibility to do
This question is generally about reading "fixed width data".
If you're stuck with this data, you'll need to parse it line by line then column by column. I'll show you how to do this with Python.
First off, the columns you counted off in the comment do not match your sample output. You seemed to have omitted the last column with a count of 2 characters.
You'll need accurate column widths to perform the task. I took your sample data and counted the columns for you and got these numbers:
8, 13, 12, 12, 8, 8, 2
So, we'll read the input data line by line, and for every line we'll:
Read 8 chars and save it as a column, then 13 chars and save it as a column, then 12 chars, etc... till we've read all the specified column widths
As we move through the line we'll keep track of our position with the variables beg and end to denote where a column begins (inclusive) and where it ends (exclusive)
The end of the first column becomes the beginning of the next, and so on down the line
We'll store those columns in a list (array) that is the new row
At the end of the line we'll save the new row to a list of all the rows
Then, we'll repeat the process for the next line
Here's how this looks in Python:
import pprint
Col_widths = [8, 13, 12, 12, 8, 8, 2]
all_rows = []
with open("data.txt") as in_file:
for line in in_file:
row = []
beg = 0
for width in Col_widths:
end = beg + width
col = line[beg:end]
row.append(col)
beg = end
all_rows.append(row)
pprint.pprint(all_rows, width=100)
all_rows is just a list of lists of text:
[['21225382', '0000025000.00', '000002500.00', '000000375.00', '11112021', '11052021', '17'],
['21245696', '0000000750.00', '000000075.00', '000000011.25', '11112021', '11022021', '17'],
['21238747', '0000010000.00', '000001000.00', '000000150.00', '11112021', '11052021', '17']]
With this approach, if you miscounted the column width or the number of columns you can easily modify the Column_widths to match your data.
From here we'll use Python's CSV module to make sure the CSV file is written correctly:
import csv
with open("data.csv", "w", newline="") as out_file:
writer = csv.writer(out_file)
writer.writerows(all_rows)
and my data.csv file looks like:
21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17
If you have access to the command-line tool awk, you can fix your data like the following:
substr() gives a portion of the string $0, which is the entire line
you start at char 1 then specify the width of your first column, 8
for the next substr(), you again use $0, you start at 9 (1+8 from the last substr), and give it the second column's width, 13
and repeat for each column, starting at "the start of the last column plus the last column's width"
#!/bin/sh
# Col_widths = [8, 13, 12, 12, 8, 8, 2]
awk '{print substr($0,1,8) "," substr($0,9,13) "," substr($0,22,12) "," substr($0,34,12) "," substr($0,46,8) "," substr($0,54,8) "," substr($0,62,2)}' data.txt > data.csv
Data = [{'Ferrari': 51078}, {'Volvo': 83245, 'Ferrari': 70432, 'Skoda':
29264, 'Lambo': 862},
{'Ferrari': 306415, 'Jeep': 4025, 'Saab': 2708, 'Lexus': 161}, {'Fiat':
27583, 'Maserati': 11030, 'Renault': 3194, 'Volvo': 259, 'Skoda': 164},
{'Ferrari': 2313172, 'Renault': 2475},
{'Volvo': 198671}, {'Volvo': 15762}]
I want to add together the numbers for each car, so I get the total amount for each element (the numbers below aren't accurate with the Data and just an example):
Ferrari: 152455
Volvo: 13515
Skoda: 1532
Lambo: 4366
Renault: 4262
Maserati: 2345
Lexus: 235
Jeep: 124
Saab: 15
I've tried with sum(), append it to new lists, collections and many other potential solutions, but I just cannot get this one right. I'm searching for a general solution not only applicable to my problem, so if I change my dataset and hence the numbers and cars, it needs to work also for the new Data.
I'm using Python3.
You can use defaultdict. The code below iterates over the list of dicts. Then taking out a random key-value pair until each dict is empty and summing the results.
from collections import defaultdict
data = [{'Ferrari': 51078},
{'Volvo': 83245, 'Ferrari': 70432, 'Skoda': 29264, 'Lambo': 862},
{'Ferrari': 306415, 'Jeep': 4025, 'Saab': 2708, 'Lexus': 161},
{'Fiat': 27583, 'Maserati': 11030, 'Renault': 3194, 'Volvo': 259, 'Skoda': 164},
{'Ferrari': 2313172, 'Renault': 2475},
{'Volvo': 198671},
{'Volvo': 15762}]
output = defaultdict(int)
for d in data:
while d:
k, v = d.popitem()
output[k] += v
print(output)
Outputs
defaultdict(<class 'int'>, {'Ferrari': 2741097,
'Lambo': 862,
'Skoda': 29428,
'Volvo': 297937,
'Lexus': 161,
'Saab': 2708,
'Jeep': 4025,
'Renault': 5669,
'Maserati': 11030,
'Fiat': 27583})
I've been looking around and couldn't find the answer so here it is.
I'm trying to look into a way for automating of changing the content of a CSV file into something else for machine learning purposes. I have the content of a single line like this:
0, 0, 0, -2.3145, 5.567...... 65, 65, 125, 70.
(516 columns)
And trying to change it to this:
0,
0,
-2.3145,
5.567
....
65,
65,
125,
70.
(516 rows)
So basically transposing the data from horizontal to vertical (single row to single column).
It's easily done using Excel but problem is I have 4000+ of the CSV file so it takes a lot of time.
On top of that, I have to get the first 512 rows and store it into a CSV of another folder adding the last 4 rows into another CSV of another folder while both files have the same name.
Eg:
features(folder)
1.CSV
2.CSV
.....
4000+.CSV
labels(folder)
1.CSV
2.CSV
.....
4000+.CSV
Any suggestions on how I can speed things up? Tried writing my own program but I'm stumped on changing it from row to column. I've only managed to split the single CSV file to it's 4000+ pieces.
EDIT:
I've tested by putting the csv rows into an array and then storing the array into the csv where the code looks like this:
with open('FFTMIM16_512L1H1S0D0_1194.csv', 'r') as f:
reader = csv.reader(f)
your_list = list(reader)
print(your_list[0:512])
print(your_list[512:516])
print(your_list)
with open('test.csv', 'w', newline = '') as fa:
writer = csv.writer(fa)
writer.writerows(your_list[0:511])
with open('test1.csv', 'w', newline = '') as fb:
writer = csv.writer(fb)
writer.writerows(your_list[512:516])
It works but I just need to run it in a loop. A problem that I don't understand is that if I save the values from 0 to 512 on test.csv, it will show 512 counts of rows but when I store from 513 to 516 to test1.csv, it only shows three instead of four rows that I need. Changing fb content from 512 to 516 will work which doesn't make sense to me because the value of 512 in test.csv is 0 while test1.csv is 69. Why is that? From what I can understand is the index of the array, it starts from 0 to the place of number I need. Or is it not the case in python?
EDIT 2:
My new code is as follows:
import csv
import os
import glob
#import itertools
directory = input("INPUT FOLDER: ")
output1 = input("FEATURES FODLER: ")
output2 = input("LABELS FOLDER: ")
in_files = os.path.join(directory, '*.csv')
for in_file in glob.glob(in_files):
with open(in_file) as input_file:
reader = csv.reader(input_file)
your_list = (reader)
filename = os.path.splitext(os.path.basename(in_file))[0] + '.csv'
with open(os.path.join(output1, filename), 'w', newline='') as output_file1:
writer = csv.writer(output_file1)
writer.writerow(your_list[0:512])
with open(os.path.join(output2, filename), 'w', newline='' ) as output_file2:
writer = csv.writer(output_file2)
writer.writerow(your_list[512:516])
It shows the output as I wanted but now it stores apostrophes and braces eg. ['0.0'], ['2.321223'] as well. How do I remove these?
I don't understand why you can't do it programatically if you have your 4000+ pieces, just write every piece in a new line?
In my opinion the easiest way, but not automatically, would be some editor like Notepad ++.
Here you can Replace "," by "\r\n" or if you want to keep the "," you replace it with ",\r\n".
If you want it automated i don't see a not programmatical way.
By the way... if you use python with numpy/scipy you can just use the .transpose() function
*Edit to your comment:
what do you mean with "split from the first to the 512"? If you want parts with the size 512 it would be something like:
new_array = []
temp_array = []
k = 0
for num in your_array:
temp_array.append(num)
k += 1
if k % 512 == 0:
new_array.append(temp_array)
k = 0
temp_array = []
#to append the last block which might not be 512 sized
if len(temp_array) > 0:
new_array.append(temp_array)
# Save Arrays
for i in len(new_array):
saveToCsv(array = new_array[i], name="csv_"+str(i))
Your new_array would now be an array filled with 512 sized arrays.
Might be mistakes here, i did not test the code. To save you only need a function saveToCsf(array, name) which saves an array into a file.
What I am trying to do is to import a dataset with a tree data structure inside from CSV to neo4j. Nodes are stored along with their parent node and depth level (max 6) in the tree. So I try to check depth level using CASE and then add a node to its parent like this (creating a node just for 1st level so far for testing purpose):
export FILEPATH=file:///Example.csv
CREATE CONSTRAINT ON (n:Node) ASSERT n.id IS UNIQUE;
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS
FROM {FILEPATH} AS line
WITH DISTINCT line,
line.`Level` AS level,
line.`ParentCodeID_Cal` AS parentCode,
line.`CodeSet` AS codeSet,
line.`Category` AS nodeCategory,
line.`Type` AS nodeType,
line.`L1code` AS l1Code, line.`L1Description` AS l1Description, line.`L1Name` AS l1Name, line.`L1NameAb` AS l1NameAb,
line.`L2code` AS l2Code, line.`L2Description` AS l2Description, line.`L2Name` AS l2Name, line.`L2NameAb` AS l2NameAb,
line.`L3code` AS l3Code, line.`L3Description` AS l3Description, line.`L3Name` AS l3Name, line.`L3NameAb` AS l3NameAb,
line.`L1code` AS l4Code, line.`L4Description` AS l4Description, line.`L4Name` AS l4Name, line.`L4NameAb` AS l4NameAb,
line.`L1code` AS l5Code, line.`L5Description` AS l5Description, line.`L5Name` AS l5Name, line.`L5NameAb` AS l5NameAb,
line.`L1code` AS l6Code, line.`L6Description` AS l6Description, line.`L6Name` AS l6Name, line.`L6NameAb` AS l6NameAb,
codeSet + parentCode AS nodeId
CASE line.`Level`
WHEN '1' THEN CREATE (n0:Node{id:nodeId, description:l1Description, name:l1Name, nameAb:l1NameAb, category:nodeCategory, type:nodeType})
ELSE
END;
But I get this result:
WARNING: Invalid input 'S': expected 'l/L' (line 17, column 3 (offset:
982)) "CASE level " ^
I'm aware there is a mistake at syntax.
I'm using neo4j 3.0.4 & Windows 10 (using neo4j shell running it with D:\Program Files\Neo4j CE 3.0.4\bin>java -classpath neo4j-desktop-3.0.4.jar org.neo4j.shell.StartClient).
You have several syntax errors. For example, a CASE clause cannot contain a CREATE clause.
In any case, you should be able to greatly simplify your Cypher. For example, this might suit your needs:
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS
FROM {FILEPATH} AS line
WITH DISTINCT line, ('l' + line.Level) AS prefix
CREATE (:Node{
id: line.CodeSet + line.ParentCodeID_Cal,
description: line[prefix + 'Description'],
name: line[prefix + 'Name'],
nameAb: line[prefix + 'NameAb'],
category: line.Category,
type: line.Type})
Am trying to create a function that takes a filename and it returns a 2-tuple with the number of the non-empty lines in that program, and the sum of the lengths of all those lines. Here is my current program:
def code_metric(file):
with open(file, 'r') as f:
lines = len(list(filter(lambda x: x.strip(), f)))
num_chars = sum(map(lambda l: len(re.sub('\s', '', l)), f))
return(lines, num_chars)
The result I get is get if I do:
if __name__=="__main__":
print(code_metric('cmtest.py'))
is
(3, 0)
when it should be:
(3,85)
Also is there a better way of finding the sum of the length of lines using using the functionals map, filter, and reduce? I did it for the first part but couldn't figure out the second half. AM kinda new to python so any help would be great.
Here is the test file called cmtest.py:
import prompt,math
x = prompt.for_int('Enter x')
print(x,'!=',math.factorial(x),sep='')
First line has 18 characters (including white space)
Second line has 29 characters
Third line has 38 characters
[(1, 18), (1, 29), (1, 38)]
The line count is 85 characters including white spaces. I apologize, I mis-read the problem. The length total for each line should include the whitespaces as well.
A fairly simple approach is to build a generator to strip trailing whitespace, then enumerate over that (with a start value of 1) filtering out blank lines, and summing the length of each line in turn, eg:
def code_metric(filename):
line_count = char_count = 0
with open(filename) as fin:
stripped = (line.rstrip() for line in fin)
for line_count, line in enumerate(filter(None, stripped), 1):
char_count += len(line)
return line_count, char_count
print(code_metric('cmtest.py'))
# (3, 85)
In order to count lines, maybe this code is cleaner:
with open(file) as f:
lines = len(file.readlines())
For the second part of your program, if you intend to count only non-empty characters, then you forgot to remove '\t' and '\n'. If that's the case
with open(file) as f:
num_chars = len(re.sub('\s', '', f.read()))
Some people have advised you to do both things in one loop. That is fine, but if you keep them separated you can make them into different functions and have more reusability of them that way. Unless you are handling huge files (or executing this coded millions of times), it shouldn't matter in terms of performance.