I was wondering if there was a way to set a primary key for text files using python. I am currently working on connecting a web scraper for the lottery to notepad++ text files to update the data sets. To prevent duplicates I think you need unique id's, and when it comes to lottery results the date seems like it would work. The reason I would be doing this is to prevent duplicates. Currently I am only able to add the new and old data together, and then overwrite the current file. If the answer is found, I will then add it into my other problem when it comes to connecting sublime, and notepad++ link.
I found that if you have task scheduler(Windows), you can write to files the regular way.
with open("Filename.txt", "r") as f:
data = f.read()
with open("Filename.txt", "w") as f:
f.write('{}{}{}'.format(Fantasy5, '\n' if data else '', data))
f.close()
And then schedule the updates to the file. Using this there is no need for a primary key, and this prevents duplicates.
Related
I am trying to figure out why MYSQL isn't working as expected.
I imported my data from a CSV into a table called Products, which is shown in the screenshot. It's a small table of just ID and Name.
But when I run the where clause, finding out where the Name = 'SMS', it returns nothing? I don't understand what the issue is.
My CSV contents in Notepad++ is shown below:
This is what I used to load in my CSV, if there are any errors here.
Could you share your csv file content?
It's happened to me too before, and the problem is because there's some blank space in the data in csv file.
So maybe you could parse first your csv file data (remove the "not needed" blank space) before import it to database
This is often caused by spaces or look-alike characters. If caused by spaces or invisible characters at the beginning/end, you can try:
where name like '%SMS%'
You can then make this more general:
where name like '%S%M%S%'
When you get a match, you'll need to do more investigate to find the actual cause.
I have many csv files with names 0_0.csv , 0_1.csv , 0_2.csv , ... , 1_0.csv , 1_1.csv , ... , z_17.csv.
I wanted to know how can I import them in a loop or something ?
Also I wanted to know am I doing it good ? ( each file is 50MB and whole files size is about 100GB )
This is my code :
create index on :name(v)
create index on :value(v)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///0_0.txt" AS csv
FIELDTERMINATOR ','
MERGE (n:name {v:csv.name})
MERGE (m:value {v:csv.value})
CREATE (n)-[:kind {v:csv.kind}]->(m)
You could handle multiple files by constructing a file name. Unfortunately this seems to break when using the USING PERIODIC COMMIT query hint so it won't be a good option for you. You could create a script to wrap it up and send the commands to bin/cypher-shell though.
UNWIND ['0','1','z'] as outer
UNWIND range(0,17) as inner
LOAD CSV WITH HEADERS FROM 'file:///'+ outer +'_' + toString(inner) + '.csv' AS csv
FIELDTERMINATOR ','
MERGE (n:name {v:csv.name})
MERGE (m:value {v:csv.value})
CREATE (n)-[:kind {v:csv.kind}]->(m)
As far as your actual load query goes. Do you name and value nodes come up multiple times in the files? If they are unique, you would be better off loading the the data in multiple passes. Load the nodes first without the indexes; then add the indexes once the nodes are loaded; and then do the relationships as the last step.
Using CREATE for the :kind relationship will result in multiple relationships even if it is the same value for csv.kind. You might want to use MERGE instead if that is the case.
For 100 GB of data though if you are starting with an empty database and are looking for speed, I would take a look at using bin/neo4j-admin import.
I have a CSV file with about 15 million rows. I am trying to import them with CSV IMPORT but it's taking too long.
When I try to import them with CREATE they get imported in a decent amount of time, but that creates a lot of duplicates. So, I tried to use MERGE instead but it's taking a lot of time. The query was running for more than 10 hours before I terminated it. After that I tried to import just few columnsand waited more than 30 minutes before I terminated the query. Here is the code for running just the few columns:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber:line1.CompanyNumber,
companyName:line1.CompanyName,
uri:line1.URI
})
My question is: does MERGE usually behave like this or am I doing something wrong?
Based on the name of your input file (companydata.csv) and the columns used in the MERGE, it looks like you're misusing MERGE, unless the URI is really part of a composite primary key:
Match on a unique key
Set the properties
Otherwise, you'll be creating duplicate nodes, as a MERGE either finds a match with all criteria, or creates a new node.
The query should probably look like
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber: line1.CompanyNumber})
SET company.companyName = line1.CompanyName,
company.uri = line1.URI;
The unicity constraint will also create an index, speeding up lookups.
Update
There's a comment about how to address nodes which don't have a unique property, such as a CompanyInfo node related to the Company and containing extra properties: the CompanyInfo might not have a unique property, but it has a unique relationship to a unique Company, and that's enough to identify it.
A more complete query would then look like:
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber: line1.CompanyNumber})
SET company.companyName = line1.CompanyName,
company.uri = line1.URI
// This pattern is unique
MERGE (company)-[:BASIC_INFO]->(companyInfo:CompanyInfo)
SET companyInfo.category = line1.CompanyCategory,
companyInfo.status = line1.CompanyStatus,
companyInfo.countryOfOrigin = line1.CountryofOrigin;
Note that if companydata.csv is the first file imported, and there are no duplicates, you could simply use CREATE instead of MERGE and get rid of the SET, which looks like your initial query:
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
CREATE (company:Company {companyNumber: line1.CompanyNumber,
companyName: line1.CompanyName,
uri: line1.URI})
CREATE (company)-[:BASIC_INFO]->(companyInfo:CompanyInfo {category: line1.CompanyCategory,
status: line1.CompanyStatus,
countryOfOrigin: line1.CountryofOrigin});
However, I noticed that you mentioned trying CREATE and getting duplicates, so I guess you do have duplicates in the file. In that case, you can stick to the MERGE query, but you might want to adjust your deduplication strategy by replacing SET (last entry wins) by ON CREATE SET (first entry wins) or ON CREATE SET / ON MATCH SET (full control).
I might be on the wrong track so I could use some helpful input. I receive data from other systems by CSV files which I can import into my DB with CSV LOAD. So far so good.
I stucked when I need to reload the CSV again to follow up updates. I cannot delet the former data as I might have additional user input already attached so I would need a query that imports the CSV data, makes a match and when it finds the node it will just use SET to override the existing properties. Saying that I am unsure how to catch the cases where there is no node in the DB (new record) and we need to create a node.
LOAD CSV FROM "file:xxx.csv" AS csvLine
MATCH (c:Customer {code:"ABC"})
SET c.name = name: csvLine[0]
***OPTIONAL MATCH // Here I am unsure how to express when the node is not found***
MERGE (c:Customer { name: csvLine[0], code: csvLine[1]})
So ideally Cypher would check if the node is there and make an UPDATE by SET the new property coming with the CSV or - if the node cannot be found - creates a new one with the CSV data.
And - as a sidenote: How would I find nodes that are not in the CSV file but in the DB in order to mark them as obsolete? (This might not be able in the import but maybe someone has an idea how to solve this in order to keep the DB clean of deleted records - which can only be detected by a comparison with the latest CSV import - happy for every idea).
Any idea or hint how to write the query for updaten the graph while importing?
You need to use MERGEs ON MATCH and/or ON CREATE handlers, see http://neo4j.com/docs/stable/query-merge.html#_use_on_create_and_on_match. I assume the customer code in the second column is the identifier - so the name in column one might change on updates:
LOAD CSV FROM "file:xxx.csv" AS csvLine
MERGE (c:Customer {code:csvLine[1]})
ON CREATE SET c.name = csvLine[0]
ON MATCH SET c.name = csvLine[0]
I'm working on test scripts and I want to load the results.csv file into a database as a BLOB.
Basically I want my table to look like:
serial_number | results |.
So one row for each device. My code would look like this for a device with serial number A123456789 (changed names and path for simplicity). The table is called test.
create table test (serial_number varchar(20), results longblob);
insert into test values ('A123456789',load_file('C:/results.csv'));
When I do this, however, the second column which should contain a BLOB, comes out containing NULL, with no exceptions raised.
If I open my results.csv file in notepad, then save as .txt file with no changes whatsoever, I get exactly what I want when I run the same code substituting ".csv" with ".txt" in the path. Basically it would also solve my problem if I could load the csv file as a text file.
thanks for anything you may be able to contribute.