Neo4j add property from CSV to a node - csv

I have a label Person which contains millions of nodes. The nodes have some properties and I am trying to add a new property to the nodes from a CSV file.
I am trying to match them by the person's forename and surname but the query is too slow. The query is:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM
'file:///personaldata.csv' AS line1
MATCH (p:Person {forename:line1.forename, surname:line1.surname})
SET p.newPersonNumber=line1.newPersonNumber
I left the query running for maybe an hour before I terminated it.
Am I doing something wrong?
Note that I have indexes on forename and surname .

Try profiling the query to see if it really uses the indices:
PROFILE
WITH "qwe" AS forename, "asd" AS surname
MATCH (p:Person {forename: forename, surname: surname})
RETURN p
If it doesn't, you can force it:
WITH "qwe" AS forename, "asd" AS surname
MATCH (p:Person {forename: forename, surname: surname})
USING INDEX p:Person(forename)
USING INDEX p:Person(surname)
RETURN p
As mentioned in the Cypher refcard (emphasis mine):
Index usage can be enforced, when Cypher uses a suboptimal index or more than one index should be used.
See also the chapter on USING.
Update
Since using multiple indices on the same node is not currently supported, let's focus back on why the query is slow, and whether it actually does something. You can profile the actual LOAD CSV for a subset, and see if the data matches anything:
PROFILE
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///personaldata.csv' AS line1
WITH line1
LIMIT 10
OPTIONAL MATCH (p:Person {forename:line1.forename, surname:line1.surname})
RETURN p, line1.newPersonNumber
That way, you can check that the MATCH finds something (i.e. the forename and surname don't need trimming or something), and you can also check which index is more beneficial to the query: since only 1 index will be used, then the results will be filtered on the other property, and it'll be faster if you use the most discriminant index. If all the persons are Johns, you'd better use the index on surname, but if they're all Does, use the index on forename. If they're all John Does, you have a duplication problem... Anyway, comparing the numbers on the filtering steps between the 2 profiles (with either index) should give you an idea of the distribution of the indices.

Related

Search in multiple column with at least 2 words in keyword

I have a table which store some datas. This is my table structure.
Course
Location
Wolden
New York
Sertigo
Seatlle
Monad
Chicago
Donner
Texas
I want to search from that table for example with this keyword Sertigo Seattle and it will return row number two as a result.
I have this query but doesn't work.
SELECT * FROM courses_data a WHERE CONCAT_WS(' ', a.Courses, a.Location) LIKE '%Sertigo Seattle%'
Maybe anyone knows how to make query to achieve my needs?
If you want to search against the course and location then use:
SELECT *
FROM courses_data
WHERE Course = 'Sertigo' AND Location = 'Seattle';
Efficient searching is usually implemented by preparing the search string before running the actual search:
You split the search string "Sertigo Seattle" into two words: "Sertigo" and "Seattle". You trim those words (remove enclosing white space characters). You might also want to normalize the words, perhaps convert them to all lower case to implement a case insentive search.
Then you run a search for the discrete words:
SELECT *
FROM courses_data
WHERE
(Course = 'Sertigo' AND Location = 'Seattle')
OR
(Course = 'Seattle' AND Location = 'Sertigo');
Of course that query is created using a prepared statement and parameter binding, using the extracted and trimmed words as dynamic parameters.
This is is much more efficient than using wildcard based search with the LIKE operator. Because the database engine can make use of the indexes you (hopefully) created for that table. You can check that by using EXPLAIN feature MySQL offers.
Also it does make sense to measure performance: run different search approaches in a loop, say 1000 times, and take the required time. You will get a clear and meaningful example. Also monitoring CPU and memory usage in such a test is of interest.

Queries with one item in the list in `__in` are extremely slow. Otherwise, super fast

I am retrieving event_id's by name with the code below:
events = Events.objects.values_list('event__id', flat=True). \
filter(name__in=names).distinct()
Everything is working great except when names consist of just one name. If I change my code to:
events = Events.objects.values_list('event__id', flat=True). \
filter(name__in=names + ['x']).distinct()
Once again, it becomes super fast. I am seriously going crazy cause this makes no sense. I used print(events.query) and it uses the same query basically, just the list changes. How is this possible?
The execution time with one name in the list lasts for 30-60secs, otherwise it takes just 100-1000ms. The amount of event_ids don't change dramatically, so it's not the size issue.
I used EXPLAIN and the difference seems to be:
Extra: Using where; Using index
Extra: Using index
And:
type: range
type: ref
More details and clarification would definitely help.
Such as:
Event model (will help reproducing the issue and give required background)
events.query SQL statement (very helpful)
values_list('event__id') suggests Event model may have ForeignKey to self, combined with retrieving event_id's by name just adds more frustration (it may be valid in fact)
how many records in events table? 100-1000ms is not very optimal query time
First thing to suggest - take a look at distinct().
To make sure only selected columns are present in select and thus distinct is just over this one column and simpler query plan - clear ordering from the QuerySet with empty order_by().
events = Events.objects.values_list('event__id', flat=True). \
filter(name__in=names + ['x']).order_by().distinct()
Description:
With distinct() Django performs SELECT DISTINCT sql query is to remove duplicate rows. Note duplicate rows which means unique rows across all columns of the SELECT, not with unique values in one specific column.
values_list('event__id', flat=True) on first look may suggest that only event_id is present in SELECT (i.e. SELECT DISTINCT event_id FROM events ...), but that is not like that - Django just takes values from columns listed in values_list of the result, but SELECT may contain any other columns Django thinks are required for the query.
So, your events.query may actually look like SELECT DISTINCT event_id, col_2, name FROM events ... which not only produces different results than distinct on one column (in some cases same results if unique column is included, i.e. id) but also may result in more complicated query plan. Also, col_2 may not even be present in QuerySet.
Django includes columns it thinks are required to run the QuerySet. I.e. this may be default ordering column set on the model - the one present if no ordering is set on QuerySet.
Have you checked the type of names when is just one name? It should work the same independent of the length of the names list, tuple, etc... However if when you have only one name in names then it is a string, not a list.
Check the example in the documentation if you pass a string, Django, and python in general, treats the string as a list of characters.
Then, if names='Django Reinhardt':
filter(name__in=names)
would become:
filter(name__in=['D', 'j', 'a', 'n', 'g', 'o', ' ',
'R', 'e', 'i', 'n', 'h', 'a', 'r', 'd', 't'])
which surely isn't the desired behavior in your case.
Be sure to enforce that names is a list even when just one provided. So when names=['Django Reinhardt]
Your code would evaluate to:
filter(name__in=['Django Reinhardt']
If you provide more details on how you obtain/construct ´names´ I could provide more help on this.

Is there a way to separate data from one field in mySQL?

We have a drug column in our database and this contains the Drug Name (e.g. Xanax) and its size (0.5mg).
E.g. Drug Name: Xanax 0.5mg
However, there's a need for us to separate the two data without creating a new column for the size as doing so will have a huge effect on the database structure. We just need to populate a list with just the drug name without its size based from this single field / column.
Is there a way to extract just the name of the drug? Let's say by forcing the user to add a parenthesis around the drug size (e.g. Xanax (0.5mg))? Meaning just extract all the string that comes before the first "(" character?
Or is there a better way?
Try this:
mytable:
id name
1 Xanax (0.5mg)
Query:
select id, substring_index(name,'(',1) from tb_mx;
Will return:
1, Xanax
So use it accordingly.
You can store drug´s data in JSON format
{"some_unique_id":{
"name": "Xanax",
"quantity": "0.5mg"
}}
And then use Functions That Search JSON Values for MySQL 5.7 to get what parameter you need.
I have some experience with DB used in Pharma industry and I can say it's not ok do doit like that.
Here is what i think u must do (normalize)
Table UM (like mg,ml,etc)
Table Packing (like quantity per pice, pice nr , FK ID_UM)
Table Drugs (name, fk id_packing)
Don't worry about space. Tables UM and Packing will have alot of reused ID, and a column int take less that than varchar.
Or can used JSON ideea, but then you will have some problemes in reporting part.

Select all from DB where the content has defined

I am trying to retrieve a list of database records which have specific 'interest codes' inside of the 'custom_fields' table. So for example right now there is 100 records, I need the Name, Email and Interest Code from each of those records.
I've tried with the following statement:
SELECT * FROM `subscribers` WHERE list = '27' AND custom_fields LIKE 'CV'
But with no luck, the response was:
MySQL returned an empty result set (i.e. zero rows). ( Query took 0.0003 sec )
You can see in this screenshot that at-least two rows have 'CV' inside custom_fields. Whilst within the database it's not called 'Interest Code', that's what they are so therefore why I am referencing it in this way.
You need to enclose your "search string" inside some wildcards:
select * from subscribers where list=27 and custom_fields like '%CV%';
The % wildcard means "zero or more chacarcters at this position". The "_" wildcard means "a character in this position". Please read the reference manual on the topic. Also, you may want to read about regular expressions in MySQL for more complex string comparissons.

MySQL - FULLTEXT in BOOLEAN mode + Relevance using views field

I have the following table:
CREATE TABLE IF NOT EXISTS `search`
(
`id` BIGINT(16) NOT NULL AUTO_INCREMENT PRIMARY KEY,
`string` TEXT NOT NULL,
`views` BIGINT(16) NOT NULL,
FULLTEXT(string)
) ENGINE=MyISAM;
It has a total of 5,395,939 entries. To perform a search on a term (like 'a'), I use the query:
SELECT * FROM `search` WHERE MATCH(string) AGAINST('+a*' IN BOOLEAN MODE) ORDER BY `views` DESC LIMIT 10
But it's really slow =(. The query above took 15.4423 seconds to perform. Obviously, it's fast without sorting by views, which takes less than 0.002s.
I'm using ft_min_word_len=1 and ft_stopword_file=
Is there any way to use the views as the relevance in the fulltext search, without making it too slow? I want the search term "a b" match "big apple", for example, but not "ibg apple" (just need the search prefixes to match).
Thanks
Since no one answered my question, I'm posting my solution (not the one I would expect to see if I was googling, since it isn't so easy to apply as a simple database-design would be, but it's still a solution to this problem).
I couldn't really solve it with any engine or function used by MySQL. Sorry =/.
So, I decided to develop my own software to do it (in C++, but you can apply it in any other language).
If what you are looking for is a method to search for some prefixes of words in small strings (the average length of my strings is 15), so you can use the following algorithm:
1. Create a trie. Each word of each string is put on the trie.
Each leaf has a list of the ids that match that word.
2. Use a map/dictionary (or an array) to memorize the informations
for each id (map[id] = information).
Searching for a string:
Note: The string will be in the format "word1 word2 word3...". If it has some symbols, like #, #, $, you might consider them as " " (spaces).
Example: "Rafael Perrella"
1. Search for the prefix "Rafael" in the trie. Put all the ids you
get in a set (a Binary-Search Tree that ignores repeated values).
Let's call this set "mainSet".
2. Search for the prefix "Perrella" in the trie. For each result,
put them in a second set (secSet) if and only if they are already
in the mainSet. Then, clear mainSet and do mainSet = secSet.
3. IF there are still words lefting to search, repeat the second step
for all those words.
After these steps, you will have a set with all the results. Make a vector using a pair for the (views, id) and sort the vector in descending order. So, just get the results you want... I've limited to 30 results.
Note: you can sort the words first to remove those with the same prefix (for example, in "Jan Ja Jan Ra" you only need "Jan Ra"). I will not explain about it since the algorithm is pretty obvious.
This algorithm may be bad sometimes (for example, if I search for "a b c d e f ... z", I will search the entire trie...). So, I made an improvement.
1. For each "id" in your map, create also a small trie, that will
contain the words of the string (include a trie for each m[id]...
m[id].trie?).
Then, to make a search:
1. Choose the longest word in the search string (it's not guaranteed,
but it is probably the word with the fewest results in the trie...).
2. Apply the step 1 of the old algorithm.
3. Make a vector with the ids in the mainSet.
4. Let's make the final vector. For each id in the vector you've created
in step 3, search in the trie of this id (m[id].trie?) for all words
in the search string. If it includes all words, it's a valid id and
you might include it in the final vector; else, just ignore this id.
5. Repeat step 4 until there are no more ids to verify. After that, just
sort the final vector for <views, id>.
Now, I use the database just as a way to easily store and load my data. All the queries in this table are directly asked to this software. When I add or remove a record, I send both to the DB and to the software, so I always keep both updated. It costs me about 30s to load all the data, but then the queries are fast (0.03s for the slowest ones, 0.001s in average; using my own notebook, didn't try it in a dedicated hosting, where it might be much faster).