invalid column count on line 1 - mysql

I'm trying to load a csv that has 21 columns and 240 rows through phpmyadmin. The most common error message is:
"invalid column count on line 1" (using csv import)
though when using load data, I get:
"error: #1083 – Field separator argument is not what is expected; check the manual"
Columns separated with ,
Columns enclosed with "
Columns escaped with \
Lines terminated with auto (though I've tried \r, \n, \r\n and any combination of the 3)
I have also tried escaping the quotes and commas but it seems to not do anything.
This is the first row of the data:
Denis,NULL,Wirtz,"221Maryland Hall 3400 North Charles Street\, Baltimore\, MD 21236",,410-516-7006,410-516-5528,wirtz#jhu.edu,Theophilus Halley Smoot Professor,NULL,NULL,"K.L. Yap\, S.I. Fraley\, M.M. Thiaville\, N. Jinawath\, K. Nakayama\, J.-L. Wang\, T.-L. Wang\, D. Wirtz\, and I.-M. Shih\, ÒNAC1 is an actin-binding protein that is essential for effective cytokinesis in cancer cellsÓ\, Cancer Research 72: 4085_4096 (2012).D.H. Kim\, S.B. Khatau\, Y. Feng\, S. Walcott\, S.X. Sun\, G.D. Longmore\, and D. Wirtz\, ÒActin cap associated focal adhesions and their distinct role in cellular mechanosensingÓ\, Scientific Reports (Nature) 2:555-568 (2012).S.I. Fraley\, Y. Feng\, G.D. Longmore\, and D. Wirtz\, ""Dimensional and temporal controls of cell migration by zyxin and binding partners in three-dimensional matrix""\, Nature Communications 3:719-731 (2012)P.-H. Wu\, C.M. Hale\, J.S.H. Lee\, Y. Tseng\, and D. Wirtz\, ÒHigh-throughput ballistic injection nanorheology (htBIN) to measure cell mechanicsÓ\, Nature Protocols 7: 155_170 (2012)C.M. Hale\, W.-C. Chen\, S.B. Khatau\, B.R. Daniels\, J.S.H. Lee\, and D. Wirtz\, ÒSMRT analysis of MTOC and nuclear positioning reveals the role of EB1 and LIC1 in single-cell polarizationÓ\, Journal of Cell Science124: 4267-4285 (2011).D. Wirtz\, K. Konstantopoulos\, and P.C. Searson\, ÒPhysics of cancer: the role of physical interactions and mechanical forces in cancer metastasisÓ\, Nature Reviews Cancer 11: 512-522 (2011)",,NULL,http://www.jhu.edu/chembe/faculty-template/DenisWirtz.jpg,Department of Chemical and Biomolecular Engineering,NULL,Whiting School of Engineering,"Postdoctoral\, Physics\, Biophysics. ESPCI\, Paris. 1993 - 1994Ph.D.\, Cemical Engineering. Stanford University. 1993M.S.\, Chemical Engineering. Stanford University. 1989B.S.\, Physics Engineering. Free University of Brussels. 1983-1988",Johns_Hopkins_University
Any help is greatly appreciated.

There are backslashes in front of commas inside double-quoted strings. If your utility treats those as escaped commas, they will function as column separators, and you will get the wrong number of columns:
"221Maryland Hall 3400 North Charles Street\, Baltimore\, MD 21236"
Again, the doubled double-quotes "" are usually a way to escape a single double-quote within a string. But if the parsing reads it as a string terminator, that's another way it can throw off your column count.
I have seen Excel mess up exported data in many fascinating ways, but this one is new to me.

Related

Get first sentence in MySQL string by identifying the period, space, then capital letter of the next sentence

String1 = "Widgets Inc. is the largest widgets producer in the world. It's much bigger than McWidgets Inc."
String2 = "Fidgets Inc is the second largest fidgets producer. It's just behind McFidgets Inc. The CEO of this company loves synergy."
String3 = "Glorious Gagets Co. is considered blah blah jdfglmdslgmldfg."
For all of the above scenarios, I would like to reliably select the first sentence only. I would use:
[EDIT]: note that there are no real patterns in the sentences.
SUBSTRING_INDEX(string, '. ', 1)
However this would cause issues with the first and third string, as they sometimes have a '.' after the name, and sometimes not.
My thought was to use something like SUBSTRING_INDEX(string, '. [A-Z]', 1), and essentially tell it to look for the first '.' which is followed by a space and then any capital letter (i.e start of the next sentence), but my SQL-fu is not strong enough yet to figure out how to do that.
Any help would be appreciated!
When you have a fixed pattern, you can use LOCATE to find the index and then use SUBSTRING to remove it. For the startung point you need regular explression, if you don't want to use functions or stored procedures, which you also need for more complex patterns
CREATE TABLE table1 (tex varchar(200))
INSERT INTO table1 VALUES ("Widgets Inc. is the largest widgets producer in the world. It's much bigger than McWidgets Inc.")
,("Fidgets Inc is the largest fidgets producer in the world. It's much bigger than McFidgets Inc.")
SELECT SUBSTRING(tex,REGEXP_INSTR(tex, '[A-Z]'),LOCATE('producer in the world.',tex)+ 21) FROM table1
| SUBSTRING(tex,REGEXP_INSTR(tex, '[A-Z]'),LOCATE('producer in the world.',tex)+ 21) |
| :--------------------------------------------------------------------------------- |
| Widgets Inc. is the largest widgets producer in the world. |
| Fidgets Inc is the largest fidgets producer in the world. |
db<>fiddle here
K looks like I have a work-around in the absence of actually identifying sentences in the requested manner, i.e. by somehow including a capital letter check in the substring parameter.
Found a list of abbreviations which would contain a period (i.e. Co., Inc., Ltd., etc...) and hardcoded to replace them without the period - Co, Ltd, Inc etc... then did the substring as normal. Not ideal but it works.

SSRS White space issue

I have 2 scenarios
Scenario 1:
abc Ins Services,
123 Pine St Fl 23
San Francisco, CA, USA
SCENARIO 2:
abc Ins Services,
#4567
123 Pine St Fl 23
San Francisco, CA, USA
All fields are dynamic and I used trim in every expression but white space still comes as shown in scenario 1 ,I dont want this white space.
The space that you're seeing there isn't just a space character, it's a line return. These can be stored in the strings in your database as part of the address. They are hard to see when you preview results in a program like SSMS. The Trim function only removes spaces. Line returns are usually made up of ASCII characters 10 and 13. In order to remove line returns you can use the Replace function like so:
=REPLACE(REPLACE(<string to search in>, CHR(13), ""), CHR(10), "")
This allows you to add your own line returns where you actually want them.

pos_tagging and NER tagging of a MUC dataset does not work correctly

I have a problem with MUC dataset. I want to do NER on that but all the words in this
dataset are in capital letters, so when pos_tagger is run, it detects all the words incorrectly
as a noun. To solve this problem, the whole text was turned initially to lower case. However,
this way raises another problem; if the text is in lowercase letters, the NER does not work
properly and literally finds no “PERSON, ORGANIZATION OR LOCATION”. Thus, the
conversion of the whole text to lower-case was kept, to be able to successfully pos_tag, and
then the manual capitalization of each word was performed to feed them into the NER
module. But another problem raises, this time NER detects everything as LOCATION.
Here is my code:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
def NER(input_file, output_file):
output = open('{0}_NER.txt'.format(output_file), 'w')
testset = open(input_file).readlines()
for line in testset:
line_clean = line.lower().strip()
tokens = nltk.word_tokenize(line_clean)
poss = nltk.pos_tag(tokens)
mylist = []
for w in poss:
s = list(w)
s1 = s[0].upper()
tmp = (s1, w[1])
mylist.append(tmp)
ner_ = nltk.ne_chunk(mylist)
Any help would be greatly appreciated.
Thanks.
Here is a piece of this dataset:
SAN SALVADOR, 3 JAN 90 -- [REPORT] [ARMED FORCES PRESS COMMITTEE,
COPREFA] [TEXT] THE ARCE BATTALION COMMAND HAS REPORTED THAT ABOUT 50
PEASANTS OF VARIOUS AGES HAVE BEEN KIDNAPPED BY TERRORISTS OF THE
FARABUNDO MARTI NATIONAL LIBERATION FRONT [FMLN] IN SAN MIGUEL
DEPARTMENT. ACCORDING TO THAT GARRISON, THE MASS KIDNAPPING TOOK PLACE ON
30 DECEMBER IN SAN LUIS DE LA REINA. THE SOURCE ADDED THAT THE TERRORISTS
FORCED THE INDIVIDUALS, WHO WERE TAKEN TO AN UNKNOWN LOCATION, OUT OF
THEIR RESIDENCES, PRESUMABLY TO INCORPORATE THEM AGAINST THEIR WILL INTO
CLANDESTINE GROUPS.
Your best bet is to train your own named entity classifier on case-folded text. The nltk book has a step by step tutorial in chapters 6 and 7. For training you could use the CONLL 2003 corpus.
Consider also training your own POS tagger on case-folded text, it might work better than the nltk POS tagger you're using now (but check).
Why do you need POS tagging if your task is NER? As far as I know POS tags do not really improve the NER result. I agree with Alexis that you need to train your own classifier since you don't have access to word shape feature without case information.

What is the best data type for ISBN10 and ISBN13 in a MySQL datase

For an application I'm currently building I need a database to store books. The schema of the books table should contain the following attributes:
id, isbn10, isbn13, title, summary
What data types should I use for ISBN10 and ISBN13? My first thoughts where a biginteger but I've read some unsubstantiated comments that say I should use a varchar.
You'll want a CHAR/VARCHAR (CHAR is probably the best choice, as you know the length - 10 and 13 characters). Numeric types like INTEGER will remove leading zeroes in ISBNs like 0-684-84328-5.
ISBN numbers should be stored as strings, varchar(17) for instance.
You need 17 characters for ISBN13, 13 numbers plus the hyphens, and 13 characters for ISBN10, 10 numbers plus hyphens.
ISBN10
ISBN10 numbers, though called "numbers", may contain the letter X. The last number in an ISBN number is a check digit that spans from 0-10, and 10 is represented as X. Plus, they might begin with a double 0, such as 0062472100, and as a numeric format, it might get the leading 00 removed once stored.
84-7844-453-X is a valid ISBN10 number, in which 84 means Spain, 7844 is the publisher's number, 453 is the book number and X (i.e 10) is the control digit. If we remove the hyphens we mix publisher with book id. Is it really important? Depending on the use you'll give to that number. Bibliographic researchers (I've found myself in that situation) might need it for many reasons that I won't go into here, since it has nothing to do with storing data. I would advise against removing hyphens, but the truth is everyone does it.
ISBN13
ISBN13 faces the same issues regarding meaning, in that, with the hyphens you get 4 blocks of meaningful data, without them, language, publisher and book id would become lost.
Nevertheless, the control digit will only be 0-9, there will never be a letter. But should you feel tempted to only store isbn13 numbers (since ISBN10 can automatically and without fail be upgraded to ISBN13), and use int for that matter, you could run into some issues in the future. All ISBN13 numbers begin with 978 or 979, but in the future some 078 might could be added.
A light explanation about ISBN13
A deeper explanation of ISBN
numbers

Human name comparison: ways to approach this task

I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).
Here's what i've learned so far:
comparing Personal Names can't be solved 100%
there are ways to achieve certain degree of accuracy.
the answer will be locale-specific, that's OK.
I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.
For example, all the names below can refer to the same person:
Berry Tsakala
Bernard Tsakala
Berry J. Tsakala
Tsakala, Berry
I'm trying to:
build (or copy) an algorithm which grades the relationship 2 input names
find an indexing method (for names in my database, for hash tables, etc.)
note:
My task isn't about finding names in text, but to compare 2 names. e.g.
name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%
I used Tanimoto Coefficient for a quick (but not super) solution, in Python:
"""
Formula:
Na = number of set A elements
Nb = number of set B elements
Nc = number of common items
T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
c = [v for v in a if v in b]
return float(len(c)) / (len(a)+len(b)-len(c))
def name_compare(name1, name2):
return tanimoto(name1, name2)
>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>
Edit: A link to a good and useful book.
Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.
We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.
One thing you need to understand is the database/people lists you are dealing with. In the English speaking world middle names are inconsistently recorded. So you can't make or deny a match based on the middle name or middle initial. Soundex will not help you with common name aliases such as "Dick" and "Richard", "Berry" and "Bernard" and possibly "Steve" and "Stephen". In some communities it is quite common for people to live at the same address and have 2 or 3 generations living at that address with the same name. The only way you can separate them is by date of birth. Date of birth may or may not be recorded. If you have the clout then you should probably make the recording of date of birth mandatory. A lot of "people databases" either don't record date of birth or won't give them away due to privacy reasons.
Effectively people name matching is not that complicated. Its entirely based on the quality of the data supplied. What happens in practice is that a lot of records remain unmatched - and even a human looking at them can't resolve the mismatch. A human may notice name aliases not recorded in the aliases list or may be able to look up details of the person on the internet - but you can't really expect your programme to do that.
Banks, credit rating organisations and the government have a lot of detailed information about us. Previous addresses, date of birth etc. And that helps them join up names. But for us normal programmers there is no magic bullet.
Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.
I had real problems with the Tanimoto using utf-8.
What works for languages that use diacritical signs is difflib.SequenceMatcher()