Related
I need to split data within a cell separated by - (dash) and put into separate columns. The problem I am having is there may be more than one -.
So using the table below with the original data coming from sic_orig, I need to put everything before the first - in sic_num and everything after the first - in sic_desc. I'm sure this is really easy, but I can't seem to find anything clear on this.
This is what my table should look like with sic_orig being the source and sic_num and sic_desc being data pulled from sic_orig:
sic_orig | sic_num | sic_desc
---------------------------------------------------------------------------
509406 - Jewelers-Wholesale | 509406 | Jewelers-Wholesale
--------------------------------------|-----------|------------------------
506324 - Burglar Alarm Systems | 506324 | Burglar Alarm Systems
--------------------------------------|-----------|------------------------
502317 - Picture Frames-Wholesale | 502317 | Picture Frames-Wholesale
This code works, but only works right if there are two -'s and some cells may have 1, 2 or 3 -'s
UPDATE test_tbl_1
SET sic_num = SUBSTRING_INDEX(`sic_orig`, '-', 1),
sic_desc = SUBSTRING_INDEX(`sic_orig`, '-', -2);
How do I split everything before first - and everything after first -?
One method is to use the length of the first part and use that for substr():
UPDATE test_tbl_1
SET sic_num = SUBSTRING_INDEX(`sic_og`, '-', 1),
sic_desc = SUBSTR(sig_og, CHAR_LENGTH(SUBSTRING_INDEX(`sic_og`, '-', 1)) + 1) ;
You can use a combination of SUBSTR() and LOCATE() function to help you slice the string:
UPDATE test_tbl_1
SET sic_num = SUBSTR(sig_orig, 1, LOCATE('-', sig_orig) - 1),
sic_desc = SUBSTR(sig_orig, LOCATE('-', sig_orig) + 1) ;
Click here for MySQL string functions.
Another alternative is to get a count of the dashes in the string. We can get a count of the number of dash characters by doing a replacement of all dash characters with an empty string, and then subtracting the length from the length of the original string.
As a demonstration:
SELECT `sic_orig`
, CHAR_LENGTH(`sic_orig`)-CHAR_LENGTH(REPLACE(`sic_orig`,'-','')) AS cnt_dashes
FROM ( SELECT '509406 - Jewelers-Wholesale ' AS sic_orig
UNION ALL SELECT '506324 - Burglar Alarm Systems'
UNION ALL SELECT '502317 - Picture Frames-Wholesale'
UNION ALL SELECT ' la di dah no dashes '
) t
returns:
sic_orig cnt_dashes
------------------------------------- ----------
509406 - Jewelers-Wholesale 2
506324 - Burglar Alarm Systems 1
502317 - Picture Frames-Wholesale 2
lots-of - -dashes- --everywhere-- -- 10
zero dashes 0
We can use the expression that returns the count of dashes as the third argument of SUBSTRING_INDEX, multiplying by negative 1 to get a a negative value...
SELECT `sic_orig`
, TRIM(
SUBSTRING_INDEX(`sic_orig`,'-'
, 1
)
) AS before_first_dash
, TRIM(
SUBSTRING_INDEX(`sic_orig`,'-'
, -1*(CHAR_LENGTH(`sic_orig`)-CHAR_LENGTH(REPLACE(`sic_orig`,'-','')))
)
) AS after_first_dash
FROM ( SELECT '509406 - Jewelers-Wholesale ' AS sic_orig
UNION ALL SELECT '506324 - Burglar Alarm Systems'
UNION ALL SELECT '502317 - Picture Frames-Wholesale'
UNION ALL SELECT 'lots-of - -dashes- - -every-where-'
UNION ALL SELECT ' zero dashes '
) t
returns:
sic_orig before_first_dash after_first_dash
--------------------------------- ----------------- ----------------------
509406 - Jewelers-Wholesale 509406 Jewelers-Wholesale
506324 - Burglar Alarm Systems 506324 Burglar Alarm Systems
502317 - Picture Frames-Wholesale 502317 Picture Frames-Wholesale
lots-of - -dashes- - -every-where- lots of - -dashes- - -every-where-
zero dashes zero dashes
The extra line breaks and formatting is intended to make deciphering the expressions easier, making sure parens balance, etc.
I always test my expressions with a SELECT statement first, before I put those expressions into an UPDATE statement.
I have a column phone_number on a database that an entry may contain more than one phone number. The plan is to identify entries which do not pass a regex expression validation.
This is the query I am using to accomplish my objective:
SELECT id, phone_number FROM store WHERE phone_number NOT REGEXP '^\s*\(?(020[78]?\)? ?[1-9][0-9]{2,3} ?[0-9]{4})|(0[1-8][0-9]{3}\)? ?[1-9][0-9]{2} ?[0-9]{3})\s*$';
Problem is, every time I run the code, I get an error:
Error Code: 1139. Got error 'repetition-operator operand invalid' from regexp
Thanks in advance.
The regex you are using has at least 2 issues: 1) the escapes should be doubled, and 2) there are 2 groups separated with | that makes the ^ and $ apply to the two branches separately.
'^\s*\(?(020[78]?\)? ?[1-9][0-9]{2,3} ?[0-9]{4})|(0[1-8][0-9]{3}\)? ?[1-9][0-9]{2} ?[0-9]{3})\s*$'
^--------------------------------------^ ^------------------------------------------^
You can use
'^[[:space:]]*\\(?(020[78]?\\)? ?[1-9][0-9]{2,3} ?[0-9]{4}|0[1-8][0-9]{3}\\)? ?[1-9][0-9]{2} ?[0-9]{3})[[:space:]]*$'
Breakdown:
^ - start of string
[[:space:]]* - 0+ whitespaces
\\(? - 1 or 0 ( chars
(020[78]?\\)? ?[1-9][0-9]{2,3} ?[0-9]{4}|0[1-8][0-9]{3}\\)? ?[1-9][0-9]{2} ?[0-9]{3}) - An alternation group matching 2 alternatives:
020[78]?\\)? ?[1-9][0-9]{2,3} ?[0-9]{4} - 020 + optional 7 or 8 + an optional ) + an optional space + a digit from 1 to 9 + 3 or 2 digits + an optional space + 4 digits
| - or
0[1-8][0-9]{3}\\)? ?[1-9][0-9]{2} ?[0-9]{3} - 0 + a digit from 1 to 8 + 3 digits + an optional ) + an optional space + a digit from 1 to 9 + 2 digits + an optional space + 3 digits
[[:space:]]* - 0+ whitespaces
$ - end of string
I have a dataset in the form of a CSV file than is sent to me on a regular basis. I want to import this data into my MySql database and turn it into a proper set of tables. The problem I am having is that one of the fields the is used to store multiple values. For example the field is storing email addresses. It may one email address or it may have two, or three, or four, etc. The field contents would look something like this. "user1#domain.com,user2#domain.com,user3#domain.com".
I need to be able to take the undetermined number of values from each field and then add them into a separate table so that they look like this.
user1#domain.com
user2#domain.com
user3#domain.com
I am not sure how I can do this. Thank you for the help.
Probably the simplest way is a brute force approach of inserting the first email, then the second, and so on:
insert into newtable(email)
select substring_index(substring_index(emails, ',', 1), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 1;
insert into newtable(email)
select substring_index(substring_index(emails, ',', 2), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 2;
insert into newtable(email)
select substring_index(substring_index(emails, ',', 3), ',', -1)
from emails
where (length(replace(emails, ',', ',,')) - length(emails)) >= 3;
And so on.
That is, extract the nth element from the list and insert that into the table. The where clause counts the number of commas in the list, which is a proxy for the length of the list.
You need to repeat this up to the maximum number of emails in the list.
Instead of importing the csv file directly and then trying to fix the problems in it, I found the best way to attack this was to first pass the csv to AWK.
AWK outputs three separate csv file that follow the normal forms. I then import those tables and all is well.
2 info="`ncftpget -V -c -u myuser -p mypassword ftp://fake.com/data_map.csv`"
3
4 echo "$info" | \
5 awk -F, -v OFS="," 'NR > 1 {
6 split($6, keyvalue, ";")
7 for (var in keyvalue) {
8 gsub(/.*:/, "", keyvalue[var])
9 print $1, keyvalue[var]
10 }}' > ~/sqlrw/table1.csv
11
12 echo "$info" | \
13 awk -F, -v OFS="," 'NR > 1 {
14 split($6, keyvalue, ";")
15 for (var in keyvalue) {
16 gsub(/:/, ",", keyvalue[var])
17 print keyvalue[var]
18 }}' > ~/sqlrw/table2.csv
19
20 sort -u ~/sqlrw/table2.csv -o ~/sqlrw/table2.csv
21
22 echo "$info" | \
23 awk -F, -v OFS="," 'NR > 1 {
24 print $1, $2, $3, $4, $5, $7, $8
25 }' > ~/sqlrw/table3.csv
Maybe using a simple php script would/shoud do the trick
<?php
$file = file_get_contents("my_file.csv");
$tmp = explode(";", $file); // iirc lines in csv are terminated by a ;
for ($i=0; $i<count($tmp); $i++)
{
$field = $tmp[$i];
$q = "INSERT INTO my_table (emails) VALUES (`$field`)";
// or use $i as an id if don't have an autoincrement
$q = "INSERT INTO my_table (id, emails) VALUES ($i, `$field`)";
// execute query ....
}
?>
Hope this helps even if it's not pure SQL .....
I was looking for a way to write a function in R which converts an IP address into an integer.
My dataframe looks like this:
total IP
626 189.14.153.147
510 67.201.11.8
509 64.22.53.140
483 180.9.85.10
403 98.8.136.126
391 64.06.187.68
I export this data from mysql database. I have a query where i can convert an IP address into an integer in mysql:
mysql> select CAST(SUBSTRING_INDEX(SUBSTRING_INDEX('75.19.168.155', '.', 1), '.', -1) << 24 AS UNSIGNED) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX('75.19.168.155', '.', 2), '.', -1) << 16 AS UNSIGNED) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX('75.19.168.155', '.', 3), '.', -1) << 8 AS UNSIGNED) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX('75.19.168.155', '.', 4), '.', -1) AS UNSIGNED) FINAL;
But I want to do this conversion in R, any help would be awesome
You were not entirely specific about what conversion you wanted, so I multiplied the decimal values by what I thought might appropriate (thinking the three digit items were actually digit equivalents in "base 256" numbers then redisplayed in base 10). If you wanted the order of the locations to be reversed, as I have seen suggested elsewhere, you would reverse the indexing of 'vals' in both solutions
convIP <- function(IP) { vals <- read.table(text=as.character(IP), sep=".")
return( vals[1] + 256*vals[2] + 256^2*vals[3] + 256^3*vals[4]) }
> convIP(dat$IP)
V1
1 2476281533
2 134990147
3 2352289344
4 173345204
5 2122844258
6 1153107520
(It's usually better IT practice to specify what you think to be the correct answer so testing can be done. Bertelson's comment above would be faster and implicitly uses 1000, 1000^2 and 1000^3 as the factors.)
I am taking a crack at simplifying the code but fear that the need to use Reduce("+", ...) may make it more complex. You cannot use sum because it is not vectorized.
convIP <- function(IP) { vals <- read.table(text=as.character(IP), sep=".")
return( Reduce("+", vals*256^(3:0))) }
> convIP(dat$IP)
[1] 5737849088 5112017 2717938944 1245449 3925902848 16449610
I have a MySQL-Database with 7 Columns (chr, pos, num, iA, iB, iC, iD) and a file that contains 40 million lines each containing a dataset. Each line has 4 tab delimited columns, whereas the first three columns always contain data, and the fourth column can contain up to three different key=value pairs separated by a semicolon
chr pos num info
1 10203 3 iA=0.34;iB=nerv;iC=45;iD=dskf12586
1 10203 4 iA=0.44;iC=45;iD=dsf12586;iB=nerv
1 10203 5
1 10213 1 iB=nerv;iC=49;iA=0.14;iD=dskf12586
1 10213 2 iA=0.34;iB=nerv;iD=cap1486
1 10225 1 iD=dscf12586
The key=value pairs in the column info have no specific order. I'm also not sure if a key can occur twice (I hope not).
I'd like to write the data into the database. The first three columns are no problem, but extractiong the values from the info-columns puzzles me, since the key=value pairs are unordered and not every key has to be in the line.
For a similar dataset (with ordered info-Column) I used a java-Programm in connection with regular expressions, which allowed me to (1) check and (2) extract data, but now I'm stranded.
How can I resolve this task, preferably with a bash-script or directly in MySQL?
You did not mention exactly how you want to write the data. But the below example with awk shows how you can get each individual id and key in each line. instead of the printf, you can use your own logic to write data
[[bash_prompt$]]$ cat test.sh; echo "###########"; awk -f test.sh log
{
if(length($4)) {
split($4,array,";");
print "In " $1, $2, $3;
for(element in array) {
key=substr(array[element],0,index(array[element],"="));
value=substr(array[element],index(array[element],"=")+1);
printf("found %s key and %s value for %d line from %s\n",key,value,NR,array[element]);
}
}
}
###########
In 1 10203 3
found iD= key and dskf12586 value for 1 line from iD=dskf12586
found iA= key and 0.34 value for 1 line from iA=0.34
found iB= key and nerv value for 1 line from iB=nerv
found iC= key and 45 value for 1 line from iC=45
In 1 10203 4
found iB= key and nerv value for 2 line from iB=nerv
found iA= key and 0.44 value for 2 line from iA=0.44
found iC= key and 45 value for 2 line from iC=45
found iD= key and dsf12586 value for 2 line from iD=dsf12586
In 1 10213 1
found iD= key and dskf12586 value for 4 line from iD=dskf12586
found iB= key and nerv value for 4 line from iB=nerv
found iC= key and 49 value for 4 line from iC=49
found iA= key and 0.14 value for 4 line from iA=0.14
In 1 10213 2
found iA= key and 0.34 value for 5 line from iA=0.34
found iB= key and nerv value for 5 line from iB=nerv
found iD= key and cap1486 value for 5 line from iD=cap1486
In 1 10225 1
found iD= key and dscf12586 value for 6 line from iD=dscf12586
Awk solution from #abasu with inserts that also solves the unordered key-value pairs.
parse.awk :
NR>1 {
col["iA"]=col["iB"]=col["iC"]=col["iD"]="null";
if(length($4)) {
split($4,array,";");
for(element in array) {
split(array[element],keyval,"=");
col[keyval[1]] = "'" keyval[2] "'";
}
}
print "INSERT INTO tbl VALUES (" $1 "," $2 "," $3 "," col["iA"] "," col["iB"] "," col["iC"] "," col["iD"] ");";
}
Test/run :
$ awk -f parse.awk file
INSERT INTO tbl VALUES (1,10203,3,'0.34','nerv','45','dskf12586');
INSERT INTO tbl VALUES (1,10203,4,'0.44','nerv','45','dsf12586');
INSERT INTO tbl VALUES (1,10203,5,null,null,null,null);
INSERT INTO tbl VALUES (1,10213,1,'0.14','nerv','49','dskf12586');
INSERT INTO tbl VALUES (1,10213,2,'0.34','nerv',null,'cap1486');
INSERT INTO tbl VALUES (1,10225,1,null,null,null,'dscf12586');