Replace not working - mysql

I have a text-file of data in key-value pairs that I have managed to convert to a format where the key-value pairs are all separated by an underscore between them, and the key is separated from the value by a colon. I thought this format would be useful for keeping spaces intact within the data. Here's an example with the data substituted for ~~~~~~~s.
_ID:~~~_NAME:~~~~~_DESCRIPTION:~~~~~~~_TYPE1:~~~~~~_TYPE2:~~~~~~ ...etc
I want to convert this to a MySQL script to insert the data into a table. My problem is there are nullable fields that aren't included in every record. e.g. A record has a _TYPE1: and may or may not have a _TYPE2:
... _DESCRIPTION:~~~~~~_TYPE1:~~~~~~_TYPE2:~~~~~~_ADDRESS:~~~~~~~ ...
... _DESCRIPTION:~~~~~~_TYPE1:~~~~~~_ADDRESS:~~~~~~~ ...
... _DESCRIPTION:~~~~~~_TYPE1:~~~~~~_ADDRESS:~~~~~~~ ...
... _DESCRIPTION:~~~~~~_TYPE1:~~~~~~_TYPE2:~~~~~~_ADDRESS:~~~~~~~ ...
... _DESCRIPTION:~~~~~~_TYPE1:~~~~~~_ADDRESS:~~~~~~~ ...
I thought to fix this by inserting _TYPE2: after every _TYPE1 without a _TYPE2:. Since there are only a few different possible types, I managed to select the _ after each _TYPE1:~~~~~~ without a TYPE2: following it. I used the following regex, where egtype is one example of a possible type:
(?<=_TYPE1:egtype)_(?!TYPE2:)
At this point, all I have to do is replace that _ with _TYPE2:_ and every field is present in every line, which makes it easy to convert every row to a MySQL insert statement! Unfortunately, Notepad++ is not replacing it when I click the Replace button. I'm not sure why.
Does anyone know why it wouldn't replace an _ with _TYPE2:_ using that particular regex? Or does anyone have any other suggestions on how to turn all this data into a MySQL insert script?

Regex
To do what you want, try this:
Find:
_TYPE1:[^_]+\K(?!.*_TYPE2)
Replace:
_TYPE2:
You can test it with your sample data and have it explained here.
Python Script plugin
As a side note, I don't think it's possible to convert your data into SQL insert statements with the use of one and only one regular expression, and while I see what you are trying to do by adding fake TYPE2, I don't think it is your best option.
So, my suggestion is to use Notepad++'s Python Script plugin.
Install Python Script plugin, from Plugin Manager or from the official website.
Then go to Plugins > Python Script > New Script. Choose a filename for your new file (eg sql_insert.py) and copy the code that follows.
Run Plugins > Python Script > Scripts > sql_insert.py and a new tab will show up the desired result.
Script:
columns = [[]]
values = [[]]
current_line = 0
def insert(line, match):
global current_line
if line > current_line:
current_line += 1
columns.append([])
values.append([])
if match:
i = 0
for m in match.groups():
if i % 2 == 0:
columns[line].append(m)
else:
values[line].append(m)
i += 1
editor.pysearch("_([A-Z0-9]+):([^_\n]+)", insert)
notepad.new()
for line in range(len(columns)):
editor.addText("INSERT INTO table (" + ",".join(columns[line]) + ") values (" + ",".join(values[line]) +");\n")
Note: I'm still learning Python and I've a feeling that this one could be written in a better way. Feel free to edit my answer or drop a comment if you can suggest improvements!
Example input:
_ID:~~~_NAME:~~~~~_DESCRIPTION:~~~~~~~_TYPE1:~~~~~~_TYPE2:~~~~~~
_ID:~~~_NAME:~~~~~_DESCRIPTION:~~~~~~_TYPE1:~~~~~~_TYPE2:~~~~~~_ADDRESS:~~~~~~~
_ID:~~~_NAME:~~~~~_DESCRIPTION:~~~~~~_TYPE1:~~~~~~_ADDRESS:~~~~~~~
Example output:
INSERT INTO table (ID,NAME,DESCRIPTION,TYPE1,TYPE2) values (~~~,~~~~~,~~~~~~~,~~~~~~,~~~~~~);
INSERT INTO table (ID,NAME,DESCRIPTION,TYPE1,TYPE2,ADDRESS) values (~~~,~~~~~,~~~~~~,~~~~~~,~~~~~~,~~~~~~~);
INSERT INTO table (ID,NAME,DESCRIPTION,TYPE1,ADDRESS) values (~~~,~~~~~,~~~~~~,~~~~~~,~~~~~~~);

try searching for (_TYPE1:)(\S\S\S\S\S\S)(_ADDRESS:)
and replacing with \1\2_TYPE2:~~~~~~\3
i tested in notepad++ with your data and it works
don't forget to change the Search Mode to regular expression.
to turn it into an INSERT script just keep using regular expression like i did above, and bracket which ever field you want and then replace with a \number whichever field and move them around it should be pretty simple manual labor, have fun.
for example search for your whole line here i am only doing DESCRIPTION,TYPE1,and TYPE2
search for using regular expression
(_DESCRIPTION)(:)(\S\S\S\S\S\S)(_TYPE1)(:)(\S\S\S\S\S\S)(_TYPE2)(:)(\S\S\S\S\S\S)
then replace with something like
INSERT INTO table1\(desc,type1,type2\)values\('\3','\6','\9'\); (in notepad++)

If this is a once-off problem then a two step process would work. First step would add a _TYPE2:SomeDefaultValue to every line. Step two would remove it from lines where it was not needed.
Step 1: Find what: $, Replace with: _TYPE2:xxx
Step 2: Find what: (_TYPE2:.*)_TYPE2:xxx$, Replace with: \1
In both steps select "regular expression" and un-select "dot matches newline". Also change xxx to your default value.

Related

multiline filebeat pattern to match miltiple word

Some confusion here where I have to use filebeat multiline pattern to collec data.
Question is how to use multiple pattern ?
Here what i use now
multiline.pattern : '^Select'
So for above pattern we can see all word start from select will be match. So my question how about INSERT,UPDATE and DELETE word ?
Also one question can I use below pattern to indicate end of multiline match ?
multiline.flush_pattern: ';'
Any idea or help is highly appreciated
To your first question:
You can specify multiple words for the beginning of the message within a single regex. So if I understood you correctly, you want to include all log lines that start with Select, INSERT, UPDATE and DELETE. To achieve this you would define a group of valid values like so:
multiline.pattern : '^(Select|INSERT|UPDATE|DELETE)
The pipe-character ( | ) acts as an OR-Operator. Please note that by default regex is case sensitive. So e.g. messages that start with an uppercase SELECT would be ignored in the sample above.
To your second question:
Besides multiline.pattern you have to specify the settings multiline.match and multiline.negate:
multiline.match determines if the log lines before or after the pattern should be put into a single event.
multiline.negate determines if the following lines have to match the pattern.
So instead of specifying a particular end-character you tell Filebeat that every log line that matches the pattern AND is following that line should get aggregated UNTIL the following line matches again the pattern.
(See https://www.elastic.co/guide/en/beats/filebeat/current/multiline-examples.html for a full reference and description).
Example:
Assuming your log file is structured as following:
Select foo from bar\n where baz = 1\n and id =4711;\n\n
DELETE from bar\n where baz = null;\n\n
INSERT ...
the following config should do the job:
multiline.pattern : '^(Select|INSERT|UPDATE|DELETE)'
multiline.match: after
multiline.negate: true
I hope I could help you.

MySQLAdmin replace text in a field with percent in text

Using MySQLAdmin. Moved data from Windows server and trying to replace case in urls but not finding the matches. Need slashes as I don't want to replace text in anything but the urls (in post table). I think the %20 are the problem somwhow?
UPDATE table_name SET field = replace(field, '/user%20name/', '/User%20Name/')
The actual string is more like:
https://www.example.com/forum/uploads/user%20name/GFCI%20Stds%20Rev%202006%20.pdf
In a case you are using MariaDB you have REGEXP_REPLACE() function.
But best approach is to dump the table into the file. Open it in a Notepad ++
and run regex replace like specified on a pic:
Pattern is: (https:[\/\w\s\.]+uploads/)(\w+)\%20(\w+)((\/.*)+)
Replace with: $1\u$2\%20\u$3$4
Then import the table again
Hope this help
If its MariaDB, you can do the following:
UPDATE table_name SET field = REGEXP_REPLACE(field, '\/user%20name\/', '\/User%20Name\/');
First, please check, what is actually stored in the database: %20 is a html-entity which represents a whitespace. Usually, when you are storing this inside the database, it will be represented as an actual whitespace (converted before you store it) -> Hence your replace doesn't match the actual data.
The second option that might be possible - depending on what you want to do: You are seeing the URL containing %20, therefore you created your database records (which you would like to fetch) with that additional %20 - And when you now try to query your results based on the actual url, the %20 is replaced with an "actual" whitespace (before your query) and hence it doesn't match your stored data.

Read a list of CSV files in Talend with ; in field

I have a list of CSV files which i receive for ETL into database every month. Its in a folder. My data has ; in many columns as well. For example, in the location column values like New York; USA are present, which i want to appear in a single column instead of splitting into many columns. How do i specify delimiter then?
I think you cannot have the field separator included in the field content or you have to incluse these values between "". For example:
blabla;"New York; USA";blabla
Other solution, change the field delimitor to a more specific (and unused) character.
I'm afraid there is no better solution.
Regards,
TRF
As TRF mentioned, you can't have the delimiter as part of the non-delimiting text in your file.
My workaround for that would be the following:
1) Read the file with a tFileInputFullRow (https://help.talend.com/display/TalendComponentsReferenceGuide54EN/tFileInputFullRow)
2) Use a tReplace to replace the ; with some other character,
say -, for the problem cells (in your case, replace "New York;USA" with "New York-USA". You can also use the regex option in the tReplace component to make it a generic rule.
3) Save that output into another file
4) Now read the new file using ; as the delimiter
References:
1) tReplace: https://help.talend.com/display/TalendOpenStudioComponentsReferenceGuide521EN/18.16+tReplace
2) Regex: https://docs.oracle.com/javase/tutorial/essential/regex/

MySQL find/replace with a unique string inside

not sure how far I'm going to get with this, but I'm going through a database removing certain bits and pieces in preparation for a conversion to different software.
I'm struggling with the image tags as on the site they currently look like
[img:<string>]<image url>[/img:<string>]
those strings are in another field called bbcode_uid
The query I'm running to make the changes so far is
UPDATE phpbb_posts SET post_text = REPLACE(post_text, '[img:]', '');
So my actual question, is there any way of pulling in each string from bbcode_uid inside of that SQL query so that I don't have to run the same command 10,000+ times, changing the unique string every time.
Alternatively could I include something inside [img:] to also include the next 8 characters, whatever they may be, as that is the length of the string that is used.
Hoping to save time with this, otherwise I might have to think of another way of doing it.
As requested.
The text I wish to replace would be
[img:1nynnywx]http://i.imgur.com/Tgfrd3x.jpg[/img:1nynnywx]
I want to end up with just
http://i.imgur.com/Tgfrd3x.jpg
Just removing the code around the URL, however each post_text has a different string which is contained inside bbcode_uid.
Method 1
LIB_MYSQLUDF_PREG
If you want more regular expression power in your database, you can consider using LIB_MYSQLUDF_PREG. This is an open source library of MySQL user functions that imports the PCRE library. LIB_MYSQLUDF_PREG is delivered in source code form only. To use it, you'll need to be able to compile it and install it into your MySQL server. Installing this library does not change MySQL's built-in regex support in any way. It merely makes the following additional functions available:
PREG_CAPTURE extracts a regex match from a string. PREG_POSITION returns the position at which a regular expression matches a string. PREG_REPLACE performs a search-and-replace on a string. PREG_RLIKE tests whether a regex matches a string.
All these functions take a regular expression as their first parameter. This regular expression must be formatted like a Perl regular expression operator. E.g. to test if regex matches the subject case insensitively, you'd use the MySQL code PREG_RLIKE('/regex/i', subject). This is similar to PHP's preg functions, which also require the extra // delimiters for regular expressions inside the PHP string
you can refer this link :github.com/hholzgra/mysql-udf-regexp
Method 2
Use php program, fetch records one by one , use php preg_replace
refer : www.php.net/preg_replace
reference:http://www.online-ebooks.info/article/MySql_Regular_Expression_Replace.html
You might be able to do this with substring_index().
The following will work on your example:
select substring_index(substring_index(post_text, '[/img:', 1), ']', -1)

Why does SSIS TOKEN function fail to count adjacent column delimiters?

I ran into a problem with SQL Server Integration Services 2012's new string function in the Expression Editor called TOKEN().
This is supposed to help you parse a delimited record. If the record comes out of a flat file, you can do this with the Flat File Source. In this case, I am dealing with old delimited import records that were stored as strings in a database VARCHAR field. Now they need to be extracted, massaged, and re-exported as delimited strings. For example:
1^Apple^0001^01/01/2010^Anteater^A1
2^Banana^0002^03/15/2010^Bear^B2
3^Cranberry^0003^4/15/2010^Crow^C3
If these strings are in a column called OldImportRecord, the delimiter is a caret (as shown), and we wish to put the fifth field into a Derived Column, we would use an expression like:
TOKEN(OldImportRecord,"^",5)
This returns Anteater, Bear, Crow, etc. In fact, we can create Derived Columns for each of the fields in this record (note that the index is one-based), change them as needed, and then build another delimited record for export.
Here's the problem. What if some of our data includes some empty strings (or Nulls rendered as empty strings)?
4^^0004^6/15/2010^Duck^D4
The TOKEN() fails to count the adjacent column delimiters, which throws off the column count. Now it only sees five columns instead of six columns. Our TOKEN(OldImportRecord,"^",5) returns "D4" instead of the intended "Duck". When we extract the fourth column, we wind up trying to put "Duck" into a Date column, and all sorts of fun ensues.
Here's a partial workaround:
TOKEN(REPLACE(OldImportRecord,"^^","^ ^"),"^",5)
Notice this misses every second delimiter pair, so it will fail for a string like "5^^^^Emu^E5", which looks like"5^ ^^ ^Emu^E5" after the REPLACE(). The column count is still wrong.
So here's my full workaround. This includes two nested REPLACE statements(), an RTRIM() to remove the superfluous spaces, and a DT_STR cast because I would like to keep the result in VARCHAR:
(DT_STR,255,1252)RTRIM(TOKEN(REPLACE(REPLACE(OldImportRecord,"^^","^ ^"),"^^","^ ^"),"^",5))
I am posting this for information, since others may also run into this problem.
Does anyone have a better workaround, or even a real solution?
Reason for the issue:
TOKEN method in SSIS uses the implementation of strtok function in C++. I gathered this information while reading the book Microsoft® SQL Server® 2012 Integration Services. It is mentioned as note on page 113 (I like this book! Lots of nice information.).
I searched for the implementation of strtok function and I found the following links.
INFO: strtok(): C Function -- Documentation Supplement - The code sample in this link shows that the function does ignore consecutive delimiter characters.
The answers to the following SO questions point out that strtok function is designed to ignore consecutive delimiters.
Need to know when no data appears between two token separators using strtok()
strtok_s behaviour with consecutive delimiters
I think that the TOKEN and TOKENCOUNT functions are working as per design but whether that is how SSIS should behave might be a question for the Microsoft SSIS team.
Original Post - Above section is an update:
I created a simple package in SSIS 2012 based on your data inputs. As you had described in your question, the TOKEN function does not behave as intended. I agree with you that the function doesn't seem to work. This post is not an answer to your original issue.
Here is an alternative way to write the expression in a relatively simpler fashion. This will only work if the last segment in your input record will always have a value (say A1, B2, C3 etc.).
Expression can be rewritten as:
This statement will take the input record as the parameter, the delimiter caret (^) as the second parameter. The third parameter calculates the total number segments in the records when split by the delimiter. If you have data in the last segment, you are guaranteed to have two segments. You can then subtract 1 to fetch the penultimate segment.
(DT_STR,50,1252)TOKEN(OldImportRecord,"^",TOKENCOUNT(OldImportRecord,"^") - 1)
I created a simple package with data flow task. OLE DB source retrieves the data and the derived transformation parses and splits the data as per the screenshot below. The output is then inserted into the destination table. You can see the source and destination tables in the last screenshot. Destination table has two columns. The first column stores the penultimate segment data and the segments count based on the delimiter (which again isn't correct). You can notice that the last record didn't fetch the correct results. If the last record didn't have the value 8, then the above expression will fail because the expression will evaluate to zero index.
Hope that helps to simplify your expression.
If you don't hear from anyone else, I would recommend logging this issue in Microsoft Connect website.
Create table and populate scripts:
CREATE TABLE [dbo].[SourceTable](
[OldImportRecord] [varchar](50) NOT NULL
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[DestinationTable](
[NewImportRecord] [varchar](50) NOT NULL,
[CaretCount] [int] NOT NULL
) ON [PRIMARY]
GO
INSERT INTO dbo.SourceTable (OldImportRecord) VALUES
('1^Apple^0001^01/01/2010^Anteater^A1'),
('2^Banana^0002^03/15/2010^Bear^B2'),
('3^Cranberry^0003^4/15/2010^Crow^C3'),
('4^^0004^6/15/2010^Duck^D4'),
('5^^^^Emu^E5'),
('6^^^^Geese^F6'),
('^^^^Pheasant^G7'),
('8^^^^Sparrow^');
GO
Derived column transformation inside data flow task:
Data in source and destination tables:
Not only does TOKEN skip adjacent delimiters, it also skips leading and trailing delimiters as well. So, using your example, if you had a field "good" field that looks like this:
1^Apple^0001^01/01/2010^Anteater^A1
Followed by one with adjacent and leading delimiters like this:
^^^0004^6/15/2010^Duck^
TOKENCOUNT would only find two delimiters and you'd end up with 0004 assigned to Token1, 6/15/2010 for Token2, and Duck for Token3.
I used a different kind of replace. Rather than placing spaces between adjacent delimiters, which wouldn't help with leading or training, I used replace to surround the delimiters with characters I absolutely wouldn't find in my text. The following Expression works well for me. It's wordy, but it is what it is.
(DT_STR,255,1252)REPLACE(TOKEN(REPLACE(OldImportRecord,"^","~^~"),"^",1),"~","")
Of course, you'd replace the number 1 with whatever Token you wanted and adjust the cast according to your needs. Hope that helps.