PostgreSQL replace HTML entities function - html

I've found this very interesting function on internet:
CREATE OR REPLACE FUNCTION strip_tags(TEXT) RETURNS TEXT AS $$
SELECT regexp_replace(regexp_replace($1, E'(?x)<[^>]*?(\s alt \s* = \s* ([\'"]) ([^>]*?) \2) [^>]*? >', E'\3'), E'(?x)(< [^>]*? >)', '', 'g')
$$ LANGUAGE SQL;
But it doesn't remove html codes like: "
Is it possible to remove them using regexp_replace?

Yes it is possible to replace HTML or other character entities with the respective characters using a function. First create a character entity table:
create table character_entity(
name text primary key,
ch char(1) unique
);
insert into character_entity (ch, name) values
(E'\u00C6','AElig'),(E'\u00C1','Aacute'),(E'\u00C2','Acirc'),(E'\u00C0','Agrave'),(E'\u0391','Alpha'),(E'\u00C5','Aring'),(E'\u00C3','Atilde'),(E'\u00C4','Auml'),(E'\u0392','Beta'),(E'\u00C7','Ccedil'),
(E'\u03A7','Chi'),(E'\u2021','Dagger'),(E'\u0394','Delta'),(E'\u00D0','ETH'),(E'\u00C9','Eacute'),(E'\u00CA','Ecirc'),(E'\u00C8','Egrave'),(E'\u0395','Epsilon'),(E'\u0397','Eta'),(E'\u00CB','Euml'),
(E'\u0393','Gamma'),(E'\u00CD','Iacute'),(E'\u00CE','Icirc'),(E'\u00CC','Igrave'),(E'\u0399','Iota'),(E'\u00CF','Iuml'),(E'\u039A','Kappa'),(E'\u039B','Lambda'),(E'\u039C','Mu'),(E'\u00D1','Ntilde'),
(E'\u039D','Nu'),(E'\u0152','OElig'),(E'\u00D3','Oacute'),(E'\u00D4','Ocirc'),(E'\u00D2','Ograve'),(E'\u03A9','Omega'),(E'\u039F','Omicron'),(E'\u00D8','Oslash'),(E'\u00D5','Otilde'),(E'\u00D6','Ouml'),
(E'\u03A6','Phi'),(E'\u03A0','Pi'),(E'\u2033','Prime'),(E'\u03A8','Psi'),(E'\u03A1','Rho'),(E'\u0160','Scaron'),(E'\u03A3','Sigma'),(E'\u00DE','THORN'),(E'\u03A4','Tau'),(E'\u0398','Theta'),
(E'\u00DA','Uacute'),(E'\u00DB','Ucirc'),(E'\u00D9','Ugrave'),(E'\u03A5','Upsilon'),(E'\u00DC','Uuml'),(E'\u039E','Xi'),(E'\u00DD','Yacute'),(E'\u0178','Yuml'),(E'\u0396','Zeta'),(E'\u00E1','aacute'),
(E'\u00E2','acirc'),(E'\u00B4','acute'),(E'\u00E6','aelig'),(E'\u00E0','agrave'),(E'\u2135','alefsym'),(E'\u03B1','alpha'),(E'\u0026','amp'),(E'\u2227','and'),(E'\u2220','ang'),(E'\u00E5','aring'),
(E'\u2248','asymp'),(E'\u00E3','atilde'),(E'\u00E4','auml'),(E'\u201E','bdquo'),(E'\u03B2','beta'),(E'\u00A6','brvbar'),(E'\u2022','bull'),(E'\u2229','cap'),(E'\u00E7','ccedil'),(E'\u00B8','cedil'),
(E'\u00A2','cent'),(E'\u03C7','chi'),(E'\u02C6','circ'),(E'\u2663','clubs'),(E'\u2245','cong'),(E'\u00A9','copy'),(E'\u21B5','crarr'),(E'\u222A','cup'),(E'\u00A4','curren'),(E'\u21D3','dArr'),
(E'\u2020','dagger'),(E'\u2193','darr'),(E'\u00B0','deg'),(E'\u03B4','delta'),(E'\u2666','diams'),(E'\u00F7','divide'),(E'\u00E9','eacute'),(E'\u00EA','ecirc'),(E'\u00E8','egrave'),(E'\u2205','empty'),
(E'\u2003','emsp'),(E'\u2002','ensp'),(E'\u03B5','epsilon'),(E'\u2261','equiv'),(E'\u03B7','eta'),(E'\u00F0','eth'),(E'\u00EB','euml'),(E'\u20AC','euro'),(E'\u2203','exist'),(E'\u0192','fnof'),
(E'\u2200','forall'),(E'\u00BD','frac12'),(E'\u00BC','frac14'),(E'\u00BE','frac34'),(E'\u2044','frasl'),(E'\u03B3','gamma'),(E'\u2265','ge'),(E'\u003E','gt'),(E'\u21D4','hArr'),(E'\u2194','harr'),
(E'\u2665','hearts'),(E'\u2026','hellip'),(E'\u00ED','iacute'),(E'\u00EE','icirc'),(E'\u00A1','iexcl'),(E'\u00EC','igrave'),(E'\u2111','image'),(E'\u221E','infin'),(E'\u222B','int'),(E'\u03B9','iota'),
(E'\u00BF','iquest'),(E'\u2208','isin'),(E'\u00EF','iuml'),(E'\u03BA','kappa'),(E'\u21D0','lArr'),(E'\u03BB','lambda'),(E'\u2329','lang'),(E'\u00AB','laquo'),(E'\u2190','larr'),(E'\u2308','lceil'),
(E'\u201C','ldquo'),(E'\u2264','le'),(E'\u230A','lfloor'),(E'\u2217','lowast'),(E'\u25CA','loz'),(E'\u200E','lrm'),(E'\u2039','lsaquo'),(E'\u2018','lsquo'),(E'\u003C','lt'),(E'\u00AF','macr'),
(E'\u2014','mdash'),(E'\u00B5','micro'),(E'\u00B7','middot'),(E'\u2212','minus'),(E'\u03BC','mu'),(E'\u2207','nabla'),(E'\u00A0','nbsp'),(E'\u2013','ndash'),(E'\u2260','ne'),(E'\u220B','ni'),
(E'\u00AC','not'),(E'\u2209','notin'),(E'\u2284','nsub'),(E'\u00F1','ntilde'),(E'\u03BD','nu'),(E'\u00F3','oacute'),(E'\u00F4','ocirc'),(E'\u0153','oelig'),(E'\u00F2','ograve'),(E'\u203E','oline'),
(E'\u03C9','omega'),(E'\u03BF','omicron'),(E'\u2295','oplus'),(E'\u2228','or'),(E'\u00AA','ordf'),(E'\u00BA','ordm'),(E'\u00F8','oslash'),(E'\u00F5','otilde'),(E'\u2297','otimes'),(E'\u00F6','ouml'),
(E'\u00B6','para'),(E'\u2202','part'),(E'\u2030','permil'),(E'\u22A5','perp'),(E'\u03C6','phi'),(E'\u03C0','pi'),(E'\u03D6','piv'),(E'\u00B1','plusmn'),(E'\u00A3','pound'),(E'\u2032','prime'),
(E'\u220F','prod'),(E'\u221D','prop'),(E'\u03C8','psi'),(E'\u0022','quot'),(E'\u21D2','rArr'),(E'\u221A','radic'),(E'\u232A','rang'),(E'\u00BB','raquo'),(E'\u2192','rarr'),(E'\u2309','rceil'),
(E'\u201D','rdquo'),(E'\u211C','real'),(E'\u00AE','reg'),(E'\u230B','rfloor'),(E'\u03C1','rho'),(E'\u200F','rlm'),(E'\u203A','rsaquo'),(E'\u2019','rsquo'),(E'\u201A','sbquo'),(E'\u0161','scaron'),
(E'\u22C5','sdot'),(E'\u00A7','sect'),(E'\u00AD','shy'),(E'\u03C3','sigma'),(E'\u03C2','sigmaf'),(E'\u223C','sim'),(E'\u2660','spades'),(E'\u2282','sub'),(E'\u2286','sube'),(E'\u2211','sum'),
(E'\u2283','sup'),(E'\u00B9','sup1'),(E'\u00B2','sup2'),(E'\u00B3','sup3'),(E'\u2287','supe'),(E'\u00DF','szlig'),(E'\u03C4','tau'),(E'\u2234','there4'),(E'\u03B8','theta'),(E'\u03D1','thetasym'),
(E'\u2009','thinsp'),(E'\u00FE','thorn'),(E'\u02DC','tilde'),(E'\u00D7','times'),(E'\u2122','trade'),(E'\u21D1','uArr'),(E'\u00FA','uacute'),(E'\u2191','uarr'),(E'\u00FB','ucirc'),(E'\u00F9','ugrave'),
(E'\u00A8','uml'),(E'\u03D2','upsih'),(E'\u03C5','upsilon'),(E'\u00FC','uuml'),(E'\u2118','weierp'),(E'\u03BE','xi'),(E'\u00FD','yacute'),(E'\u00A5','yen'),(E'\u00FF','yuml'),(E'\u03B6','zeta'),
(E'\u200D','zwj'),(E'\u200C','zwnj')
;
This is the function:
create or replace function entity2char(t text)
returns text as $body$
declare
r record;
begin
for r in
select distinct ce.ch, ce.name
from
character_entity ce
inner join (
select name[1] "name"
from regexp_matches(t, '&([A-Za-z]+?);', 'g') r(name)
) s on ce.name = s.name
loop
t := replace(t, '&' || r.name || ';', r.ch);
end loop;
for r in
select distinct
hex[1] hex,
('x' || repeat('0', 8 - length(hex[1])) || hex[1])::bit(32)::int codepoint
from regexp_matches(t, '&#x([0-9a-f]{1,8}?);', 'gi') s(hex)
loop
t := regexp_replace(t, '&#x' || r.hex || ';', chr(r.codepoint), 'gi');
end loop;
for r in
select distinct
chr(codepoint[1]::int) ch,
codepoint[1] codepoint
from regexp_matches(t, '&#([0-9]{1,10}?);', 'g') s(codepoint)
loop
t := replace(t, '&#' || r.codepoint || ';', r.ch);
end loop;
return t;
end;
$body$
language plpgsql immutable;
Use it like this:
select entity2char('HH■XXXÆYYY×ZZZ■UUU');
entity2char
--------------------
HH■XXXÆYYY×ZZZ■UUU
It only works for UTF-8.

This classic quote may apply here: Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. Regex are useful, but HTML parsing is not a job they're well suited for. Jeff Atwood explains this well. To strip tags from HTML correctly some kind of parsing is necessary.
What I would recommend is that you use a more powerful PL like PL/Perl or PL/Pythonu to invoke mature and well tested HTML-stripping libraries. For example, you could use Perl's HTML::Strip via a plperl function that accepts text and returns text.
The quick and dirty way to handle this would be to use another layer of regexp_replace expressions to convert entities. This will rapidly lead you down the path alluded to by Igor though, and is best avoided by using tools that aready exist. For example, if you use HTML::Strip it'll use HTML::Entities to convert entities for you as part of the process.

I've been using this successfully for while - thanks for the solution.
However I've just discovered that this doesn't seem to work with HTML items such as ² (superscript 2 = &sup2 ), and I suspect any other HTML item that has digit just before the closing ";".
I believe the line
from regexp_matches(t, '&([A-Za-z]+?);', 'g') r(name)
should be
from regexp_matches(t, '&([A-Za-z]+[0-9]?);', 'g') r(name)
I've tried this with a few examples and it seems to work.

Related

How to crop text between braces

I have data in MySQL in one string field with below structure:
{language *lang_code*}text{language}{language *lang_code*}text{language}
And here is example:
{language en}text in english{language}{language de}text in german{language}
The ideal output for this would be in this case
text in english
So we want to disregard the other languages, just want to extract the first one, and put it into new column, because it's often the title of the product, with translations, and for us the first one is the most important.
The values in first braces may be different, so for example here the first one is english, but in other example it might be in german, so the lang code might also be dynamic.
I am wondering if it's possible to extract the text value between two first braces through SQL query?
This is really horrible but it works for your simple example -
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR('{language en}text in english{language}{language de}text in german{language}', '\\{language en\\}(.*?)\\{language\\}'), '}', -2), '{', 1);
or
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR('{language en}text in english{language}{language de}text in german{language}', '\\{language de\\}(.*?)\\{language\\}'), '}', -2), '{', 1);
to retrieve the german text.
To retrieve the first text in the string regardless of language you can use -
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR('{language en}text in english{language}{language de}text in german{language}', '\\{language [a-z]{2}\\}(.*?)\\{language\\}'), '}', -2), '{', 1);
Note this version assumes the language code is always 2 x a-z chars - [a-z]{2}
Here is an example of the above wrapped in a stored function -
DELIMITER $$
CREATE FUNCTION `ExtractLangString`(content TEXT, lang CHAR(8))
RETURNS text
DETERMINISTIC
BEGIN
-- if lang is not 2 chars in length or lang not found return first language string
IF LENGTH(lang) <> 2 OR content NOT LIKE CONCAT('%{language ', lang, '}%') THEN
SET lang = '[a-z]{2}';
END IF;
RETURN SUBSTRING_INDEX(SUBSTRING_INDEX(REGEXP_SUBSTR(content, CONCAT('\\{language ', lang, '\\}(.*?)\\{language\\}')), '}', -2), '{', 1);
END$$
DELIMITER ;
There is probably a cleaner way of doing it but I cannot think of it right now.
Obviously, the better solution would be to normalise the data that is currently serialised into this column.

Sql script to get English text from script

I have a column with both English and Chinese text.
Example: The hills have eyes. 隔山有眼
Expected results: The hills have eyes.
How can I extract the English text from that string using sql, please.
Thanks for help.
A quick-and-dirty way simply converts the string to ASCII and removes the '?' -- which is the representation of the other characters:
select replace(convert(t.str using ascii), '?', '')
from t;
The only downside is that you lose '?' characters in the original string as well.
Here is a db<>fiddle.
For more control over the replacement, you can use regexp_replace():
select regexp_replace(t.str, '[^a-zA-Z0-9.?, ]', '')
from t;
Unfortunately, I am not aware of a character class for ASCII-only characters.
One option you have is to use a function that returns just the english only text.
Additionally, you could make it dual-purpose with another parameter to determine if you want the English text or Non-English text to switch the <127 comparison.
CREATE FUNCTION `EnglishOnly`(String VARCHAR(100))
RETURNS varchar(100)
NO SQL
BEGIN
DECLARE output VARCHAR(100) DEFAULT '';
DECLARE i INTEGER DEFAULT 1;
DECLARE ch varchar(1);
IF LENGTH(string) > 0 THEN
WHILE(i <= LENGTH(string)) DO
SET ch=SUBSTRING(string, i, 1);
IF ASCII(ch)<127 then
set output = CONCAT(output,ch);
END IF;
SET i = i + 1;
END WHILE;
END IF;
RETURN output;
END;
You can then sinply use it like so
select EnglishOnly ("The hills have eyes 隔山有眼that see all.")
Output
The hills have eyes that see all.
Example Fiddle

unescape diactrics in \u0 format (json) in ms sql (SQL Server)

I'm getting json file, which I load to Azure SQL databese. This json is direct output from API, so there is nothing I can do with it before loading to DB.
In that file, all Polish diactircs are escaped to "C/C++/Java source code" (based on: http://www.fileformat.info/info/unicode/char/0142/index.htm
So for example:
ł is \u0142
I was trying to find some method to convert (unescape) those to proper Polish letters.
In worse case scenario, I can write function which will replace all combinations
Repalce(Replace(Replace(string,'\u0142',N'ł'),'\u0144',N'ń')))
And so on, making one big, terrible function...
I was looking for some ready functions like there is for URLdecode, which was answered here on stack in many topics, and here: https://www.codeproject.com/Articles/1005508/URL-Decode-in-T-SQL
Using this solution would be possible but I cannot figure out cast/convert with proper collation and types in there, to get result I'm looking for.
So if anyone knows/has function that would make conversion in string for unescaping that \u this would be great, but I will manage to write something on my own if I would get right conversion. For example I tried:
select convert(nvarchar(1), convert(varbinary, 0x0142, 1))
I made assumption that changing \u to 0x will be the answer but it gives some Chinese characters. So this is wrong direction...
Edit:
After googling more I found exactly same question here on stack from #Pasetchnik: Json escape unicode in SQL Server
And it looks this would be the best solution that there is in MS SQL.
Onlty thing I needed to change was using NVARCHAR instead of VARCHAR that is in linked solution:
CREATE FUNCTION dbo.Json_Unicode_Decode(#escapedString nVARCHAR(MAX))
RETURNS nVARCHAR(MAX)
AS
BEGIN
DECLARE #pos INT = 0,
#char nvarCHAR,
#escapeLen TINYINT = 2,
#hexDigits TINYINT = 4
SET #pos = CHARINDEX('\u', #escapedString, #pos)
WHILE #pos > 0
BEGIN
SET #char = NCHAR(CONVERT(varbinary(8), '0x' + SUBSTRING(#escapedString, #pos + #escapeLen, #hexDigits), 1))
SET #escapedString = STUFF(#escapedString, #pos, #escapeLen + #hexDigits, #char)
SET #pos = CHARINDEX('\u', #escapedString, #pos)
END
RETURN #escapedString
END
Instead of nested REPLACE you could use:
DECLARE #string NVARCHAR(MAX)= N'\u0142 \u0144\u0142';
SELECT #string = REPLACE(#string,u, ch)
FROM (VALUES ('\u0142',N'ł'),('\u0144', N'ń')) s(u, ch);
SELECT #string;
DBFiddle Demo

function return varray error

I keep getting a error when i run this code, What wrong with the code?
create or replace function f_vars(line varchar2,delimit varchar2 default ',')
return line_type is type line_type is varray(1000) of varchar2(3000);
sline varchar2 (3000);
line_var line_type;
pos number;
begin
sline := line;
for i in 1 .. lenght(sline)
loop
pos := instr(sline,delimit,1,1);
if pos =0 then
line_var(i):=sline;
exit;
endif;
string:=substr(sline,1,pos-1);
line_var(i):=string;
sline := substr(sline,pos+1,length(sline));
end loop;
return line_var;
end;
LINE/COL ERROR
20/5 PLS-00103: Encountered the symbol "LOOP" when expecting one of
the following:
if
22/4 PLS-00103: Encountered the symbol "end-of-file" when expecting
one of the following:
end not pragma final instantiable order overriding static
member constructor map
Stack Overflow isn't really a de-bugging service.
However, I'm feeling generous.
You have spelt length incorrectly; correcting this should fix your first error. Your second is caused by endif;, no space, which means that the if statement has no terminator.
This will not correct all your errors. For instance, you're assigning something to the undefined (and unnecessary) variable string.
I do have more to say though...
I cannot over-emphasise the importance of code-style and whitespace. Your code is fairly unreadable. While this may not matter to you now it will matter to someone else coming to the code in 6 months time. It will probably matter to you in 6 months time when you're trying to work out what you wrote.
Secondly, I cannot over-emphasise the importance of comments. For exactly the same reasons as whitespace, comments are a very important part of understanding how something works.
Thirdly, always explicitly name your function when ending it. It makes things a lot clearer in packages so it's a good habit to have and in functions it'll help with matching up the end problem that caused your second error.
Lastly, if you want to return the user-defined type line_type you need to declare this _outside your function. Something like the following:
create or replace object t_line_type as object ( a varchar2(3000));
create or replace type line_type as varray(1000) of t_line_type;
Adding whitespace your function might look something like the following. This is my coding style and I'm definitely not suggesting that you should slavishly follow it but it helps to have some standardisation.
create or replace function f_vars ( PLine in varchar2
, PDelimiter in varchar2 default ','
) return line_type is
/* This function takes in a line and a delimiter, splits
it on the delimiter and returns it in a varray.
*/
-- local variables are l_
l_line varchar2 (3000) := PLine;
l_pos number;
-- user defined types are t_
-- This is a varray.
t_line line_type;
begin
for i in 1 .. length(l_line) loop
-- Get the position of the first delimiter.
l_pos := instr(l_line, PDelimiter, 1, 1);
-- Exit when we have run out of delimiters.
if l_pos = 0 then
t_line_var(i) := l_line;
exit;
end if;
-- Fill in the varray and take the part of a string
-- between our previous delimiter and the next.
t_line_var(i) := substr(l_line, 1, l_pos - 1);
l_line := substr(l_line, l_pos + 1, length(l_line));
end loop;
return t_line;
end f_vars;
/

Optimize escape JSON in PostgreSQL 9.0

I'm currently using this JSON escaping function in PostgreSQL as a stand in for future native JSON support. While it works, it's also limiting our systems performance. How can I go about optimizing it? Maybe some kind of lookup array?
CREATE OR REPLACE FUNCTION escape_json(i_text TEXT)
RETURNS TEXT AS
$body$
DECLARE
idx INTEGER;
text_len INTEGER;
cur_char_unicode INTEGER;
rtn_value TEXT := i_text;
BEGIN
-- $Rev: $ --
text_len = LENGTH(rtn_value);
idx = 1;
WHILE (idx <= text_len) LOOP
cur_char_unicode = ASCII(SUBSTR(rtn_value, idx, 1));
IF cur_char_unicode > 255 THEN
rtn_value = OVERLAY(rtn_value PLACING (E'\\u' || LPAD(UPPER(TO_HEX(cur_char_unicode)),4,'0')) FROM idx FOR 1);
idx = idx + 5;
text_len = text_len + 5;
ELSE
/* is the current character one of the following: " \ / bs ff nl cr tab */
IF cur_char_unicode IN (34, 92, 47, 8, 12, 10, 13, 9) THEN
rtn_value = OVERLAY(rtn_value PLACING (E'\\' || (CASE cur_char_unicode
WHEN 34 THEN '"'
WHEN 92 THEN E'\\'
WHEN 47 THEN '/'
WHEN 8 THEN 'b'
WHEN 12 THEN 'f'
WHEN 10 THEN 'n'
WHEN 13 THEN 'r'
WHEN 9 THEN 't'
END)
)
FROM idx FOR 1);
idx = idx + 1;
text_len = text_len + 1;
END IF;
END IF;
idx = idx + 1;
END LOOP;
RETURN rtn_value;
END;
$body$
LANGUAGE plpgsql;
Confession: I am the Google Summer of Code 2010 student who was going to try to bring JSON support to PostgreSQL 9.1. Although my code was fairly feature-complete , it wasn't completely ready for upstream, and the PostgreSQL development community was looking at some alternative implementations. However, with spring break coming up, I'm hoping to finish my rewrite and give it a final push this week.
In the mean time, you can download and install the work-in-progress JSON data type module, which should work on PostgreSQL 8.4.0 and up. It is a PGXS module, so you can compile and install it without having to compile all of PostgreSQL. However, you will need the PostgreSQL server development headers.
Installation goes something like this:
git clone git://git.postgresql.org/git/json-datatype.git
cd json-datatype/
USE_PGXS=1 make
sudo USE_PGXS=1 make install
psql -f json.sql <DBNAME1> # requires database superuser privileges
Although the build and install only needs to be done once, json.sql needs to be run on every database you plan to use the JSON data type on.
With that installed, you can now run:
=> SELECT to_json(E'"quotes and \n newlines"\n'::TEXT);
to_json
--------------------------------
"\"quotes and \n newlines\"\n"
(1 row)
Note that this does not escape non-ASCII characters.
All my approaches boil down to "do it some other way":
Write it in some other language, e.g. use pl/perl, pl/python, pl/ruby
Write a wrapper round some external JSON library written in C
Do the JSON escaping in the client rather than in the query (assuming your client has some good JSON escaping support)
In my experience pl/pgsql isn't fast at this sort of thing- its strength is in its integral support for exchanging data with the database, not as a general-purpose programming language.
Example:
create or replace function escape_json_perl(text) returns text
strict immutable
language plperlu as $$
use JSON;
return JSON->new->allow_nonref->encode($_[0]);
$$;
A quick test suggests this is on the order of 15x faster than the plpgsql function (although it returns quotes around the value which you probably want to strip off)
I have found a PostgreSQL function implemented in C here : http://code.google.com/p/pg-to-json-serializer/
I have not compared it with your PLSQL method but it should be faster than any interpreted language.
Another one : http://miketeo.net/wp/index.php/projects/json-functions-for-postgresql