Using SQL to determine word count stats of a text field - mysql

I've recently been working on some database search functionality and wanted to get some information like the average words per document (e.g. text field in the database). The only thing I have found so far (without processing in language of choice outside the DB) is:
SELECT AVG(LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1)
FROM documents
This seems to work* but do you have other suggestions? I'm currently using MySQL 4 (hope to move to version 5 for this app soon), but am also interested in general solutions.
Thanks!
* I can imagine that this is a pretty rough way to determine this as it does not account for HTML in the content and the like as well. That's OK for this particular project but again are there better ways?
Update: To define what I mean by "better": either more accurate, performs more efficiently, or is more "correct" (easy to maintain, good practice, etc). For the content I have available, the query above is fast enough and is accurate for this project, but I may need something similar in the future (so I asked).

The text handling capabilities of MySQL aren't good enough for what you want. A stored function is an option, but will probably be slow. Your best bet to process the data within MySQL is to add a user defined function. If you're going to build a newer version of MySQL anyway, you could also add a native function.
The "correct" way is to process the data outside the DB since DBs are for storage, not processing, and any heavy processing might put too much of a load on the DBMS. Additionally, calculating the word count outside of MySQL makes it easier to change the definition of what counts as a word. How about storing the word count in the DB and updating it when a document is changed?
Example stored function:
DELIMITER $$
CREATE FUNCTION wordcount(str LONGTEXT)
RETURNS INT
DETERMINISTIC
SQL SECURITY INVOKER
NO SQL
BEGIN
DECLARE wordCnt, idx, maxIdx INT DEFAULT 0;
DECLARE currChar, prevChar BOOL DEFAULT 0;
SET maxIdx=char_length(str);
SET idx = 1;
WHILE idx <= maxIdx DO
SET currChar=SUBSTRING(str, idx, 1) RLIKE '[[:alnum:]]';
IF NOT prevChar AND currChar THEN
SET wordCnt=wordCnt+1;
END IF;
SET prevChar=currChar;
SET idx=idx+1;
END WHILE;
RETURN wordCnt;
END
$$
DELIMITER ;

This is quite a bit faster, though just slightly less accurate. I found it 4% light on the count, which is OK for "estimate" scenarios.
SELECT
ROUND (
(
CHAR_LENGTH(content) - CHAR_LENGTH(REPLACE (content, " ", ""))
)
/ CHAR_LENGTH(" ")
) AS count
FROM documents

Simple solution for some similar cases (MySQL):
SELECT *,
(CHAR_LENGTH(student)-CHAR_LENGTH(REPLACE(student,' ','')))+1 as 'count'
FROM documents;

You can use the word_count() UDF from https://github.com/spachev/mysql_udf_bundle. I ported the logic from the accepted answer with a difference that my code only supports latin1 charset. The logic would need to be reworked to support other charsets. Also, both implementations always consider a non-alphanumeric character to be a delimiter, which may not always desirable - for example "teacher's book" is considered to be three words by both implementations.
The UDF version is, of course, significantly faster. For a quick test I tried both on a dataset from Project Guttenberg consisting of 9751 records totaling about 3 GB. The UDF did all of them in 18 seconds, while the stored function took 63 seconds to process just 30 records (which UDF does in 0.05 seconds). So the UDF is roughly 1000 times faster in this case.
UDF will beat any other method in speed that does not involve modifying MySQL source code. This is because it has access to the string bytes in memory and can operate directly on bytes without them having to be moved around. It is also compiled into machine code and runs directly on the CPU.

Well I tried to use the function defined above and it was great, except one scenario.
In English you have strong use of ' as part of the word. The function above, at least to me, counted "haven't" as 2.
So here is my little correction:
DELIMITER $$
CREATE FUNCTION wordcount(str TEXT)
RETURNS INT
DETERMINISTIC
SQL SECURITY INVOKER
NO SQL
BEGIN
DECLARE wordCnt, idx, maxIdx INT DEFAULT 0;
DECLARE currChar, prevChar BOOL DEFAULT 0;
SET maxIdx=CHAR_LENGTH(str);
WHILE idx < maxIdx DO
SET currChar=SUBSTRING(str, idx, 1) RLIKE '[[:alnum:]]' OR SUBSTRING(str, idx, 1) RLIKE "'";
IF NOT prevChar AND currChar THEN
SET wordCnt=wordCnt+1;
END IF;
SET prevChar=currChar;
SET idx=idx+1;
END WHILE;
RETURN wordCnt;
END
$$

Related

How to CREATE a FUNCTION in SQL to concatenate two strings of text?

I am trying to create a function that concatenates two strings of text. I cannot simply use the concat function, I have to create a function that I can input two strings of text like String 1 = "Hi my name is' and String 2 = "Gary"
EDITOR'S NOTE: This is homework help. For future readers, these answers may not be applicable outside of given homework guidelines.
EDIT NOTE: In looking back at the original question, it asked for a function. This could be made into a function, but I think a function is way overkill for a fairly simple built-in operation. A user function would be a lot of overhead for something the engine can already easily handle.
In my personal opinion, this is a flawed question from your instructor. The correct answer is to do the very thing you aren't allowed to do. In fact, in MySQL, it's pretty much the only way the engine WANTS you to do that task. However, it can be done.
<teachingMoment>
First, we'll set up simple data.
CREATE TABLE t1 (s1 varchar(50), s2 varchar(50)) ;
INSERT INTO t1 (s1,s2) VALUES ('Hello','World') ;
To do it the way we SHOULD do it:
SELECT concat(s1, ', ', s2, '!') FROM t1 ;
Which gives us: Hello, World!. Easy, peasy.
In most* other flavors of SQL, the double-pipe operator (||) can be used for concatenation.
* : "Most" does not include Microsoft T-SQL. It uses + instead.
SELECT s1 || ',' || s2 || '!' FROM t1 ;
Most of the time gives us Hello, World! again. However, using pipes (||) to concatenate is not turned on by default in MySQL (or MariaDB), meaning we must use concat(). To turn this functionality on, we can do this by:
SET sql_mode = 'PIPES_AS_CONCAT' ;
SELECT s1 || ',' || s2 || '!' FROM t1 ;
And now we're back to giving us Hello, World! without concat().
</teachingMoment>
Again, I think this is a very contrived question that doesn't really provide much useful learning other than how to make yourself do something that MySQL was smart enough to stop you from doing. However, I guess it is useful in teaching you how to do it in other languages (except T-SQL).
Good luck.
https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=325e5180aecbf5b632f960e777650a76
From comments, It seems you are using MySQL and you might use below function -
CREATE FUNCTION STR_ADD(STRING_1 VARCHAR2(50)
,STRING_2 VARCHAR2(50))
RETURNS VARCHAR2
BEGIN
DECLARE RES_STRING VARCHAR2(100);
SELECT CONCAT(STRING_1, STRING_2) INTO RES_STRING;
RETURN RES_STRING;
END;

MySQL Spatial Simplify Geometries

I'm trying to create a web map which contains large polygon data-sets. In order to increase performance, I'm hoping to decrease polygon detail when zooming out.
Is MySQL able to simplify polygons and other geometries as part of a query?
EDIT: As James has pointed out, since 5.7, MySQL does support ST_Simplify.
I am afraid there is no simplify function in MySQL spatial. I filed that as a feature request five years ago, and you can see it has received no attention since then.
You have a number of options depending on whether you want to do this as a one off, or if you have dynamic data.
1). Write a quick and dirty function using the PointN function to access your points, only taking every 5th point say, creating a WKT string representing a simplified geometry, and then recreating a geometry using the GeomFromText function.
2). Dump your polygon as WKT, using AsText(geom) out to csv. Import into Postgres/Postgis using COPY command (equivalent of LOAD DATA INFILE), and use the ST_Simplify function there, then reverse the process to bring in back into MySQL.
3). Use ogr2ogr to dump to shp format, and then a tool like mapshaper to simplify it, output to shp and then import again using ogr2ogr. Mapshaper is nice, because you can get a feel for how the algorithm works, and could potentially use it to implement your own, instead of option 1.
There are other options, such as using Java Topology Suite if you are using Java server side, but I hope this gives you some idea of how to proceed.
I am sorry that the initial answer is, No. I went through this a number of years ago and eventually made a permanent switch to Postgres/Postgis, as it is much more fully featured for spatial work.
MySQL 5.7 contains ST_Simplify function for simplifying geometries.
From https://dev.mysql.com/doc/refman/5.7/en/spatial-convenience-functions.html:
ST_Simplify(g, max_distance)
Simplifies a geometry using the Douglas-Peucker algorithm and returns a simplified value of the same type, or NULL if any argument is NULL.
MySQL 5.7 includes a huge overhaul of spatial functionality compared to 5.6 and is now in General Availability status as of 5.7.9.
I had a use case where I wanted to use ST_Simplify, but the code had to run on MySQL 5.6 (which doesn't have it). Therefore, I developed a solution like the one suggested by John Powell in another answer.
MySQL does not unfortunately offer any aggregates, whereby you can create geometry by progressively adding points to it (i.e. there is no ST_AddPoint or similar). The only way you can compose a geometry is by building it step-by-step as a WKT string, then finally converting the completed string to a geometry.
Here is an example of a stored function that accepts a MultiLineString, and simplifies each LineString in it by only keeping every nth point, making sure that the start and end points are always kept. This is done by looping through the LineStrings in the MultiLineString, then through the points for each (skipping as required), and accumulating the lot in a WKT string, which is finally converted to a geometry using ST_GeomCollFromText.
-- geometryCollection: MultiLineString collection to simplify
-- skip: Number of points to remove between every two points
CREATE FUNCTION `sp_CustomSimplify`(gc geometrycollection, skip INT) RETURNS geometrycollection
BEGIN
DECLARE i, j,numLineStrings, numPoints INT;
DECLARE ls LineString;
DECLARE pt, lastPt Point;
DECLARE ls LineString;
DECLARE lastPt Point;
DECLARE txt VARCHAR(20000);
DECLARE digits INT;
SET digits = 4;
-- Start WKT string:
SET txt = 'MULTILINESTRING(';
-- Loop through the LineStrings in the geometry (which is a MultiLineString)
SET i = 1;
SET numLineStrings = ST_NumGeometries(gc);
loopLineStrings: LOOP
IF i > numLineStrings THEN LEAVE loopLineStrings; END IF;
SET ls = ST_GeometryN(gc, i);
-- Add first point to LineString:
SET pt = ST_StartPoint(ls);
SET txt = CONCAT(txt, '(', TRUNCATE(ST_X(pt),digits), ' ', TRUNCATE(ST_Y(pt),digits));
-- For each LineString, loop through points, skipping
-- points as we go, adding them to a running text string:
SET numPoints = ST_NumPoints(ls);
SET j = skip;
loopPoints: LOOP
IF j > numPoints THEN LEAVE loopPoints; END IF;
SET pt = ST_PointN(ls, j);
-- For each point, add it to a text string:
SET txt = CONCAT(txt, ',', TRUNCATE(ST_X(pt),digits), ' ', TRUNCATE(ST_Y(pt),digits));
SET j = j + skip;
END LOOP loopPoints;
-- Add last point to LineString:
SET lastPt = ST_EndPoint(ls);
SET txt = CONCAT(txt, ',', TRUNCATE(ST_X(lastPt),digits), ' ', TRUNCATE(ST_Y(lastPt),digits));
-- Close LineString WKT:
SET txt = CONCAT(txt, ')');
IF(i < numLineStrings) THEN
SET txt = CONCAT(txt, ',');
END IF;
SET i = i + 1;
END LOOP loopLineStrings;
-- Close MultiLineString WKT:
SET txt = CONCAT(txt, ')');
RETURN ST_GeomCollFromText(txt);
END
(This could be a lot prettier by extracting bits into separate functions.)

Filter Dataset using SQL Query

I am using Zeos and mysql in my delphi project.
what I would like to do is filter dataset using a textbox.
to do that, I am using following query in textbox 'OnChange' Event:
ZGrips.Active := false;
ZGrips.SQL.Clear;
ZGrips.SQL.Add('SELECT Part_Name, Description, OrderGerman, OrderEnglish FROM Part');
ZGrips.SQL.Add('WHERE Part_Name LIKE ' + '"%' + trim(txt_search.Text) + '%"');
ZGrips.Active := true;
after I run and type first character in textbox, I get empty dataset in my DBGrid,
so DBGrid is showing nothing, then If I type second character I get some result in DBGrid. and even more strange behavior: if I will use AS Clause in my SQL Query like:
Part_Name AS blablabla,
Description AS blablabla,
OrderGerman AS OG,
OrderEnglish AS OE
in that case DBGrid is showing only 2 columns: Part_Name and Description, I dont understand why it is ignoring 3rd and 4th columns.
thanks for any help in advance.
Always use parameters
Firstly you need to use parameters, otherwise your query will break or worse when the user enters the wrong characters in the search box.
See: How does the SQL injection from the "Bobby Tables" XKCD comic work?
Parameters also makes you query faster, because the database engine only have to decode the query once.
If you change a parameter the engine will know that the query itself has not changed and will not re-decode it.
Don't use clear and add
Just supply the SQL as text in one go, it's faster.
This is esp. true in a loop, outside the loop you will not notice the difference.
Your code should read something like:
procedure TForm1.SetupSearch; //run this only once.
var
SQL: string;
begin
ZGrips.Active:= false;
SQL:= 'SELECT Part_Name, Description, OrderGerman, OrderEnglish FROM Part' +
'WHERE Part_Name LIKE :searchtext'); //note no % here.
ZGrips.SQL.Text:= SQL; //don't use clear and don't use SQL.Add.
end;
//See: http://docwiki.embarcadero.com/Libraries/XE2/en/Vcl.StdCtrls.TEdit.OnChange
procedure TForm1.Edit1Change(Sender: TObject);
begin
if Edit1.Modified then begin
Timer1.Active:= true;
end;
end;
procedure TForm1.Timer1Timer(Sender: TObject);
begin
Timer1.Active:= false;
if Edit1.Text <> ZGrips.Params[0].AsString then begin
ZGrips.Params[0].AsString:= Edit1.Text + '%'
ZGrips.Active:= true;
end;
end;
Use a timer
As per #MartinA's suggestion, use a timer and start the query only ever so often.
The wierd behaviour you're getting maybe because you're stopping and reactivating a new query before the old one has had time to finish.
The Params[index: integer] property is a bit faster than the ParamsByName property.
Although this does not really matter outside a loop.
Allow the database to use an index!
Using only a trailing wildcard % is faster than using a leading wildcard because the database can only use an index is there is a trailing wildcard.
If you want to use a leading wildcard, then consider storing the data in reverse order and use a trailing wildcard instead.
Full-text indexes are much better than like
Of course if you use both a leading and a trailing wild card then you have to use a full-text index.
In MySQL you'll than use the MATCH AGAINST syntax,
see: Differences between INDEX, PRIMARY, UNIQUE, FULLTEXT in MySQL?
and: Which SQL query is better, MATCH AGAINST or LIKE?
The lastest versions of MySQL support full-text indexes in InnoDB.
Remember to never use MyISAM, it's unreliable.

Using a scalar valued find string function when searching multiple rows returned?

Given SQL Server 2008, I have written a simple find in string function as follows:
ALTER FUNCTION [dbo].[FindInString]
(
#FindText VARCHAR(255),
#TextSource VARCHAR(512)
)
RETURNS INT
AS
BEGIN
DECLARE #Result INT
SET #Result = 0
SELECT #Result = CHARINDEX(#FindText, #TextSource)
RETURN #Result
END
The complexity of the find function may change in the future, which is why I wanted to encapsulate it in a function.
Now, when I only have one matching record in a table, this works:
SELECT #FindCount = dbo.FindInString('somestring', (SELECT TableSearch FROM Segments WHERE CID=22793))
However, when the select statement returns more than one, it makes sense as to why an error is thrown.
like to know is what I need to do to still have this work as a simple call, as above?
I only need to know if there is one match (I just need to know if #FindCount > 0), and I'm guessing some sort of a loop may be required, but would like to keep this as simple as possible.
Thanks.
You can use aggregate functions and one select:
select
#FindCount = sum(dbo.FindInString('somestring', TableSearch))
from
Segment
where
CID = 22793
Just take care with this, as FindInString will fire for each row, which can significantly reduce query performance. In this case, it's the only way to solve your problem, but just beware of the troubles that could arise.

Why is my custom MySQL function so much slower than inlining same in query?

I repeatedly use this SELECT query to read unsigned integers representing IPv4 addresses and present them as human readable dotted quad strings.
SELECT CONCAT_WS('.',
FLOOR(ip/POW(256,3)),
MOD(FLOOR(ip/POW(256,2)), 256),
MOD(FLOOR(ip/256), 256),
MOD(ip, 256))
FROM ips;
With my test data, this query takes 3.6 seconds to execute.
I thought that creating a custom stored function for the int->string conversion would allow for easier to read queries and allow reuse, so I made this:
CREATE FUNCTION IntToIp(value INT UNSIGNED)
RETURNS char(15)
DETERMINISTIC
RETURN CONCAT_WS(
'.',
FLOOR(value/POW(256,3)),
MOD(FLOOR(value/POW(256,2)), 256),
MOD(FLOOR(value/256), 256),
MOD(value, 256)
);
With this function my query looks like this:
SELECT IntToIp(ip) FROM ips;
but with my test data, this takes 13.6 seconds to execute.
I would expect this to be slower on first run, as there is an extra level of indirection involved, but nearly 4 times slower seems excessive. Is this much slowness expected?
I'm using out of the box MySQL server 5.1 on Ubuntu 10.10 with no configuration changes.
To reproduce my test, create a table and populate with 1,221,201 rows:
CREATE TABLE ips (ip INT UNSIGNED NOT NULL);
DELIMITER //
CREATE PROCEDURE AddIps ()
BEGIN
DECLARE i INT UNSIGNED DEFAULT POW(2,32)-1;
WHILE (i>0) DO
INSERT INTO ips (ip) VALUES (i);
SET i = IF(i<3517,0,i-3517);
END WHILE;
END//
DELIMITER ;
CALL AddIps();
Don't reinvent the wheel, use INET_NTOA():
mysql> SELECT INET_NTOA(167773449);
-> '10.0.5.9'
Using this one you could get better performance:
CREATE FUNCTION IntToIp2(value INT UNSIGNED)
RETURNS char(15)
DETERMINISTIC
RETURN CONCAT_WS(
'.',
(value >> 24),
(value >> 16) & 255,
(value >> 8) & 255,
value & 255
);
> SELECT IntToIp(ip) FROM ips;
1221202 rows in set (18.52 sec)
> SELECT IntToIp2(ip) FROM ips;
1221202 rows in set (10.21 sec)
Launching your original SELECT just after adding your test data took 4.78 secs on my system (2gB mysql 5.1 instance on quad core (fedora 64 bit).
EDIT: Is this much slowness expected?
Yes, stored procedures are slow, a bunch of magnitudes slower than interpreted/compiled code. They turn out useful when you need to tie up some database logic which you want to keep out of your application because it's out of the specific domain (ie, logging/administrative tasks). If a stored function contains no queries, it's always better practice to write an utility function in your chosen language, as that wont prevent reuse (there are no queries), and will run much faster.
And that's the reason for which, in this particular case, you should use the INET_NTOA function instead, which is available and fulfils your needs, as suggested in sanmai answer.