MySQL version 5.5.35-log
I have an exceptionally large set of data which consists of a many-to-many relationship which is closely related to people shopping at outlets. A person may shop at many hundreds of different outlets, and similarly, many thousands of people may shop at any particular outlet. The overall number of people and outlets extends into the millions each.
I have a situation where checking if a person shops at a particular outlet must be resolved quickly, so I opted to use reverse lookups; i.e. each 'person' row stores a list of ID's for the stores they shop in. Due to the volume of data, a third relationship table is presumed to be unsuitable; i.e. one which has a row for each persons outlets. My assumption here is that it would have little choice but to produce table scans through many, many rows.
To store this reverse lookup in MySQL however, SET is also unsuitable as it has a maximum of 64 entries which is of course not enough in this situation. So, I opted for a BLOB which is structured as simply a block containing each 4 byte little-endian ID.
But, a different problem arises here; When it comes time to find if an outlet ID is contained in the BLOB using SQL, unusual things start occuring. From other questions, it seems the only way to do this is using SUBSTRING with the BLOB in a loop, however this doesn't seem to work; SUBSTRING is returning a blank string. First, here's some code:
CREATE FUNCTION `DoesShopAt`(shopperID INT UNSIGNED,outletID TINYBLOB) RETURNS VARCHAR(20)
BEGIN
-- Setup a loop. We're going to loop through each ID in the blob:
declare i_max int unsigned default 0;
declare i int unsigned default 0;
declare offset int unsigned default 0;
declare storeID tinyblob;
-- Setup the blob store - all the stops a particular shopper goes to:
declare allShops blob;
-- Grab the set of ID's - a blob of each 4 byte outlet ID:
select AllStores from Shoppers where ID=shopperID into allShops;
-- How many shops?
select length(allShops)/4 into i_max;
while i < i_max do
-- Grab the shops ID:
set storeID=substring(allShops,offset,4);
if outletID = storeID then
return "Yep, they do!";
end if;
-- Update the ID offset in the blob:
set offset=offset+4;
-- Update the loop counter:
set i=i+1;
end while;
return "Nope, they don't.";
END
For debugging purposes it is set to return a string. The intention is it returns true or false depending on if the given shopper does shop at the given outlet.
Ideally, this function would receive two numbers; the shopperID and the outletID, however converting the outletID into a block of 4 little endian bytes seems unreliable and slow at best as it must go via hex (as far as I can tell). So instead, the calling service provides the block of 4 bytes.
Interestingly though, returning storeID immediately after it is set results in a blank string. This is the case if the type of storeID is a varchar, binary or tinyblob; it seems no matter what, it is returning a blank string.
So as a final resort for testing purposes, I instead tried this:
set storeID=substring(hex(allShops),offset,8);
Ensuring that the offset counter was increased by 8 this time, and the input ID was adjusted to suit. Yet again though, it still was returning a blank string (again with return storeID immediately after it's set), even though the allShops data is non-zero.
Edit: Although I found the issue, I can't help but think that maybe there is a better approach to reverse lookups like this in MySQL; do you have any suggestions?
I started playing around with substring and realised what the issue was; offset is being initialised to 0 when it should be 1. Changing this then started correctly returning results:
declare offset int unsigned default 0;
Should have been:
declare offset int unsigned default 1;
However, please see the note at the bottom of the original question.
Related
Sorry, long pre-history, but it is needed to clarify the question.
In my org the computers have names like CNT30[0-9]{3}[1-9a-z], for example cnt300021 or cnt30253a.
Last symbol is a "qualifier", so single workplace may have equally named computers assigned to it, distinguished by this qualifier. For example cnt300021 may mean desktop computer on workplace #002, and cnt30002a may mean notebook assigned for same workplace. Workplaces are "virtual" and their existence made just for our (IT dept) convenience.
Each dept have its own unique range [0-9]{3}. For example, computers of accounting have names starting cnt302751 upto cnt30299z which gives them 25 unique workplaces max, with up to 35 computers per workplace. (IRL most users have one desktop PC, much lesser have desktop and notebook, and only 2 or 3 technicians have more than one notebook at their disposal).
Recently, doing some inventory of computers' passports (unsure about a term: a paper, which means for computer the same, what a passport means for human), I found that there some holes in sequential numbering. For example, we have cnt302531 and cnt302551, but have no cnt302541, which means that there's no workplace #254.
What I want to do? I want to find these gaps without manual searching. For this I need a cycle from 1 to MaxComp=664 (no more workplace numbers assigned yet)
That's what I could write using some pseudo-SQL-BASIC:
for a=0 to MaxComp
a$="CNT30"+right(a+1000,3)
'comparing only 8 leftmost characters, ignoring 9th one - the qualifier
b$=(select name from table where left(name,8) like a$)
print a$;b$
next a
That code should give me two colummns: possible names and existing ones.
But I can't figure out how to implement this in SQL-query. What I tried:
# because of qualifier there may be several computers with same
# 8 leftmost characters
select #cnum:=#cnum+1 as CompNum, group_concat(name separator ',')
# PCs are inventoried by OCS-NG Inventory software
from hardware
cross join (select #cnum:=0) cnt
where left(hardware.name,8)=concat('CNT30',right(#cnum+1000,3))
limit 100
But this construct returns exactly one row. And I can't understand, if it is possible without using the stored procedures, and what I did wrong if it is possible?
Found working path:
(at first I tried to use stored function)
CREATE FUNCTION `count_comps`(num smallint) RETURNS tinytext CHARSET utf8
BEGIN
return (select group_concat(name separator ',')
from hardware where left(hardware.name,8)=concat('CNT30',right(num+1000,3))
);
END
Then I tried hard to replicate function's results in subquery. And I did it! Note: the inner select returns exactly same results as function does
# Starting point. May be INcreased to narrow the results list
set #cnum:=0;
select
#cnum:=#cnum+1 as CompNum,
concat('CNT30',right(#cnum+1000,3)) as CalcNum,
# this
count_comps(#cnum) as hwns,
# and this gives equal results
(select group_concat(name separator ',')
from hardware where left(name,8)=calcnum
) hwn2
from hardware
# no more dummy tables here
# Ending point. May be DEcreased to narrow the results list
where #cnum<665;
So, the wrong part of "classical" approach was the usage of dummy table, which seems to be not necessary.
Partial results example (starting set #cnum:=479;, finishing where #cnum<530;):
CompNum, CalcNum, hwns, hwn2
'488', 'CNT30488', 'CNT304881', 'CNT304881'
'489', 'CNT30489', 'CNT304892', 'CNT304892'
'490', 'CNT30490', 'CNT304901,CNT304902,CNT304903', CNT304901,CNT304902,CNT304903'
'491', 'CNT30491', NULL, NULL
'492', 'CNT30492', NULL, NULL
'493', 'CNT30493', 'CNT304932', 'CNT304932'
'494', 'CNT30494', 'CNT304941', 'CNT304941'
I found that there no workplaces #491 and #492. On next adding the PCs for the 'October Region' dept (range 480-529), at least two of new PCs will get names CNT304911 and CNT304921, filling this gap.
The SQLite JSON1 extension has some really neat capabilities. However, I have not been able to figure out how I can update or insert individual JSON attribute values.
Here is an example
CREATE TABLE keywords
(
id INTEGER PRIMARY KEY,
lang INTEGER NOT NULL,
kwd TEXT NOT NULL,
locs TEXT NOT NULL DEFAULT '{}'
);
CREATE INDEX kwd ON keywords(lang,kwd);
I am using this table to store keyword searches and recording the locations from which the search was ininitated in the object locs. A sample entry in this database table would be like the one shown below
id:1,lang:1,kwd:'stackoverflow',locs:'{"1":1,"2":1,"5":1}'
The location object attributes here are indices to the actual locations stored elsewhere.
Now imagine the following scenarios
A search for stackoverflow is initiated from location index "2". In this case I simply want to increment the value at that index so that after the operation the corresponding row reads
id:1,lang:1,kwd:'stackoverflow',locs:'{"1":1,"2":2,"5":1}'
A search for stackoverflow is initiated from a previously unknown location index "7" in which case the corresponding row after the update would have to read
id:1,lang:1,kwd:'stackoverflow',locs:'{"1":1,"2":1,"5":1,"7":1}'
It is not clear to me that this can in fact be done. I tried something along the lines of
UPDATE keywords json_set(locs,'$.2','2') WHERE kwd = 'stackoverflow';
which gave the error message error near json_set. I'd be most obliged to anyone who might be able to tell me how/whether this should/can be done.
It is not necessary to create such complicated SQL with subqueries to do this.
The SQL below would solve your needs.
UPDATE keywords
SET locs = json_set(locs,'$.7', IFNULL(json_extract(locs, '$.7'), 0) + 1)
WHERE kwd = 'stackoverflow';
I know this is old, but it's like the first link when searching, it deserves a better solution.
I could have just deleted this question but given that the SQLite JSON1 extension appears to be relatively poorly understood I felt it would be more useful to provide an answer here for the benefit of others. What I have set out to do here is possible but the SQL syntax is rather more convoluted.
UPDATE keywords set locs =
(select json_set(json(keywords.locs),'$.**N**',
ifnull(
(select json_extract(keywords.locs,'$.**N**') from keywords where id = '1'),
0)
+ 1)
from keywords where id = '1')
where id = '1';
will accomplish both of the updates I have described in my original question above. Given how complicated this looks a few explanations are in order
The UPDATE keywords part does the actual updating, but it needs to know what to updatte
The SELECT json_set part is where we establish the value to be updated
If the relevant value does not exsit in the first place we do not want to do a + 1 on a null value so we do an IFNULL TEST
The WHERE id = bits ensure that we target the right row
Having now worked with JSON1 in SQLite for a while I have a tip to share with others going down the same road. It is easy to waste your time writing extremely convoluted and hard to maintain SQL in an effort to perform in-place JSON manipulation. Consider using SQLite in memory tables - CREATE TEMP TABLE... to store intermediate results and write a sequence of SQL statements instead. This makes the code a whole lot eaiser to understand and to maintain.
I have the following problem.
When I run the query in DataBase1..
declare #x float
select #x = Descripcion from Automotor.Modelo
where id = 57
select #x
the result is "1,7E+27"
But I run the query in DataBase2..
The result is
"Msg 8114, Level 16, State 5, Line 2
Error converting data type varchar to float."
The structure is the same have idea why this happens?
The table in one database could have character data that isn't in the other, and the assignment (or at least validating the assignment operation) is trying to happen before the filter. This could even happen with the same data but different indexes, different statistics or a different query plan for other reasons, which could mean that SQL Server sees the data in a different order and comes across the bad value. But I would guess that even though the structures are identical, the data is not.
My suggestions are either:
(a) store your float data in a float column instead of a varchar column (preferable)
(b) short circuit the assignment problem like this:
SELECT #x = CASE WHEN Descripcion NOT LIKE '%[^0-9.]%' THEN Descripcion END
FROM Automator.Modelo WHERE id = 57;
Given SQL Server 2008, I have written a simple find in string function as follows:
ALTER FUNCTION [dbo].[FindInString]
(
#FindText VARCHAR(255),
#TextSource VARCHAR(512)
)
RETURNS INT
AS
BEGIN
DECLARE #Result INT
SET #Result = 0
SELECT #Result = CHARINDEX(#FindText, #TextSource)
RETURN #Result
END
The complexity of the find function may change in the future, which is why I wanted to encapsulate it in a function.
Now, when I only have one matching record in a table, this works:
SELECT #FindCount = dbo.FindInString('somestring', (SELECT TableSearch FROM Segments WHERE CID=22793))
However, when the select statement returns more than one, it makes sense as to why an error is thrown.
like to know is what I need to do to still have this work as a simple call, as above?
I only need to know if there is one match (I just need to know if #FindCount > 0), and I'm guessing some sort of a loop may be required, but would like to keep this as simple as possible.
Thanks.
You can use aggregate functions and one select:
select
#FindCount = sum(dbo.FindInString('somestring', TableSearch))
from
Segment
where
CID = 22793
Just take care with this, as FindInString will fire for each row, which can significantly reduce query performance. In this case, it's the only way to solve your problem, but just beware of the troubles that could arise.
I've recently been working on some database search functionality and wanted to get some information like the average words per document (e.g. text field in the database). The only thing I have found so far (without processing in language of choice outside the DB) is:
SELECT AVG(LENGTH(content) - LENGTH(REPLACE(content, ' ', '')) + 1)
FROM documents
This seems to work* but do you have other suggestions? I'm currently using MySQL 4 (hope to move to version 5 for this app soon), but am also interested in general solutions.
Thanks!
* I can imagine that this is a pretty rough way to determine this as it does not account for HTML in the content and the like as well. That's OK for this particular project but again are there better ways?
Update: To define what I mean by "better": either more accurate, performs more efficiently, or is more "correct" (easy to maintain, good practice, etc). For the content I have available, the query above is fast enough and is accurate for this project, but I may need something similar in the future (so I asked).
The text handling capabilities of MySQL aren't good enough for what you want. A stored function is an option, but will probably be slow. Your best bet to process the data within MySQL is to add a user defined function. If you're going to build a newer version of MySQL anyway, you could also add a native function.
The "correct" way is to process the data outside the DB since DBs are for storage, not processing, and any heavy processing might put too much of a load on the DBMS. Additionally, calculating the word count outside of MySQL makes it easier to change the definition of what counts as a word. How about storing the word count in the DB and updating it when a document is changed?
Example stored function:
DELIMITER $$
CREATE FUNCTION wordcount(str LONGTEXT)
RETURNS INT
DETERMINISTIC
SQL SECURITY INVOKER
NO SQL
BEGIN
DECLARE wordCnt, idx, maxIdx INT DEFAULT 0;
DECLARE currChar, prevChar BOOL DEFAULT 0;
SET maxIdx=char_length(str);
SET idx = 1;
WHILE idx <= maxIdx DO
SET currChar=SUBSTRING(str, idx, 1) RLIKE '[[:alnum:]]';
IF NOT prevChar AND currChar THEN
SET wordCnt=wordCnt+1;
END IF;
SET prevChar=currChar;
SET idx=idx+1;
END WHILE;
RETURN wordCnt;
END
$$
DELIMITER ;
This is quite a bit faster, though just slightly less accurate. I found it 4% light on the count, which is OK for "estimate" scenarios.
SELECT
ROUND (
(
CHAR_LENGTH(content) - CHAR_LENGTH(REPLACE (content, " ", ""))
)
/ CHAR_LENGTH(" ")
) AS count
FROM documents
Simple solution for some similar cases (MySQL):
SELECT *,
(CHAR_LENGTH(student)-CHAR_LENGTH(REPLACE(student,' ','')))+1 as 'count'
FROM documents;
You can use the word_count() UDF from https://github.com/spachev/mysql_udf_bundle. I ported the logic from the accepted answer with a difference that my code only supports latin1 charset. The logic would need to be reworked to support other charsets. Also, both implementations always consider a non-alphanumeric character to be a delimiter, which may not always desirable - for example "teacher's book" is considered to be three words by both implementations.
The UDF version is, of course, significantly faster. For a quick test I tried both on a dataset from Project Guttenberg consisting of 9751 records totaling about 3 GB. The UDF did all of them in 18 seconds, while the stored function took 63 seconds to process just 30 records (which UDF does in 0.05 seconds). So the UDF is roughly 1000 times faster in this case.
UDF will beat any other method in speed that does not involve modifying MySQL source code. This is because it has access to the string bytes in memory and can operate directly on bytes without them having to be moved around. It is also compiled into machine code and runs directly on the CPU.
Well I tried to use the function defined above and it was great, except one scenario.
In English you have strong use of ' as part of the word. The function above, at least to me, counted "haven't" as 2.
So here is my little correction:
DELIMITER $$
CREATE FUNCTION wordcount(str TEXT)
RETURNS INT
DETERMINISTIC
SQL SECURITY INVOKER
NO SQL
BEGIN
DECLARE wordCnt, idx, maxIdx INT DEFAULT 0;
DECLARE currChar, prevChar BOOL DEFAULT 0;
SET maxIdx=CHAR_LENGTH(str);
WHILE idx < maxIdx DO
SET currChar=SUBSTRING(str, idx, 1) RLIKE '[[:alnum:]]' OR SUBSTRING(str, idx, 1) RLIKE "'";
IF NOT prevChar AND currChar THEN
SET wordCnt=wordCnt+1;
END IF;
SET prevChar=currChar;
SET idx=idx+1;
END WHILE;
RETURN wordCnt;
END
$$