CONTAINS(Oracle) vs. MATCH(MySql) - mysql

I want to know what I can do to find the same results as in Oracle with MySql DBMS.
For example I use this statement in Oracle:
Select *
FROM PEP INNER JOIN ZUSAMMEN ON PEP.ID = ZUSAMMEN.PEPID
WHERE CONCATINS(ZUSAMMEN.NAMEN, '%Angela% and %Merkel%',0 ) > 0;
So I've set a CONTEXT Index on the 'Namen' Column.
Now, in MySql the syntax looks like this:
SELECT *
FROM INNER JOIN ZUSAMMEN ON PEP.ID = ZUSAMMEN.PEPID
WHERE MATCH(ZUSAMMEN.NAMEN) AGAINST ('Angela Merkel' IN BOOLEAN MODE);
The problem is, that the MySql Statement finds more results than Oracle.
Oracle finds the exact Name (Angela Dorothea Merkel).
MySql Not.
How can I build my Syntax for MySql, that MySql finds the same results as Oracle?

According to mysql documentation on fulltext search in boolean mode:
In implementing this feature, MySQL uses what is sometimes referred to
as implied Boolean logic, in which
+: stands for AND
-: stands for NOT
[no operator]: implies OR
You have not indicated any operators, so mysql is searching for records that have either Angela or Merkel present in the namen field. Modify the search using the + operator to require both words to be present:
MATCH(ZUSAMMEN.NAMEN) AGAINST ('+Angela +Merkel' IN BOOLEAN MODE)
Pls note, that the mysql search will still behave slightly differently. The Oracle search should return a record where the name is Angelas Merkels, while the mysql search will not.

Related

how to speed up mysql regex query

I want to develope a site for announcing jobs, but because I have a lot of conditions (title,category,tags,city..) I use a MySQL regex statement. However, it's very slow and sometimes results in a 500 internal Server Error
Here is one example :
select * from job
where
( LOWER(title) REGEXP 'dév|freelance|free lance| 3eme grade|inform|design|site|java|vb.net|poo '
or
LOWER(description) REGEXP 'dév|freelance|free lance| 3eme grade|inform|design|site|java|vb.net|poo '
or
LOWER(tags) REGEXP 'dév|freelance|free lance| 3eme grade|inform|design|site|java|vb.net|poo')
and
LOWER(ville) REGEXP LOWER('Agadir')
and
`date`<'2016-01-11'
order by `date` desc
Any advice?
You can't optimize a query based exclusively on regexes. Use full text indexing (or a dedicated search engine such as Mnogo) for text search and geospatial indexing for locations.
The big part of the WHERE, namely the OR of 3 REGEXPs cannot be optimized.
LOWER(ville) REGEXP LOWER('Agadir') can be turned into simply ville REGEXP 'Agadir' if your collation is ..._ci. Please provide SHOW CREATE TABLE job.
Then that can be optimized to ville = 'Agadir'.
But maybe this query is "generated" by your UI? And the users are allowed to use regexp thingies? (SECURITY WARNING: SQL injection is possible here!)
If it is "generated", the generate the "=" version if there are no regexp codes.
Provide these:
INDEX(ville, date) -- for cases when you can do `ville = '...'`
INDEX(date) -- for cases when you must have `ville REGEXP '...'`
The first will be used (and reasonably optimal) when appropriate. The second is better than nothing. (It depends on how many rows have that date range.)
It smells like there may be other SELECTs. Let's see some other variants. What I have provided here may or may not help with them.
See my indexing cookbook: http://mysql.rjweb.org/doc.php/index_cookbook_mysql

Django mysql count distinct gives different result to postgres

I'm trying to count distinct string values for a fitered set of results in a django query against a mysql database versus the same data in a postgres database. However, I'm getting really confusing results.
In the code below, NewOrder represents queries against the same data in a postgres database, and OldOrder is the same data in a MYSQL instance.
( In the old database, completed orders had status=1, in the new DB complete status = 'Complete'. In both the 'email' field is the same )
OldOrder.objects.filter(status=1).count()
6751
NewOrder.objects.filter(status='Complete').count()
6751
OldOrder.objects.filter(status=1).values('email').distinct().count()
3747
NewOrder.objects.filter(status='Complete').values('email').distinct().count()
3825
print NewOrder.objects.filter(status='Complete').values('email').distinct().query
SELECT DISTINCT "order_order"."email" FROM "order_order" WHERE "order_order"."status" = Complete
print OldSale.objects.filter(status=1).values('email').distinct().query
SELECT DISTINCT "order_order"."email" FROM "order_order" WHERE "order_order"."status" = 1
And here is where it gets really bizarre
new_orders = NewOrder.objects.filter(status='Complete').values_list('email', flat=True)
len(set(new_orders))
3825
old_orders = OldOrder.objects.filter(status=1).values_list('email',flat=True)
len(set(old_orders))
3825
Can anyone explain this discrepancy? And possibly point me as to why results would be different between postgres and mysql? My only guess is a character encoding issue, but I'd expect the results of the python set() to also be different?
Sounds like you're probably using a case-insensitive collation in MySQL. There's no equivalent in PostgreSQL; the closest is the citext data type, but usually you just compare lower(...) of strings, or use ILIKE for pattern matching.
I don't know how to say it in Django, but I'd see if the count of the set of distinct lowercased email addresses is the same as the old DB.
According to the Django docs something like this might work:
NewOrder.objects.filter(status='Complete').values(Lower('email')).distinct()

How best to retrieve result of SELECT COUNT(*) from SQL query in Java/JDBC - Long? BigInteger?

I'm using Hibernate but doing a simple SQLQuery, so I think this boils down to a basic JDBC question. My production app runs on MySQL but my test cases use an in memory HSQLDB. I find that a SELECT COUNT operation returns BigInteger from MySQL but Long from HSQLDB.
MySQL 5.5.22
HSQLDB 2.2.5
The code I've come up with is:
SQLQuery tq = session.createSQLQuery(
"SELECT COUNT(*) AS count FROM calendar_month WHERE date = :date");
tq.setDate("date", eachDate);
Object countobj = tq.list().get(0);
int count = (countobj instanceof BigInteger) ?
((BigInteger)countobj).intValue() : ((Long)countobj).intValue();
This problem of the return type negates answers to other SO questions such as getting count(*) using createSQLQuery in hibernate? where the advice is to use setResultTransformer to map the return value into a bean. The bean must have a type of either BigInteger or Long, and fails if the type is not correct.
I'm reluctant to use a cast operator on the 'COUNT(*) AS count' portion of my SQL for fear of database interoperability. I realise I'm already using createSQLQuery so I'm already stepping outside the bounds of Hibernates attempts at database neutrality, but having had trouble before with the differences between MySQL and HSQLDB in terms of database constraints
Any advice?
I don't known a clear solution for this problem, but I will suggest you to use H2 database for your tests.
H2 database has a feature that you can connect using a compatibility mode to several different databases.
For example to use MySQL mode you connect to the database using this jdbc:h2:~/test;MODE=MySQL URL.
You can downcast to Number and then call the intValue() method. E.g.
SQLQuery tq = session.createSQLQuery("SELECT COUNT(*) AS count FROM calendar_month WHERE date = :date");
tq.setDate("date", eachDate);
Object countobj = tq.list().get(0);
int count = ((Number) countobj).intValue();
Two ideas:
You can get result value as String and then parse it to Long or BigInteger
Do not use COUNT(*) AS count FROM ..., better use something like COUNT(*) AS cnt ... but in your example code you do not use name of result column but it index, so you can use simply COUNT(*) FROM ...

Use SQL Server FTS Stemmer

Is there any way to directly access the stemmer used in the FORMSOF() option of a CONTAINS Full Text Search query so that it returns the stems/inflections of an input word, not just those derivations that exist in a search column.
For example, the query
SELECT * FROM dbo.MyDB WHERE contains(CHAR_COL,'FORMSOF(INFLECTIONAL, prettier)')
returns the stem "pretty" and other inflections such as "prettiest" if they exists in the CHAR_COL column. What I want is to call the FORMSOF() function directly without referencing a column at all. Any chance?
EDIT:
The query that met my needs ended up being
SELECT * FROM
(SELECT ROW_NUMBER() OVER (PARTITION BY group_ID ORDER BY GROUP_ID) ord, display_term
from sys.dm_fts_parser('FORMSOF( FREETEXT, running) and FORMSOF(FREETEXT, jumping)', 1033, null, 1)) a
WHERE ord=1
Requires membership in the sysadmin
fixed server role and access rights to
the specified stoplist.
No. You can not do this. You can't get an access to stemmer directly.
You can get an idea of how it works by looking into Solr source code. But it might (and I guess will) be different from the one implemented in MS SQL FT.
UPDATE: It turns out that in SQL Server 2008 R2 you can do something quite close to what you want. A special table-valued UDF was added:
sys.dm_fts_parser('query_string', lcid, stoplist_id, accent_sensitivity)
it allows you to get a tokenization result (i.e. the result after applying word breaking, thesaurus and stop list application). So in case you feed it 'FORMSOF(....)' it will give you the result you want (well, you will have to process result set anyway). Here's corresponding article in MSDN.

SQL query execution - different outcomes on Windows and Linux

The following is generated query from Hibernate (except I replaced the list of fields with *):
select *
from
resource resource0_,
resourceOrganization resourceor1_
where
resource0_.active=1
and resource0_.published=1
and (
resource0_.resourcePublic=1
or resourceor1_.resource_id=resource0_.id
and resourceor1_.organization_id=2
and (
resourceor1_.resource_id=resource0_.id
and resourceor1_.forever=1
or resourceor1_.resource_id=resource0_.id
and (
current_date between resourceor1_.startDate and resourceor1_.endDate
)
)
)
Currently I have 200+ records in both the Windows and Linux databases and currently for each record, the following happens to be true:
active = 1
published = 1
resourcePublic = 1
When I run this directly in a SQL client, this SQL query gets me all the matching records on Windows but none on Linux. I've MySQL 5.1 on both Windows and Linux.
If I apply the Boolean logic, (true and true and (true or whatever)), I expect the outcome to be true. It indeed is true on Windows but false on Linux!!!
If I modify the query as the following, it works on both Windows and Linux:
select *
from
resource resource0_
where
resource0_.active=1
and resource0_.published=1
and (
resource0_.resourcePublic=1
)
So, just the presence of conditions related to resourceOrganization is making the query bring 0 results on Linux and I expected that since it is the second part of an 'or' condition whose first part is true, the outcome should be true.
Any idea why this difference in behavior between the 2 OSs and why what should obviously work on Linux doesn't!
Thanks in advance!
Check the case sensitivity and collation sets (Collation issues)
Check the table case sensitivity. In particular note that on windows the table names are case-insensitive and on Linux they are case-sensitive.
Have you tried a simple test case on both system?
Check that current_date() returns the same format in both plataforms
I notice that the second test query only consults the resource table not the resourceOrganisation table.
I suspect that the table resourceOrganisation is populated differently on the two machines, and the corresponding rows may not exist in your Linux MySQL.
What does this query return?
select *
from
resource resource0_,
resourceOrganization resourceor1_
where
resource0_.active=1
and resource0_.published=1
and (
resource0_.resourcePublic=1
or resourceor1_.resource_id=resource0_.id
and resourceor1_.organization_id=2
)
Also don't forget to check the collation and case sensitivity, if one server uses a different collation to the other then you will have this same issue.