Wildcard (*) Searches With Thinking Sphinx

Wildcard (*) Searches With Thinking Sphinx - thinking-sphinx

Is it possible to give star () between the search text .
Example => peole

The enable_star option allows prefix matching ('foo*') and infix matching ('*foo*'). It does not, however allow you to stick the * in the middle of a word as the question asks. The best simple solution I can suggest for the described case is searching as two words with 'any' matching:
IndexedThingie.search('peo le', :match_mode => :any)
If you specifically need 'all' style matching for everything else, you should look into the expression matching syntax in the Sphinx manual (http://sphinxsearch.com/docs/2.0.1/extended-syntax.html), which is available if you specify the 'extended' match mode (see TS match mode documentation: http://freelancing-god.github.com/ts/en/searching.html#matchmodes). It might be complicated, but with some manipulation of your search input, you should be able to manage it. In particular, look at the 'strict order' operator, '<<'.
IndexedThingie.search('peo << le', :match_mode => :extended)

Yes. enable_star: 1 or enable_star: true in your sphinx.yml.

Related

How to extract in Splunk at indexed time json field with same child-key from different father-key using regex? [duplicate]

This question already has answers here:
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
I have this gigantic ugly string:
J0000000: Transaction A0001401 started on 8/22/2008 9:49:29 AM
J0000010: Project name: E:\foo.pf
J0000011: Job name: MBiek Direct Mail Test
J0000020: Document 1 - Completed successfully
I'm trying to extract pieces from it using regex. In this case, I want to grab everything after Project Name up to the part where it says J0000011: (the 11 is going to be a different number every time).
Here's the regex I've been playing with:
Project name:\s+(.*)\s+J[0-9]{7}:
The problem is that it doesn't stop until it hits the J0000020: at the end.
How do I make the regex stop at the first occurrence of J[0-9]{7}?

Make .* non-greedy by adding '?' after it:
Project name:\s+(.*?)\s+J[0-9]{7}:

Using non-greedy quantifiers here is probably the best solution, also because it is more efficient than the greedy alternative: Greedy matches generally go as far as they can (here, until the end of the text!) and then trace back character after character to try and match the part coming afterwards.
However, consider using a negative character class instead:
Project name:\s+(\S*)\s+J[0-9]{7}:
\S means “everything except a whitespace and this is exactly what you want.

Well, ".*" is a greedy selector. You make it non-greedy by using ".*?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ".*?". This means that if for instance nothing comes after the ".*?", then it matches nothing.
Here's what I used. s contains your original string. This code is .NET specific, but most flavors of regex will have something similar.
string m = Regex.Match(s, #"Project name: (?<name>.*?) J\d+").Groups["name"].Value;

I would also recommend you experiment with regular expressions using "Expresso" - it's a utility a great (and free) utility for regex editing and testing.
One of its upsides is that its UI exposes a lot of regex functionality that people unexprienced with regex might not be familiar with, in a way that it would be easy for them to learn these new concepts.
For example, when building your regex using the UI, and choosing "*", you have the ability to check the checkbox "As few as possible" and see the resulting regex, as well as test its behavior, even if you were unfamiliar with non-greedy expressions before.
Available for download at their site:
http://www.ultrapico.com/Expresso.htm
Express download:
http://www.ultrapico.com/ExpressoDownload.htm

(Project name:\s+[A-Z]:(?:\\w+)+.[a-zA-Z]+\s+J[0-9]{7})(?=:)
This will work for you.
Adding (?:\\w+)+.[a-zA-Z]+ will be more restrictive instead of .*

Regex getting the tags from an <a href= ...> </a> and the likes

I've tried the answers I've found in SOF, but none supported here : https://regexr.com
I essentially have an .OPML file with a large number of podcasts and descriptions.
in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
http://softwareengineeringdaily.com/feed/podcast/

Brief
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
Code
Method 1
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of " and doesn't include the possibility for whitespace before or after the = symbol. This is the simplest solution to get the values you want.
See regex in use here
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S can be used to specify any non-whitespace characters or a set such as [\w-] may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl), which says don't match those characters. Also, note that the word boundary \b at the beginning ensures that we're matching the full attribute name of text and not the possibility of other attributes with the same termination such as subtext.
\b((?!text|xmlUrl)\w+)="[^"]*"
Method 2
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
See regex in use here
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the = symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/ may also be valid).
See regex in use here
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
Method 3
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
Method 4
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*))
See regex in use here
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J regex flag, which allows duplicate subpattern names ((?<v>) is in there twice)
See regex in use here
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
Results
Input
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Output
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
Explanation
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*" This is the fastest method possible (to the best of my knowledge) to grabbing anything between two " symbols. Note that it does not check for escaped backslashes, it will match any non-" character between two ". Whilst "(.*?)" can also be used, it's slightly slower
(["'])(.*?)\2 is basically shorthand for "(.*?)"|'(.*?)'. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)') <-- slightly faster than line above
(?|) This is a branch reset group. When you place groups inside it like (?|(x)|(y)) it returns the same group index for both matches. This means that if x is captured, it'll get group index of 1, and if y is captured, it'll also get a group index of 1.

For simple HTML strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).

Erlang binary pattern matching fails

Why does this issue a badmatch error? I can't figure out why this would fail:
<<IpAddr, ":*:*">> = <<"2a01:e34:ee8b:c080:a542:ffaf:*:*">>.

You need to specify the size of IpAddr so that it can be pattern-matched:
1> <<IpAddr:28/binary, ":*:*">> = <<"2a01:e34:ee8b:c080:a542:ffaf:*:*">>.
<<"2a01:e34:ee8b:c080:a542:ffaf:*:*">>
2> IpAddr.
<<"2a01:e34:ee8b:c080:a542:ffaf">>

Pattern matching of a binary proceeds left-to-right so it will match IpAddr first before it tries the following segment. There is no back-tracking until there is a match. A default typed variable like IpAddr matches one byte. See Bit Syntax Expressions and Bit Syntax for a proper description and more examples.
As alternative to using pattern matching here you might consider using the binary module. There are two functions which could be useful to you: binary:match/2/3 and binary:split/2/3. These search which may better fit your problem.
As a last alternative you could try using regular expressions and the re module.

How to select rows that start with a digit in Rails?

I have page that shows items in an index.
I'm able to get items by letter using the following:
scope :by_letter, lambda { |letter| where("name LIKE '#{letter}%'") }
But I can't figure out an elegant solution for names that start with a number (0-9).
How could I rewrite this or a separate scope that would let me search for names starting with a digit?
EDIT: I'm trying to get all rows that start with 0-9 in one go (not separately for each number).

this should work
scope :starts_with_number, where("name REGEXP '[0-9]%'")

Jacob, try this slightly rewritten version of what you ended up with:
#letter_merchants = (0..9).map { |d| Merchant.by_letter(d) }
Please note that this should only illustrate how awesome language Ruby is, not how the problem should be solved (there would be too many database calls).

Here's how I ended up doing it:
#letter_merchants = []
(0..9).to_a.each do |digit|
#letter_merchants |= Merchant.by_letter(digit)
end

One disadvantage of REGEXP is that it can't use indexes. however
scope :starts_with_number, where("name >= '0' and name < ':')
can use an index on name. It does rely on the characters 0-9: being in precisely that order, with nothing in between which will be the case in anything like ascii, utf8 but not if you used ebcdic or anything crazy like that

Making sure a url parameter is not present, using regex

I need 2 regular expressions that I will use in MySQL
OK if one of the url parameters equals something (e.g page_id=5)
I came up with this: ^https?:.*[?&]page_id=5([#&].*)?$
OK if a certain parameter is not present in the url (e.g do not match [?&]page_id=)
This is the one I need help with.
This functionality is part of a bigger problem that does need to be implemented with regular expressions and they have to be compatible with MySQLs RLIKE

your regexp looks fine - just use NOT RLIKE

AFAIK, MySQL's regex library does not support look-aheads, which is necessary for this kind of thing. As already stated, NOT RLIKE seems to be the only option.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Wildcard (*) Searches With Thinking Sphinx - thinking-sphinx

Is it possible to give star () between the search text . Example => peole

Yes. enable_star: 1 or enable_star: true in your sphinx.yml.

Related

How to extract in Splunk at indexed time json field with same child-key from different father-key using regex? [duplicate]

Regex getting the tags from an <a href= ...> </a> and the likes

Erlang binary pattern matching fails

How to select rows that start with a digit in Rails?

Making sure a url parameter is not present, using regex

Categories

Resources