How to identify path/file/url in href - html

I'm trying to grab the href value in <a> HTML tags using Nokogiri.
I want to identify whether they are a path, file, URL, or even a <div> id.
My current work is:
hrefvalue = []
html.css('a').each do |atag|
hrefvalue << atag['href']
end
The possible values in a href might be:
somefile.html
http://www.someurl.com/somepath/somepath
/some/path/here
#previous
Is there a mechanism to identify whether the value is a valid full URL, or file, or path or others?

try URI:
require 'uri'
URI.parse('somefile.html').path
=> "somefile.html"
URI.parse('http://www.someurl.com/somepath/somepath').path
=> "/somepath/somepath"
URI.parse('/some/path/here').path
=> "/some/path/here"
URI.parse('#previous').path
=> ""

Nokogiri is often used with ruby's URI or open-uri, so if that's the case in your situation you'll have access to its methods. You can use that to attempt to parse the URI (using URI.parse). You can also generally use URI.join(base_uri, retrieved_href) to construct the full url, provided you've stored the base_uri.
(Edit/side-note: further details on using URI.join are available here: https://stackoverflow.com/a/4864170/624590 ; do note that URI.join that takes strings as parameters, not URI objects, so coerce where necessary)
Basically, to answer your question
Is there a mechanism to identify whether the value is a valid full
url, or file, or path or others?
If the retrieved_href and the base_uri are well formed, and retrieved_href == the joined pair, then it's an absolute path. Otherwise it's relative (again, assuming well formed inputs).

If you use URI to parse the href values, then apply some heuristics to the results, you can figure out what you want to know. This is basically what a browser has to do when it's about to send a request for a page or a resource.
Using your sample strings:
%w[
somefile.html
http://www.someurl.com/somepath/somepath
/some/path/here
#previous
].each do |u|
puts URI.parse(u).class
end
Results in:
URI::Generic
URI::HTTP
URI::Generic
URI::Generic
The only one that URI recognizes as a true HTTP URI is "http://www.someurl.com/somepath/somepath". All the others are missing the scheme "http://". (There are many more schemes you could encounter. See the specification for more information.)
Of the generic URIs, you can use some rules to sort through them so you'd know how to react if you have to open them.
If you gathered the HREF strings by scraping a page, you can assume it's safe to use the same scheme and host if the URI in question doesn't supply one. So, if you initially loaded "http://www.someurl.com/index.html", you could use "http://www.someurl.com/" as your basis for further requests.
From there, look inside the strings to determine whether they are anchors, absolute or relative paths. If the string:
Starts with # it's an anchor and would be applied to the current page without any need to reload it.
Doesn't contain a path delimiter /, it's a filename and would be added to the currently retrieved URL, substituting the file name, and retrieved. A nice way to do the substitution is to use File.dirname , File.basename and File.join against the string.
Begins with a path delimiter it's an absolute path and is used to replace the path in the original URL. URI::split and URI::join are your friends here.
Doesn't begin with a path delimiter, it's a relative path and is added to the current URI similarly to #2.
Regarding:
hrefvalue = []
html.css('a').each do |atag|
hrefvalue << atag['href']
end
I'd use this instead:
hrefvalue = html.search('a').map { |a| a['href'] }
But that's just me.
A final note: URI has some problems with age and needs an update. It's a useful library but, for heavy-duty URI rippin' apart, I highly recommend looking into using Addressable/URI.

Related

OData: How to add operations for get by id?

My swagger.json on the backend lists two different paths for each operation like so:
"paths": {
"/api/Clients": {
...
"/api/Clients({key}: {"
...
When I try to edit the OpenAPI markup directly and add new path, it says duplicate path.
I also tried adding {key} as an optional parameter to the existing Clients opeartion, but it didnt like being marked optional, but having the value come from the path. From this post it looks like its possible, but I cannot figure out how.
Based on the post that you shared, the recommendation was to use a path like /api/Clients/{key} and then rewrite the URI as required.
To be exact to the recommendation, you could go for /api/{entity}/{key} itself, catching all entities.

How to replace a character by another in a variable

I want to know if there is a way to replace a character by another in a variable. For example replacing every dots with underscores in a string variable.
I haven't tried it, but based on the Variables specification, the way I'd try to approach this would be to try to match on the text before and after a dot, and then make new variables based on the matches. Something like:
set "value" "abc.def";
if string :matches "${value}" "*.*" {
set "newvalue" "${1}_${2}
}
This will, of course, only match on a single period because Sieve doesn't include any looping structures. While there's a regex match option, I'm not aware of any regex replacement Sieve extensions.
Another approach to complex mail filtering you can do with Dovecot (if you do need loops and have full access to the mail server) is their Dovecot-specific extensions like vnd.dovecot.pipe which allows the mail administrator to define full programs (written in whatever language one wishes) to process mail on its way through.
Following #BluE's comment, if your use case is to store e-mails in folders per recipient address or something like that, perhaps you don't actually want a generic character replace function but some way to create mailboxes with dots in their names. In the case of dovecot, there seem to be a solution: [Dovecot] . (dot) in maildir folder names
https://wiki2.dovecot.org/Plugins/Listescape
Ensure one of the files in /etc/dovecot/conf.d contains this line:
mail_plugins = listescape
Then you can filter mailing lists into separate boxes based on their IDs.
This Sieve script snippet picks the ID from the x-list-id header:
if exists "x-list-id" {
if header :regex "x-list-id" "<([\.#a-z_0-9-]+)" {
set :lower "listname" "${1}";
fileinto :create "mailing_list\\${listname}";
} else {
keep;
}
stop;
}

Actionscript 3.0 Reg Exp Find first URL, Ignore Email

I've got a bit of code where I have a loop with a string.search() to parse a string of HTML. The purpose is to seek out any valid URLs and surround each one with HREF tags appropriately while ignoring anything else, like email addresses. The problem is no matter how I modify the regular expression, it either kicks back the part of the email address after the # sign and highlights it or it highlights or ignores everything.
An example string would be:
"</span><span class='blue'>Weaselgrease:</span><span class='magenta'>weaselgrease#weasel.grs vs weaselgrease.weasel.grs</span><span class='blue'> [12:41:33 AM]</span>"
Where 'weaselgrease.weasel.grs' would be identified as a proper URL and 'weaselgrease#weasel.grs' would be ignored.
The code I have currently is /([fh]t{1,2}ps?:\/\/)?[\w-]+\.\w{2,4}/
I know it's rather simple, but it doesn't need to be complex yet.
I've tried a conditional and gotten nowhere. I may just be missing something, but my searching and even playing legos with http://regex101.com/ has gotten me no closer.
Ultimately I'm going to have it do the following:
Identify a valid URL's index in the string
Ignore if it's an email
Ignore if it's just an IP address (no prepending http:// and no trailing slash)
But I'd be happy with just an inkling of help on what I need to do to get it to ignore email addresses.
URL without proper protocol (i.e. http, https, ftp) cannot be validated as such, because this means that almost everything that has . (dot) in it is a valid url.
So there is not a way to properly check if it's url or e-mail if you don't use the protocol. Example:
end of sentence.New sentence -> sentence.New is valid url in your case
weaselgrease#weasel.grs -> everything before # is ignored and weasel.grs is valid url

Regex not matching URL Params

I am currently working on a stub server I can plug into a webpage so I do not need to hit sagepay every time I test my payment screen. I need the server to receive a request from the web page and use the dynamic parameters contained in the URL to build the server response. The stub uses regex targets to pick out the parameters I need and add them to the response.
I am using this stub server
I built the accepted URL piece by piece, using the regex tester contained here to test each bit of logic. The expressions work separately, but when I try to join two or more of them together they refuse to work. Each parameter is separated by an ampersand (&) and the name of the parameter.
Here is a sample of the parameters:
paymentType=A&amount=147.06&policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad&paymentMethod=A&script=Retail/accept.py&scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A&description=New Business Payment&firstName=Adam&surname=Har&addressLine1=20 Potters Road&city=London&postalCode=EC1 4JS&payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b&cardType=valid&continuousAuthority=true&makeCurrent=true
and in a list for ease of reading (without &'s)
paymentType=A
amount=147.06
policyUid=07ef493b-0000-0000-6a05-9fa4d6a5b5ad
paymentMethod=A
script=Retail/accept.py
scriptParams=uid=07ef461a-0000-0000-6a059fa44a8870bf&invokePCL=true&paymentType=A
description=New Business Payment
firstName=Adam
surname=Har
addressLine1=20 Chase road
city=London
postalCode=EC1 3PF
payerUid=07ef3ff7-0000-0000-6a05-9fa42e92d56b
cardType=valid
continuousAuthority=true
makeCurrent=true
And here is my accepted URL parameters with the regex logic:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))&description=([a-zA-Z0-9 ]+s)&firstName=[A-Za-z]&surname=[A-Za-z]&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)$)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
again in a list:
registerPayment?outputType=xml
country=GB
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)))
description=([a-zA-Z0-9 ]+s)
firstName=[A-Za-z]
surname=[A-Za-z]
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*$)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
My question is; why does my regex and sample match ok seperately, but dont when I put them all together ?
Additional question:
I am using the logic (([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+))) for the whole ScriptParams parameter (the &'s here are part of the parameter.) If I just want to get the 'uid' part and leave the rest, what expression would I need to target this (it is made up of A-z a-z 0-9 and dashes)?
thank you
UPDATE
I have tweaked your answer slightly, because the stub server I am using will not accept the (?:[\s-]) when it loads the file containing the URL templates. I have also incorporated a lot of % and 0-9 because the request is UTF encoded before it is matched (which I had not anticipated), and a few of the params have rogue spaces beyond my control. Other than that, your solution worked great :)
Here is my new version of the scriptParams regex:
&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+
This accepts the whole parameter, and works fine in the regex tester. Now when I link anything after this part, there is an unsuccessful match.
I do not understand why this is a problem as the regex seem to string together nicely otherwise. Any ideas are appreciated.
Here is the full regex:
paymentType=[-%a-zA-Z0-9 ]+&amount=[0-9]+.[0-9]{2}&policyUid=([-A-Za-z0-9]+)&paymentMethod=([%a-zA-Z0-9]+)&script=[%/.a-zA-Z0-9]+&scriptParams=[a-zA-Z]{3}%3d[-A-Za-z0-9]+&description=[%a-zA-Z0-9 ]+&firstName=[-%A-Za-z0-9]+&surname=[-%A-Za-z0-9]+&addressLine1=[-%a-zA-Z0-9 ]+&city=[-%a-zA-Z 0-9]+&postalCode=[-%a-zA-Z 0-9]+&payerUid=([-A-Za-z0-9]+)&cardType=[%A-Za-z0-9]+&continuousAuthority=[A-Za-z]+&makeCurrent=[A-Za-z]+
And here is the full set of URL params (with UTF encoding present):
paymentType=A&amount=104.85&policyUid=16a9cc22-0000-0000-5a96-5654d9a31f92&paymentMethod=A%20&script=RetailQuotes%2FacceptQuote.py%20&scriptParams=uid%3d16a9c958-0000-0000-5a96-565435311d07%26invokePCL%3dtrue%26paymentType%3dA%20&description=New%2520Business%2520Payment&firstName=Adam&surname=Har%20&addressLine1=26%2520Close&city=Potters%2520Town&postalCode=EC1%25206LR%20&payerUid=16a9c24e-0000-0000-5a96-5654b3f956e0&cardType=valid%20&continuousAuthority=true&makeCurrent=true
Thank you
PS
(Solved the server problem. Was a slight mistake I was making in the usage of URL params.)
First, your regex not all work, some are missing quantifiers, others have a $ for some reason and some parameters are even missing! Here's what they should have been:
paymentType=A
amount=([0-9]+.[0-9]{2})
policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
paymentMethod=([a-zA-Z]+)
script=([a-zA-Z]+/[a-zA-Z]+.py)
scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))
invokePCL=([a-z]+)
paymentType=A
description=([a-zA-Z0-9 ]+)
firstName=[A-Za-z]+
surname=[A-Za-z]+
addressLine1=[a-zA-Z0-9 ]+
city=([a-zA-Z ]+)
postalCode=[a-zA-Z0-9 ]+
payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)
cardType=[a-zA-Z]+
continuousAuthority=[a-zA-Z]+
makeCurrent=[a-zA-Z]+
And combined, you get:
paymentType=A&amount=([0-9]+.[0-9]{2})&policyUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&paymentMethod=([a-zA-Z]+)&script=([a-zA-Z]+/[a-zA-Z]+.py)&scriptParams=[a-zA-Z]{3}=(([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)+))&invokePCL=([a-z]+)&paymentType=A&description=([a-zA-Z0-9 ]+)&firstName=[A-Za-z]+&surname=[A-Za-z]+&addressLine1=[a-zA-Z0-9 ]+&city=([a-zA-Z ]+)&postalCode=[a-zA-Z0-9 ]+&payerUid=([A-Za-z0-9]+(?:[\s-][A-Za-z0-9]+)*)&cardType=[a-zA-Z]+&continuousAuthority=[a-zA-Z]+&makeCurrent=[a-zA-Z]+
regex101 demo
[Note, I took your regexes where they matched and ran minimal edits to them].
For your second question, I'm not sure what you mean by the Uid part and that & are part of the parameter. Given that there are 3 Uids in the url with similar format (policy, scriptparams, user), you will have to put them in the expression, unless you know a specific pattern to the scriptparams' Uid.
In the expression below, I made use of the fact that only scriptparams' uid was in lowercase:
uid=[0-9a-f]+(?:-[0-9a-f]+)+
regex101 demo

Naming variables that contain filenames?

If I've got a variable that contains the fully-qualified name of a file (for example, a project file), should it be called projectFile, projectFileName or projectPath? Or something else?
I usually go with these:
FileName for the file name only (without path)
FilePath for the parent path only (without the file name)
FileFullName for the fully qualified name with path
I don't think there is such a thing as an accepted standard for this. I depends on your (team's) preferences and whether you need to differentiate between the three in a given situation.
EDIT: My thoughts on these particular naming conventions are these:
intuitively, a "Name" is a string, so is a "Path" (and a "FileName")
a "Name" is relative unless it is a "FullName"
related variable names should begin with the same prefix ("File" + ...), I think this improves readability
concretions/properties are right-branching: "File" -> "FileName"
specializations are left-branching: "FileName" -> "ProjectFileName" (or "ProjectFileFullName")
a "File" is an object/handle representing a physical object, so "ProjectFile" cannot be a string
I cannot always stick to these conventions, but I try to. If I decided to use a particular naming pattern I am consistent even if it means that I have to write more descriptive (= longer) variable names. Code is more often read than written, so the little extra typing doesn't bother me too much.
System.IO.Path seems to refer to the fully qualified name of the file as the path, the name of the file itself as filename, and its containing directory as directory. I would suggest that in your case projectPath is most in keeping with this nomenclature if you are using in the context of System.IO.Path. I would then refer to the name of the file as fileName and it's containing directory as parentDirectory.
I don't think there's a consensus on this, just try to be consistent.
Examples from the .NET Framework:
FileStream(string path...);
Assembly.LoadFrom(string assemblyFile)
XmlDocument.Load(string filename)
Even the casing of filename (filename / fileName) is inconsistent in the framework, e.g.:
FileInfo.CopyTo(string destFileName)
You should title it after the file it is likely to contain, ie:
system_configuration_file_URI
user_input_file_URI
document_template_file_URI
"file" or "filename" is otherwise mostly useless
Also, "file" could mean "I am a file point" which is ambiguous, "filename" doesn't state whether or not it has context ( ie: directories ), fileURI is contextwise unambiguous as it says "this is a resource identifier that when observed, points to a resource ( a file ) "
Some thoughts:
projectFile might not be a string -- it might be an object that represents the parsed contents of the file.
projectFileName might not be fully-qualified. If the file is actually "D:\Projects\MyFile.csproj", then this might be expected to contain "MyFile.csproj".
projectPath might be considered to be the fully-qualified path to the file, or it might be considered to be the name of the parent folder containing the file.
projectFolder might be considered to hold the name of the parent folder, or it might actually be an implementation of some kind of Folder abstraction in the code.
.NET sometimes uses path to refer to a file, and sometimes uses filename.
Notice that “filename” is one word in English! There's no need to capitalize the “n” in the middle of the identifier.
That said, I append Filename to all my string variables that contain filenames. However, I try to avoid this whole scenario in favour of using strongly typed variables in languages that support a type of files and directories. After all, this is what extensible type systems are there for.
In strongly typed languages, the need for the descriptive postfix is then often unnecessary (especially in function arguments) because variable type and usage infers its content.
All of this depends partly on the size of the method and if the variables are class variables.
If they are class variables or in a large complicated method, then follow Kent Fredric's advice and name them something that indicates what the file is used for, i.e. "projectFileName".
If this is a small utility method that, say, deletes a file, then don't name it "projectFileName". Call it simply "filename" then.
I would never name it "path", since that implies that it's referring to the folder that it's in.
Calling it "file" would be OK, if there were not also another variable, like "fileID" or "filePtr".
So, I would use "folder" or "path" to identify the directory that the file is in.
And "fileID" to represent a file object.
And finaly, "filename" for the actual name of the file.
Happy Coding,
Randy
Based on the above answers of the choices being ambiguous as to their meaning and there not being a widely accepted name, if this is a .NET method requesting a file location then specify in the XML comments what you want, along with an example, as another developer can refer to that if they are unsure of what you want.
It depends on naming conventions used in your environment e.g. in Python the answer would be projectpath due to naming conventions of os.path module.
>>> import os.path
>>> path = "/path/to/tmp.txt"
>>> os.path.abspath(path)
'c:\\path\\to\\tmp.txt'
>>> os.path.split(path)
('/path/to', 'tmp.txt')
>>> os.path.dirname(path)
'/path/to'
>>> os.path.basename(path)
'tmp.txt'
>>> os.path.splitext(_)
('tmp', '.txt')