Here is a Ruby question guys. So need to parse through the html file and catch urls and emails can't come up with proper regex expression. Tried 100+ regexes and all the times I cash something else with the url.
File.open("/Desktop/file.html").each_line do |line|
if line.split("href=\"") =~ /???/
puts line
end
end
# I can use line.split("href=\"") so each new line will start with url =>
(https://www.facebook.com/students">
The question is what regex can I use to catch everything from https to the end of the url which ends with (") (there could be one or more samples of same url so {1,2} is needed
Try this
file = File.open('filename_path')
links = file.read().scan(/href=\"(?<url>.*?)\"/)
you get links in array
it also works if you remove ?<url> from above(it's just named capture group)
Related
Right now I have the line of code in python:
urls = re.findall("(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?=%.]+",str(field))
This searches if a keyword is in a url, however it doesn't parse urls which include a # correctly. An example link I am trying parse is
https://partalert.net/product.html?v=51421546#asin=B08KH7RL89&price=&smid=A3P5ROKL5A1OLE&tag=partalert-21×tamp=00%3A17+UTC+%281.3.2021%29&title=Gigabyte+GeForce+RTX+3080+VISION+OC+10GB+Graphics+Card&tld=.co.uk
However the parsing excludes the hashtag and everything after it:
https://partalert.net/product.html?v=51421546
I managed to solve this, i needed to add a few symbols to the character classes, here is the working regex: "(?:(?:https?|ftp)://)?[\w/-?=%.#&+]+.[\w/-?=%.#&+]+"
EDIT: Now that the problem is solved, I realize that it had more to do with properly reading/writing byte-strings, rather than HTML. Hopefully, that will make it easier for someone else to find this answer.
I have an HTML file that's poorly formatted. I want to use a Python lib to just make it tidy.
It seems like it should be as simple as the following:
import sys
from lxml import etree, html
#read the unformatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html', 'r', encoding='utf-8') as file:
#write the pretty XML to a file
file_text = ''.join(file.readlines())
#format the HTML
document_root = html.fromstring(file_text)
document = etree.tostring(document_root, pretty_print=True)
#write the nice, pretty, formatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/Pretty.html', 'w') as file:
#write the pretty XML to a file
file.write(document)
But this chunk of code complains that file_lines is not a string or bytes-like object. Okay, it makes sense that the function can't take a list, I suppose.
But then, it's 'bytes' not a string. No problem, str(document)
But then I get HTML that's full of '\n' that are not newlines... they're a slash followed by an en. And there are no actual carriage returns in the result, it's just one long line.
I've tried a number of other weird things like specifying the encoding, trying to decode, etc. None of which produce the desired result.
What's the right way to read and write this kind of (is non-ASCII the right term?) text?
You are missing that you get bytes from tostring method from etree and need to take that into account when writing (a bytestring) to a file. Use the b switch in the open function like this and forget about the str() conversion:
with open('Pretty.html', 'wb') as file:
#write the pretty XML to a file
file.write(document)
Addendum
Even though this answer solves the immediate problem at hand and teaches about bytestrings, the solution by Padraic Cunningham is the cleaner and faster way to write lxml etrees to a file.
This can be done all using lxml in a couple of lines of code without ever needing to use open, the .write method is exactly for what you are trying to do:
# parse using file name which is the also the recommended way.
tree = html.parse("C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html")
# call write on the tree
tree.write("C:/Users/mhurley/Portable_Python/notebooks/Pretty.html", pretty_print=True, encoding="utf=8")
Also file_text = ''.join(file.readlines()) is exactly the same as file_text = file.read()
I have a file upload form and would like to use the filename on the server, however I notice that when I upload it the spaces are dropped. On the client/browser I can do something like this in an event called after the input type='file' element has changed:
function process_svg (e) {
var files = e.target.files || e.originalEvent.dataTransfer.files;
console.log(files[0].filename);
And if I upload a file with the name 'some file - type.ext' 'some file - type.ext' will be printed in the console. On the server (running bottle) however if I run:
#route('/some_route')
def some_route():
print(request.files['form_name_attr'].filename)
I get 'somefile-type.ext.' I am guessing this has to do with uri escaping (or lack there of), but since you cannot change a file preupload how do you get around this and preserve it? Strangely I cannot find mention of this on google, in part I have had trouble thinking of appropriate search terms, but I'm also aware that this may not actually be native behaviour, but a bug elsewhere in my code.
I do not think that is the case as I've issued these console.log and print statements at the end (right before the upload) and beginning (right when the server starts processing the request) and do not believe I really have any code to touch it in between, however if that is the case please let me know as I could be looking in the wrong direction.
You want raw_filename, not filename.
(Note that it may contain unsafe characters.)
#route('/some_route', method='POST')
def some_route():
print(request.files['form_name_attr'].filename) # "cleaned" file name
print(request.files['form_name_attr'].raw_filename) # unmodified file name
Found this in the source code for FileUpload.filename:
Only ASCII letters, digits, dashes, underscores and dots are allowed
in the final filename. Accents are removed, if possible. Whitespace is
replaced by a single dash. Leading or tailing dots or dashes are
removed. The filename is limited to 255 characters.
I have a huge amount of saved HTML pages in my PC. I had parsed the the HTML page and got the image src. I need to store the images in every HTML page in a specific structure in separate directory. I tried out NET::HTTP.get but i am getting a error of Filename too long. Is there any way to do this ??
Below are the ways i tried out.
Method 1:
{
require 'open-uri'
def save_image(imgsrc)
File.open("images/img1","w") do |f|
asdf = open(imgsrc).read
f.write(asdf)
end
end
}
Method 2:
{
require 'NET::HTTP'
def save_image(imgsrc)
File.open("images/img1","w") do |f|
asdf = Net::HTTP.get_response(URI.parse(imgsrc)
f.write(asdf)
end
end
}
imgsrc => 
You already have the images, the one you posted(in the imgsrc variable) is
You only need to decode it using base64 module, and save the result to a file.
To decode your image i've used this service.
To decode using Base64 you should use #strict_decode64 method:
$ cat testb64.rb
imgsrc='/9j/4AAQS... ...oooA//2Q==' #( snipped here your long variable,
# removed "data:image/jpeg;base64,"
# from the beginning )
require 'base64'
print Base64.strict_decode64(imgsrc)
$ ruby testb64.rb >img.jpg
$ xxd -p img.jpg
ffd8ffe000104a464946....
(valid JFIF header, viewable JPEG by Gwenview and Dolphin)
This should work:
require 'open-uri'
require 'base64'
require 'open-uri'
def save_image(imgsrc)
File.open("images/img1", "wb") do |fo|
fo.write(Base64.decode64(open(imgsrc).read))
end
end
It will save to the file path "images/img1" so you'll want to create separate paths for each file otherwise they'll overwrite each one.
"wb" means to open the output file using binary mode, which avoids line-end translations appropriate for your OS. Without b, Ruby will look for "\r" and "\n" and either remove or add them as necessary for a text file, which will corrupt a binary file. b avoids that step. This is documented in the IO.new description.
You can't pass
imgsrc => 
as the URL for an image, as that isn't a URL. Both OpenURI and Net::HTTP expect a URL to the image, which they will then request and read the resulting response, returning the data back to your code. You'd need to do a Base64 decode against that data, which will result in a binary string in memory, which you can then write to a file opened in binary mode.
I'd like to ask if its possible for 4D to create a document on a network directory. For example:
vIP:="\\100.100.100.100" // this is a hypothetical IP
vPath:=vIP+"\storage\"
vDoc:=Create document(vPath+"notes.txt")
If(OK=1)
SEND PACKET(vDoc;"Hello World")
CLOSE DOCUMENT(vDoc)
End if
one way to do this is:
you can map your second machine's drive to machine where your 4d database is running.
then this drive would behave like a local drive.
for example:
i have mapped a drive which is named as "D" drive on remote machine, and it becomes "W" drive on machine where 4D database is running.
then you can use this code
c_Text(vPath)
vPath:="W:\var\www....." //temp path.....
vDoc:=Create document(vPath+"notes.txt")
If(OK=1)
SEND PACKET(vDoc;"Hello World")
CLOSE DOCUMENT(vDoc)
End if
I know this is an old question, but there aren't too many of us 4D coders floating around here, so I'll answer this for posterity!
Yes, you can create a document on a network share like this, assuming you have the appropriate permissions to do so.
In this case, I think you just need to be careful of how you escape the path. Ensure that you're doubling up your backslashes so that the code block looks like this (note the extra backslashes around the IP address and folder name):
vIP:="\\\\100.100.100.100" // this is a hypothetical IP
vPath:=vIP+"\\storage\\"
vDoc:=Create document(vPath+"notes.txt")
If (OK=1)
SEND PACKET(vDoc;"Hello World")
CLOSE DOCUMENT(vDoc)
End if
Hope this helps!
Yes, although undocumented, the CREATE DOCUMENT command does works with a valid UNC path provided that you have sufficient privileges to create a document at the path given.
However, you have an issue with your sample code. Your issue comes down to your usage of the backslash \ character.
The backslash \ character is used for escape sequences in 4D and is therefore used for escaping many other characters, so it must also be escaped itself. Simply doubling all of your backslashes in your sample code from \ to \\ should correct the issue.
Your sample code:
vIP:="\\100.100.100.100" // this is a hypothetical IP
vPath:=vIP+"\storage\"
vDoc:=Create document(vPath+"notes.txt")
If(OK=1)
SEND PACKET(vDoc;"Hello World")
CLOSE DOCUMENT(vDoc)
End if
Should be written like this:
vIP:="\\\\100.100.100.100" // this is a hypothetical IP
vPath:=vIP+"\\storage\\"
vDoc:=Create document(vPath+"notes.txt")
If(OK=1)
SEND PACKET(vDoc;"Hello World")
CLOSE DOCUMENT(vDoc)
End if
Your code could be further improved by using Test Path Name to confirm the path is valid, and that the file does not exist. Then if it does exist you could even use Open Document and Set Document Position to append to the document, like this:
vIP:="\\\\100.100.100.100"
vPath:=vIP+"\\storage\\"
vDocPath:=vPath+"notes.txt"
If (Test path name(vPath)=Is a folder)
// is a valid path
If (Not(Test path name(vDocPath)=Is a document))
// document does not exist
vDoc:=Create document(vDocPath)
If (OK=1)
SEND PACKET(vDoc;"Hello World")
CLOSE DOCUMENT(vDoc)
End if
Else
// file already exists at location!
vDoc:=Open document(vDocPath)
If (OK=1)
SET DOCUMENT POSITION(vDoc;0;2) // position 0 bytes from EOF
SEND PACKET(vDoc;"\rHello Again World") // new line prior to Hello
CLOSE DOCUMENT(vDoc)
End if
End if
Else
// path is not valid!
ALERT(vPath+" is invalid")
End if