How to get images from a saved html page - html

I have a huge amount of saved HTML pages in my PC. I had parsed the the HTML page and got the image src. I need to store the images in every HTML page in a specific structure in separate directory. I tried out NET::HTTP.get but i am getting a error of Filename too long. Is there any way to do this ??
Below are the ways i tried out.
Method 1:
{
require 'open-uri'
def save_image(imgsrc)
File.open("images/img1","w") do |f|
asdf = open(imgsrc).read
f.write(asdf)
end
end
}
Method 2:
{
require 'NET::HTTP'
def save_image(imgsrc)
File.open("images/img1","w") do |f|
asdf = Net::HTTP.get_response(URI.parse(imgsrc)
f.write(asdf)
end
end
}
imgsrc => 

You already have the images, the one you posted(in the imgsrc variable) is
You only need to decode it using base64 module, and save the result to a file.
To decode your image i've used this service.
To decode using Base64 you should use #strict_decode64 method:
$ cat testb64.rb
imgsrc='/9j/4AAQS... ...oooA//2Q==' #( snipped here your long variable,
# removed "data:image/jpeg;base64,"
# from the beginning )
require 'base64'
print Base64.strict_decode64(imgsrc)
$ ruby testb64.rb >img.jpg
$ xxd -p img.jpg
ffd8ffe000104a464946....
(valid JFIF header, viewable JPEG by Gwenview and Dolphin)

This should work:
require 'open-uri'
require 'base64'
require 'open-uri'
def save_image(imgsrc)
File.open("images/img1", "wb") do |fo|
fo.write(Base64.decode64(open(imgsrc).read))
end
end
It will save to the file path "images/img1" so you'll want to create separate paths for each file otherwise they'll overwrite each one.
"wb" means to open the output file using binary mode, which avoids line-end translations appropriate for your OS. Without b, Ruby will look for "\r" and "\n" and either remove or add them as necessary for a text file, which will corrupt a binary file. b avoids that step. This is documented in the IO.new description.
You can't pass
imgsrc => 
as the URL for an image, as that isn't a URL. Both OpenURI and Net::HTTP expect a URL to the image, which they will then request and read the resulting response, returning the data back to your code. You'd need to do a Base64 decode against that data, which will result in a binary string in memory, which you can then write to a file opened in binary mode.

Related

How to read data from a CSV file of two possible encodings?

I want to read data from csv files with two possible encodings (UTF-8 and ISO-8859-15). I mean different files with different encodings. Not the same file with two encodings.
Now I can only read data correctly from a utf-8 encoding file. Can I just implement this by adding an extra option? For example . encoding: 'ISO-8859-15'
What i have:
def csv
file = File.open(file.tempfile)
CSV.open(file, csv_options)
end
private
def csv_options
{
col_sep: ";",
headers: true,
return_headers: false,
skip_blanks: true
}
end
Once you know what encoding your file has, you can pass inside the CSV options i.e.
external_encoding: Encoding::ISO_8859_15,
internal_encoding: Encoding::UTF_8
(This would establish, that the file is ISO-8859-15, but you want the strings internally as UTF-8).
So the strategy is that you decided first (before opening the file), what encoding you want, and then use the appropriate option Hash.

Client checks if all prefixes are stored correctly in a JSON file. Discord.py

Answer (updated)
New update guys, I chose to remove this because I read that this code, and basically this whole thing I had created, was not good enough and if the bot would get larger, it would cause problems because it would overload the .JSON file. So I did this:
Main File:
def get_prefix(client, message):
with open('prefixes.json', 'r') as file:
prefixes = json.load(file)
if str(message.guild.id) in prefixes:
return prefixes[str(message.guild.id)]
else:
return "/"
client = commands.Bot(command_prefix = get_prefix)
Prefixes File: (INSIDE A COG)
#commands.Cog.listener()
async def on_guild_remove(self, guild):
with open('prefixes.json', 'r') as f:
prefixes = json.load(f)
prefixes.pop(str(guild.id))
with open('prefixes.json', 'w') as f:
json.dump(prefixes, f, indent=4)
So before this "Update" the bot would just load the Default prefix to the .JSON file. Now what I have done is that the Get_Prefix will check if the guild's ID is on the .JSON File, and if in any case is not in there it will use the / as a Default prefix. This will help because the code will not have to store every single server to the .JSON File which would cause problems in the future if the bot had the chance to get bigger and more well known.
So then in the Prefix File (can be also used in your Main File but you will have to remove the Self that I have used.) I removed the code which the bot would have to add the prefix to the .JSON file and I only let the part that will remove the Guild's Custom Prefix if it is on the .JSON file. That's pretty much it, if you have any more questions I can help you with this just comment in this question!
Question
I'm struggling to find a solution in a large problem I have with my bot. Basically I'm using a per-server-prefix code, which I found as a tutorial in the Internet. It's a pretty basic one, I believe that there are more advanced codes than this one but that's not the point at the moment. Before I begin with my problem I want to explain to you how my code works, Basically whenever the Client joins a new server it will save it's ID in a JSON File and it will give it a Default Prefix.
import discord
import json
import asyncio
import os
from discord.ext import commands
class Prefixes(commands.Cog):
def __init__(self, client):
self.client = client
#commands.Cog.listener()
async def on_guild_join(self, guild):
with open('prefixes.json', 'r') as f:
prefixes = json.load(f)
prefixes[str(guild.id)] = '/'
with open('prefixes.json', 'w') as f:
json.dump(prefixes, f, indent=4)
#commands.Cog.listener()
async def on_guild_remove(self, guild):
with open('prefixes.json', 'r') as f:
prefixes = json.load(f)
prefixes.pop(str(guild.id))
with open('prefixes.json', 'w') as f:
json.dump(prefixes, f, indent=4)
then if you wish to change the prefix to something else you can use the command /prefix and it will change to the new prefix while the client will save the new prefix in the JSON File.
#commands.command()
#commands.has_permissions(manage_roles=True, ban_members=True)
async def prefix(self, ctx, prefix=None):
if prefix == None:
try:
x = ""
pfp = self.client.user.avatar_url
prefix = discord.Embed(title=x, description=f"My prefix for **{ctx.guild.name}** is `{ctx.prefix}`. If you want to find out more information about me type `{ctx.prefix}help`.", color = 0x456383)
prefix.set_footer(text=f"ChizLanks", icon_url=pfp)
await ctx.channel.send(embed = prefix)
return
except discord.Forbidden:
return await ctx.channel.send(f"My prefix for **{ctx.guild.name}** is `{ctx.prefix}`. If you want to find out more information about me type `{ctx.prefix}help`")
else:
with open('prefixes.json', 'r') as f:
prefixes = json.load(f)
prefixes[str(ctx.guild.id)] = prefix
with open('prefixes.json', 'w') as f:
json.dump(prefixes, f, indent=4)
await ctx.channel.send(f'Server Prefix has changed to `{prefix}`')
That's pretty much the code, I also use some errors for the command but that is not needed at the moment. My problem is that if the client is offline the code is basically not running, that means if the bot gets invited to a new server, the client will not be able to save the prefix to the JSON File and this will cause problems to the server, because when the client will get back online, it will not have any prefix and that means that they will not be able to use any commands.
Now how can that be fixed? I already have an idea of how it will work. I will probably need to use on_ready event and that event will search if the servers that the client has joined are in the JSON File it will use them, otherwise it will create a a new prefix (The Default One) and it will save it to the JSON File. That's my idea but I need some help because I do not have any idea if this is even possible. How might this work?
You could add a default prefix that is recognized by the bot if there is no prefix found for the server. That way, when the very first command is sent in a server, the bot can recognize it even though it wasn't added to the prefix file. Then, when it runs that default prefix, the server could then be registered in the JSON file and everything will be normal from there.

I need a good regex for HTML file parsing in ruby

Here is a Ruby question guys. So need to parse through the html file and catch urls and emails can't come up with proper regex expression. Tried 100+ regexes and all the times I cash something else with the url.
File.open("/Desktop/file.html").each_line do |line|
if line.split("href=\"") =~ /???/
puts line
end
end
# I can use line.split("href=\"") so each new line will start with url =>
(https://www.facebook.com/students">
The question is what regex can I use to catch everything from https to the end of the url which ends with (") (there could be one or more samples of same url so {1,2} is needed
Try this
file = File.open('filename_path')
links = file.read().scan(/href=\"(?<url>.*?)\"/)
you get links in array
it also works if you remove ?<url> from above(it's just named capture group)

What is the proper method for reading and writing HTML/XML (byte string) with Python and lxml and etree?

EDIT: Now that the problem is solved, I realize that it had more to do with properly reading/writing byte-strings, rather than HTML. Hopefully, that will make it easier for someone else to find this answer.
I have an HTML file that's poorly formatted. I want to use a Python lib to just make it tidy.
It seems like it should be as simple as the following:
import sys
from lxml import etree, html
#read the unformatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html', 'r', encoding='utf-8') as file:
#write the pretty XML to a file
file_text = ''.join(file.readlines())
#format the HTML
document_root = html.fromstring(file_text)
document = etree.tostring(document_root, pretty_print=True)
#write the nice, pretty, formatted HTML
with open('C:/Users/mhurley/Portable_Python/notebooks/Pretty.html', 'w') as file:
#write the pretty XML to a file
file.write(document)
But this chunk of code complains that file_lines is not a string or bytes-like object. Okay, it makes sense that the function can't take a list, I suppose.
But then, it's 'bytes' not a string. No problem, str(document)
But then I get HTML that's full of '\n' that are not newlines... they're a slash followed by an en. And there are no actual carriage returns in the result, it's just one long line.
I've tried a number of other weird things like specifying the encoding, trying to decode, etc. None of which produce the desired result.
What's the right way to read and write this kind of (is non-ASCII the right term?) text?
You are missing that you get bytes from tostring method from etree and need to take that into account when writing (a bytestring) to a file. Use the b switch in the open function like this and forget about the str() conversion:
with open('Pretty.html', 'wb') as file:
#write the pretty XML to a file
file.write(document)
Addendum
Even though this answer solves the immediate problem at hand and teaches about bytestrings, the solution by Padraic Cunningham is the cleaner and faster way to write lxml etrees to a file.
This can be done all using lxml in a couple of lines of code without ever needing to use open, the .write method is exactly for what you are trying to do:
# parse using file name which is the also the recommended way.
tree = html.parse("C:/Users/mhurley/Portable_Python/notebooks/View_Custom_Report.html")
# call write on the tree
tree.write("C:/Users/mhurley/Portable_Python/notebooks/Pretty.html", pretty_print=True, encoding="utf=8")
Also file_text = ''.join(file.readlines()) is exactly the same as file_text = file.read()

Invalid byte sequence in UTF-8, CSV import, Rails 4

I have a rake task that populates my database from a CSV file:
require 'csv'
namespace :import_data_csv do
desc "Import teams from csv file"
task import_data: :environment do
CSV.foreach(file, :headers => true) do |row|
#various import tasks
This had been working properly, but with a new CSV file, I'm getting the following error on the 6th row of the CSV file:
Invalid byte sequence in UTF-8
I have looked through the row and can't seem to find any irregular characters.
I've also attempted a couple other fixes recommended on stackoverflow:
- Changing the CSV.foreach to:
reader = CSV.open(file, "r")
reader.each do |row|
And changing:
CSV.foreach(file, headers => true) do |row|
to:
CSV.foreach(file, encoding: "r:ISO-8859-1", :headers => true) do |row|
None of these seem to correct the issue.
Suggestions?
I solved this by saving the file as a MDOS CSV, instead of the standard CSV file or the Windows CSV format.
The answer for me was to take the CSV file and save it to a text file. Then replace the tabs with commas. Then save the file as UTF-8 encoded. Finally, change the .txt to .csv and make sure it works in Excel BUT DON'T save it in Excel. Just close it when you see it looks correct. Then upload it.
A long non-programatic solution, but for my purposes it's sufficient.
Source is here: https://help.salesforce.com/apex/HTViewSolution?id=000003837&language=en_US