Issues downloading HTML from a website with pywebcopy - html

I'm making a web scraper that I using for drop add at my school but I'm having trouble getting it to run on my debian VPS. It works perfectly on my mac but for some reason pywebcopy just doesn't want to work on the VPS. I'm also willing to use something else to do it because I only need the HTML when from what I have seen pywebcopy is built for getting everything from a website. So recomendations on other ways to go about it or a fix to my issue would be a blessing. Also this class is getting ran every second and is intended to be run until I get all of my classes added. On my mac it is being done with python 3.6 but on the vps it is with python 3.7 so that might be another issue.
def getwebpage(crn):
#18139
kwargs = {'project_name': 'site folder'}
url='https://.edu/bprod/bwckschd.p_disp_listcrse?term_in=202101&subj_in=CSC&crse_in=4780&crn_in='+crn
config.setup_config(
# url pf the website
url,
# folder where the copy will be saved
project_folder='rawhtml/'+crn,
**kwargs
)
wp = WebPage()
wp.get(url)
wp.save_html()
dirs = os.listdir("rawhtml/"+crn+"/site folder/.edu/bprod")
file_path = "rawhtml/"+crn+"/site folder/.edu/bprod/"+dirs[0]
return file_path

Related

Opening a downloaded mht file from Selenium (Help needed)

Long story short, I'm not a coder.
My team used to have this coder who created this Python/Selenium code to extract some information from chrome browser (Echocardiography reports) and/or downloaded mht file (also Echocardiography reports).
This code was working fine until recently, it stopped working.
The program still successfully downloads the mht file via chrome.
However, it fails to open the file and hence, code continues without extracting any information - resulting in empty extractions.
This is the part I need help figuring out
driver.get('chrome://downloads')
# driver.get('file:///C:/Users/name/Downloads/')
root1 = driver.find_element_by_tag_name('downloads-manager')
shadow_root1 = expand_shadow_element(root1)
time.sleep(2)
root2 = shadow_root1.find_element_by_css_selector('downloads-item')
shadow_root2 = expand_shadow_element(root2)
time.sleep(1.5)
openEchoFileButton = shadow_root2.find_element_by_id('file-link')
mhtFileName = openEchoFileButton.text
driver.get('file:///C:/Users/name/Downloads/' + mhtFileName) # go to web page
try:
echoDateElement = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.XPATH, '/html/body/div[3]/p[1]/span[3]')))
except TimeoutException:
print("Loading page took too much time!")
I'm trying to figure out why it suddenly fails to open the downloaded mht files.
Last time our team tried using this code is back in 2020 and was successful.
Were there any updates to Chrome perhaps?
Help would be immensely appreciated.
Thank you so much in advance.
There are three obvious weaknesses in this code. The first two are the use of time.sleep() to wait for the element to appear and be manipulable. What if the machine is busy doing something else, and 1.5 seconds isn't enough? The right way to do that is to repeatedly check for the element to be ready. You've got a great example of how to do that using WebDriverWait() in this code already. The third weakness is the locator used in that presence_of_element_located() call. XPath locators rooted at "/html" are notoriously fragile, subject to breakage by small changes to the web page. Try to find something in the page that you can check via a more stable locator - ideally, an element with an ID= attribute.

Play 2 - External configuration file fails to load in production mode

I am having trouble including an external file in the application.conf of my play 2.1.1 application in production mode (when launched with start)
Following the official documentation I added an include statement in my application.conf :
[...]
include "/absolute/path/to/external/config/file.conf"
The content is loaded this way :
configuration.getConfig("some-key")
It works fine in dev mode, but fails in production mode (it's always None).
This is preventing me from deploying my application to production.
Any help/ideas would be greatly appreciated.
EDIT:
Following Saffron's comment I tried a few workarounds.
Removing the first slash from the include statement did not work.
Loading the configuration file via -Dconfig.file=/abs/path gave weird results and it seems that Play is not behaving in a consistent way:
play start -Dconfig.file=/path/to/file.conf does not work. However launching play THEN running start -Dconfig.file=/path/to/file.conf does work ??!!
So I ended up creating a new configuration instead of using Play's:
val conf = ConfigFactory.parseFile(new File("/path/to/file.conf")).resolve()
val myValues = new Configuration(conf).getConfig("some-key").get
Hope it can help someone who ran into the same issue.
I had a problem in which my images weren't displayed correctly in prod mode, but were ok in dev mode.
The solution was to write the path as ("images/...") instead of ("/images/..."). Try it, just for science sake.
Anyway,if this doesn't work, here is some documentation about additional Conf in prod mode, with console lines to override the file. $ start -Dconfig.file=/opt/conf/prod.conf
http://www.playframework.com/documentation/2.0/ProductionConfiguration

Joomla content invisible after upgrade

I have got an old joomla version running. Some 1.0.x. (I did not yet intend to upgrade this site but will do as soon as I find some time for doing so)
However I had to upgrade the outdated linux (Suse 10.1) on that server and installed ubuntu 12.04.
Then I copied all the files to the server, that I backed up before the OS upgrade. And I re-created the database and the user that Joomla was using to access the DB. I Imported the tables and data using phpMyAdmin which I used before to export the old data base.
I did that before with other (more modern versions of) Joomla installations. As far as I can see the database was recoverd fine and all the files were installed and are at a proper place. The Backoffice/Admin site works fine. All Links (an extenstion/component) and all content items are still there and just look fine. (Given it is a rather old version :)
But on the frontend site the content items are missing. Front page looks fine, menu looks fine but the content is empty.
Menu Items to components (old zoom gallery, weblinks component) work just right. Samples:
http://www.klecker.de/photo/index.php?option=com_weblinks&Itemid=52
http://www.klecker.de/photo/index.php?option=com_zoom&Itemid=26&catid=13
But "internal" links to content items - static and normal - don't work at all. Sample:
http://www.klecker.de/photo/index.php?option=com_content&task=view&id=121&Itemid=199
What could be wrong? What did I miss or overlook? Something related to the file system structure, which is slightly different between these two linux distributions and plesk versions? Or may different Versions of php5 or mysql cause some side effect?
Could you turn on your error debugging or let us know what the error is?
If you are on php 5.3 try the following. It worked for me on an archived (locked down) 1.0.15 site:
Open /includes/Cache/Lite/Function.php
Go to line 74, i.e. $arguments
= func_get_args();
Replace it with this:
$arguments = func_get_args();
$numargs = func_num_args();
for($i=1; $i < $numargs; $i++)
{
$arguments[$i] = &$arguments[$i];
}
Save
Test
5.3 support was not officially added to Joomla until version 1.5.15.

Interpret/Render output from puts() as HTML

When I run my ruby script, I want the output to be rendered as HTML, preferably with a browser (e.g. Chrome). However, I would very much prefer if I didn't have to start a webservice, because I'm not making a website. I've tried sinatra, and the problem with it, is that I have to restart the server every time I do changes to my code, plus it features requests (like GET/POST-arguments) which I don't really need.
I simply prefer the output from my Ruby program to appear as HTML as opposed to console-text -- since html allows for more creative/expressive output. Is there a good/simple/effective way to do this? (I'm using notepad++ to edit my code, so if its possible to combine the above with it somehow, that would be awesome).
Thanks alot :)
Using the gem shotgun you can run a Sinatra app that automatically reloads changes without restarting the server.
Alternatively, using a library like awesome_print which has HTML formatting, you could write a function which takes the output and saves it to a file. Then open the file in Chrome.
If you don't want to have to manually refresh the page in Chrome, you could take a look at guard-livereload (https://github.com/guard/guard-livereload) which will monitor a given file using the guard gem and reload Chrome. Ryan Bates has a screenshot on guard here, http://railscasts.com/episodes/264-guard.
Here's a function that overrides Kernel#puts to print the string to STDOUT and write the HTML formatted version of it to output.html.
require 'awesome_print'
module Kernel
alias :old_puts :puts
def puts(string)
old_puts string
File.open("output.html", "w") do |file|
file.puts string.ai(:html => true)
end
end
end
puts "test"

Is there a way to convert Trac Wiki pages to HTML?

I see the suggestion of using Mylyn WikiText to convert wiki pages to html from this question except I'm not sure if its what I'm looking for from reading the front page of the site alone. I'll look into it further. Though I would prefer it being a Trac plug-in so I could initiate the conversion from within the wiki options but all the plugins at Trac-Hacks export single pages only whereas I want to dump all formatted pages in one go.
So is there an existing Trac plug-in or stand-alone application that'll meet my requirements? If not where would you point me to start looking at implementing that functionality myself?
You may find some useful information in the comments for this ticket on trac-hacks. One user reports using the wget utility to create a mirror copy of the wiki as if it was a normal website. Another user reports using the XmlRpc plugin to extract HTML versions of any given wiki page, but this method would probably require you to create a script to interface with the plugin. The poster didn't provide any example code, unfortunately, but the XmlRpc Plugin page includes a decent amount of documentation and samples to get you started.
If you have access to a command line on the server hosting Trac, you can use the trac-admin command like:
trac-admin /path/to/trac wiki export <wiki page name>
to retrieve a plain-text version of the specified wiki page. You would then have to parse the wiki syntax to HTML, but there are tools available to do that.
For our purposes, we wanted to export each of the wiki pages individually without the header/footer and other instance-specific content. For this purpose, the XML-RPC interface was a good fit. Here's the Python 3.6+ script I created for exporting the whole of the wiki into HTML files in the current directory. Note that this technique doesn't rewrite any hyperlinks, so they will resolve absolutely to the site.
import os
import xmlrpc.client
import getpass
import urllib.parse
def add_auth(url):
host = urllib.parse.urlparse(url).netloc
realm = os.environ.get('TRAC_REALM', host)
username = getpass.getuser()
try:
import keyring
password = keyring.get_password(realm, username)
except Exception:
password = getpass.getpass(f"password for {username}#{realm}: ")
if password:
url = url.replace('://', f'://{username}:{password}#')
return url
def main():
trac_url = add_auth(os.environ['TRAC_URL'])
rpc_url = urllib.parse.urljoin(trac_url, 'login/xmlrpc')
trac = xmlrpc.client.ServerProxy(rpc_url)
for page in trac.wiki.getAllPages():
filename = f'{page}.html'.lstrip('/')
dir = os.path.dirname(filename)
dir and os.makedirs(dir, exist_ok=True)
with open(filename, 'w') as f:
doc = trac.wiki.getPageHTML(page)
f.write(doc)
__name__ == '__main__' and main()
This script requires only Python 3.6, so download and save to a export-wiki.py file, then set the TRAC_URL environment variable and invoke the script. For example on Unix:
$ TRAC_URL=http://mytrac.mydomain.com python3.6 export-wiki.py
It will prompt for a password. If no password is required, just hit enter to bypass. If a different username is needed, also set the USER environment variable. Keyring support is also available but can be disregarded.