Python + Selenium + PhantomJS render to PDF - multiple files - html

I am trying to addapt the code found in
Python + Selenium + PhantomJS render to PDF
so I instead of saving one web page as a pdf file, I can iterate over a list of urls and save each one with a specific name (from another list).
count = 0
while count < length:
def execute(script, args):
driver.execute('executePhantomScript', {'script': script, 'args' : args })
driver = webdriver.PhantomJS('phantomjs')
# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver.get(urls[count])
# set page format
# inside the execution script, webpage is "this"
pageFormat = '''this.paperSize = {format: "A4", orientation: "portrait" };'''
execute(pageFormat, [])
# render current page
render = '''this.render("test.pdf")'''
execute(render, [])
count+=1
I tested modifying
render = '''this.render("test.pdf")'''
to
render = '''this.render(names[count]+".pdf")'''
so as to include the each name in the list using count but have not been successful.
Also tried:
dest = file_user[count]+".pdf"
render = '''this.render(dest)'''
execute(render, [])
But did not work either.
I greatly appreciate a suggestion for the appropriate syntax.
It must be very simple but I am a noobie.

Use string formatting:
render = 'this.render("{file_name}.pdf")'.format(file_name=names[count])

Related

Firebase Cloud Functions-ImageMagick CLI PDF to images

I am trying to work with Firebase Cloud Functions, and ImageMagick, similar to how the thumbnail demo is done. By re-purposing the demo script, I want to execute a CLI command for ImageMagick to split PDF pages to images.
convert -density 150 presentation.pdf -quality 90 output-%3d.jpg
The snippet
exports.splitPdfPages = functions.storage.object().onFinalize(async (object) => {
const fileBucket = object.bucket; // The Storage bucket that contains the file.
const filePath = object.name; // File path in the bucket.
const contentType = object.contentType; // File content type.
const metageneration = object.metageneration; // Number of times metadata has been generated. New objects have a value of 1.
// Download file from bucket.
const bucket = admin.storage().bucket(fileBucket);
const tempFilePath = path.join(os.tmpdir(), fileName);
const tempSplitImagesPath = tempFilePath.replace(".png", "_%3d.png");
await bucket.file(filePath).download({destination: tempFilePath});
console.log('PDF downloaded locally to', tempFilePath);
// Generate split page images using ImageMagick.
await spawn('convert', ['-density', '150', tempFilePath, '-quality', '90', tempSplitImagesPath]);
console.log('pages split images created at', tempFilePath);
...
// Uploading the split images.
...
// Once the thumbnail has been uploaded delete the local file to free up disk space.
return fs.unlinkSync(tempFilePath);
});
Unfortunately, I'm encountering errors in the Cloud Functions log indicating the statement error
ChildProcessError: convert -density 150 /tmp/7eCxdDKqCb0rlYVw3AYf__foobar.pdf -quality 100 /tmp/7eCxdDKqCb0rlYVw3AYf__foobar_%3d.png failed with code 1
I searched for resolution to the error, but it only indicates that whitespaces are the main reason for the issue (which based on my statement doesn't have any). Invoking generateThumbnail function works properly, so I'm presuming its based on my changes
Am I missing something to properly call the ImageMagick command for converting PDF pages to image?
Looking forwad to hearing from you.

Can't parse <content:encoded> from RSS

This is what RSS looks like: https://reddit.0qz.fun/r/dankmemes/top.json
My script perfectly parses "title", "description" and other items tags from the RSS. But it doesn't parse "content:encoded".
I tried this:
item.getChild("content:encoded").getText();
And this:
item.getChild("encoded").getText();
And this (found on Stackoverflow):
item.getChild("http://purl.org/rss/1.0/modules/content/","encoded").getText();
But nothing works... Could you help me?
The namespace is important for the getChild and similar methods to parse the content successfully.
Your third example is close, but you have the parameter order backwards, and you need to use the XmlService.getNamespace method, not a raw string. (The signature is getChild(string, namespace), not getChild(string, string).)
This one is tricky as the namespace should be included for some of the elements, and not for others. I am not an XML expert, so I don't know if this is expected behavior or not. The minimal example script below does find and log the text of the <content:encoded> elements using getChild, but I was only able to figure out when to include or exclude the namespace through trial and error. (If anyone has further info on why this is, please let me know in the comments.)
function logContentEncoded() {
const result = UrlFetchApp.fetch("https://reddit.0qz.fun/r/dankmemes/top.json");
const document = XmlService.parse(result.getContentText());
const root = document.getRootElement();
const namespace = XmlService.getNamespace("http://purl.org/rss/1.0/modules/content/");
const channel = root.getChild("channel"); // fails if namespace is included
const item = channel.getChild("item"); // fails if namespace is included
const encoded = item.getChild("encoded", namespace); // fails if namespace is EXCLUDED
console.log(encoded.getText());
}
Adding this library to the project: 1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw
You can scrape the page. With this code, i.e., You can get the first content of <content:encoded> tags.
function getDataFromJson() {
var url = "https://reddit.0qz.fun/r/dankmemes/top.json";
var fromText = '<content:encoded>';
var toText = '</content:encoded>';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser
.data(content)
.from(fromText)
.to(toText)
.build();
Logger.log(scraped);
return scraped;
}

How do I parse a html page using nodejs to find a qr code?

I want to parse a web page, searching for QRcodes in the page. When I find them, I am going to read them using the QRcode npm module.
The hard part is, I don't know how to parse the html page in a way I can detect the only the image tags that contains a QRcode inside it.
I tried finding some kind of pattern in the images that contain a Qr code, but it usually starts with "?qr" but I think the ending is different everytimwe.
I'm using the module require-promise to get the raw html, and then I parse through it
const rp = require('request-promise');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';
rp(url)
.then(function(html){
//success!
console.log(html);
})
.catch(function(err){
//handle error
});
I want to be able to download the image of the QRcode.
You need to pass the html returned into something like https://www.npmjs.com/package/node-html-parser
const rp = require('request-promise');
const parser = require('node-html-parser');
const url = 'https://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States';
rp(url)
.then(function(html){
const data = parser.parse(html);
console.log(JSON.stringify(data));
})
.catch(function(err){
//handle error
});
Then you can access things off the data object to find the QR code

Sending a plotly graph over flask

Right now I have a code that uses plotly to create a figure
def show_city_frequency(number_of_city = 10):
plot_1 = go.Histogram(
x=dataset[dataset.city.isin(city_count[:number_of_city].index.values)]['city'],
showlegend=False)
## Creating the grid for all the above plots
fig = tls.make_subplots(rows=1, cols=1)
fig.append_trace(plot_1,1,1)
fig['layout'].update(showlegend=True, title="Frequency of cities in the dataset ")
return plot(fig)
I want to incorporate this into a flask function and send it to an html template as a bytes io object using send_file. I was able to do this for a matplotlib just using:
img = io.BytesIO()
plt.plot(x,y, label='Fees Paid')
plt.savefig(img, format='png')
img.seek(0)
return send_file(img, mimetype='image/png')
I've read that I can do basically the same thing except using:
img = plotly.io.to_image(fig, format='png')
img.seek(0)
return send_file(img, mimetype='image/png')
but I can't seem to find where to download plotly.io. I've read that plotly offline doesn't work for Ubuntu so I am wondering if that is what my issue is as well. I am also open to new suggestions of how to send this image dynamically to my html code.

HTML from Database to PDF

I need to generate pdf from html dynamically using asp.net. HTML is stored in database. HTML has tables and css, upto 10 pages. I have tried iTextSharp by directly passing html, it produces pdf which is not opening. Destination pdf.codeplex.com has no documentation, it produces PDF with styles from parent page.
Any other solution will be helpful.
I've tried many HTML to PDF solutions including iTextSharp, wkhtmltopdf and ABCpdf (paid)
I'm currently settled on PhantomJS a headless, open-source, WebKit-based browser. It is scriptable with a javascript API which is reasonably well documented.
The only disadvantage I found was that attempting to use stdin to pass HTML into the process was unsuccessful because the REPL still has some bugs. I also found that using stdout seemed to be a lot slower than simply allowing the process to write to disk.
The code below avoids stdin and stdout by creating the javascript input as a temp file, executing PhantomJS, copying the output file to a MemoryStream and cleaning up the temporary files at the end.
using System.IO;
using System.Drawing;
using System.Diagnostics;
public Stream HTMLtoPDF (string html, Size pageSize) {
string path = "C:\\dev\\";
string inputFileName = "tmp.js";
string outputFileName = "tmp.pdf";
StringBuilder input = new StringBuilder();
input.Append("var page = require('webpage').create();");
input.Append(String.Format("page.viewportSize = {{ width: {0}, height: {1} }};", pageSize.Width, pageSize.Height));
input.Append("page.paperSize = { format: 'Letter', orientation: 'portrait', margin: '1cm' };");
input.Append("page.onLoadFinished = function() {");
input.Append(String.Format("page.render('{0}');", outputFileName));
input.Append("phantom.exit();");
input.Append("};");
// html is being passed into a string literal so make sure any double quotes are properly escaped
input.Append("page.content = \"" + html.Replace("\"", "\\\"") + "\";");
File.WriteAllText(path + inputFileName, input.ToString());
Process p;
ProcessStartInfo psi = new ProcessStartInfo();
psi.FileName = path + "phantomjs.exe";
psi.Arguments = inputFileName;
psi.WorkingDirectory = Path.GetDirectoryName(psi.FileName);
psi.UseShellExecute = false;
psi.CreateNoWindow = true;
p = Process.Start(psi);
p.WaitForExit(10000);
Stream strOut = new MemoryStream();
Stream fileStream = File.OpenRead(path + outputFileName);
fileStream.CopyTo(strOut);
fileStream.Close();
strOut.Position = 0;
File.Delete(path + inputFileName);
File.Delete(path + outputFileName);
return strOut;
}