GitHub Pages mangling syntax highlighting after upgrade to Jekyll 3 - jekyll

I use Github Pages for my personal website. They're upgrading from Jekyll 2 to Jekyll 3 and sending deprecation warnings. I complied with their warnings and switched from redcarpet to kramdown and from pygments to rouge. When I build locally (with bundle exec jekyll serve) everything works. But when I push the changes the syntax highlighting gets mangled wherever I have linenos in my code blocks.
This is the code block:
{% highlight python linenos %}
'''
scrape lyrics from vagalume.com.br
(author: thiagomarzagao.com)
'''
import json
import time
import pickle
import requests
from bs4 import BeautifulSoup
# get each genre's URL
basepath = 'http://www.vagalume.com.br'
r = requests.get(basepath + '/browse/style/')
soup = BeautifulSoup(r.text)
genres = [u'Rock']
u'Ax\u00E9',
u'Forr\u00F3',
u'Pagode',
u'Samba',
u'Sertanejo',
u'MPB',
u'Rap']
genre_urls = {}
for genre in genres:
genre_urls[genre] = soup.find('a', class_ = 'eA', text = genre).get('href')
# get each artist's URL, per genre
artist_urls = {e: [] for e in genres}
for genre in genres:
r = requests.get(basepath + genre_urls[genre])
soup = BeautifulSoup(r.text)
counter = 0
for artist in soup.find_all('a', class_ = 'top'):
counter += 1
print 'artist {} \r'.format(counter)
artist_urls[genre].append(basepath + artist.get('href'))
time.sleep(2) # don't reduce the 2-second wait (here or below) or you get errors
# get each lyrics, per genre
api = 'http://api.vagalume.com.br/search.php?musid='
genre_lyrics = {e: {} for e in genres}
for genre in artist_urls:
print len(artist_urls[genre])
counter = 0
artist1 = None
for url in artist_urls[genre]:
success = False
while not success: # foor loop in case your connection flickers
try:
r = requests.get(url)
success = True
except:
time.sleep(2)
soup = BeautifulSoup(r.text)
hrefs = soup.find_all('a')
for href in hrefs:
if href.has_attr('data-song'):
song_id = href['data-song']
print song_id
time.sleep(2)
success = False
while not success:
try:
song_metadata = requests.get(api + song_id).json()
success = True
except:
time.sleep(2)
if 'mus' in song_metadata:
if 'lang' in song_metadata['mus'][0]: # discard if no language info
language = song_metadata['mus'][0]['lang']
if language == 1: # discard if language != Portuguese
if 'text' in song_metadata['mus'][0]: # discard if no lyrics
artist2 = song_metadata['art']['name']
if artist2 != artist1:
if counter > 0:
print artist1.encode('utf-8') # change as needed
genre_lyrics[genre][artist1] = artist_lyrics
artist1 = artist2
artist_lyrics = []
lyrics = song_metadata['mus'][0]['text']
artist_lyrics.append(lyrics)
counter += 1
print 'lyrics {} \r'.format(counter)
# serialize
with open(genre + '.json', mode = 'wb') as fbuffer:
json.dump(genre_lyrics[genre], fbuffer)
{% endhighlight %}
This is what I see locally:
This is what I see on Github Pages:
(Without linenos the syntax highlighting works fine.)
What could be happening?

I think I got it!
Your code block seems to be fine. No problem there.
Make sure you have added this into your _config.yml:
highlighter: rouge
markdown: kramdown
kramdown:
input: GFM
Probably what you're missing is kramdown input: GFM, isn't it?
Well, I tested locally and worked fine. When uploaded to GitHub, worked fine as well. Should work for you too.
Let me know how it goes, ok? :)
UPDATE!
Add this to your stylesheet and check how it goes:
.lineno { width: 35px; }
Looks like it's something about your CSS styles that is breaking the layout! Keep tweaking your CSS and you're gonna be fine!

Related

Not saving html interactive file with R

I am trying to design a circos plot using BioCircos R package. BioCircos allows to save the plots as .html interactive files. However, when I run the package using RScript the saved .html file is empty. To save the .html file I used saveWidget option from htmlwidgets package. Is it something wrong with saveWidget option? The code I used follows:
#!/usr/bin/Rscript
######R script for BioCircos test
library(htmlwidgets)
library(BioCircos)
genomes <- list("chra1" = 217471166, "chra2" = 181034961, "chra3" = 153873357, "chra4" = 153961319, "chra5" = 164033575,
"chra6" = 154486312, "chra7" = 133565930, "chra8" = 147241510, "chra9" = 91218944, "chra10" = 52432566, "chrb1" = 843366180, "chrb2" = 842558404, "chrb3" = 707956555, "chrb4" = 635713434, "chrb5" = 567300182,
"chrb6" = 439630435, "chrb7" = 236595445, "chrb8" = 231667822, "chrb9" = 230778867, "chrb10" = 151572763, "chrb11" = 103205957) # custom genome
links_chromosomes_01 <- c("chra1", "chra2", "chra3", "chra4", "chra4", "chra5", "chra6", "chra7", "chra7", "chra8", "chra8", "chra9", "chra10") # Chromosomes on which the links should start
links_chromosomes_02 <- c("chrb2", "chrb3", "chrb1", "chrb9", "chrb10", "chrb4", "chrb5", "chrb6", "chrb1", "chrb8", "chrb3", "chrb7", "chrb6") # Chromosomes on which the links should end
links_pos_01 <- c(115060347, 102611974, 14761160, 128700431, 128681496, 42116205, 58890582, 40356090,
146935315, 136481944, 157464876, 39323393, 84752508, 136164354,
99573657, 102580613,
111139346, 120764772, 90748238, 122164776,
44933176, 18823342,
48771409, 128288229, 150613881, 18509106, 123913217, 51237349,
34237851, 53357604, 78270031,
25306417, 25320614,
94266153,
41447919, 28810876, 2802465,
45583472,
81968637, 27858237, 17263637,
30569409) ### links chra chromosomes
links_pos_02 <- c(410543481, 463189512, 825903588, 353914638, 354135472, 717707494, 643107332, 724899652,
583713545, 558756961, 642015290, 154999098, 340216235, 557731577,
643350872, 655077847,
85356666, 157889318, 226411560, 161566470,
109857786, 25338955,
473876792, 124495704, 46258030, 572314729, 141584107, 426419779,
531245660, 220131772, 353941099,
62422773, 62387030,
116923325,
76544045, 33452274, 7942164,
642047816,
215981114, 39278129, 23302654,
418922633) ### links chrb chromosomes
links_labels <- c("aldh1a3", "amh", "cyp26b1", "dmrt1", "dmrt3", "fgf20", "hhip", "srd5a3",
"amhr2", "dhh", "fgf9", "nr0b1", "rspo1", "wnt1",
"aldh1a2", "cyp19a1",
"lhx9", "pdgfb", "ptch2", "sox10",
"cbln1", "wt1",
"esr1", "foxl2", "gata4", "lrpprc", "serpine2", "srd5a2",
"asns", "ctnnb1", "srd5a1",
"cyp26a1", "cyp26c1",
"wnt4",
"ar", "nr5a1", "ptgds",
"fgf16",
"cxcr4", "pdgfa", "sox8",
"sox9")
tracklist <- BioCircosLinkTrack('myLinkTrack', links_chromosomes_01, links_pos_01,
links_pos_01, links_chromosomes_02, links_pos_02, links_pos_02,
maxRadius = 0.55, labels = links_labels)
#plotting results
plot_chra_chrb <- BioCircos(tracklist, genome = chra_chrb_genomes, genomeFillColor = "RdBu", chrPad = 0.02, displayGenomeBorder = FALSE, genomeLabelTextSize = "10pt", genomeTicksScale = 4e+3,
elementId = "chra_chrb_comp_plot_test.html")
saveWidget(plot_chra_chrb, "chra_chrb_comp_plot_test.html", selfcontained = F, libdir = "lib")
The command line to run this script:
Rscript /path_to/Circle_plot_test.r
I tried to use this script in RStudio (without saveWidget() command), however it took too long to run in my personnel computer and the results was not displayed. However, this could be due to memory usage setup because when I took off some data, the script easily generates the plot in RStudio and I am able to save it. Is there other way to save the .hmtl interactive files in R or am I doing something wrong using htmlwidgets package in my script?
Thanks all in advance for any help and comments.
When you said it took too long to run, that was a sign that something was wrong! You weren't getting anything when you used saveWidget, because there is nothing returned from BioCiros.
I found two things that are a problem. The first one will result in a blank output—you can't use a '.' in the element ID. This ID will be used in the HTML coding.
You were getting huge delays due to the scale you set for genomeTickScale. That scaling value is for a tick mark attribute. I'm not sure why you set it to .004. However, when I comment out that line, it renders immediately. I have no issues with saving the widget, either.
--One other thing, you had chra_chrb_genomes as the object name assigned to the parameter genome in the function BioCircos. I assumed it was the object genome from your question since it was the only unused object.
The only things I changed were in the BioCircos function:
(plot_chra_chrb <- BioCircos(tracklist, genome = genomes, #chra_chrb_genomes,
genomeFillColor = "RdBu",
chrPad = 0.02,
displayGenomeBorder = FALSE,
genomeLabelTextSize = "10pt",
# genomeTicksScale = 4e+3, # problematic
elementId = "chra_chrb_comp_plot_test" # no periods
))

The srcset= strings on the website are https links, but when I use selenium's driver.find('img')['srcset'], it gives me base64 strings

I'm trying to scrape images from a website. In the website's html code, the srcset sections exist and are of the form
srcset="https://...."
For example,
srcset="https://secure.img1-fg.wfcdn.com/im/80458162/resize-h300-w300%5Ecompr-r85/1068/106844956/Derry+84%2522+Reversible+Sectional.jpg 300w,https://secure.img1-fg.wfcdn.com/im/19496430/resize-h400-w400%5Ecompr-r85/1068/106844956/Derry+84%2522+Reversible+Sectional.jpg 400w,https://secure.img1-fg.wfcdn.com/im/75516274/resize-h500-w500%5Ecompr-r85/1068/106844956/Derry+84%2522+Reversible+Sectional.jpg 500w"
However, when I try to get these srcset link using selenium and beautiful soup, I get the following:
""
Moreover, every time the srcset fails to get a valid link, the string that it gets is always
""
I tried a bunch of different lines of code, but haven't had success with any of it. Here is the full code I currently have:
def get_info_from_product_link(product_link): #get the price and correctly filtered image link
info = dict()
driver = webdriver.Chrome('C:/Users/Brian/Downloads/chromedriver_win32/chromedriver.exe')
driver.implicitly_wait(200)
try:
driver.get(product_link)
soup = BeautifulSoup(driver.page_source, 'html.parser')
time.sleep(60)
image_carousel = soup.find_all('li', {"class" : "ProductDetailImageCarouselVariantB-carouselItem"})
print("Number of images in gallery: ", len(image_carousel))
#deal with captcha
while len(image_carousel) <= 0:
print("CAPTCHA ENCOUNTERED. FIX")
soup = BeautifulSoup(driver.page_source, 'html.parser')
image_carousel = soup.find_all('li', {"class" : "ProductDetailImageCarouselVariantB-carouselItem"})
time.sleep(30)
valid_image_links = []
highest_resolution_images = []
#get correct image links
#i = 1
for image_block in image_carousel:
try:
#print("image_block:")
#print(image_block)
#print("Image: ", i)
#i += 1
images = image_block.find('div', {"class" : "ImageComponent ImageComponent--overlay"})
#image_links = images.find('img').get_attribute('srcset').split(',')
print(images)
#driver.implicitly_wait(60)
#wait = WebDriverWait(images, 30)
#image_links = wait.until(EC.visibility_of_element_located((By.tagName, "img"))).get_attribute("srcset").split(',')
#image_links = wait.until(EC.text_to_be_present_in_element_value((By.tagName, 'img'), "https")).get_attribute("srcset").split(',')
#image_links = wait.until(EC.text_to_be_present_in_element_value((By.CSS_SELECTOR, "img [srcset*='https']"), "https")).get_attribute("srcset").split(',')
#image_links = wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, "img[src*='https']"))).get_attribute("src").split(',')
images.implicitly_wait(30)
image_links = images.find_element_by_tag_name('img').get_attribute('srcset').split(',')
#"div[class='ajax_enabled'] [style='display:block']"
#image_links = images.find('img')['srcset'].split(',')
#print("Image links:")
#print(image_links)
#print("Number of links: ", len(image_links))
for link in image_links:
print(link)
for image_info in image_links:
image_link = image_info.split(" ")[0]
try:
if hasValidBackground(image_link) and hasValidSize(image_link):
valid_image_links.append(image_link)
else:
print("Invalid image size or background")
except:
print('ERROR when reading image: ' + image_link)
continue
if len(valid_image_links) > 0:
highest_resolution_images.append(valid_image_links[-1])
valid_image_links.clear()
except:
print("Error. Invalid image link.")
pass
#extract one link to a correctly filtered image
if len(highest_resolution_images) <= 0:
return -1
valid_image_link = highest_resolution_images[0];
info['img_url'] = valid_image_link
#get price information
standard_price_block = soup.find('div', {"class" : "StandardPriceBlock"})
base_price_block = standard_price_block.find('div', {"class" : "BasePriceBlock BasePriceBlock--highlight"})
if base_price_block is None:
base_price_block = standard_price_block.find('div', {"class" : "BasePriceBlock"})
base_price = base_price_block.find('span').text
#price_block = soup.find('span', {"class" : "notranslate"})
#base_price = standard_price_block.find('span').text
info['price'] = base_price
print(base_price)
#print(f"Image link: {image_link}\n")
#print(f"Link to product: {product_link}\n")
driver.close()
#browser.switch_to.window(browser.window_handles[0])
return info
except TimeoutException as e:
print("Page Load Timeout Occurred. Quitting...")
driver.close()
I was testing using this website:
https://www.wayfair.com/furniture/pdp/foundstone-derry-84-reversible-sectional-w001832490.html
My goal is to process each image in the image gallery/carousel and find one that has white background and has valid size of height >= 80 and width >= 80
I'm just starting to learn web scraping, so any help would be much appreciated!!

Prevent jekyll from cleaning up generated JSON file?

I've written a simple plugin that generates a small JSON file
module Jekyll
require 'pathname'
require 'json'
class SearchFileGenerator < Generator
safe true
def generate(site)
output = [{"title" => "Test"}]
path = Pathname.new(site.dest) + "search.json"
FileUtils.mkdir_p(File.dirname(path))
File.open(path, 'w') do |f|
f.write("---\nlayout: null\n---\n")
f.write(output.to_json)
end
# 1/0
end
end
end
But the generated JSON file gets deleted every time Jekyll runs to completion. If I uncomment the division by zero line and cause it to error out, I can see that the search.json file is being generated, but it's getting subsequently deleted. How do I prevent this?
I found the following issue, which suggested adding the file to keep_files: https://github.com/jekyll/jekyll/issues/5162 which worked:
The new code seems to avoid search.json from getting deleted:
module Jekyll
require 'pathname'
require 'json'
class SearchFileGenerator < Generator
safe true
def generate(site)
output = [{"title" => "Test"}]
path = Pathname.new(site.dest) + "search.json"
FileUtils.mkdir_p(File.dirname(path))
File.open(path, 'w') do |f|
f.write("---\nlayout: null\n---\n")
f.write(output.to_json)
end
site.keep_files << "search.json"
end
end
end
Add your new page to site.pages :
module Jekyll
class SearchFileGenerator < Generator
def generate(site)
#site = site
search = PageWithoutAFile.new(#site, site.source, "/", "search.json")
search.data["layout"] = nil
search.content = [{"title" => "Test 32"}].to_json
#site.pages << search
end
end
end
Inspired by jekyll-feed code.

How to parse all the text content from the HTML using Beautiful Soup

I wanted to extract an email message content. It is in html content, used the BeautifulSoup to fetch the From, To and subject. On fetching the body content, it fetches the first line alone. It leaves the remaining lines and paragraph.
I miss something over here, how to read all the lines/paragraphs.
CODE:
email_message = mail.getEmail(unreadId)
print (email_message['From'])
print (email_message['Subject'])
if email_message.is_multipart():
for payload in email_message.get_payload():
bodytext = email_message.get_payload()[0].get_payload()
if type(bodytext) is list:
bodytext = ','.join(str(v) for v in bodytext)
else:
bodytext = email_message.get_payload()[0].get_payload()
if type(bodytext) is list:
bodytext = ','.join(str(v) for v in bodytext)
print (bodytext)
parsedContent = BeautifulSoup(bodytext)
body = parsedContent.findAll('p').getText()
print body
Console:
body = parsedContent.findAll('p').getText()
AttributeError: 'list' object has no attribute 'getText'
When I use
body = parsedContent.find('p').getText()
It fetches the first line of the content and it is not printing the remaining lines.
Added
After getting all the lines from the html tag, I get = symbol at the end of each line and also &nbsp ; , &lt is displayed.How to overcome those.
Extracted text:
Dear first,All of us at GenWatt are glad to have xyz as a
customer. I would like to introduce myself as your Account
Manager. Should you = have any questions, please feel free to
call me at or email me at ash= wis#xyz.com. You
can also contact GenWatt on the following numbers: Main:
810-543-1100Sales: 810-545-1222Customer Service & Support:
810-542-1233Fax: 810-545-1001I am confident GenWatt will serve you
well and hope to see our relationship=
Let's inspect the result of soup.findAll('p')
python -i test.py
----------
import requests
from bs4 import BeautifulSoup
bodytext = requests.get("https://en.wikipedia.org/wiki/Earth").text
parsedContent = BeautifulSoup(bodytext, 'html.parser')
paragraphs = soup.findAll('p')
----------
>> type(paragraphs)
<class 'bs4.element.ResultSet'>
>> issubclass(type(paragraphs), list)
True # It's a list
Can you see? It's a list of all paragraphs. If you want to access their content you will need iterate over the list or access an element by an index, like a normal list.
>> # You can print all content with a for-loop
>> for p in paragraphs:
>> print p.getText()
Earth (otherwise known as the world (...)
According to radiometric dating and other sources of evidence (...)
...
>> # Or you can join all content
>> content = []
>> for p in paragraphs:
>> content.append(p.getText())
>>
>> all_content = "\n".join(content)
>>
>> print(all_content)
Earth (otherwise known as the world (...) According to radiometric dating and other sources of evidence (...)
Using List Comprehension your code will looks like:
parsedContent = BeautifulSoup(bodytext)
body = '\n'.join([p.getText() for p in parsedContent.findAll('p')]
When I use
body = parsedContent.find('p').getText()
It fetches the first line of the content and it is not printing the
remaining lines.
Do parsedContent.find('p') is exactly the same that do parsedContent.findAll('p')[0]
>> parsedContent.findAll('p')[0].getText() == parsedContent.find('p').getText()
True

batch convert HTML to Markdown

I have a whole lot of html files that live in one folder. I need to convert these to markdown I found a couple gems out there that does this great one by one.
my question is...
How can I loop though each file in the folder and run the command to convert these to md on a separate folder.
UPDATE
#!/usr/bin/ruby
root = 'C:/Doc'
inDir = File.join(root, '/input')
outDir = File.join(root, '/output')
extension = nil
fileName = nil
Dir.foreach(inDir) do |file|
# Dir.foreach will always show current and parent directories
if file == '.' or item == '..' then
next
end
# makes sure the current iteration is not a sub directory
if not File.directory?(file) then
extension = File.extname(file)
fileName = File.basename(file, extension)
end
# strips off the last string if it contains a period
if fileName[fileName.length - 1] == "." then
fileName = fileName[0..-1]
end
# this is where I got stuck
reverse_markdown File.join(inDir, fileName, '.html') > File.join(outDir, fileName, '.md')
Dir.glob(directory) {|f| ... } will loop through all files inside a directory. For example using the Redcarpet library you could do something like this:
require 'redcarpet'
markdown = Redcarpet::Markdown.new(Redcarpet::Render::HTML, :autolink => true)
Dir.glob('*.md') do |in_filename|
out_filename = File.join(File.dirname(in_filename), "#{File.basename(in_filename,'.*')}.html")
File.open(in_filename, 'r') do |in_file|
File.open(out_filename, 'w') do |out_file|
out_file.write markdown.render(in_file.read)
end
end
end