Detect automatically opened tabs with Pyppeteer - google-chrome

With pyppeteer it is possible to get all open tabs via the .pages function. This is working fine until a website opens a new tab by itself (e.g. after a click on a button). In this case, the new tab isn’t listed in the return of **.pages*.
Is there a way to detect this new tab, so that I can work with it like I can do with the other tabs/pages?
(I didn't test it with puppeteer, but I think it'll behave the same.)
Code Example (Sadly I have to use Python 2.7., so I have to use yield from):
self.browser = yield from launch(appMode=True, closeAtExit=False)
pages = yield from self.browser.pages()
self.page = pages[len(pages) - 1] # Open w3schools in the init tab
yield from self.page.goto("https://www.w3schools.com/tags/att_a_target.asp")
link = yield from self.page.waitForSelector('a.w3-btn:nth-child(4)')
yield from link.click()
yield from asyncio.sleep(5) # Just to give some extra time...
pages1 = yield from self.browser.pages()
self.log.info("Count: " + str(len(pages1))) # Should be 2 now
for mpage in pages1:
self.log.info("URL: " + str(mpage.url))
Output:
TARGETS: {'246562630E35EEAD0384B80658C827F8': <pyppeteer.target.Target object at 0x03482F10>}
TARGETS: {'246562630E35EEAD0384B80658C827F8': <pyppeteer.target.Target object at 0x03482F10>}
INFO:__main__:Count: 1
INFO:__main__:URL: https://www.w3schools.com/tags/att_a_target.asp
INFO:__main__:Done!

Related

Difficulties with web scraping

I have just came to an article called The 500 Greatest Songs of All Time and thought "oh that's cool I bet they also made a Spotify/Apple music list that I can follow". Well...they don't.
So in a nutshell, I wonder if it's possible to 1) scrap the website to extract the songs and 2) then do some kind of bulk upload to Spotify to create the list.
Songs' titles and authors are structured like this in the website:
Website screenshot. I have already tried to scrap the web with the importxml() formula in google sheets but with no success.
I understand the scrapping part is easier than the other and, as I am new to programming, I would be happy to manage to partially achieve this goal. I am sure this task can be achieved easily on python.
I feel like explaining everything would go beyond the scope here, so I tried to comment the code well enough.
1. Scrape the songs
I used python3 and selenium, their website doesn't block that.
Be sure to adjust your chromedriver path, and the output path of the .txt file at the bottom if necessary. Once it's done and you have your .txt file you can close it.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
s = Service(r'/Users/main/Desktop/chromedriver')
driver = webdriver.Chrome(service=s)
# just setting some vars, I used Xpath because I know that
top_500 = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/'
cookie_button_xpath = "// button [#id = 'onetrust-accept-btn-handler']"
div_containing_links_xpath = "// div [#id = 'pmc-gallery-list-nav-bar-render'] // child :: a"
song_names_xpath = "// article [#class = 'c-gallery-vertical-album'] / child :: h2"
links = []
songs = []
driver.get(top_500)
# accept cookies, give time to load
time.sleep(3)
cookie_btn = driver.find_element(By.XPATH, cookie_button_xpath)
cookie_btn.click()
time.sleep(1)
# extracting all the links since there are only 50 songs per page
links_to_next_pages = driver.find_elements(By.XPATH, div_containing_links_xpath)
for element in links_to_next_pages:
l = element.get_attribute('href')
links.append(l)
# extracting the songs, then going to next page and so on until we hit 500
counter = 1 # were starting with 1 here since links[0] is the current page we are already on
while True:
list = driver.find_elements(By.XPATH, song_names_xpath)
for element in list:
s = element.text
songs.append(s)
if len(songs) == 500:
break
driver.get(links[counter])
counter += 1
time.sleep(2)
# verify that there are no duplicates, if there were, something would be off
if len(songs) != len( set(songs) ):
print('you f***** up')
else:
print('seems fine')
with open('/Users/main/Desktop/output_songs.txt', 'w') as file:
file.writelines(line + '\n' for line in songs)
2. Prepare Spotify
Go to the Spotify Developer Dashboard and create an
account (use your Spotify acc).
Then create an app, call it whatever you want.
On your app click settings and whitelist http://localhost:8888/callback
On your app click "users and access" and add your Spotify account
Leave the tab open, we'll come back to it
3. Prepare Your Environment
You need Node.js so make sure that is installed on your machine
Download this from Spotifys GitHub
Unzip it, cd into the folder and run npm install
Go into the authorization_code folder and open app.js in a editor
Find var scope and append ' playlist-modify-public' to the string, this is so that your app can access you Spotify playlists, see here
Now go back to the app in your Spotify Developer Dashboard we'll need to copy the Client ID and the Client Secret into the var client_id and var client_secret respectively (in the app.js file). var redirect_uri will be
http://localhost:8888/callback - don't forget to save your changes.
4. Run the Spotify side of things
cd into the authorization_code folder and run app.js with node app.js (this is basically a server running on your PC)
Now if that works leave it running and go to http://localhost:8888, authorise your Spotify account there
There copy the full token, including the overflow, use inspect element to get it
Adjust the user_id and auth variables as well as the path to the output_songs.txt (at with open) in the following python script and run that, songs which are not found will be printed out at the end, give it a search with Google. They are usually on Spotify as well but Google seem to have the better search algorithm (surprised Pikachu face).
import requests
import re
import json
# this is NOT you display name, it's your user name!!
user_id = 'YOUR_USERNAME'
# paste your auth token from spotify; it can time out then you have to get a new one, so dont panic if you get a bunch of responses in the 400s after some time
auth = {"Authorization": "Bearer YOUR_AUTH_KEY_FROM_LOCALHOST"}
playlist = []
err_log = []
base_url = 'https://api.spotify.com/v1'
search_method = '/search'
with open('/Users/main/Desktop/output_songs.txt', 'r') as file:
songs = file.readlines()
# this querys spotify does some magic and then appends the tracks spotify uri to an array
def query_song_uris():
for n, entry in enumerate(songs):
x = re.findall(r"'([^']*)'", entry)
title_len = len(entry) - len(x[0]) - 4
title = x[0]
artist = entry[:title_len]
payload = {
'q': (entry),
'track:': (title),
'artist:': (artist),
'type': 'track',
'limit': 1
}
url = base_url + search_method
try:
r = requests.get(url, params=payload, headers=auth)
print('\nquerying spotify; ', r)
c = r.content.decode('UTF-8')
dic = json.loads(c)
track_uri = dic["tracks"]["items"][0]["uri"]
playlist.append(track_uri)
print(track_uri)
except:
err = f'\nNr. {(len(songs)-n)}: ' + f'{entry}'
err_log.append(err)
playlist.reverse()
query_song_uris()
# creates a playlist and returns playlist id
def create_playlist():
payload = {
"name": "Rolling Stone: Top 500 (All Time)",
"description": "music for old men xD with occasional hip hop appearences. just kidding"
}
url = base_url + f'/users/{user_id}/playlists'
r = requests.post(url, headers=auth, json=payload)
c = r.content.decode('UTF-8')
dic = json.loads(c)
print(f'\n\ncreating playlist #{dic["id"]}; ', r)
return dic["id"]
def add_to_playlist():
playlist_id = create_playlist()
while True:
if len(playlist) > 100:
p = playlist[:100]
else:
p = playlist
payload = {"uris": (p)}
url = base_url + f'/playlists/{playlist_id}/tracks'
r = requests.post(url, headers=auth, json=payload)
print(f'\nadding {len(p)} songs to playlist; ', r)
del playlist[ : len(p) ]
if len(playlist) == 0:
break
add_to_playlist()
print('\n\ncheck your spotify :)')
print("\n\n\nthese tracks didn't make it, check manually:\n")
for line in err_log:
print(line)
print('\n\n')
Done
If you don't want to run the code yourself, heres the playlist:
https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS
If you have trouble, everything from step 2 on is also described here in the Web API quick start or in general in the web API docs.
Regarding Apple Music
So Apple seems very closed up (surprise haha). What I found though is that you can query the i-Tunes store. Given response also contains a direct link to the song(s) on Apple music.
You might be able to go from there.
Get ISRC code from iTunes Search API (Apple music)
PS: undeniably regex is witchcraft, but y'all here got my back

Python Full Web Parsing

As of right now I'm attempting to make a simple music player app that streams music or video directly from a Youtube URL, and in order to do that I need the full download of the search page that's used to search for videos to stream. But I'm having some problems with the urlopen module in python 3, which is what I'm using to make the command application. It won't load the ytd-app tag on Youtube, which is what a good deal of the video and playlist references are put on when you first load the search. Anyone know what's going on, or know some type of workaround for it? Thanks!
My code so far:
BASICURL = "https://www.youtube.com/results?"
query = query.split()
ret = ""
stufffound = {}
for x in query:
ret = ret + x + "+"
ret = (ret[:len(ret)-1])
# URL BUILDER
if filtercriteria:
URL = BASICURL + "sp={0}".format(filtercriteria) + "&search_query={0}".format(ret)
else:
URL = BASICURL + "search_query={0}".format(ret)
query = urlopen(str(URL))
passdict = {}
def findvideosonpage(query,dictToAddTo):
for x in (BS(urlopen(query)).read()).findAll(attrs={'class':'yt-simple-endpoint style-scope ytd-video-renderer'})
dictToAddTo[query.index(x)] = x[href]
print(x)
return list([x for _,x in sorted(zip(dictToAddTo.values(), dictToAddTo.keys()))])
# Dictionary is meant to be converted into a list later to order the results

.text is scrambled with numbers and special keys in BeautifuSoup

Hello I am currently using Python 3, BeautifulSoup 4 and, requests to scrape some information from supremenewyork.com UK. I have implemented a proxy script (that I know works) into the script. The only problem is that this website does not like programs to scrape this information automatically and so they have decided to scramble this script which I think makes it unusable as text.
My question: is there a way to get the text without using the .text thing and/or is there a way to get the script to read the text? and when it sees a special character like # to skip over it or to read the text when it sees & skip until it sees ;?
because basically how this website scrambles the text is by doing this. Here is an example, the text shown when you inspect element is:
supremetshirt
Which is supposed to say "supreme t-shirt" and so on (you get the idea, they don't use letters to scramble only numbers and special keys)
this  is kind of highlighted in a box automatically when you inspect the element using a VPN on the UK supreme website, and is different than the text (which isn't highlighted at all). And whenever I run my script without the proxy code onto my local supremenewyork.com, It works fine (but only because of the code, not being scrambled on my local website and I want to pull this info from the UK website) any ideas? here is my code:
import requests
from bs4 import BeautifulSoup
categorys = ['jackets', 'shirts', 'tops_sweaters', 'sweatshirts', 'pants', 'shorts', 't-shirts', 'hats', 'bags', 'accessories', 'shoes', 'skate']
catNumb = 0
#use new proxy every so often for testing (will add something that pulls proxys and usses them for you.
UK_Proxy1 = '51.143.153.167:80'
proxies = {
'http': 'http://' + UK_Proxy1 + '',
'https': 'https://' + UK_Proxy1 + '',
}
for cat in categorys:
catStr = str(categorys[catNumb])
cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
proxy_script = requests.get(cUrl, proxies=proxies).text
bSoup = BeautifulSoup(proxy_script, 'lxml')
print('\n*******************"'+ catStr.upper() + '"*******************\n')
catNumb += 1
for item in bSoup.find_all('div', class_='inner-article'):
url = item.a['href']
alt = item.find('img')['alt']
req = requests.get('http://www.supremenewyork.com' + url)
item_soup = BeautifulSoup(req.text, 'lxml')
name = item_soup.find('h1', itemprop='name').text
#name = item_soup.find('h1', itemprop='name')
style = item_soup.find('p', itemprop='model').text
#style = item_soup.find('p', itemprop='model')
print (alt +(' --- ')+ name +(' --- ')+ style)
#print(alt)
#print(str(name))
#print (str(style))
When I run this script I get this error:
name = item_soup.find('h1', itemprop='name').text
AttributeError: 'NoneType' object has no attribute 'text'
And so what I did was I un-hash-tagged the stuff that is hash-tagged above, and hash-tagged the other stuff that is similar but different, and I get some kind of str error and so I tried the print(str(name)). I am able to print the alt fine (with every script, the alt is not scrambled), but when it comes to printing the name and style all it prints is a None under every alt code is printed.
I have been working on fixing this for days and have come up with no solutions. can anyone help me solve this?
I have solved my own answer using this solution:
thetable = soup5.find('div', class_='turbolink_scroller')
items = thetable.find_all('div', class_='inner-article')
for item in items:
alt = item.find('img')['alt']
name = item.h1.a.text
color = item.p.a.text
print(alt,' --- ', name, ' --- ',color)

PyQt stacked widget not moving to next page until function ends

Hi i have a program that when a button is pressed it should move to the next stacked widget replace some text in some labels and then execute some functions but this is not working and moves to the next page when the functions completes
The code is :
QtCore.QObject.connect(self.StartBtn, QtCore.SIGNAL(_fromUtf8("clicked()")), self.start) #Start
def nextPage(self):
current_page = self.stackedWidget.currentIndex()
i = int(current_page) + 1
self.stackedWidget.setCurrentIndex(i)
def start(self):
self.nextPage()
self.animation()
self.runFunctions()
def runFunctions(self):
try:
self.DbLabel.setText(_translate("MainWindow", "Checking Database", None))
if checkDb == True:
self.DbLabel.setText(_translate("MainWindow", "Checking Database ", None))
self.checkDbFun()
self.DbLabel.setText(_translate("MainWindow", "Database checked", None))
else:
self.checkedDbImg.setPixmap(QtGui.QPixmap(_fromUtf8("Files\\x.png")))
self.DbLabel.setText(_translate("MainWindow", "Database not checked", None))
except Exception as e:
self.AlertMessage(e)
def animation(self):
self.LoadingGif = QtGui.QLabel(MainWindow)
movie = QtGui.QMovie("Files\\loading.png")
self.LoadingGif.setMovie(movie)
self.LoadingGif.setAlignment(QtCore.Qt.AlignCenter)
self.gridLayout_2.addWidget(self.LoadingGif, 4, 1, 1, 1)
movie.start()
So what i want is to press StartBtn then move to next stacked widget page load the animation image and then run the functions
You probably need to let Qt process events in order for the tab change to take effect. You could do that two ways:
insert a qApp.processEvents() between the animation() and runFunctions() (qApp is in PyQt5.QtWidgets)
call runFunctions() via a single-shot timer: QTimer.singleShot(0, runFunctions), which will schuedule runFunctions via the event loop, so any pending events will first be processed (because runFunctions() is the latest added), then runFunctions() will get called. If you actually have params for runFunctions(), use a lambda.
I favor the first approach because I find it more clearly indicates what is happening (events need to be processed), but I recommend also adding a comment on that line that "so stack tab can change".
BTW you should be use the new-style notation for signals-slot connections, much cleaner, of the form "signal.connect(slot)":
self.StartBtn.clicked.connect(self.start)
So for approach #1 your code would look like this:
from PyQt5.QtWidgets import qApp
...
self.StartBtn.clicked.connect(self.start)
...
def start(self):
self.nextPage()
self.animation()
qApp.processEvents()
self.runFunctions()
...

Generate an HTML report based on user interactions in shiny

I have a shiny application that allows my user to explore a dataset. The idea is that the user explores the dataset, and any interesting things the user finds he will share with his client via email. I don't know in advance how many things the user will find interesting. So, next to each table or chart I have an "add this item to the report" button, which isolates the current view and adds it to a reactiveValues list.
Now, what I want to do is the following:
Loop through all the items in the reactiveValues list,
Generate some explanatory text describing the item (This text should preferably be formatted HTML/markdown, rather than code comments)
Display the item
Capture the output of this loop as HTML
Display this HTML in Shiny as a preview
write this HTML to a file
knitr seems to do exactly the reverse of what I want - where knitr allows me to add interactive shiny components in an otherwise static document, I want to generate HTML in shiny (maybe using knitr, I don't know) based on static values the user has created.
I've constructed a minimum not-working example below to try to indicate what I would like to do. It doesn't work, it's just for demonstration purposes.
ui = shinyUI(fluidPage(
title = "Report generator",
sidebarLayout(
sidebarPanel(textInput("numberinput","Add a number", value = 5),
actionButton("addthischart", "Add the current chart to the report")),
mainPanel(plotOutput("numberplot"),
htmlOutput("report"))
)
))
server = shinyServer(function(input, output, session){
#ensure I can plot
library(ggplot2)
#make a holder for my stored data
values = reactiveValues()
values$Report = list()
#generate the plot
myplot = reactive({
df = data.frame(x = 1:input$numberinput, y = (1:input$numberinput)^2)
p = ggplot(df, aes(x = x, y = y)) + geom_line()
return(p)
})
#display the plot
output$numberplot = renderPlot(myplot())
# when the user clicks a button, add the current plot to the report
observeEvent(input$addthischart,{
chart = isolate(myplot)
isolate(values$Report <- c(values$Report,list(chart)))
})
#make the report
myreport = eventReactive(input$addthischart,{
reporthtml = character()
if(length(values$Report)>0){
for(i in 1:length(values$Report)){
explanatorytext = tags$h3(paste(" Now please direct your attention to plot number",i,"\n"))
chart = values$Report[[i]]()
theplot = HTML(chart) # this does not work - this is the crux of my question - what should i do here?
reporthtml = c(reporthtml, explanatorytext, theplot)
# ideally, at this point, the output would be an HTML file that includes some header text, as well as a plot
# I made this example to show what I hoped would work. Clearly, it does not work. I'm asking for advice on an alternative approach.
}
}
return(reporthtml)
})
# display the report
output$report = renderUI({
myreport()
})
})
runApp(list(ui = ui, server = server))
You could capture the HTML of your page using html2canvas and then save the captured portion of the DOM as a image using this answer, this way your client can embed this in any HTML document without worrying about the origin of the page contents