Python Full Web Parsing

Python Full Web Parsing - html

As of right now I'm attempting to make a simple music player app that streams music or video directly from a Youtube URL, and in order to do that I need the full download of the search page that's used to search for videos to stream. But I'm having some problems with the urlopen module in python 3, which is what I'm using to make the command application. It won't load the ytd-app tag on Youtube, which is what a good deal of the video and playlist references are put on when you first load the search. Anyone know what's going on, or know some type of workaround for it? Thanks!
My code so far:
BASICURL = "https://www.youtube.com/results?"
query = query.split()
ret = ""
stufffound = {}
for x in query:
ret = ret + x + "+"
ret = (ret[:len(ret)-1])
# URL BUILDER
if filtercriteria:
URL = BASICURL + "sp={0}".format(filtercriteria) + "&search_query={0}".format(ret)
else:
URL = BASICURL + "search_query={0}".format(ret)
query = urlopen(str(URL))
passdict = {}
def findvideosonpage(query,dictToAddTo):
for x in (BS(urlopen(query)).read()).findAll(attrs={'class':'yt-simple-endpoint style-scope ytd-video-renderer'})
dictToAddTo[query.index(x)] = x[href]
print(x)
return list([x for _,x in sorted(zip(dictToAddTo.values(), dictToAddTo.keys()))])
# Dictionary is meant to be converted into a list later to order the results

Related

Difficulties with web scraping

I have just came to an article called The 500 Greatest Songs of All Time and thought "oh that's cool I bet they also made a Spotify/Apple music list that I can follow". Well...they don't.
So in a nutshell, I wonder if it's possible to 1) scrap the website to extract the songs and 2) then do some kind of bulk upload to Spotify to create the list.
Songs' titles and authors are structured like this in the website:
Website screenshot. I have already tried to scrap the web with the importxml() formula in google sheets but with no success.
I understand the scrapping part is easier than the other and, as I am new to programming, I would be happy to manage to partially achieve this goal. I am sure this task can be achieved easily on python.

I feel like explaining everything would go beyond the scope here, so I tried to comment the code well enough.
1. Scrape the songs
I used python3 and selenium, their website doesn't block that.
Be sure to adjust your chromedriver path, and the output path of the .txt file at the bottom if necessary. Once it's done and you have your .txt file you can close it.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
s = Service(r'/Users/main/Desktop/chromedriver')
driver = webdriver.Chrome(service=s)
# just setting some vars, I used Xpath because I know that
top_500 = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/'
cookie_button_xpath = "// button [#id = 'onetrust-accept-btn-handler']"
div_containing_links_xpath = "// div [#id = 'pmc-gallery-list-nav-bar-render'] // child :: a"
song_names_xpath = "// article [#class = 'c-gallery-vertical-album'] / child :: h2"
links = []
songs = []
driver.get(top_500)
# accept cookies, give time to load
time.sleep(3)
cookie_btn = driver.find_element(By.XPATH, cookie_button_xpath)
cookie_btn.click()
time.sleep(1)
# extracting all the links since there are only 50 songs per page
links_to_next_pages = driver.find_elements(By.XPATH, div_containing_links_xpath)
for element in links_to_next_pages:
l = element.get_attribute('href')
links.append(l)
# extracting the songs, then going to next page and so on until we hit 500
counter = 1 # were starting with 1 here since links[0] is the current page we are already on
while True:
list = driver.find_elements(By.XPATH, song_names_xpath)
for element in list:
s = element.text
songs.append(s)
if len(songs) == 500:
break
driver.get(links[counter])
counter += 1
time.sleep(2)
# verify that there are no duplicates, if there were, something would be off
if len(songs) != len( set(songs) ):
print('you f***** up')
else:
print('seems fine')
with open('/Users/main/Desktop/output_songs.txt', 'w') as file:
file.writelines(line + '\n' for line in songs)
2. Prepare Spotify
Go to the Spotify Developer Dashboard and create an
account (use your Spotify acc).
Then create an app, call it whatever you want.
On your app click settings and whitelist http://localhost:8888/callback
On your app click "users and access" and add your Spotify account
Leave the tab open, we'll come back to it
3. Prepare Your Environment
You need Node.js so make sure that is installed on your machine
Download this from Spotifys GitHub
Unzip it, cd into the folder and run npm install
Go into the authorization_code folder and open app.js in a editor
Find var scope and append ' playlist-modify-public' to the string, this is so that your app can access you Spotify playlists, see here
Now go back to the app in your Spotify Developer Dashboard we'll need to copy the Client ID and the Client Secret into the var client_id and var client_secret respectively (in the app.js file). var redirect_uri will be
http://localhost:8888/callback - don't forget to save your changes.
4. Run the Spotify side of things
cd into the authorization_code folder and run app.js with node app.js (this is basically a server running on your PC)
Now if that works leave it running and go to http://localhost:8888, authorise your Spotify account there
There copy the full token, including the overflow, use inspect element to get it
Adjust the user_id and auth variables as well as the path to the output_songs.txt (at with open) in the following python script and run that, songs which are not found will be printed out at the end, give it a search with Google. They are usually on Spotify as well but Google seem to have the better search algorithm (surprised Pikachu face).
import requests
import re
import json
# this is NOT you display name, it's your user name!!
user_id = 'YOUR_USERNAME'
# paste your auth token from spotify; it can time out then you have to get a new one, so dont panic if you get a bunch of responses in the 400s after some time
auth = {"Authorization": "Bearer YOUR_AUTH_KEY_FROM_LOCALHOST"}
playlist = []
err_log = []
base_url = 'https://api.spotify.com/v1'
search_method = '/search'
with open('/Users/main/Desktop/output_songs.txt', 'r') as file:
songs = file.readlines()
# this querys spotify does some magic and then appends the tracks spotify uri to an array
def query_song_uris():
for n, entry in enumerate(songs):
x = re.findall(r"'([^']*)'", entry)
title_len = len(entry) - len(x[0]) - 4
title = x[0]
artist = entry[:title_len]
payload = {
'q': (entry),
'track:': (title),
'artist:': (artist),
'type': 'track',
'limit': 1
}
url = base_url + search_method
try:
r = requests.get(url, params=payload, headers=auth)
print('\nquerying spotify; ', r)
c = r.content.decode('UTF-8')
dic = json.loads(c)
track_uri = dic["tracks"]["items"][0]["uri"]
playlist.append(track_uri)
print(track_uri)
except:
err = f'\nNr. {(len(songs)-n)}: ' + f'{entry}'
err_log.append(err)
playlist.reverse()
query_song_uris()
# creates a playlist and returns playlist id
def create_playlist():
payload = {
"name": "Rolling Stone: Top 500 (All Time)",
"description": "music for old men xD with occasional hip hop appearences. just kidding"
}
url = base_url + f'/users/{user_id}/playlists'
r = requests.post(url, headers=auth, json=payload)
c = r.content.decode('UTF-8')
dic = json.loads(c)
print(f'\n\ncreating playlist #{dic["id"]}; ', r)
return dic["id"]
def add_to_playlist():
playlist_id = create_playlist()
while True:
if len(playlist) > 100:
p = playlist[:100]
else:
p = playlist
payload = {"uris": (p)}
url = base_url + f'/playlists/{playlist_id}/tracks'
r = requests.post(url, headers=auth, json=payload)
print(f'\nadding {len(p)} songs to playlist; ', r)
del playlist[ : len(p) ]
if len(playlist) == 0:
break
add_to_playlist()
print('\n\ncheck your spotify :)')
print("\n\n\nthese tracks didn't make it, check manually:\n")
for line in err_log:
print(line)
print('\n\n')
Done
If you don't want to run the code yourself, heres the playlist:
https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS
If you have trouble, everything from step 2 on is also described here in the Web API quick start or in general in the web API docs.
Regarding Apple Music
So Apple seems very closed up (surprise haha). What I found though is that you can query the i-Tunes store. Given response also contains a direct link to the song(s) on Apple music.
You might be able to go from there.
Get ISRC code from iTunes Search API (Apple music)
PS: undeniably regex is witchcraft, but y'all here got my back

Corona SDK JSON usage

I have an operating JSON library which I use to load an array of tile IDs. When I double click main.lua directly from file explorer, it runs great, but when I open Corona Simulator and open my project from there or build my project and run it on my testing device, it gives me a null reference error when I attempt to use the data I loaded.
Here is the function to load a table from a JSON file:
function fileIO.loadJSONFile (fileName)
local path = fileName
local contents = ""
local loadingTable = {}
local file = io.open (path, "r")
print (file)
if file then
local contents = file:read ("*a")
loadingTable = json.decode (contents)
io.close (file)
return loadingTable
end
return nil
end
Here is the usage:
function wr:renderChunkFile (path)
local data = fileIO.loadJSONFile (path)
self:renderChunk (data)
end
function wr:renderChunk (data)
local a, b = 1
if (self.img ~= nil) then
a = #self.img + 1
self.img[a] = {}
else
self.img[1] = {}
end
if (self.chunks ~= nil) then
b = #self.chunks + 1
self.chunks[b] = display.newGroup ()
else
self.chunks[1] = display.newGroup ()
end
for i = 1, #data do -- Y axis ERROR IS HERE
self.img[a][i] = {}
for j = 1, #data[i] do -- Z axis
self.img[a][i][j] = {}
for k = 1, #data[i][j] do -- X axis
if (data[i + 1] ~= nil) then
if (data[i + 1][j][k] < self.transparentLimit) then
self.img[a][i][j][k] = display.newImage ("images/tiles/"..data[i][j][k]..".png", k*self.tileWidth, display.contentHeight -j*self.tileDepth - i*self.tileThickness)
self.chunks[b]:insert (self.img[a][i][j][k])
elseif(data[i + 1] == nil) then
self.img[a][i][j][k] = display.newImage ("images/tiles/"..data[i][j][k]..".png", k*self.tileWidth, display.contentHeight -j*self.tileDepth - i*self.tileThickness)
self.chunks[b]:insert (self.img[a][i][j][k])
end
end
end
end
end
end
When it gets to the line for i = 1, #data do it tells me it is trying to access the length of a nil field. Where did I go wrong here?
EDIT: I feel the need to give a more clear explanation of what my problem is. I am getting inconsistent results from this program. When I select main.lua in file explorer and open it with Corona Simulator, it works. When I open Corona Simulator and internally navigate to main.lua, it does not work. When I build the project and test it on my device, it does not work. What I really need is some insight into Corona's JSON library and APK internal directory structure requirements (directory nesting limits, naming restrictions, etc.). If someone thinks of something else that might cause the issue I am having, please bring it up! I am open to anything.

Without seeing the entire error message and not knowing what the value of "path" is it's going to be hard to speculate. But Corona SDK uses four base directories:
system.ResourceDirectory -- Same folder as main.lua and is read-only
system.DocumentsDirectory -- Your writable folder where your data lives
system.CachesDirectory -- for downloaded files
system.TemporaryDirectory -- for temp files.
The last three, while in the simulator are in the project's Sandbox master folder. On device who knows where the folders really are.
In your case, if your JSON file is going to be included in with your downloadble app, your .json file should be in the same folder with your main.lua (or a sub folder) and referenced in system.ResourceDirectory.

Image uploads with Pyramid and SQLAlchemy

How one should do image file uploads with Pyramid, SQLAlchemy and deform? Preferably so that one can easily get image thumbnail tags in the templates. What configuration is needed (store images on the file system backend, so on).

This question is by no means specifically asking one thing. Here however is a view which defines a form upload with deform, tests the input for a valid image file, saves a record to a database, and then even uploads it to amazon S3. This example is shown under the links to the various documentation I have referenced.
To upload a file with deform see the documentation.
If you want to learn how to save image files to disk, see this article see the official documentation
Then if you want to learn how to save new items with SQLAlchemy see the SQLAlchemy tutorial.
If you want to ask a better question where a more precise detailed answer can be given for each section, then please do so.
#view_config(route_name='add_portfolio_item',
renderer='templates/user_settings/deform.jinja2',
permission='view')
def add_portfolio_item(request):
user = request.user
# define store for uploaded files
class Store(dict):
def preview_url(self, name):
return ""
store = Store()
# create a form schema
class PortfolioSchema(colander.MappingSchema):
description = colander.SchemaNode(colander.String(),
validator = Length(max=300),
widget = text_area,
title = "Description, tell us a few short words desribing your picture")
upload = colander.SchemaNode(
deform.FileData(),
widget=widget.FileUploadWidget(store))
schema = PortfolioSchema()
myform = Form(schema, buttons=('submit',), action=request.url)
# if form has been submitted
if 'submit' in request.POST:
controls = request.POST.items()
try:
appstruct = myform.validate(controls)
except ValidationFailure, e:
return {'form':e.render(), 'values': False}
# Data is valid as far as colander goes
f = appstruct['upload']
upload_filename = f['filename']
extension = os.path.splitext(upload_filename)[1]
image_file = f['fp']
# Now we test for a valid image upload
image_test = imghdr.what(image_file)
if image_test == None:
error_message = "I'm sorry, the image file seems to be invalid is invalid"
return {'form':myform.render(), 'values': False, 'error_message':error_message,
'user':user}
# generate date and random timestamp
pub_date = datetime.datetime.now()
random_n = str(time.time())
filename = random_n + '-' + user.user_name + extension
upload_dir = tmp_dir
output_file = open(os.path.join(upload_dir, filename), 'wb')
image_file.seek(0)
while 1:
data = image_file.read(2<<16)
if not data:
break
output_file.write(data)
output_file.close()
# we need to create a thumbnail for the users profile pic
basewidth = 530
max_height = 200
# open the image we just saved
root_location = open(os.path.join(upload_dir, filename), 'r')
image = pilImage.open(root_location)
if image.size[0] > basewidth:
# if image width greater than 670px
# we need to recduce its size
wpercent = (basewidth/float(image.size[0]))
hsize = int((float(image.size[1])*float(wpercent)))
portfolio_pic = image.resize((basewidth,hsize), pilImage.ANTIALIAS)
else:
# else the image can stay the same size as it is
# assign portfolio_pic var to the image
portfolio_pic = image
portfolio_pics_dir = os.path.join(upload_dir, 'work')
quality_val = 90
output_file = open(os.path.join(portfolio_pics_dir, filename), 'wb')
portfolio_pic.save(output_file, quality=quality_val)
profile_full_loc = portfolio_pics_dir + '/' + filename
# S3 stuff
new_key = user.user_name + '/portfolio_pics/' + filename
key = bucket.new_key(new_key)
key.set_contents_from_filename(profile_full_loc)
key.set_acl('public-read')
public_url = key.generate_url(0, query_auth=False, force_http=True)
output_dir = os.path.join(upload_dir)
output_file = output_dir + '/' + filename
os.remove(output_file)
os.remove(profile_full_loc)
new_image = Image(s3_key=new_key, public_url=public_url,
pub_date=pub_date, bucket=bucket_name, uid=user.id,
description=appstruct['description'])
DBSession.add(new_image)
# add the new entry to the association table.
user.portfolio.append(new_image)
return HTTPFound(location = route_url('list_portfolio', request))
return dict(user=user, form=myform.render())

How to obtain a list of titles of all Wikipedia articles

I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.
I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.
So my question is
Is there a way to obtain only the titles of Wikipedia articles via the API?
Is there a way to combine multiple request/queries into one? Or do I actually have to download a Wikipedia dump?

The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.
But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).

Right now, as per the current statistics the number of articles is around 5.8M.
To get the list of pages I did use the AllPages API. However, the number of pages I get is around 14.5M which is ~3 times of what I was expecting. I restricted myself to namespace 0 to get the list. Following is the sample code that I am using:
# get the list of all wikipedia pages (articles) -- English
import sys
from simplemediawiki import MediaWiki
listOfPagesFile = open("wikiListOfArticles_nonredirects.txt", "w")
wiki = MediaWiki('https://en.wikipedia.org/w/api.php')
continueParam = ''
requestObj = {}
requestObj['action'] = 'query'
requestObj['list'] = 'allpages'
requestObj['aplimit'] = 'max'
requestObj['apnamespace'] = '0'
pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']
for eachPage in pagesInQuery:
pageId = eachPage['pageid']
title = eachPage['title'].encode('utf-8')
writestr = str(pageId) + "; " + title + "\n"
listOfPagesFile.write(writestr)
numQueries = 1
while len(pagelist['query']['allpages']) > 0:
requestObj['apcontinue'] = pagelist["continue"]["apcontinue"]
pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']
for eachPage in pagesInQuery:
pageId = eachPage['pageid']
title = eachPage['title'].encode('utf-8')
writestr = str(pageId) + "; " + title + "\n"
listOfPagesFile.write(writestr)
# print writestr
numQueries += 1
if numQueries % 100 == 0:
print "Done with queries -- ", numQueries
print numQueries
listOfPagesFile.close()
The number of queries fired is around 28900, which results in approx. 14.5M names of the pages.
I also tried the all-titles link mentioned in the above answer. In that case as well I am getting around 14.5M pages.
I thought that this overestimate to the actual number of pages is because of the redirects, and did add the 'nonredirects' option to the request object:
requestObj['apfilterredir'] = 'nonredirects'
After doing that I get only 112340 number of pages. Which is too small as compared to 5.8M.
With the above code I was expecting roughly 5.8M pages, but that doesn't seem to be the case.
Is there any other option that I should be trying to get the actual (~5.8M) set of page names?

Here is an asynchronous program that will generate mediawiki pages titles:
async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
log.debug('Started generating asynchronously wiki titles at {}', wiki)
# XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
url = "{}/w/api.php".format(wiki)
params = {
"action": "query",
"format": "json",
"list": "allpages",
"apfilterredir": "nonredirects",
"apfrom": "",
}
while True:
content = await get(http, url, params=params)
if content is None:
continue
content = json.loads(content)
for page in content["query"]["allpages"]:
yield page["title"]
try:
apcontinue = content['continue']['apcontinue']
except KeyError:
return
else:
params["apfrom"] = apcontinue

Capture loaded source of audio tag, using Ruby on Rails

I need to save the currently-loaded source file of an audio tag. Sounds simple, but here's the catch: the source gives a random sound file on every request.
The audio tag is created, the source set, and the audio played with JavaScript, as seen here:
function createAudio() {
var audio = document.createElement('audio');
audio.setAttribute('id', 'file_audio')
audio.setAttribute('controls', 'controls');
audio.setAttribute('autoplay', 'true');
audio.setAttribute('hidden', 'true');
audio.appendChild(createSource());
return audio;
}
function createSource() {
var source = document.createElement('source');
var d = new Date();
source.setAttribute('id', 'file_audio_source')
source.setAttribute('src', 'file.wav?r=' + d.getTime());
source.setAttribute('type', 'audio/wav');
return source;
}
this.switchAudio = function() {
var d = new Date();
$svjq("#file_audio").find('audio').remove();
$svjq("#file_audio").find('source').remove();
$svjq("#file_audio").find('embed').remove();
if (Modernizr.audio.wav) {
document.getElementById("file_audio").appendChild(createAudio());
} else {
$svjq("#file_audio").append('<embed id="file_audio_embed" name="file_audio_embed" src="file.wav?r=' + d.getTime() + '" autostart="true" cache="false" type="audio/wav" hidden="true" loop="false" enablejavascript="true">');
}
};
this.playAgain = function() {
if (Modernizr.audio.wav) {
document.getElementById('file_audio').play();
} else {
document.getElementById('file_audio_embed').play();
}
};
I need be able to save the currently-loaded file in the source. However, if you access the file URL in the browser it returns a different file.
Automated processes such as Watir-WebDriver, Capybara (Capybara-Webkit), and Mechanize also return a random file. For example:
require 'capybara'
session = Capybara::Session.new(:selenium)
session.visit('url')
session.click_link 'play sound' #on every click u get a new sound
session.click_link 'play again'
#file_audio_source
e = session.find_by_id('file_audio_source')
e[:src]
#save the current open page and opens it
#session.save_and_open_page
#returns different file
session.visit(e[:src])
#returns different file
session.execute_script("window.open('"+e[:src]+"')")
require 'Mechanize'
agent = Mechanize.new{|agent| agent.ssl_version, agent.verify_mode = 'SSLv3', OpenSSL::SSL::VERIFY_NONE}
filedata = agent.get(e[:src]).content
aFile = File.new("/Users/me/Documents/test/test111.wav", 'wb')
#aFile.syswrite(filedata)
Could the file be embedded into the HTML or cached? And is there a way to get the file and save it locally?
Other options include recording from the sound device or using the mic to record the sound played, though this option is not at all ideal.

opt.1:
require 'capybara'
session = Capybara::Session.new(:selenium)
session.visit('url')
session.click_link 'Play sound'#this gets the file into the cache, then use the codes to get in out
opt.2
#execute the javascript that loads the file/creates the sound url. no playing of the sound
session.execute_script("document.getElementById('file_audio').appendChild(createSource());")
e = session.find_by_id("file_audio_source")
session.visit(e[:src])
Watir and Capybara perform great ! :)
but now the problem is to make it headless
and it seems that the headless browser doesnt act the same and the non-headless ??
A method to give headless functionality
def headless_get_file url
require 'uri'
res = #session.driver.cookies
agent = Mechanize.new {|agent| agent.ssl_version, agent.verify_mode = 'SSLv3', OpenSSL::SSL::VERIFY_NONE}
uri = URI('https://....')
res.keys.each do |i|
temp = res[i]
cookie = Mechanize::Cookie.new(i, temp.value)
cookie.domain = temp.domain
cookie.path = temp.path
agent.cookie_jar.add(uri,cookie)
end
filedata = agent.get(url).content
aFile = File.new("#{dir}/file.wav", 'wb')
aFile.syswrite(filedata)
end
Could the file be embedded into the html or cached?
yes it can ! Is it possible to use data URIs in video and audio tags?
<audio controls="controls" autobuffer="autobuffer" autoplay="autoplay">
<source src="data:audio/wav;base64,UklGRhwMAABXQVZFZm10IBAAAAABAAEAgD4AAIA+AAABAAgAZGF0Ya4LAACAgICAgICAgICAgICAgICAgICAgICAgICAf3hxeH+AfXZ1eHx6dnR5fYGFgoOKi42aloubq6GOjI2Op7ythXJ0eYF5aV1AOFFib32HmZSHhpCalIiYi4SRkZaLfnhxaWptb21qaWBea2BRYmZTVmFgWFNXVVVhaGdbYGhZbXh1gXZ1goeIlot1k6yxtKaOkaWhq7KonKCZoaCjoKWuqqmurK6ztrO7tbTAvru/vb68vbW6vLGqsLOfm5yal5KKhoyBeHt2dXBnbmljVlJWUEBBPDw9Mi4zKRwhIBYaGRQcHBURGB0XFxwhGxocJSstMjg6PTc6PUxVV1lWV2JqaXN0coCHhIyPjpOenqWppK6xu72yxMu9us7Pw83Wy9nY29ve6OPr6uvs6ezu6ejk6erm3uPj3dbT1sjBzdDFuMHAt7m1r7W6qaCupJOTkpWPgHqAd3JrbGlnY1peX1hTUk9PTFRKR0RFQkRBRUVEQkdBPjs9Pzo6NT04Njs+PTxAPzo/Ojk6PEA5PUJAQD04PkRCREZLUk1KT1BRUVdXU1VRV1tZV1xgXltcXF9hXl9eY2VmZmlna3J0b3F3eHyBfX+JgIWJiouTlZCTmpybnqSgnqyrqrO3srK2uL2/u7jAwMLFxsfEv8XLzcrIy83JzcrP0s3M0dTP0drY1dPR1dzc19za19XX2dnU1NjU0dXPzdHQy8rMysfGxMLBvLu3ta+sraeioJ2YlI+MioeFfX55cnJsaWVjXVlbVE5RTktHRUVAPDw3NC8uLyknKSIiJiUdHiEeGx4eHRwZHB8cHiAfHh8eHSEhISMoJyMnKisrLCszNy8yOTg9QEJFRUVITVFOTlJVWltaXmNfX2ZqZ21xb3R3eHqAhoeJkZKTlZmhpJ6kqKeur6yxtLW1trW4t6+us7axrbK2tLa6ury7u7u9u7vCwb+/vr7Ev7y9v8G8vby6vru4uLq+tri8ubi5t7W4uLW5uLKxs7G0tLGwt7Wvs7avr7O0tLW4trS4uLO1trW1trm1tLm0r7Kyr66wramsqaKlp52bmpeWl5KQkImEhIB8fXh3eHJrbW5mYGNcWFhUUE1LRENDQUI9ODcxLy8vMCsqLCgoKCgpKScoKCYoKygpKyssLi0sLi0uMDIwMTIuLzQ0Njg4Njc8ODlBQ0A/RUdGSU5RUVFUV1pdXWFjZGdpbG1vcXJ2eXh6fICAgIWIio2OkJGSlJWanJqbnZ2cn6Kkp6enq62srbCysrO1uLy4uL+/vL7CwMHAvb/Cvbq9vLm5uba2t7Sysq+urqyqqaalpqShoJ+enZuamZqXlZWTkpGSkpCNjpCMioqLioiHhoeGhYSGg4GDhoKDg4GBg4GBgoGBgoOChISChISChIWDg4WEgoSEgYODgYGCgYGAgICAgX99f398fX18e3p6e3t7enp7fHx4e3x6e3x7fHx9fX59fn1+fX19fH19fnx9fn19fX18fHx7fHx6fH18fXx8fHx7fH1+fXx+f319fn19fn1+gH9+f4B/fn+AgICAgH+AgICAgIGAgICAgH9+f4B+f35+fn58e3t8e3p5eXh4d3Z1dHRzcXBvb21sbmxqaWhlZmVjYmFfX2BfXV1cXFxaWVlaWVlYV1hYV1hYWVhZWFlaWllbXFpbXV5fX15fYWJhYmNiYWJhYWJjZGVmZ2hqbG1ub3Fxc3V3dnd6e3t8e3x+f3+AgICAgoGBgoKDhISFh4aHiYqKi4uMjYyOj4+QkZKUlZWXmJmbm52enqCioqSlpqeoqaqrrK2ur7CxsrGys7O0tbW2tba3t7i3uLe4t7a3t7i3tre2tba1tLSzsrKysbCvrq2sq6qop6alo6OioJ+dnJqZmJeWlJKSkI+OjoyLioiIh4WEg4GBgH9+fXt6eXh3d3V0c3JxcG9ubWxsamppaWhnZmVlZGRjYmNiYWBhYGBfYF9fXl5fXl1dXVxdXF1dXF1cXF1cXF1dXV5dXV5fXl9eX19gYGFgYWJhYmFiY2NiY2RjZGNkZWRlZGVmZmVmZmVmZ2dmZ2hnaGhnaGloZ2hpaWhpamlqaWpqa2pra2xtbGxtbm1ubm5vcG9wcXBxcnFycnN0c3N0dXV2d3d4eHh5ent6e3x9fn5/f4CAgIGCg4SEhYaGh4iIiYqLi4uMjY2Oj5CQkZGSk5OUlJWWlpeYl5iZmZqbm5ybnJ2cnZ6en56fn6ChoKChoqGio6KjpKOko6SjpKWkpaSkpKSlpKWkpaSlpKSlpKOkpKOko6KioaKhoaCfoJ+enp2dnJybmpmZmJeXlpWUk5STkZGQj4+OjYyLioqJh4eGhYSEgoKBgIB/fn59fHt7enl5eHd3dnZ1dHRzc3JycXBxcG9vbm5tbWxrbGxraWppaWhpaGdnZ2dmZ2ZlZmVmZWRlZGVkY2RjZGNkZGRkZGRkZGRkZGRjZGRkY2RjZGNkZWRlZGVmZWZmZ2ZnZ2doaWhpaWpra2xsbW5tbm9ub29wcXFycnNzdHV1dXZ2d3d4eXl6enp7fHx9fX5+f4CAgIGAgYGCgoOEhISFhoWGhoeIh4iJiImKiYqLiouLjI2MjI2OjY6Pj46PkI+QkZCRkJGQkZGSkZKRkpGSkZGRkZKRkpKRkpGSkZKRkpGSkZKRkpGSkZCRkZCRkI+Qj5CPkI+Pjo+OjY6Njo2MjYyLjIuMi4qLioqJiomJiImIh4iHh4aHhoaFhoWFhIWEg4SDg4KDgoKBgoGAgYCBgICAgICAf4CAf39+f35/fn1+fX59fHx9fH18e3x7fHt6e3p7ent6e3p5enl6enl6eXp5eXl4eXh5eHl4eXh5eHl4eXh5eHh3eHh4d3h4d3h3d3h4d3l4eHd4d3h3eHd4d3h3eHh4eXh5eHl4eHl4eXh5enl6eXp5enl6eXp5ent6ent6e3x7fHx9fH18fX19fn1+fX5/fn9+f4B/gH+Af4CAgICAgIGAgYCBgoGCgYKCgoKDgoOEg4OEg4SFhIWEhYSFhoWGhYaHhoeHhoeGh4iHiIiHiImIiImKiYqJiYqJiouKi4qLiouKi4qLiouKi4qLiouKi4qLi4qLiouKi4qLiomJiomIiYiJiImIh4iIh4iHhoeGhYWGhYaFhIWEg4OEg4KDgoOCgYKBgIGAgICAgH+Af39+f359fn18fX19fHx8e3t6e3p7enl6eXp5enl6enl5eXh5eHh5eHl4eXh5eHl4eHd5eHd3eHl4d3h3eHd4d3h3eHh4d3h4d3h3d3h5eHl4eXh5eHl5eXp5enl6eXp7ent6e3p7e3t7fHt8e3x8fHx9fH1+fX59fn9+f35/gH+AgICAgICAgYGAgYKBgoGCgoKDgoOEg4SEhIWFhIWFhoWGhYaGhoaHhoeGh4aHhoeIh4iHiIeHiIeIh4iHiIeIiIiHiIeIh4iHiIiHiIeIh4iHiIeIh4eIh4eIh4aHh4aHhoeGh4aHhoWGhYaFhoWFhIWEhYSFhIWEhISDhIOEg4OCg4OCg4KDgYKCgYKCgYCBgIGAgYCBgICAgICAgICAf4B/f4B/gH+Af35/fn9+f35/fn1+fn19fn1+fX59fn19fX19fH18fXx9fH18fXx9fH18fXx8fHt8e3x7fHt8e3x7fHt8e3x7fHt8e3x7fHt8e3x7fHt8e3x8e3x7fHt8e3x7fHx8fXx9fH18fX5+fX59fn9+f35+f35/gH+Af4B/gICAgICAgICAgICAgYCBgIGAgIGAgYGBgoGCgYKBgoGCgYKBgoGCgoKDgoOCg4KDgoOCg4KDgoOCg4KDgoOCg4KDgoOCg4KDgoOCg4KDgoOCg4KDgoOCg4KDgoOCg4KDgoOCg4KCgoGCgYKBgoGCgYKBgoGCgYKBgoGCgYKBgoGCgYKBgoGCgYKBgoGCgYKBgoGBgYCBgIGAgYCBgIGAgYCBgIGAgYCBgIGAgYCBgIGAgYCAgICBgIGAgYCBgIGAgYCBgIGAgYCBgExJU1RCAAAASU5GT0lDUkQMAAAAMjAwOC0wOS0yMQAASUVORwMAAAAgAAABSVNGVBYAAABTb255IFNvdW5kIEZvcmdlIDguMAAA" />
</audio>
is there a way to get it and save it locally?
yes u can !
you just have to find the cache directory :)
http://www.digitalmediaminute.com/article/626/viewing-browser-cache-in-firefox
and then right a little code to go and fetch it this codes goes with opt.1 not opt.2
def getlatestdir(newdirs)
times = Array.new
newdirs.each_with_index do |newdir,index|
times[index] = File::mtime(newdir)
end
temp = times[0]
count = 0
times.each_with_index do |time,index|
if temp < time
temp = time
count = index
end
end
return newdirs[count]
end
def getCacheDir
#how to get the path
#in irb enter
#require 'capybara'
#session = Capybara::Session.new(:selenium)
#session.visit('https://www.google.co.za')
#--- then open a new tab and enter about:cache
#copy the disk cache device cache directory ( from /var/... to .../T/ )
path = '/var/folders/9x/51cvmc215xx6zy9vd_64sxwc0000gn/T/'
dirs = Dir.glob(path +'*/')
newdirs = Array.new
dirs.each_with_index do |dir,index|
if(dir.include? 'webdriver-profile')
newdirs[newdirs.length] = dir
end
end
the_cache_dir = getlatestdir(newdirs) + 'Cache'
return the_cache_dir
end
def saveFile
rifffile = ''
count = 0
the_cache_dir = getCacheDir
files = Dir.glob(the_cache_dir + '/*/*/*')
files.each_with_index do |file,index|
bytes = open(file, 'rb'){|io|io.read}
str = bytes[0].to_s + bytes[1].to_s + bytes[2].to_s + bytes[3].to_s
if(str == 'RIFF')
count = index
rifffile = file
break
end
end
puts rifffile
filename = 'test123.wav'
#read file bytes
bytes = File.open(rifffile, 'rb'){|io|io.read}
#write file to the directory
f = File.new(filename, 'wb')
f.syswrite(bytes)
return filename
end
granted the above code isnt the greatest or fastest, but it gets the job done
Other options include recording from the sound device, or using the mic to record the sound played
thats would take too long and too much effort :P
In summary, opt.1 is ok but not great, opt.2 is far, far better :)
ajt

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Python Full Web Parsing - html

Related

Difficulties with web scraping

Corona SDK JSON usage

Image uploads with Pyramid and SQLAlchemy

How to obtain a list of titles of all Wikipedia articles

Capture loaded source of audio tag, using Ruby on Rails

Categories

Resources