How to obtain a list of titles of all Wikipedia articles - mediawiki

I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.
I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.
So my question is
Is there a way to obtain only the titles of Wikipedia articles via the API?
Is there a way to combine multiple request/queries into one? Or do I actually have to download a Wikipedia dump?

The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.
But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).

Right now, as per the current statistics the number of articles is around 5.8M.
To get the list of pages I did use the AllPages API. However, the number of pages I get is around 14.5M which is ~3 times of what I was expecting. I restricted myself to namespace 0 to get the list. Following is the sample code that I am using:
# get the list of all wikipedia pages (articles) -- English
import sys
from simplemediawiki import MediaWiki
listOfPagesFile = open("wikiListOfArticles_nonredirects.txt", "w")
wiki = MediaWiki('https://en.wikipedia.org/w/api.php')
continueParam = ''
requestObj = {}
requestObj['action'] = 'query'
requestObj['list'] = 'allpages'
requestObj['aplimit'] = 'max'
requestObj['apnamespace'] = '0'
pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']
for eachPage in pagesInQuery:
pageId = eachPage['pageid']
title = eachPage['title'].encode('utf-8')
writestr = str(pageId) + "; " + title + "\n"
listOfPagesFile.write(writestr)
numQueries = 1
while len(pagelist['query']['allpages']) > 0:
requestObj['apcontinue'] = pagelist["continue"]["apcontinue"]
pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']
for eachPage in pagesInQuery:
pageId = eachPage['pageid']
title = eachPage['title'].encode('utf-8')
writestr = str(pageId) + "; " + title + "\n"
listOfPagesFile.write(writestr)
# print writestr
numQueries += 1
if numQueries % 100 == 0:
print "Done with queries -- ", numQueries
print numQueries
listOfPagesFile.close()
The number of queries fired is around 28900, which results in approx. 14.5M names of the pages.
I also tried the all-titles link mentioned in the above answer. In that case as well I am getting around 14.5M pages.
I thought that this overestimate to the actual number of pages is because of the redirects, and did add the 'nonredirects' option to the request object:
requestObj['apfilterredir'] = 'nonredirects'
After doing that I get only 112340 number of pages. Which is too small as compared to 5.8M.
With the above code I was expecting roughly 5.8M pages, but that doesn't seem to be the case.
Is there any other option that I should be trying to get the actual (~5.8M) set of page names?

Here is an asynchronous program that will generate mediawiki pages titles:
async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
log.debug('Started generating asynchronously wiki titles at {}', wiki)
# XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
url = "{}/w/api.php".format(wiki)
params = {
"action": "query",
"format": "json",
"list": "allpages",
"apfilterredir": "nonredirects",
"apfrom": "",
}
while True:
content = await get(http, url, params=params)
if content is None:
continue
content = json.loads(content)
for page in content["query"]["allpages"]:
yield page["title"]
try:
apcontinue = content['continue']['apcontinue']
except KeyError:
return
else:
params["apfrom"] = apcontinue

Related

Difficulties with web scraping

I have just came to an article called The 500 Greatest Songs of All Time and thought "oh that's cool I bet they also made a Spotify/Apple music list that I can follow". Well...they don't.
So in a nutshell, I wonder if it's possible to 1) scrap the website to extract the songs and 2) then do some kind of bulk upload to Spotify to create the list.
Songs' titles and authors are structured like this in the website:
Website screenshot. I have already tried to scrap the web with the importxml() formula in google sheets but with no success.
I understand the scrapping part is easier than the other and, as I am new to programming, I would be happy to manage to partially achieve this goal. I am sure this task can be achieved easily on python.
I feel like explaining everything would go beyond the scope here, so I tried to comment the code well enough.
1. Scrape the songs
I used python3 and selenium, their website doesn't block that.
Be sure to adjust your chromedriver path, and the output path of the .txt file at the bottom if necessary. Once it's done and you have your .txt file you can close it.
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
s = Service(r'/Users/main/Desktop/chromedriver')
driver = webdriver.Chrome(service=s)
# just setting some vars, I used Xpath because I know that
top_500 = 'https://www.rollingstone.com/music/music-lists/best-songs-of-all-time-1224767/'
cookie_button_xpath = "// button [#id = 'onetrust-accept-btn-handler']"
div_containing_links_xpath = "// div [#id = 'pmc-gallery-list-nav-bar-render'] // child :: a"
song_names_xpath = "// article [#class = 'c-gallery-vertical-album'] / child :: h2"
links = []
songs = []
driver.get(top_500)
# accept cookies, give time to load
time.sleep(3)
cookie_btn = driver.find_element(By.XPATH, cookie_button_xpath)
cookie_btn.click()
time.sleep(1)
# extracting all the links since there are only 50 songs per page
links_to_next_pages = driver.find_elements(By.XPATH, div_containing_links_xpath)
for element in links_to_next_pages:
l = element.get_attribute('href')
links.append(l)
# extracting the songs, then going to next page and so on until we hit 500
counter = 1 # were starting with 1 here since links[0] is the current page we are already on
while True:
list = driver.find_elements(By.XPATH, song_names_xpath)
for element in list:
s = element.text
songs.append(s)
if len(songs) == 500:
break
driver.get(links[counter])
counter += 1
time.sleep(2)
# verify that there are no duplicates, if there were, something would be off
if len(songs) != len( set(songs) ):
print('you f***** up')
else:
print('seems fine')
with open('/Users/main/Desktop/output_songs.txt', 'w') as file:
file.writelines(line + '\n' for line in songs)
2. Prepare Spotify
Go to the Spotify Developer Dashboard and create an
account (use your Spotify acc).
Then create an app, call it whatever you want.
On your app click settings and whitelist http://localhost:8888/callback
On your app click "users and access" and add your Spotify account
Leave the tab open, we'll come back to it
3. Prepare Your Environment
You need Node.js so make sure that is installed on your machine
Download this from Spotifys GitHub
Unzip it, cd into the folder and run npm install
Go into the authorization_code folder and open app.js in a editor
Find var scope and append ' playlist-modify-public' to the string, this is so that your app can access you Spotify playlists, see here
Now go back to the app in your Spotify Developer Dashboard we'll need to copy the Client ID and the Client Secret into the var client_id and var client_secret respectively (in the app.js file). var redirect_uri will be
http://localhost:8888/callback - don't forget to save your changes.
4. Run the Spotify side of things
cd into the authorization_code folder and run app.js with node app.js (this is basically a server running on your PC)
Now if that works leave it running and go to http://localhost:8888, authorise your Spotify account there
There copy the full token, including the overflow, use inspect element to get it
Adjust the user_id and auth variables as well as the path to the output_songs.txt (at with open) in the following python script and run that, songs which are not found will be printed out at the end, give it a search with Google. They are usually on Spotify as well but Google seem to have the better search algorithm (surprised Pikachu face).
import requests
import re
import json
# this is NOT you display name, it's your user name!!
user_id = 'YOUR_USERNAME'
# paste your auth token from spotify; it can time out then you have to get a new one, so dont panic if you get a bunch of responses in the 400s after some time
auth = {"Authorization": "Bearer YOUR_AUTH_KEY_FROM_LOCALHOST"}
playlist = []
err_log = []
base_url = 'https://api.spotify.com/v1'
search_method = '/search'
with open('/Users/main/Desktop/output_songs.txt', 'r') as file:
songs = file.readlines()
# this querys spotify does some magic and then appends the tracks spotify uri to an array
def query_song_uris():
for n, entry in enumerate(songs):
x = re.findall(r"'([^']*)'", entry)
title_len = len(entry) - len(x[0]) - 4
title = x[0]
artist = entry[:title_len]
payload = {
'q': (entry),
'track:': (title),
'artist:': (artist),
'type': 'track',
'limit': 1
}
url = base_url + search_method
try:
r = requests.get(url, params=payload, headers=auth)
print('\nquerying spotify; ', r)
c = r.content.decode('UTF-8')
dic = json.loads(c)
track_uri = dic["tracks"]["items"][0]["uri"]
playlist.append(track_uri)
print(track_uri)
except:
err = f'\nNr. {(len(songs)-n)}: ' + f'{entry}'
err_log.append(err)
playlist.reverse()
query_song_uris()
# creates a playlist and returns playlist id
def create_playlist():
payload = {
"name": "Rolling Stone: Top 500 (All Time)",
"description": "music for old men xD with occasional hip hop appearences. just kidding"
}
url = base_url + f'/users/{user_id}/playlists'
r = requests.post(url, headers=auth, json=payload)
c = r.content.decode('UTF-8')
dic = json.loads(c)
print(f'\n\ncreating playlist #{dic["id"]}; ', r)
return dic["id"]
def add_to_playlist():
playlist_id = create_playlist()
while True:
if len(playlist) > 100:
p = playlist[:100]
else:
p = playlist
payload = {"uris": (p)}
url = base_url + f'/playlists/{playlist_id}/tracks'
r = requests.post(url, headers=auth, json=payload)
print(f'\nadding {len(p)} songs to playlist; ', r)
del playlist[ : len(p) ]
if len(playlist) == 0:
break
add_to_playlist()
print('\n\ncheck your spotify :)')
print("\n\n\nthese tracks didn't make it, check manually:\n")
for line in err_log:
print(line)
print('\n\n')
Done
If you don't want to run the code yourself, heres the playlist:
https://open.spotify.com/playlist/5fdLKYNFlA4XSvhEl36KXS
If you have trouble, everything from step 2 on is also described here in the Web API quick start or in general in the web API docs.
Regarding Apple Music
So Apple seems very closed up (surprise haha). What I found though is that you can query the i-Tunes store. Given response also contains a direct link to the song(s) on Apple music.
You might be able to go from there.
Get ISRC code from iTunes Search API (Apple music)
PS: undeniably regex is witchcraft, but y'all here got my back

.text is scrambled with numbers and special keys in BeautifuSoup

Hello I am currently using Python 3, BeautifulSoup 4 and, requests to scrape some information from supremenewyork.com UK. I have implemented a proxy script (that I know works) into the script. The only problem is that this website does not like programs to scrape this information automatically and so they have decided to scramble this script which I think makes it unusable as text.
My question: is there a way to get the text without using the .text thing and/or is there a way to get the script to read the text? and when it sees a special character like # to skip over it or to read the text when it sees & skip until it sees ;?
because basically how this website scrambles the text is by doing this. Here is an example, the text shown when you inspect element is:
supremetshirt
Which is supposed to say "supreme t-shirt" and so on (you get the idea, they don't use letters to scramble only numbers and special keys)
this  is kind of highlighted in a box automatically when you inspect the element using a VPN on the UK supreme website, and is different than the text (which isn't highlighted at all). And whenever I run my script without the proxy code onto my local supremenewyork.com, It works fine (but only because of the code, not being scrambled on my local website and I want to pull this info from the UK website) any ideas? here is my code:
import requests
from bs4 import BeautifulSoup
categorys = ['jackets', 'shirts', 'tops_sweaters', 'sweatshirts', 'pants', 'shorts', 't-shirts', 'hats', 'bags', 'accessories', 'shoes', 'skate']
catNumb = 0
#use new proxy every so often for testing (will add something that pulls proxys and usses them for you.
UK_Proxy1 = '51.143.153.167:80'
proxies = {
'http': 'http://' + UK_Proxy1 + '',
'https': 'https://' + UK_Proxy1 + '',
}
for cat in categorys:
catStr = str(categorys[catNumb])
cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
proxy_script = requests.get(cUrl, proxies=proxies).text
bSoup = BeautifulSoup(proxy_script, 'lxml')
print('\n*******************"'+ catStr.upper() + '"*******************\n')
catNumb += 1
for item in bSoup.find_all('div', class_='inner-article'):
url = item.a['href']
alt = item.find('img')['alt']
req = requests.get('http://www.supremenewyork.com' + url)
item_soup = BeautifulSoup(req.text, 'lxml')
name = item_soup.find('h1', itemprop='name').text
#name = item_soup.find('h1', itemprop='name')
style = item_soup.find('p', itemprop='model').text
#style = item_soup.find('p', itemprop='model')
print (alt +(' --- ')+ name +(' --- ')+ style)
#print(alt)
#print(str(name))
#print (str(style))
When I run this script I get this error:
name = item_soup.find('h1', itemprop='name').text
AttributeError: 'NoneType' object has no attribute 'text'
And so what I did was I un-hash-tagged the stuff that is hash-tagged above, and hash-tagged the other stuff that is similar but different, and I get some kind of str error and so I tried the print(str(name)). I am able to print the alt fine (with every script, the alt is not scrambled), but when it comes to printing the name and style all it prints is a None under every alt code is printed.
I have been working on fixing this for days and have come up with no solutions. can anyone help me solve this?
I have solved my own answer using this solution:
thetable = soup5.find('div', class_='turbolink_scroller')
items = thetable.find_all('div', class_='inner-article')
for item in items:
alt = item.find('img')['alt']
name = item.h1.a.text
color = item.p.a.text
print(alt,' --- ', name, ' --- ',color)

Where to store SQL commands for execution

We face code quality issues because of inline mysql queries. Having self-written mysql queries really clutters the code and also increases code base etc.
Our code is cluttered with stuff like
/* beautify ignore:start */
/* jshint ignore:start */
var sql = "SELECT *"
+" ,DATE_ADD(sc.created_at,INTERVAL 14 DAY) AS duedate"
+" ,distance_mail(?,?,lat,lon) as distance,count(pks.skill_id) c1"
+" ,count(ps.profile_id) c2"
+" FROM TABLE sc"
+" JOIN "
+" PACKAGE_V psc on sc.id = psc.s_id "
+" JOIN "
+" PACKAGE_SKILL pks on pks.package_id = psc.package_id "
+" LEFT JOIN PROFILE_SKILL ps on ps.skill_id = pks.skill_id and ps.profile_id = ?"
+" WHERE sc.type in "
+" ('a',"
+" 'b',"
+" 'c' ,"
+" 'd',"
+" 'e',"
+" 'f',"
+" 'g',"
+" 'h')"
+" AND sc.status = 'open'"
+" AND sc.crowd_type = ?"
+" AND sc.created_at < DATE_SUB(NOW(),INTERVAL 10 MINUTE) "
+" AND sc.created_at > DATE_SUB(NOW(),INTERVAL 14 DAY)"
+" AND distance_mail(?, ?,lat,lon) < 500"
+" GROUP BY sc.id"
+" HAVING c1 = c2 "
+" ORDER BY distance;";
/* jshint ignore:end */
/* beautify ignore:end */
I had to blur the code a little bit.
As you can see, having this repeatedly in your code is just unreadable. Also because atm we can not go to ES6, which would at least pretty the string a little bit thanks to multi-line strings.
The question now is, is there a way to store that SQL procedures in one place? As additional information, we use node (~0.12) and express to expose an API, accessing a MySQL db.
I already thought about, using a JSON, which will result in an even bigger mess. Plus it may not even be possible since the charset for JSON is a little bit strict and the JSON will probably not like having multi line strings too.
Then I came up with the idea to store the SQL in a file and load at startup of the node app. This is at the moment my best shot to get the SQL queries at ONE place and offering them to the rest of the node modules.
Question here is, use ONE file? Use one file per query? Use one file per database table?
Any help is appreciated, I can not be the first on the planet solving this so maybe someone has a working, nice solution!
PS: I tried using libs like squel but that does not really help, since our queries are complex as you can see. It is mainly about getting OUR queries into a "query central".
I prefer putting every bigger query in one file. This way you can have syntax highlighting and it's easy to load on server start. To structure this, i usually have one folder for all queries and inside that one folder for each model.
# queries/mymodel/select.mymodel.sql
SELECT * FROM mymodel;
// in mymodel.js
const fs = require('fs');
const queries = {
select: fs.readFileSync(__dirname + '/queries/mymodel/select.mymodel.sql', 'utf8')
};
I suggest you store your queries in .sql files away from your js code. This will separate the concerns and make both code & queries much more readable. You should have different directories with nested structure based on your business.
eg:
queries
├── global.sql
├── products
│ └── select.sql
└── users
└── select.sql
Now, you just need to require all these files at application startup. You can either do it manually or use some logic. The code below will read all the files (sync) and produce an object with the same hierarchy as the folder above
var glob = require('glob')
var _ = require('lodash')
var fs = require('fs')
// directory containing all queries (in nested folders)
var queriesDirectory = 'queries'
// get all sql files in dir and sub dirs
var files = glob.sync(queriesDirectory + '/**/*.sql', {})
// create object to store all queries
var queries = {}
_.each(files, function(file){
// 1. read file text
var queryText = fs.readFileSync(__dirname + '/' + file, 'utf8')
// 2. store into object
// create regex for directory name
var directoryNameReg = new RegExp("^" + queriesDirectory + "/")
// get the property path to set in the final object, eg: model.queryName
var queryPath = file
// remove directory name
.replace(directoryNameReg,'')
// remove extension
.replace(/\.sql/,'')
// replace '/' with '.'
.replace(/\//g, '.')
// use lodash to set the nested properties
_.set(queries, queryPath, queryText)
})
// final object with all queries according to nested folder structure
console.log(queries)
log output
{
global: '-- global query if needed\n',
products: {
select: 'select * from products\n'
},
users: {
select: 'select * from users\n'
}
}
so you can access all queries like this queries.users.select
Put your query into database procedure and call procedure in the code, when it is needed.
create procedure sp_query()
select * from table1;
There are a few things you want to do. First, you want to store multi-line without ES6. You can take advantage of toString of a function.
var getComment = function(fx) {
var str = fx.toString();
return str.substring(str.indexOf('/*') + 2, str.indexOf('*/'));
},
queryA = function() {
/*
select blah
from tableA
where whatever = condition
*/
}
console.log(getComment(queryA));
You can now create a module and store lots of these functions. For example:
//Name it something like salesQry.js under the root directory of your node project.
var getComment = function(fx) {
var str = fx.toString();
return str.substring(str.indexOf('/*') + 2, str.indexOf('*/'));
},
query = {};
query.template = getComment(function() { /*Put query here*/ });
query.b = getComment(function() {
/*
SELECT *
,DATE_ADD(sc.created_at,INTERVAL 14 DAY) AS duedate
,distance_mail(?,?,lat,lon) as distance,count(pks.skill_id) c1
,count(ps.profile_id) c2
FROM TABLE sc
JOIN PACKAGE_V psc on sc.id = psc.s_id
JOIN PACKAGE_SKILL pks on pks.package_id = psc.package_id
LEFT JOIN PROFILE_SKILL ps on ps.skill_id = pks.skill_id AND ps.profile_id = ?
WHERE sc.type in ('a','b','c','d','e','f','g','h')
AND sc.status = 'open'
AND sc.crowd_type = ?
AND sc.created_at < DATE_SUB(NOW(),INTERVAL 10 MINUTE)
AND sc.created_at > DATE_SUB(NOW(),INTERVAL 14 DAY)
AND distance_mail(?, ?,lat,lon) < 500
GROUP BY sc.id
HAVING c1 = c2
ORDER BY distance;
*/
});
//Debug
console.log(query.template);
console.log(query.b);
//module.exports.query = query //Uncomment this.
You can require the necessary packages and build your logic right in this module or build a generic wrapper module for better OO design.
//Name it something like SQL.js. in the root directory of your node project.
var mysql = require('mysql'),
connection = mysql.createConnection({
host: 'localhost',
user: 'me',
password: 'secret',
database: 'my_db'
});
module.exports.load = function(moduleName) {
var SQL = require(moduleName);
return {
query: function(statement, param, callback) {
connection.connect();
connection.query(SQL[statement], param, function(err, results) {
connection.end();
callback(err, result);
});
}
});
To use it, you do something like:
var Sql = require ('./SQL.js').load('./SalesQry.js');
Sql.query('b', param, function (err, results) {
...
});
I come from different platform, so I'm not sure if this is what you are looking for. like your application, we had many template queries and we don't like having it hard-coded in the application.
We created a table in MySQL, allowing to save Template_Name (unique), Template_SQL.
We then wrote a small function within our application that returns the SQL template.
something like this:
SQL = fn_get_template_sql(Template_name);
we then process the SQL something like this:
pseudo:
if SQL is not empty
SQL = replace all parameters// use escape mysql strings from your parameter
execute the SQL
or you could read the SQL, create connection and add parameters using your safest way.
This allows you to edit the template query where and whenever. You can create an audit table for the template table capturing all previous changes to revert back to previous template if needed. You can extend the table and capture who and when was the SQL last edited.
from performance point of view, this would work as on-the-fly plus you don't have to read any files or restart server when you are depending on starting-server process when adding new templates.
You could create a completely new npm module let's assume the custom-queries module and put all your complex queries in there.
Then you can categorize all your queries by resource and by action. For example, the dir structure can be:
/index.js -> it will bootstrap all the resources
/queries
/queries/sc (random name)
/queries/psc (random name)
/queries/complex (random name)
The following query can live under the /queries/complex directory in its own file and the file will have a descriptive name (let's assume retrieveDistance)
// You can define some placeholders within this var because possibly you would like to be a bit configurable and reuseable in different parts of your code.
/* jshint ignore:start */
var sql = "SELECT *"
+" ,DATE_ADD(sc.created_at,INTERVAL 14 DAY) AS duedate"
+" ,distance_mail(?,?,lat,lon) as distance,count(pks.skill_id) c1"
+" ,count(ps.profile_id) c2"
+" FROM TABLE sc"
+" JOIN "
+" PACKAGE_V psc on sc.id = psc.s_id "
+" JOIN "
+" PACKAGE_SKILL pks on pks.package_id = psc.package_id "
+" LEFT JOIN PROFILE_SKILL ps on ps.skill_id = pks.skill_id and ps.profile_id = ?"
+" WHERE sc.type in "
+" ('a',"
+" 'b',"
+" 'c' ,"
+" 'd',"
+" 'e',"
+" 'f',"
+" 'g',"
+" 'h')"
+" AND sc.status = 'open'"
+" AND sc.crowd_type = ?"
+" AND sc.created_at < DATE_SUB(NOW(),INTERVAL 10 MINUTE) "
+" AND sc.created_at > DATE_SUB(NOW(),INTERVAL 14 DAY)"
+" AND distance_mail(?, ?,lat,lon) < 500"
+" GROUP BY sc.id"
+" HAVING c1 = c2 "
+" ORDER BY distance;";
/* jshint ignore:end */
module.exports = sql;
The top level index.js will export an object with all the complex queries. An example can be:
var sc = require('./queries/sc');
var psc = require('./queries/psc');
var complex = require('./queries/complex');
// Quite important because you want to ensure that no one will touch the queries outside of
// the scope of this module. Be careful, because the Object.freeze is freezing only the top
// level elements of the object and it is not recursively freezing the nested objects.
var queries = Object.freeze({
sc: sc,
psc: psc,
complex: complex
});
module.exports = queries;
Finally, on your main code you can use the module like that:
var cq = require('custom-queries');
var retrieveDistanceQuery = cq.complex.retrieveDistance;
// #todo: replace the placeholders if they exist
Doing something like that you will move all the noise of the string concatenation to another place that you would expect and you will be able to find quite easily in one place all your complex queries.
This is no doubt a million dollar question, and I think the right solution depends always on the case.
Here goes my thoughts. Hope could help:
One simple trick (which, in fact, I read that it is surprisingly more efficient than joining strings with "+") is to use arrays of strings for each row and join them.
It continues being a mess but, at least for me, a bit clearer (specially when using, as I do, "\n" as separator instead of spaces, to make resulting strings more readable when printed out for debugging).
Example:
var sql = [
"select foo.bar",
"from baz",
"join foo on (",
" foo.bazId = baz.id",
")", // I always leave the last comma to avoid errors on possible query grow.
].join("\n"); // or .join(" ") if you prefer.
As a hint, I use that syntax in my own SQL "building" library. It may not work in too complex queries but, if you have cases in which provided parameters could vary, it is very helpful to avoid (also subotptimal) "coalesce" messes by fully removing unneeded query parts. It is also on GitHub, (and it isn't too complex code), so you can extend it if you feel it useful.
If you prefer separate files:
About having single or multiple files, having multiple files is less efficient from the point of view of reading efficiency (more file open/close overhead and harder OS level caching). But, if you load all of them single time at startup, it is not in fact a hardly noticeable difference.
So, the only drawback (for me) is that it is too hard to have a "global glance" of your query collection. Even, if you have very huge amount of queries, I think it is better to mix both approaches. That is: group related queries in the same file so you have single file per each module, submodel or whatever criteria you chosen.
Of course: Single file would result in relatively "huge" file, also difficult to handle "at first". But I (hardly) use vim's marker based folding (foldmethod=marker) which is very helpfull to handle that files.
Of course: if you don't (yet) use vim (truly??), you wouldn't have that option, but sure there is another alternative in your editor. If not, you always can use syntax folding and something like "function (my_tag) {" as markers.
For example:
---(Query 1)---------------------/*{{{*/
select foo from bar;
---------------------------------/*}}}*/
---(Query 2)---------------------/*{{{*/
select foo.baz
from foo
join bar using (foobar)
---------------------------------/*}}}*/
...when folded, I see it as:
+-- 3 línies: ---(Query 1)------------------------------------------------
+-- 5 línies: ---(Query 2)------------------------------------------------
Which, using properly selected labels, is much more handy to manage and, from the parsing point of view, is not difficult to parse the whole file splitting queries by that separation rows and using labels as keys to index the queries.
Dirty example:
#!/usr/bin/env node
"use strict";
var Fs = require("fs");
var src = Fs.readFileSync("./test.sql");
var queries = {};
var label = false;
String(src).split("\n").map(function(row){
var m = row.match(/^-+\((.*?)\)-+[/*{]*$/);
if (m) return queries[label = m[1].replace(" ", "_").toLowerCase()] = "";
if(row.match(/^-+[/*}]*$/)) return label = false;
if (label) queries[label] += row+"\n";
});
console.log(queries);
// { query_1: 'select foo from bar;\n',
// query_2: 'select foo.baz \nfrom foo\njoin bar using (foobar)\n' }
console.log(queries["query_1"]);
// select foo from bar;
console.log(queries["query_2"]);
// select foo.baz
// from foo
// join bar using (foobar)
Finally (idea), if you do as much effort, wouldn't be a bad idea to add some boolean mark together with each query label telling if that query is intended to be used frequently or only occasionally. Then you can use that information to prepare those statements at application startup or only when they are going to be used more than single time.
Can you create a view which that query.
Then select from the view
I don't see any parameters in the query so I suppose view creation is possible.
Create store procedures for all queries, and replace the var sql = "SELECT..." for calling the procedures like var sql = "CALL usp_get_packages".
This is the best for performance and no dependency breaks on the application. Depending on the number of queries may be a huge task, but for every aspect (maintainability, performance, dependencies, etc) is the best solution.
I'm late to the party, but if you want to store related queries in a single file, YAML is a good fit because it handles arbitrary whitespace better than pretty much any other data serialization format, and it has some other nice features like comments:
someQuery: |-
SELECT *
,DATE_ADD(sc.created_at,INTERVAL 14 DAY) AS duedate
,distance_mail(?,?,lat,lon) as distance,count(pks.skill_id) c1
,count(ps.profile_id) c2
FROM TABLE sc
-- ...
# Here's a comment explaining the following query
someOtherQuery: |-
SELECT 1;
This way, using a module like js-yaml you can easily load all of the queries into an object at startup and access each by a sensible name:
const fs = require('fs');
const jsyaml = require('js-yaml');
export default jsyaml.load(fs.readFileSync('queries.yml'));
Here's a snippet of it in action (using a template string instead of a file):
const yml =
`someQuery: |-
SELECT *
FROM TABLE sc;
someOtherQuery: |-
SELECT 1;`;
const queries = jsyaml.load(yml);
console.dir(queries);
console.log(queries.someQuery);
<script src="https://unpkg.com/js-yaml#3.8.1/dist/js-yaml.min.js"></script>
Another approach with separate files by using ES6 string templates.
Of course, this doesn't answer the original question because it requires ES6, but there is already an accepted answer which I'm not intending to replace. I simply thought that it is interesting from the point of view of the discussion about query storage and management alternatives.
// myQuery.sql.js
"use strict";
var p = module.parent;
var someVar = p ? '$1' : ':someVar'; // Comments if needed...
var someOtherVar = p ? '$2' : ':someOtherVar';
module.exports = `
--##sql##
select foo from bar
where x = ${someVar} and y = ${someOtherVar}
--##/sql##
`;
module.parent || console.log(module.exports);
// (or simply "p || console.log(module.exports);")
NOTE: This is the original (basic) approach. I
later evolved it adding some interesting improvements
(BONUS, BONUS 2 and FINAL EDIT sections). See the bottom of
this post for a full-featured snipet.
The advantages of this approach are:
Is very readable, even the little javascript overhead.
It also can be properly syntax higlighted (at least in Vim) both javascript and SQL sections.
Parameters are placed as readable variable names instead of silly "$1, $2", etc... and explicitly declared at the top of the file so it's simple to check in which order they must be provided.
Can be required as myQuery = require("path/to/myQuery.sql.js") obtaining valid query string with $1, $2, etc... positional parameters in the specified order.
But, also, can be directly executed with node path/to/myQuery.sql.js obtaining valid SQL to be executed in a sql interpreter
This way you can avoid the mess of copying forth and back the query and replace parameter specification (or values) each time from query testing environments to application code: Simply use the same file.
Note: I used PostgreSQL syntax for variable names. But with other databases, if different, it's pretty simple to adapt.
More than that: with a few more tweaks (see BONUS section), you can turn it in a viable console testing tool and:
Generate yet parametized sql by executing something like node myQueryFile.sql.js parameter1 parameter2 [...].
...or directly execute it by piping to your database console. Ex: node myQueryFile.sql.js some_parameter | psql -U myUser -h db_host db_name.
Even more: You also can tweak the query making it to behave slightly different when executed from console (see BONUS 2 section) avoiding to waste space displaying large but no meaningful data while keeping it when the query is read by the application that needs it.
And, of course: you can pipe it again to less -S to avoid line wrapping and be able to easily explore data by scrolling it both in horizontal and vertical directions.
Example:
(
echo "\set someVar 3"
echo "\set someOtherVar 'foo'"
node path/to/myQuery.sql.js
) | psql dbName
NOTES:
'##sql##' and '##/sql##' (or similar) labels are fully optional,
but very useful for proper syntax highlighting, at least in Vim.
This extra-plumbing is no more necessary (see BONUS section).
In fact, I actually doesn't write below (...) | psql... code directly to console but simply (in a vim buffer):
echo "\set someVar 3"
echo "\set someOtherVar 'foo'"
node path/to/myQuery.sql.js
...as many times as test conditions I want to test and execute them by visually selecting desired block and typing :!bash | psql ...
BONUS: (edit)
I ended up using this approach in many projects with just a simple modification that consist in changing last row(s):
module.parent || console.log(module.exports);
// (or simply "p || console.log(module.exports);")
...by:
p || console.log(
`
\\set someVar '''${process.argv[2]}'''
\\set someOtherVar '''${process.argv[3]}'''
`
+ module.exports
);
This way I can generate yet parametized queries from command line just by passing parameters normally as position arguments. Example:
myUser#myHost:~$ node myQuery.sql.js foo bar
\set someVar '''foo'''
\set someOtherVar '''bar'''
--##sql##
select foo from bar
where x = ${someVar} and y = ${someOtherVar}
--##/sql##
...and, better than that: I can pipe it to postgres (or any other database) console just like this:
myUser#myHost:~$ node myQuery.sql.js foo bar | psql -h dbHost -u dbUser dbName
foo
------
100
200
300
(3 rows)
This approach make it much more easy to test multiple values because you can simply use command line history to recover previous commands and just edit whatever you want.
BONUS 2:
Two few more tricks:
1. Sometimes we need to retrieve some columns with binary and/or large data that make it difficult to read from console and, in fact, we probaby even don't need to see them at all while testing the query.
In this cases we can take advantadge of the p variable to alter the output of the query and shorten, format more properly, or simply remove that column from the projection.
Examples:
Format: ${p ? jsonb_column : "jsonb_pretty("+jsonb_column+")"},
Shorten: ${p ? long_text : "substring("+long_text+")"},
Remove: ${p ? binary_data + "," : "" (notice that, in this case, I moved the comma inside the exprssion due to be able to avoid it in console version.
2. Not a trick in fact but just a reminder: We all know that to deal with large output in the console, we only need to pipe it to less command.
But, at least me, often forgive that, when ouput is table-aligned and too wide to fit in our terminal, there is the -S modifier to instruct less not to wrap and instead let us scroll text also in horizontal direction to explore the data.
Here full version of the original snipped with this change applied:
// myQuery.sql.js
"use strict";
var p = module.parent;
var someVar = p ? '$1' : ':someVar'; // Comments if needed...
var someOtherVar = p ? '$2' : ':someOtherVar';
module.exports = `
--##sql##
select
foo
, bar
, ${p ? baz : "jsonb_pretty("+baz+")"}
${p ? ", " + long_hash : ""}
from bar
where x = ${someVar} and y = ${someOtherVar}
--##/sql##
`;
p || console.log(
`
\\set someVar '''${process.argv[2]}'''
\\set someOtherVar '''${process.argv[3]}'''
`
+ module.exports
);
FINAL EDIT:
I have been evolving a lot more this concept until it became too wide to be strictly manually handled approach.
Finally, taking advantage of the great ES6+ Tagged Templates i implemented a much simpler library driven approach.
So, in case anyone could be interested in it, here it is: SQLTT
Call procedure in the code after putting query into the db procedure. #paval also already answered
you may also refer here.
create procedure sp_query()
select * from table1;

Putting hyperlinks into an HTML table in R

I am a biologist trying to do computer science for research, so I may be a bit naïve. But I would like to a make a table containing information from a data frame, with a hyperlink in one of the columns. I imagine this needs to be an html document (?). I found this post this post describing how to put a hyperlink into a data frame and write it as an HTML file using googleVis. I would like to use this approach (it is the only one I know and seems to work well) except I would like to replace the actual URL with a description. The real motivation being that I would like to include many of these hyperlinks, and the links have long addresses which is difficult to read.
To be verbose, I essentially want to do what I did here where we read 'here' but 'here' points to
http:// stackoverflow.com/questions/8030208/exporting-table-in-r-to-html-with-hyperlinks
From your previous question, you can have another list which contains the titles of the URL's:
url=c('http://nytimes.com', 'http://cnn.com', 'http://www.weather.gov'))
urlTitles=c('NY Times', 'CNN', 'Weather'))
foo <- transform(foo, url = paste('<a href = ', shQuote(url), '>', urlTitles, '</a>'))
x = gvisTable(foo, options = list(allowHTML = TRUE))
plot(x)
Building on Jack's answer but consolidating from different threads:
library(googleVis)
library(R2HTML)
url <- c('http://nytimes.com', 'http://cnn.com', 'http://www.weather.gov')
urlTitles <- c('NY Times', 'CNN', 'Weather')
foo <- data.frame(a=c(1,2,3), b=c(4,5,6), url=url)
foo <- transform(foo, url = paste('<a href = ', shQuote(url), '>', urlTitles, '</a>'))
x <- gvisTable(foo, options = list(allowHTML = TRUE))
plot(x)

page number variable: html,django

i want to do paging. but i only want to know the current page number, so i will call the webservice function and send this parameter and recieve the curresponding data. so i only want to know how can i be aware of current page number? i'm writing my project in django and i create the page with xsl. if o know the page number i think i can write this in urls.py:
url(r'^ask/(\d+)/$',
'ask',
name='ask'),
and call the function in views.py like:
ask(request, pageNo)
but i don't know where to put pageNo var in html page. (so fore example with pageN0=2, i can do pageNo+1 or pageNo-1 to make the url like 127.0.0.01/ask/3/ or 127.0.0.01/ask/2/). to make my question more cleare i want to know how can i do this while we don't have any variables in html?
sorry for my crazy question, i'm new in creating website and also in django. :">
i'm creating my html page with xslt. so i send the total html page. (to show.html which contains only {{str}} )
def ask(request:
service = GetConfigLocator().getGetConfigHttpSoap11Endpoint()
myRequest = GetConfigMethodRequest()
myXml = service.GetConfigMethod(myRequest)
myXmlstr = myXml._return
styledoc = libxml2.parseFile("ask.xsl")
style = libxslt.parseStylesheetDoc(styledoc)
doc = libxml2.parseDoc(myXmlstr)
result = style.applyStylesheet(doc, None)
out = style.saveResultToString( result )
ok = mark_safe(out)
style.freeStylesheet()
doc.freeDoc()
result.freeDoc()
return render_to_response("show.html", {
'str': ok,
}, context_instance=RequestContext(request))
i'm not working with db and i just receive xml file to parse it. so i don't have contact_list = Contacts.objects.all(). can i still use this way? should i put the first parameter inpaginator = Paginator(contact_list, 25) blank?
if you user standart django paginator, thay send you to url http://example.com/?page=N, where N - number you page
So,
# urls.py
url('^ask/$', 'ask', name='viewName'),
You can get page number in views:
# views.py
def ask(request):
page = request.GET.get('page', 1)