Writing items from function to separate text files? - function

I'm running some web scraping, and now have a list of 911 links saved in the following (I included 5 to demonstrate how they're stored):
every_link = ['http://www.millercenter.org/president/obama/speeches/speech-4427', 'http://www.millercenter.org/president/obama/speeches/speech-4425', 'http://www.millercenter.org/president/obama/speeches/speech-4424', 'http://www.millercenter.org/president/obama/speeches/speech-4423', 'http://www.millercenter.org/president/obama/speeches/speech-4453']
These URLs link to presidential speeches over time. I want to store each individual speech (so, 911 unique speeches) in different text files, or be able to group by president. I'm trying to pass the following function on to these links:
def processURL(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
item_str = item_div.text.lower()
item_str_processed = punctuation.sub('',item_str)
item_str_processed_final = item_str_processed.replace('—',' ')
for l in every_link:
processURL(l)
So, I would want to save to unique text files words from the all the processed speeches. This might look like the following, with obama_44xx representing individual text files:
obama_4427 = "blah blah blah"
obama_4425 = "blah blah blah"
obama_4424 = "blah blah blah"
...
I'm trying the following:
for l in every_link:
processURL(l)
obama.write(processURL(l))
But that's not working...
Is there another way I should go about this?

Okay, so you have a couple of issues. First of all, your processURL function doesn't actually return anything, so when you try to write the return value of the function, it's going to be None. Maybe try something like this:
def processURL(link):
open_url = urllib2.urlopen(link).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
item_str = item_div.text.lower()
item_str_processed = punctuation.sub('',item_str)
item_str_processed_final = item_str_processed.replace('—',' ')
splitlink = link.split("/")
president = splitlink[4]
speech_num = splitlink[-1].split("-")[1]
filename = "{0}_{1}".format(president, speech_num)
return filename, item_str_processed_final # returning a tuple
for link in every_link:
filename, content = processURL(link) # yay tuple unpacking
with open(filename, 'w') as f:
f.write(content)
This will write each file to a filename that looks like president_number. So for example, it will write Obama's speech with id number 4427 to a file called obama_4427. Lemme know if that works!

You have to call the processURL function and have it return the text you want written. After that, you simply have to add the writing to disk code within the loop. Something like this:
def processURL(l):
open_url = urllib2.urlopen(l).read()
item_soup = BeautifulSoup(open_url)
item_div = item_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
item_str = item_div.text.lower()
#item_str_processed = punctuation.sub('',item_str)
#item_str_processed_final = item_str_processed.replace('—',' ')
return item_str
for l in every_link:
speech_text = processURL(l).encode('utf-8').decode('ascii', 'ignore')
speech_num = l.split("-")[1]
with open("obama_"+speech_num+".txt", 'w') as f:
f.write(speech_text)
The .encode('utf-8').decode('ascii', 'ignore') is purely for dealing with non-ascii characters in the text. Ideally you would handle them in a different way, but that depends on your needs (see Python: Convert Unicode to ASCII without errors).
Btw, the 2nd link in your list is 404. You should make sure your script can handle that.

Related

Trying to define a function that creates lists from files and uses random.choices to choose an element from the weighted lists

I'm trying to define a function that will create lists from multiple text files and print a random element from one of the weighted lists. I've managed to get the function to work with random.choice for a single list.
enter code here
def test_rollitems():
my_commons = open('common.txt')
all_common_lines = my_commons.readlines()
common = []
for i in all_common_lines:
common.append(i)
y = random.choice(common)
print(y)
When I tried adding a second list to the function it wouldn't work and my program just closes when the function is called.
enter code here
def Improved_rollitem():
#create the lists from the files#
my_commons = open('common.txt')
all_common_lines= my_commons.readlines()
common = []
for i in all_common_lines:
common.append(i)
my_uncommons = open('uncommon.txt')
all_uncommon_lines =my_uncommons.readlines()
uncommon =[]
for i in all_uncommon_lines:
uncommon.apend(i)
y = random.choices([common,uncommon], [80,20])
print(y)
Can anyone offer any insight into what I'm doing wrong or missing ?
Nevermind. I figured this out on my own! Was having issues with Geany so I installed Pycharm and was able to work through the issue. Correct code is:
enter code here
def Improved_rollitem():
#create the lists from the files#
my_commons = open('common.txt')
all_common_lines= my_commons.readlines()
common = []
for i in all_common_lines:
common.append(i)
my_uncommons = open('uncommon.txt')
all_uncommon_lines =my_uncommons.readlines()
uncommon =[]
for i in all_uncommon_lines:
uncommon.append(i)
y = random.choices([common,uncommon], [.8,.20])
if y == [common]:
for i in [common]:
print(random.choice(i))
if y == [uncommon]:
for i in [uncommon]:
print(random.choice(i))
If there's a better way to do something like this, it would certainly be cool to know though.

I'm having trouble with sending a form using POST to retrieve data in R

I'm having trouble collecting doctors from https://www.uchealth.org/providers/. I've found out it's a POST method but with httr I can't seem to create the form. Here's what I have
url = url = 'https://www.uchealth.org/providers/'
formA = list(title = 'Search', onClick = 'swapForms();', doctor-area-of-care-top = 'Cancer Care')
formB = list(Search = 'swapForms();', doctor-area-of-care = 'Cancer Care')
get = POST(url, body = formB, encode = 'form')
I'm fairly certain formB is the correct one. However, I can't test it since I yield an error when trying to make the list. I believe it is because you can't use "-" characters when naming although I could be wrong on that. Could somebody help please?
I am unable to comment properly but try this to create an list. Code below worked for me.
library(httr)
url = 'https://www.uchealth.org/providers/'
formB = list(Search = 'swapForms();', `doctor-area-of-care` = 'Cancer Care')
get = POST(url, body = formB, encode = 'form')
When you are creating names with spaces or some other special character you have to put it into the operator above.

How do I append text to the start, middle, and end of each line from one file to another using python3?

I am writing a script to change the formatting of the text from one file, and create a new text file with the changes in formatting.
I have been able to remove unwanted characters, but haven't found a way to append text to the beginning of every line in the file.
Content from the original file looks like:
DMA 123 USA 12345
What I need it to look like after appending data to the start, middle, and end of the string:
<option label="DMA 123 USA" value="123"></option>
I have almost 100 lines that vary some, but follow the above formatting. I am trying to automate this as it will be a frequent task to adjust the original file to the new format for web publishing
I have been searching and haven't found any way to do it yet. Here is my current code:
path = 'file.txt'
tvfile = open(path,'r')
days = tvfile.read()
new_path = 'tvs.txt'
new_days = open(new_path,'w')
replace_me = ['-' ,'(' ,')' ,',' , '"' , ]
for item in replace_me:
days = days.replace(item,'')
days = days.strip()
new_days.write(days)
print(days)
tvfile.close()
new_days.close()
Nit: you need to prepend, not append. That said, try something along these lines:
buffer = ""
for item in replace_me:
line = "<option label=\""
line = line + days.replace(item,'').strip()
line = line + "\"></option>"
buffer = buffer + line
new_days.write(buffer)

How to obtain a list of titles of all Wikipedia articles

I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.
I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.
So my question is
Is there a way to obtain only the titles of Wikipedia articles via the API?
Is there a way to combine multiple request/queries into one? Or do I actually have to download a Wikipedia dump?
The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.
But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).
Right now, as per the current statistics the number of articles is around 5.8M.
To get the list of pages I did use the AllPages API. However, the number of pages I get is around 14.5M which is ~3 times of what I was expecting. I restricted myself to namespace 0 to get the list. Following is the sample code that I am using:
# get the list of all wikipedia pages (articles) -- English
import sys
from simplemediawiki import MediaWiki
listOfPagesFile = open("wikiListOfArticles_nonredirects.txt", "w")
wiki = MediaWiki('https://en.wikipedia.org/w/api.php')
continueParam = ''
requestObj = {}
requestObj['action'] = 'query'
requestObj['list'] = 'allpages'
requestObj['aplimit'] = 'max'
requestObj['apnamespace'] = '0'
pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']
for eachPage in pagesInQuery:
pageId = eachPage['pageid']
title = eachPage['title'].encode('utf-8')
writestr = str(pageId) + "; " + title + "\n"
listOfPagesFile.write(writestr)
numQueries = 1
while len(pagelist['query']['allpages']) > 0:
requestObj['apcontinue'] = pagelist["continue"]["apcontinue"]
pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']
for eachPage in pagesInQuery:
pageId = eachPage['pageid']
title = eachPage['title'].encode('utf-8')
writestr = str(pageId) + "; " + title + "\n"
listOfPagesFile.write(writestr)
# print writestr
numQueries += 1
if numQueries % 100 == 0:
print "Done with queries -- ", numQueries
print numQueries
listOfPagesFile.close()
The number of queries fired is around 28900, which results in approx. 14.5M names of the pages.
I also tried the all-titles link mentioned in the above answer. In that case as well I am getting around 14.5M pages.
I thought that this overestimate to the actual number of pages is because of the redirects, and did add the 'nonredirects' option to the request object:
requestObj['apfilterredir'] = 'nonredirects'
After doing that I get only 112340 number of pages. Which is too small as compared to 5.8M.
With the above code I was expecting roughly 5.8M pages, but that doesn't seem to be the case.
Is there any other option that I should be trying to get the actual (~5.8M) set of page names?
Here is an asynchronous program that will generate mediawiki pages titles:
async def wikimedia_titles(http, wiki="https://en.wikipedia.org/"):
log.debug('Started generating asynchronously wiki titles at {}', wiki)
# XXX: https://www.mediawiki.org/wiki/API:Allpages#Python
url = "{}/w/api.php".format(wiki)
params = {
"action": "query",
"format": "json",
"list": "allpages",
"apfilterredir": "nonredirects",
"apfrom": "",
}
while True:
content = await get(http, url, params=params)
if content is None:
continue
content = json.loads(content)
for page in content["query"]["allpages"]:
yield page["title"]
try:
apcontinue = content['continue']['apcontinue']
except KeyError:
return
else:
params["apfrom"] = apcontinue

page number variable: html,django

i want to do paging. but i only want to know the current page number, so i will call the webservice function and send this parameter and recieve the curresponding data. so i only want to know how can i be aware of current page number? i'm writing my project in django and i create the page with xsl. if o know the page number i think i can write this in urls.py:
url(r'^ask/(\d+)/$',
'ask',
name='ask'),
and call the function in views.py like:
ask(request, pageNo)
but i don't know where to put pageNo var in html page. (so fore example with pageN0=2, i can do pageNo+1 or pageNo-1 to make the url like 127.0.0.01/ask/3/ or 127.0.0.01/ask/2/). to make my question more cleare i want to know how can i do this while we don't have any variables in html?
sorry for my crazy question, i'm new in creating website and also in django. :">
i'm creating my html page with xslt. so i send the total html page. (to show.html which contains only {{str}} )
def ask(request:
service = GetConfigLocator().getGetConfigHttpSoap11Endpoint()
myRequest = GetConfigMethodRequest()
myXml = service.GetConfigMethod(myRequest)
myXmlstr = myXml._return
styledoc = libxml2.parseFile("ask.xsl")
style = libxslt.parseStylesheetDoc(styledoc)
doc = libxml2.parseDoc(myXmlstr)
result = style.applyStylesheet(doc, None)
out = style.saveResultToString( result )
ok = mark_safe(out)
style.freeStylesheet()
doc.freeDoc()
result.freeDoc()
return render_to_response("show.html", {
'str': ok,
}, context_instance=RequestContext(request))
i'm not working with db and i just receive xml file to parse it. so i don't have contact_list = Contacts.objects.all(). can i still use this way? should i put the first parameter inpaginator = Paginator(contact_list, 25) blank?
if you user standart django paginator, thay send you to url http://example.com/?page=N, where N - number you page
So,
# urls.py
url('^ask/$', 'ask', name='viewName'),
You can get page number in views:
# views.py
def ask(request):
page = request.GET.get('page', 1)