Crawling and Scraping iTunes App Store

Crawling and Scraping iTunes App Store - language-agnostic

I noticed that iTunes preview allows you to crawl and scrape pages via the http:// protocol. However, many of the links are trying to be opened in iTunes rather than the browser. For example, when you go to the iBooks page, it immediately tries opening a url with an itms:// protocol.
Are there any other methods of crawling the App Store or is this the only way?
Can the itms:// protocol links themselves be crawled somehow?

I would have a decent look at the iTunes Search API and the iTunes Enterprise Partner API
Search API -
http://www.apple.com/itunes/affiliates/resources/blog/introduction---search-api.html
Enterprise Partner API -
http://www.apple.com/itunes/affiliates/resources/documentation/itunes-enterprise-partner-feed.html
You might get most/all of the information you need in a nice JSON file format.
If you can't get the information you need with the API, I would be interested what it is :)

As phillipp mentioned, the iTunes search API is an easy way to retrieve data about your App Store listings in JSON format.
Simply query for this with your app id (you can find the app id by viewing the web listing for your app at itunes.apple.com), ex:
http://itunes.apple.com/lookup?id=INSERT_YOUR_APP_ID_HERE
then, parse the resulting JSON to your heart's content.

The only difference between http:// links and itms:// links is that you need to set your User-Agent to an iTunes user-agent, and depending on the version you may also have to include a verification code based on some not-so-secret algorithm.
For example this is the code for iTunes 9:
# Some magic. Generates a seed we use for X-Apple-Validation. Adapted from LWP::UserAgent::iTMS_Client.
function comp_seed($url, $user_agent) {
$random = sprintf( "%04X%04X", rand(0,0x10000), rand(0,0x10000) );
$static = base64_decode("ROkjAaKid4EUF5kGtTNn3Q==");
$url_end = ( preg_match("|.*/.*/.*(/.+)$|",$url,$matches)) ? $matches[1] : '?';
$digest = md5(join("",array($url_end, $user_agent, $static, $random)) );
return $random . '-' . strtoupper($digest);
}
However if you are only scraping, iTunes preview should work for your purposes, the link you gave us to the iBooks page had more than enough information to scrape.

We tried scraping ourselves too about a year ago and it just became too much of a headache. Philipp's comment is a good one as the enterprise feed from apple (need to apply for it with a legitimate use) does have a good amount of useful info that you might be after in scraping.
There are a few companies that offer data as a service too - abto and AppMonsta are two I heard of when I was looking. I can't seem to find abto anymore but http://appmonsta.com seems to be. The search API looks ok (never experimented) but limited.
Good luck!

Related

What is the reason why the OneNote APIs won't return all the pages in a notebook?

I am reading around here and I am seeing multiple messages about the /pages endpoint that is not working a expected
It seems that the OneNote APIs (MS Graph or Office365) are not returning all the pages that the user can see. In particular recent pages are not shown as available.
This message is for those of you who work for Microsoft and who keep an eye on this forum. Please if you have any explanation or workaround for this we would like to hear about it.
If this is work in progress we would also like to know when the APIs can be considered stable and reliable enough to consider them OK for production use
Update:
Permissions or scopes
scopes=[
"Notes.Read",
"Notes.Read.All",
"Notes.ReadWrite",
]
This is for a device authorization flow, the device is acting as a Microsoft Online account. The app is registered to Azure as personal app but the enterprise one does the same
The authorization process is described here
What type of app/authentication flow should I select to read my cloud OneNote content using a Python script and a personal Microsoft account?
After that I am using this endpoint to get the notebooks
https://graph.microsoft.com/v1.0/users/user-id/onenote/notebooks
from the returned json I pick the endpoint for the notebook I want to read and I access the endpoint the link stored in notebook['sectionsUrl']. This call returns a sections json
From this I pick the section I want and I access the link stored in section['pagesUrl']
Each call returns the expected info excepting the last one, when I get an arbitrary low number of pages in the section I want to explore. There is nothing wrong with the format of the info, it is just incomplete or not up to date
Not sure if this is related but when I try to access the pages in a section from MS Graph Explored I am seeing the same behavior (not all the pages are reported). This is a shared notebook and I am using the owner account for all the above so it should not be a permission problem
from msal import PublicClientApplication
import requests
endpoint= "https://graph.microsoft.com/v1.0/me/onenote"
authority = "https://login.microsoftonline.com/consumers"
app=PublicClientApplication(client_id=client_id, authority=authority)
flow = app.initiate_device_flow(scopes=scopes)
# there is an interactive part here that I automated using selenium, you
# are supposed to ouse a link to enter a code and then autorize the
# device; code not shown
result = app.acquire_token_by_device_flow(flow)
token= result['access_token']
headers={'Authorization': 'Bearer ' + token}
endpoint= "https://graph.microsoft.com/v1.0/users/c5af8759-4785-4abf-9434-xxxxxxxxx/onenote/notebooks"
notebooks = requests.get(endpoint,headers=headers).json()
for notebook in notebooks['value']:
print(notebook['displayName'])
print(notebook['sectionsUrl'])
print(notebook['sectionGroupsUrl'])
# I pick a certain notebook
section=[section for section in sections if section['displayName']=="Test"][0]
endpoint=notebook['sectionsUrl']
pages=requests.get(endpoint,headers=headers).json()
for page in pages['value']:
print(page['title'])
Update2
If I use this endpoint
https://graph.microsoft.com/v1.0/users/user-id/onenote/sections/section-id/pages
I would expect to get the complete list of pages for that section.
That is not working
After reading again and again the docs I my understanding is that the approach is to
call https://graph.microsoft.com/v1.0/users/user-id/onenote/pages$fiter or search etc etc
I this correct?
Also I vaguely remember there is a way to search for a section and have it expanded so that the search returs the children too.
Am I close to understanding this?
Thank you
MM

get data patterns in HTML

my goal is to write some lines of R code which allow me to make web scraping from
www.skyscanner.it/trasporti/voli/mila/fran/180201?adults=1&children=0&adultsv2=1&childrenv2=&infants=0&cabinclass=economy&rtn=0&preferdirects=false&outboundaltsenabled=false&inboundaltsenabled=false&ref=home#results
getting: airline, departure and arrival airportS, departure and arrival timeS, price.
I decided to use the Rcrawler package (here how it works) but, having no experience of HTML, i've no idea of how to set the ExtractXpathPat option to get data.
Rcrawler(Website = "https://www.skyscanner.it/trasporti/voli/mila/fran/180201?adults=1&children=0&adultsv2=1&childrenv2=&infants=0&cabinclass=economy&rtn=0&preferdirects=false&outboundaltsenabled=false&inboundaltsenabled=false&ref=day-view#results",
no_cores = 4, no_conn = 4, ExtractXpathPat = c("?????"))
What should i do? How can i learn how to set path?
Thanks!

Be careful according to the policy of the domain is not allowed to extract through web scraping the information. However to get the css code or the xpath you can use "Selector Gadget" or the inspect button in your browser.
To make sure web scraping is allowed you must visit the robots.txt of the domain. In your case: http://www.skyscanner.com/robots.txt. You can also use robotstxt package.

How to restrict fields returned by stackexchange api, and turn off paging?

I'd like to have a list of just the current titles for all questions in one of the smaller (less than 10,000 questions) stackexchange site. I tried the interactive utility here: https://api.stackexchange.com/docs/questions and it both reports the result as a json at the bottom, and produces the requesting url at the top. For example:
https://api.stackexchange.com/2.2/questions?order=desc&sort=activity&tagged=apples&site=cooking
returns this JSON in my browser:
{"items":[{"tags":["apples","crumble"],"owner":{ ...
...
...],"has_more":true,"quota_max":300,"quota_remaining":252}
What is quota? It was 10,000 on one search on one site, but suddenly it's only 300 here.
I won't be doing this very often, what I'd like is the quickest way to edit that (or similar of course) url so I can get a list of all of the titles on a small site. I don't understand how to use paging, and I don't need any of the other fields. I don't care if I get them, but I'm thinking if I exclude them I can have more at once.
If I need to script it, python (2.7) is my preferred (only) language.

quota_max is the number of requests your application is allowed per day. 300 is the default for an unregistered application. This used to be mentioned directly on the page describing throttles, but seems to have been removed. Here is historical information describing the default.
To increase this to 10,000, you need to register an application and then authenticate by passing an access token in your script.
To get all titles on a site, you can use a Python library to help:
StackAPI. The answer below will use this library. DISCLAIMER: I wrote this library
Py-StackExchange
SEAPI
StackPy
Assuming you have registered your application and authenticated we can proceed.
First, install StackAPI (documentation):
pip install stackapi
This code will then grab the 10,000 most recent questions (max_pages * page_size) for the site hardwarerecs. Each page costs you one API hit, so the more items per page, the few API calls.
from stackapi import StackAPI
SITE = StackAPI('hardwarerecs')
SITE.page_size = 100
SITE.max_pages = 100
# Filter to only get question title and link
filter = '!BHMIbze0EQ*ved8LyoO6rNjkuLgHPR'
questions = SITE.fetch('questions', filter=filter)
In the questions variable is a dictionary that looks very similar to the API output, except that the library did all the paging for you. Your data is in questions['data'] and, in this case, contains a list of dictionaries that look like this:
[
...
{u'link': u'http://hardwarerecs.stackexchange.com/questions/29/sound-board-to-replace-a-gl2200-in-a-house-of-worship-foh-setting',
u'title': u'Sound board to replace a GL2200 in a house-of-worship FOH setting?'},
{ u'link': u'http://hardwarerecs.stackexchange.com/questions/31/passive-gps-tracker-logger',
u'title': u'Passive GPS tracker/logger'}
...
]
This result set is limited to only the title and the link because of the filter we applied. You can find the appropriate filter by adjusting what fields you want in the web UI and copying the filter field.
The hardwarerecs parameter that is passed when creating the SITE parameter is the first part of the site's domain URL. Alternatively, you can find it by looking at the api_site_parameter for your site when looking at the /sites end point.

Logging service allowing simple <img/> interface

I'm looking to do some dead-simple logging from a web app (client-side) to some remote service/endpoint. Sure, I could roll my own, but for the purpose of this task, let's assume I want an existing service like Logentries/Splunk/Logstash so that my viewers can still log debugging info if my backend goes down.
Most logging services offer an API where I can import some <script/> onto my page and then use an API like LE.log('string', data); [Logentries example]. However that pulls in a JS dependency and uses cross-domain XHR for probably well-founded reasons (like URI length limitations).
My question is if anyone can point me to a service that will let me send simple query params to a "pixel" endpoint (similar to how Google Analytics does it). Something like:
<script>
new Image().src = 'http://something.io/pixel/log/<API_TOKEN>?some_data=1234';
</script>
-- or, in pure HTML --
<img src="http://something.io/pixel/log/<API_TOKEN>?some_data=1234" style="display:none" />
I'd assume some of the big names in the logging-as-a-service space would have something like this but I've not found anything (or it's too specific to turn up any search results).
This would not be for analytics so much as error logging, debugging, etc. Fire-and-forget sort of stuff.
Any advice appreciated.

It's possible to do this with Logentries, they offer a pixel tracker.
They require that data is sent in a base64 encoding, but that's quite simple in Javascript.
From their documentation:
var encoded = encodeURIComponent(btoa("Log message"));
This data can then be used in a pixel tracker like this:
<img src="https://js.logentries.com/v1/logs/{API-TOKEN}?e={ENCODED_DATA}/">

Scraping data after filling out form?

I'm doing a little project for my class and I'm just a beginner, so please forgive me if I mix up some of my terminology.
Basically, I'm creating an interactive journey planner for my city's public transit system. Unfortunately, they haven't made all the data I need publicly available. So instead of putting all my time into gathering the data for personal use, I've opted to do some screen scraping - letting their servers calculate the journey info from a START and STOP variable and then displaying the selected info on my page.
So is it possible to fill out a form's fields remotely, and then scrape the data on the page that subsequently loads? And if so, what would be the quickest, most convenient way? This happens to be a case where the data can't be manipulated via the URL, so it has to access the data by filling out the form first.
The website in question:
http://jp.translink.com.au/travel-information/journey-planner

Here is what you can do:
1.) Send a POST Request to the journey-planner with some data like that (be aware that CORS might jump in, then you could use cURL via PHP or whatsoever):
Start:Wickham Tce, Spring Hill
End:Upper Edward St, Spring Hill
SearchDate:10/05/2013 12:00:00 AM
TimeSearchMode:LeaveAfter
SearchHour:7
SearchMinute:40
TimeMeridiem:AM
TransportModes:Bus
TransportModes:Train
TransportModes:Ferry
MaximumWalkingDistance:1500
WalkingSpeed:Normal
ServiceTypes:Regular
ServiceTypes:Express
ServiceTypes:NightLink
FareTypes:Standard
FareTypes:Prepaid
FareTypes:Free
2.) You will get a new response location. This seems to be a REST link. Important for you is the id at the end. You will have to call that page and parse the HTML and look for a div with the HTML-id option-summaries, where you will find more information within the divs travel-option-1 to travel-option-n. You have to look at it carefully in order to find out which information is stored whee and how you will be able to use it.
In order to find such things you should learn how to use Firebug or Chrome's development tools.
This is one way to solve your problem. Probably not the best but still better than "screen-scraping" anything. But it will ask you for a lot of skills and effort. Furthermore if the data provider is going to change just a bit your solution will not work anymore. Additionally they might prevent your access by CORS or anything else (blocking your IP etc.)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008