I'm using R and I would like to get JSON information from url and I have around 5000 user agent to sent to this API (http://www.useragentstring.com/pages/api.php)
I use this code to create the url and concatenate the user-agent:
url_1<-paste(" \"http://www.useragentstring.com/?uas=",uaelenchi[11,1],"&getJSON=all\"",sep = '');
json_data2<-fromJSON(readLines(cat(url_1)))
But I receive this error:
Error in readLines(cat(url_1)) : 'con' is not a connection
Any suggestions would be really appreciated! Thanks
I use rjson::fromJSON(file = paste(your_url)). If you make a reproducible example, I could check if it is working in your case.
library(httr)
library(jsonlite)
library(purrr)
uas <- c("Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:17.0) Gecko/20100101 Firefox/17.0",
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:17.0) Gecko/20100101 Firefox/17.0",
"Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.6 Safari/537.11",
"Mozilla/5.0 (X11; OpenBSD amd64; rv:28.0) Gecko/20100101 Firefox/28.0",
"Mozilla/5.0 (Windows NT 5.1; rv:31.0) Gecko/20100101 Firefox/31.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.6 Safari/537.11",
"Mozilla/5.0 (X11; OpenBSD amd64; rv:28.0) Gecko/20100101 Firefox/28.0",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:14.0) Gecko/20120405 Firefox/14.0a1",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:14.0) Gecko/20120405 Firefox/14.0a1",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
parse_uas <- function(uas) {
res <- GET("http://www.useragentstring.com/", query=list(uas=uas, getJSON="all"))
stop_for_status(res)
content(res, as="text", encoding="UTF-8") %>%
fromJSON(res, flatten=TRUE) %>%
as.data.frame(stringsAsFactors=FALSE)
}
map_df(uas, parse_uas)
To save API calls you should add a caching layer to the parse_uas() function, which could be done pretty easily with the memoise package:
library(memoise)
.parse_uas <- function(uas) {
res <- GET("http://www.useragentstring.com/", query=list(uas=uas, getJSON="all"))
stop_for_status(res)
content(res, as="text", encoding="UTF-8") %>%
fromJSON(res, flatten=TRUE) %>%
as.data.frame(stringsAsFactors=FALSE)
}
parse_uas <- memoise(.parse_uas)
Also, if you're on Linux, you can also try this package (it doesn't compile well on macOS and not at all on Windows IIRC) which will do all the processing locally.
Related
import com.alibaba.fastjson2.JSONArray
JSONArray.parseArray(str).toString()
I use the toString method of fastjson2 to parse this JSON string, but I will encounter an error:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1024
at com.alibaba.fastjson2.JSONWriterUTF16JDK8.writeString(JSONWriterUTF16JDK8.java:183)
at com.alibaba.fastjson2.writer.ObjectWriterImplMap.write(ObjectWriterImplMap.java:428)
at com.alibaba.fastjson2.writer.ObjectWriterImplMap.write(ObjectWriterImplMap.java:457)
at com.alibaba.fastjson2.writer.ObjectWriterImplList.write(ObjectWriterImplList.java:278)
at com.alibaba.fastjson2.JSONArray.toString(JSONArray.java:871)
Similar strings can work normally. I really can't find which special character caused them.
My str is:
[{"response_info":{"header":"Content-Length: 388\r\nContent-Type: application/octet-stream\r\nUser-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36\r\nHost: 180.102.211.212\r\n","body":"\u0000\u0000\u0000\u0003seq\u0000\u0000\u0000\u000241\u0000\u0000\u0000\u0003ver\u0000\u0000\u0000\u00011\u0000\u0000\u0000\tweixinnum\u0000\u0000\u0000\n1429629729\u0000\u0000\u0000\u0007authkey\u0000\u0000\u0000D0B\u0002\u0001\u0001\u0004;09\u0002\u0001\u0002\u0002\u0001\u0001\u0002\u0004U6k!\u0002\u0003\u000fBA\u0002\u0004\u0015zXu\u0002\u0004\ufffd\ufffdf\ufffd\u0002\u0003\u000fU\ufffd\u0002\u0003\u0006\u0000\u0000\u0002\u0004U6k!\u0002\u0004d=\u001eS\u0002\u0004\ufffd\ufffd7\u0019\u0004\u0000\u0000\u0000\u0000\u0006rsaver\u0000\u0000\u0000\u00011\u0000\u0000\u0000\brsavalue\u0000\u0000\u0000\ufffd\ufffd\ufffd\ufffd\ufffd\u0006\u001d\ufffd_;\ufffdi\ufffdT.\ufffd\ufffd\"CK\ufffd/\u00169\u0018\u0015bI\ufffd\ufffd`<n\ufffd\ufffd\ufffdw\ufffd\ufffd\ufffd!\ufffd\u001a\u0003\ufffdHh\ufffdP%i$\ufffd$\ufffd\u0005\ufffd<\ufffd8\ufffd\ufffd\ufffd\ufffd\n\ufffd$\u0016A-O5\ufffd`\r\ufffd\ufffdc\ufffd\ufffd\u001b\ufffd\ufffd\r3\ufffd\ufffd`\ufffd)\ufffd\ufffdV\ufffdf \ufffd`\t\ufffd%\u0010\ufffd\ufffd\ufffdJ\ufffd\u001aCu\u0010\u000b\ufffd\u0001X\ufffd\ufffd\u01b7\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd.\u0000\u0000\u0000\u0007filemd5\u0000\u0000\u0000 0d65f9a4beb26b55874965490344abef\u0000\u0000\u0000\bfiletype\u0000\u0000\u0000\u00015\u0000\u0000\u0000\u0006touser\u0000\u0000\u0000\u00101688854880368629"}}]
fastjson version is 2.0.10
Why didn't I get all the HTML code when using Request or Selenium to scrape the e-commerce website?
So, here's my code:
html = "https://www.tokopedia.com/p/fashion-anak-bayi/pakaian-anak-laki-laki/baju-tidur-anak-laki-laki?page=1&wholesale=true&goldmerchant=true&fcity=174,175,176,177,178,179"
header = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 RuxitSynthetic/1.0 v6366394992 t38550 ath9b965f92 altpub',
"Upgrade-Insecure-Requests": "1",
"DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate"}
web = requests.get(html,headers=header, data=data)
pageweb = BeautifulSoup(web.content, 'html.parser')
#Get name and Location
store_loc = pageweb.find_all('span',{'class':'css-1kr22w3'})
But, the result doesn't appear for all of div class span. If I go with Selenium, then the answer is still the same.
The class what I'm looking for didn't appear in all of them.
I'm using this code to fulfill my curiosity about scraping and get the data.
Try this:
import requests
from bs4 import BeautifulSoup
url = "https://www.tokopedia.com/p/fashion-anak-bayi/pakaian-anak-laki-laki/baju-tidur-anak-laki-laki?page=1&wholesale=true&goldmerchant=true&fcity=174,175,176,177,178,179"
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 RuxitSynthetic/1.0 v6366394992 t38550 ath9b965f92 altpub',
"Upgrade-Insecure-Requests": "1",
"DNT": "1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate"
}
page = BeautifulSoup(requests.get(url, headers=headers).text, 'html.parser')
store_loc = page.find_all('span', {'class': 'css-1kr22w3'})
print(len(store_loc))
for tag in store_loc:
print(tag.text)
Output of 20 items.
Jakarta Utara
Chloe Clozette
Jakarta Utara
Chloe Clozette
Jakarta Barat
AICOMFY Collection
...
I'm trying to login to PowerSchool to scrape my grades. Whenever I run the code it gives me the login pages HTML code instead of the secured pages HTML code.
Question 1: How do I get the value of the 3 fields that change labeled 'this changes' in the code above, and submit it to the current post?
Question 2: Am I required to add anything in the code for my password that gets hashed each post.
https://ps.lphs.net/public/home.html <--- Link to login page for HTML code.
Picture of form data on chrome
import requests
payload = {
'pstoken': 'this changes',
'contextData': 'this changes',
'dbpw': 'this changes',
'translator_username': '',
'translator_password': '',
'translator_ldappassword': '',
'serviceName':' PS Parent Portal',
'serviceTicket':'',
'pcasServerUrl':' /',
'credentialType':'User Id and Password Credential',
'account':'200276',
'pw':'my password',
'translatorpw':''
}
head = {'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3180.0 Safari/537.36'}
with requests.Session() as s:
p = s.post('https://ps.lphs.net/public/', data=payload, headers=head)
r = s.get('https://ps.lphs.net/guardian/home.html')
print(r.text)
EDIT 1 :
s.headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3180.0 Safari/537.36'}
p = s.get('https://ps.lphs.net/guardian/home.html')
print(p.text)
r = s.post('https://ps.lphs.net/guardian/home.html', data=payload,
headers={'Content-Type': 'application/x-www-form-urlencoded',
'Referer': 'https://ps.lphs.net/public/home.html'})
print(r.text)
Give this a shot. It should fetch you the valid response:
import requests
payload = {
'pstoken': 'this changes',
'contextData': 'this changes',
'dbpw': 'this changes',
'translator_username': '',
'translator_password': '',
'translator_ldappassword': '',
'serviceName':' PS Parent Portal',
'serviceTicket':'',
'pcasServerUrl':' /',
'credentialType':'User Id and Password Credential',
'account':'200276',
'pw':'my password',
'translatorpw':''
}
with requests.Session() as s:
s.headers={'User-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3180.0 Safari/537.36'}
r = s.post('https://ps.lphs.net/guardian/home.html',data=payload,
headers={'Content-Type': 'application/x-www-form-urlencoded',
'Referer':'https://ps.lphs.net/public/home.html'})
print(r.text)
Btw, change the parameter in payload (if needed) to get logged in.
Can someone explain me why session2 gives me following error:
library("rvest")
uastring = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session = html_session("https://www.linkedin.com/job/", user_agent(uastring))
session2 = html_session("https://www.linkedin.com/job/")
Error in http_statuses[[as.character(status)]] : subscript out of
bounds
I have these example from https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/
How I can check which value of uastring I have to put to html_session (for different sites). I don't ask about this specific site (I put it here because it's comes from tutorial).
I am trying to get the page from www.dotabuff.com.
library(RCurl)
url <- "http://www.dotabuff.com/heroes/abaddon/matchups"
webpage <- getURL(url,verbose = TRUE)
The result is a page from dotabuff complaining too many requests. I was expecting a html page with a table, like the one viewable in a web browser. I have tried http, https, getURLContent, etc.
I think this has something to do with the type of request getURL sent, or maybe something tricky about that website.
Add a header to the request...
library(RCurl)
url <- "http://www.dotabuff.com/heroes/abaddon/matchups"
options(RCurlOptions = list(verbose = TRUE, useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13"))
webpage <- getURL(url,verbose = TRUE)
* Trying 23.235.40.64...
* Connected to www.dotabuff.com (23.235.40.64) port 80 (#0)
> GET /heroes/abaddon/matchups HTTP/1.1
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13
Host: www.dotabuff.com
Accept: */*
< HTTP/1.1 200 OK
...