I try to find emails into html using regex but I have problems with some websites.
The main problem is that regex function paralyzes the process and leaves the cpu overloaded.
import re
from urllib.request import urlopen, Request
email_regex = re.compile('([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})', re.IGNORECASE)
request = Request('http://www.serviciositvyecla.com')
request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36')
html = str(urlopen(request, timeout=5).read().decode("utf-8", "strict"))
email_regex.findall(html) ## here is where regex takes a long time
I have not problems if the website is another one.
request = Request('https://www.velezmalaga.es/')
If someone know how to solve this problem or know how to timeout the regex function, I will appreciate it.
I use Windows.
I initially tried fiddling with your approach, but then I ditched it and resorted to BeautifulSoup. It worked.
Try this:
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}
pages = ['http://www.serviciositvyecla.com', 'https://www.velezmalaga.es/']
emails_found = set()
for page in pages:
html = requests.get(page, headers=headers).content
soup = BeautifulSoup(html, "html.parser").select('a[href^=mailto]')
for item in soup:
try:
emails_found.add(item['href'].split(":")[-1].strip())
except ValueError:
print("No email :(")
print('\n'.join(email for email in emails_found))
Output:
info#serviciositvyecla.com
oac#velezmalaga.es
EDIT:
One reason your approach doesn't work is, well, the regex itself. The other one is the size (I suspect) of the HTML returned.
See this:
import re
import requests
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36",
}
html = requests.get('https://www.velezmalaga.es/', headers=headers).text
op_regx = '([A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4})'
simplified_regex = '[\w\.-]+#[\w\.-]+\.\w+'
print(f"OP's regex results: {re.findall(op_regx, html)}")
print(f"Simplified regex results: {re.findall(simplified_regex, html)}")
This prints:
OP's regex results: []
Simplified regex results: ['oac#velezmalaga.es', 'oac#velezmalaga.es']
Finally, I found a solution for no consume all RAM with a regex search. In my problem, obtaining a white result even though there is email on the web is an acceptable solution, as long as not to block the process due to lack of memory.
The html of the scraped page contained 5.5 million characters. 5.1 millions did not contain priority information, since it is a hidden div with unintelligible characters.
I have added an exception similar than:
if len(html) < 1000000: do whathever
I'm an absolute beginner in get/post requests and micropython.
I'm programming my ESP8266 Wemos D1 mini as a HTTP server with micropython. My project consists of using a website to control the RGB values of a neopixel matrix hooked up to the D1 (all the code is on my GitHub here: https://github.com/julien123123/NeoLamp-Micro).
Basically, the website contains three sliders: one for Red, one for Green and one for Blue. A javascript code reads the value of each slider and sends it to the micropython code with using the POST method as follows :
getColors = function() {
var rgb = new Array(slider1.value, slider2.value, slider3.value);
return rgb;
};
postColors = function(rgb) {
var xmlhttp = new XMLHttpRequest();
var npxJSON = '{"R":' + rgb[0] + ', "G":' + rgb[1] + ', "B":' + rgb[2] + '}';
xmlhttp.open('POST', 'http://' + window.location.hostname + '/npx', true);
xmlhttp.setRequestHeader('Content-type', 'application/json');
xmlhttp.send(npxJSON);
};
To recieve the resquest in micropython here's my code:
conn, addr = s.accept()
request = conn.recv(1024)
request = str(request)
print(request)
The response prints as follows:
b'POST /npx HTTP/1.1\r\nHost: 192.xxx.xxx.xxx\r\nConnection: keep-alive\r\nContent-Length: 27\r\nOrigin: http://192.168.0.110\r\nUser-Agent: Mozilla/5.0 (X11; CrOS x86_64 10323.46.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.107 Safari/537.36\r\nContent-type: application/json\r\nAccept: */*\r\nReferer: http://192.xxx.xxx.xxx/\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: fr,en;q=0.9,fr-CA;q=0.8\r\n\r\n{"R":114, "G":120, "B":236}'
The only important bit for me is at the end : {"R":114, "G":120, "B":236}. I want to use those values to change the color values of my neopixel object.
My question to you is how to I process the response so that I keep only the dictionary containing the RGB variables at the end of the response??
Thanks in advance (I'm almost there!)
This is more related to generic python data type. The data type of request is in bytes as indicated by prefix b in b'POST /npx HTTP/1.1...\r\n{"R":114, "G":120, "B":236}'. You will have to use decode() to convert it to string
import json
request = b'POST /npx HTTP/1.1\r\nHost: 192.xxx.xxx.xxx\r\nConnection: keep-alive\r\nContent-Length: 27\r\nOrigin: http://192.168.0.110\r\nUser-Agent: Mozilla/5.0 (X11; CrOS x86_64 10323.46.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.107 Safari/537.36\r\nContent-type: application/json\r\nAccept: */*\r\nReferer: http://192.xxx.xxx.xxx/\r\nAccept-Encoding: gzip, deflate\r\nAccept-Language: fr,en;q=0.9,fr-CA;q=0.8\r\n\r\n{"R":114, "G":120, "B":236}'
data = request.decode() # convert to str
rgb = data.split('\r\n')[-1:] #split the str and discard the http header
for color in rgb:
print(color, type(color))
d = json.loads(color)
print(d, type(d))
The result of color is a str representation of an json object, the d will give you a python dict object to be used for further manipulation:
{"R":114, "G":120, "B":236} <class 'str'>
{'R': 114, 'G': 120, 'B': 236} <class 'dict'>
For learning purposes I'm trying to reproduce Instagram internal API with Ruby and Faraday. However, the response's body I get when making a POST is somehow encoded instead of JSON:
What the response's body should look like:
{
"status": "ok",
"media": {
"page_info": {
"start_cursor": "1447303180937779444_4460593680",
"has_next_page": true,
"end_cursor": "1447303180937779444",
"has_previous_page": true
},
...
What I get:
#=> \x1F\x8B\b\x00#\x15\x9EX\x02\xFF...
Question:
Any idea (i) why I'm getting a response's body like that and (ii) how can I convert that to JSON?
Flow:
When you hit https://www.instagram.com/explore/locations/127963847/madrid-spain/ in your browser Instagram makes two requests (among others):
GET: https://www.instagram.com/explore/locations/127963847/madrid-spain/
POST: https://www.instagram.com/query/
I used Postman to intercept requests and just copied headers and parameters for the second (/query/) request. This is my implementation (get status '200'):
class IcTest
require 'open-uri'
require "net/http"
require "uri"
def self.faraday
conn = Faraday.new(:url => 'https://www.instagram.com') do |faraday|
faraday.request :url_encoded # form-encode POST params
faraday.response :logger # log requests to STDOUT
faraday.adapter Faraday.default_adapter # make requests with Net::HTTP
end
res = conn.post do |req|
req.url '/query/'
req.headers['Origin'] = 'https://www.instagram.com'
req.headers['X-Instagram-AJAX'] = '1'
req.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
req.headers['Content-Type'] = 'application/x-www-form-urlencoded'
# req.headers['Accept'] = '*/*'
req.headers['X-Requested-With'] = 'XMLHttpRequest'
req.headers['X-CSRFToken'] = 'SrxvROytxQHAesy1XcgcM2PWrEHHuQnD'
req.headers['Referer'] = 'https://www.instagram.com/explore/locations/127963847/madrid-spain/'
req.headers['Accept-Encoding'] = 'gzip, deflate, br'
req.headers['Accept-Language'] = 'es,en;q=0.8,gl;q=0.6,pt;q=0.4,pl;q=0.2'
req.headers['Cookie'] = 'mid=SJt50gAEAAE6KZ50GByVoStJKLUH; sessionid=IGSC514a2e9015f548b09176228f83ad5fe716f32e7143f6fe710c19a71c08b9828b%3Apc2KPxgwvokLyZhfZHcO1Qzfb2mpykG8%3A%7B%22_token%22%3A%2233263701%3Ai7HSIbxIMLj70AoUrCRjd0o1g7egHg79%3Acde5fe679ed6d86011d70b7291901998b8aae7d0aaaccdf02a2c5abeeaeb5908%22%2C%22asns%22%3A%7B%2283.34.38.249%22%3A3352%2C%22time%22%3A1486584547%7D%2C%22last_refreshed%22%3A1436584547.2838287%2C%22_platform%22%3A4%2C%22_token_ver%22%3A2%2C%22_auth_user_backend%22%3A%22accounts.backends.CaseInsensitiveModelBackend%22%2C%22_auth_user_id%22%3A33233701%2C%22_auth_user_hash%22%3A%22%22%7D; ds_user_id=31263701; csrftoken=sxvROytxQHAesy1XcgcM2PWrEHHuQnD; s_network=""; ig_vw=1440; ig_pr=2;'
req.body = { :q => "ig_location(127963847) { media.after('', 60) { count, nodes { caption, code, comments { count }, comments_disabled, date, dimensions { height, width }, display_src, id, is_video, likes { count }, owner { id }, thumbnail_src, video_views }, page_info} }",
:ref => "locations::show",
:query_id => "" }
end
end
Thanks.
Josh comment made it! :-)
The body's content was gzip.
Solution here.
I am starting development with Erlang and need to make a REST HTTP call to a server where I send a JSON and receive JSON confirmation.
Follows the code
Method = put,
URL = "http://api.teste.com:8080/v1/user_auth",
Header = [],
Type = "application/json",
Json = <<"{ \"data\" : { \"test-one\" : \"123\", \"test-two\" : \"return test 2\" } }">>,
HTTPOptions = [],
Options = [],
application:start(ssl),
application:start(inets),
httpc:request(Method, {URL, Header, Type, Json}, HTTPOptions, Options).\
When you run this code I am with the following error:
=ERROR REPORT==== 5-Dec-2015::14:21:01 ===
Error in process <0.161.0> on node 'middleware#127.0.0.1' with exit value:
{[{reason,undef},
{mfa,{user_account_handler,handle_post,2}},
{stacktrace,[{httpc,request,
[put,
{"http://api.teste.com:8080/v1/user_auth",[],
"application/json",
<<"{ \"data\" : { \"test-one\" : \"123\", \"test-two\" : \"return test 2\" } }">>},
[],[]],
[]},
{cowboy_rest,call,3,[{file,"src/cowboy_rest.erl"},{line,972}]},
{cowboy_rest,process_content_type,3,
[{file,"src/cowboy_rest.erl"},{line,773}]},
{cowboy_protocol,execute,4,
[{file,"src/cowboy_protocol.erl"},
{line,442}]}]},
{req,[{socket,#Port<0.479>},
{transport,ranch_tcp},
{connection,keepalive},
{pid,<0.161.0>},
{method,<<"POST">>},
{version,'HTTP/1.1'},
{peer,{{127,0,0,1},49895}},
{host,<<"localhost">>},
{host_info,undefined},
{port,8080},
{path,<<"/v1/create_user_account">>},
{path_info,undefined},
{qs,<<>>},
{qs_vals,undefined},
{bindings,[]},
{headers,[{<<"host">>,<<"localhost:8080">>},
{<<"connection">>,<<"keep-alive">>},
{<<"content-length">>,<<"58">>},
{<<"cache-control">>,<<"no-cache">>},
{<<"origin">>,
<<"chrome-extension://fhbjgbiflinjbdggehcddcbncdddomop">>},
{<<"content-type">>,<<"application/json">>},
{<<"user-agent">>,
<<"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.73 Safari/537.36">>},
{<<"postman-token">>,
<<"2dc302f2-7c93-b9f9-2143-cff41bfeb45a">>},
{<<"accept">>,<<"*/*">>},
{<<"accept-encoding">>,<<"gzip, deflate">>},
{<<"accept-language">>,
<<"en-US,en;q=0.8,es;q=0.6,pt;q=0.4">>}]},
{p_headers,[{<<"content-type">>,{<<"application">>,<<"json">>,[]}},
{<<"if-modified-since">>,undefined},
{<<"if-none-match">>,undefined},
{<<"if-unmodified-since">>,undefined},
{<<"if-match">>,undefined},
{<<"accept">>,[{{<<"*">>,<<"*">>,[]},1000,[]}]},
{<<"connection">>,[<<"keep-alive">>]}]},
{cookies,undefined},
{meta,[{media_type,{<<"application">>,<<"json">>,[]}},
{charset,undefined}]},
{body_state,waiting},
{buffer,<<"{\n \"username\":\"igor#gmail.com\"\n , \"password\":\"123\"\n}">>},
{multipart,undefined},
{resp_compress,false},
{resp_state,waiting},
{resp_headers,[{<<"content-type">>,
[<<"application">>,<<"/">>,<<"json">>,<<>>]}]},
{resp_body,<<>>},
{onresponse,undefined}]},
{state,undefined}],
[{cowboy_rest,process_content_type,3,
[{file,"src/cowboy_rest.erl"},{line,773}]},
{cowboy_protocol,execute,4,[{file,"src/cowboy_protocol.erl"},{line,442}]}]}
I have mongodb collection "mongocollection", each document in the collection consists of two columns, first a string "cid" which is the id of the collection and second is a json array.
Ex:
{ "_id" : "domain.com", "
className" : "UserAgents",
"userAgents" : [
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-CA; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5",
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.6) Gecko/20100628 Ubuntu/10.04 (lucid) Firefox/3.6.6 (.NET CLR 3.5.30729)",
"Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.2) Gecko/20090803 Firefox/3.5.2 Slackware",
"Mozilla/5.0 (X11; U; Linux x86_64) Gecko/2008072820 Firefox/3.0.1"
]
}
From mongo cmd line, i can get the contents of a given document within a collection as
db.CorruptUserAgents.find({"_id":"domain.com"}).pretty();
How do i get the number of elements in a given array of a given document. Ex
SOMETHING.count();
100
Is it possible to do this via cmd line ?
I know i can get the document, iterate over the array and count the elements, but i want to do this from cmd line.
You can use .findOne() instead since you are looking up by _id, it will return a single document.
So you can simply do something like this:
var document = db.CorruptUserAgents.findOne({"_id":"domain.com"});
var count = document.userAgents.length;
You can directly get it by using below :
db.CorruptUserAgents.findOne({"_id":"domain.com"}).userAgents.length;