Extracting "http" strings from 100s of .htm files - extract

I know a bit of php, a bit of python, and I'm pretty savvy on looking for tools, but I can't find a tool or methodology that will parse the htm files and return all strings containing "http".
I KNOW there's quick fix. Anyone?

You can try this in python:
def grepFileForLines(self, fileName = "", keepLinesWith = ""):
fileObj = open(fileName, 'r')
matches = []
for line in fileObj:
if line.find(keepLinesWith) > -1:
matches.append(line)
return matches

Related

Micropython: bytearray in json-file

i'm using micropython in the newest version. I also us an DS18b20 temperature sensor. An adress of theses sensor e.g. is "b'(b\xe5V\xb5\x01<:'". These is the string representation of an an bytearray. If i use this to save the adress in a json file, i run in some problems:
If i store directly "b'(b\xe5V\xb5\x01<:'" after reading the json-file there are no single backslahes, and i get b'(bxe5Vxb5x01<:' inside python
If i escape the backslashes like "b'(b\xe5V\xb5\x01<:'" i get double backslashes in python: b'(b\xe5V\xb5\x01<:'
How do i get an single backslash?
Thank you
You can't save bytes in JSON with micropython. As far as JSON is concerned that's just some string. Even if you got it to give you what you think you want (ie. single backslashes) it still wouldn't be bytes. So, you are faced with making some form of conversion, no-matter-what.
One idea is that you could convert it to an int, and then convert it back when you open it. Below is a simple example. Of course you don't have to have a class and staticmethods to do this. It just seemed like a good way to wrap it all into one, and not even need an instance of it hanging around. You can dump the entire class in some other file, import it in the necessary file, and just call it's methods as you need them.
import math, ujson, utime
class JSON(object):
#staticmethod
def convert(data:dict, convert_keys=None) -> dict:
if isinstance(convert_keys, (tuple, list)):
for key in convert_keys:
if isinstance(data[key], (bytes, bytearray)):
data[key] = int.from_bytes(data[key], 'big')
elif isinstance(data[key], int):
data[key] = data[key].to_bytes(1 if not data[key]else int(math.log(data[key], 256)) + 1, 'big')
return data
#staticmethod
def save(filename:str, data:dict, convert_keys=None) -> None:
#dump doesn't seem to like working directly with open
with open(filename, 'w') as doc:
ujson.dump(JSON.convert(data, convert_keys), doc)
#staticmethod
def open(filename:str, convert_keys=None) -> dict:
return JSON.convert(ujson.load(open(filename, 'r')), convert_keys)
#example with both styles of bytes for the sake of being thorough
json_data = dict(address=bytearray(b'\xFF\xEE\xDD\xCC'), data=b'\x00\x01\02\x03', date=utime.mktime(utime.localtime()))
keys = ['address', 'data'] #list of keys to convert to int/bytes
JSON.save('test.json', json_data, keys)
json_data = JSON.open('test.json', keys)
print(json_data) #{'date': 1621035727, 'data': b'\x00\x01\x02\x03', 'address': b'\xff\xee\xdd\xcc'}
You may also want to note that with this method you never actually touch any JSON. You put in a dict, you get out a dict. All the JSON is managed "behind the scenes". Regardless of all of this, I would say using struct would be a better option. You said JSON though so, my answer is about JSON.

Attempting to parse a JSON file with Python

So I've been beating my head against a wall for days now and have been diving down the google/SO rabbit hole in search of answers. I've been debating on how to phrase this question as the API that I am pulling from, may or may not contain some sensitive information that gets uncomfortably close to HIPPA laws for my liking. For that reason I will not be providing the direct link/auth for the my code. That being said I will be providing a made up JSON script to help with the explaining.
import requests
import json
import urllib3
r = requests.get('https://madeup.url.com/api/vi/information here', auth=('123456789', '1111111111222222222223333333333444444455555555'))
payload = {'query': 'firstName'}
response = requests.get(r, params=payload)
json_response = response.json()
print(json.dumps(json_response))
The JSON file that I'm trying to parse looks in part like this:
"{\"id\": 123456789, \"firstName\": \"NAME\", \"lastName\": \"NAME\", \"phone\": \"NUMBER\", \"email\": \"EMAIL#gmail.com\", \"date\": \"December 16, 2021\", \"time\": \"9:50am\", \"endTime\": \"10:00am\",.....
When I run the code I am getting a "urllib3.exceptions.LocationParseError: Failed to parse: <Response [200]>" traceback and I can not for the life of me figure out what is going on. urllib3 is installed and updated according to the console.
Any help would be much appreciated. TIA
That is not a JSON file. It is a string containing escaped characters. It needs to be unescaped before parsing can work.
youre passing r to requests.get() (line 9) , but r is a response to another requests.get() (line 5)... shouldn't you be passing params=payload in line 5? then getting de response from there, in one single request
import requests
import json
import urllib3
payload = {'query': 'firstName'}
response = requests.get('{YOUR_URL}', auth=('{USER}', '{PASS}'), params=payload)
json_response = response.json()
print(json.dumps(json_response))
That is not a JSON file. It is a string containing escaped characters. It needs to be unescaped before parsing can work.
Well now I'm even more confused. I'm trying to self teach myself python and clearly struggling. To get the "JSON" I posted I used the following code:
r = requests.get('URL', 'auth = ('user', 'pass'))
Data = r.json()
packages_str = json.dumps(Data[0])
with open('Data.json', 'w') as f:
json.dump(packages_str, f)
So basically I'm even more lost now...
Okay, update: Good news! kinda... so my code now reads as follows;
import requests
import json
import urllib3
payload = {
'query1'= 'firstName',
'query2'= 'lastName'
}
response = requests.get("url", auth= ("user","pass"), params=payload)
Data = response.json()
packages_str = json.dumps(Data, ensure_ascii=False, indent=2)
with open('Data.json), 'w') as f:
json.dump(packages_str,f)
f.write(packages_str)
And when I then open the JOSN file, the first line of is the entire API in a string but below that, is a properly formatted JSON file. Unfortunately its the entire API and not a parsed JSON file looking for the the information That I need...
Continuing down the google/youtube/SO rabbit hole and will update at a later date if i find a work around.

how to write r.headers from different urls into one json?

I would like to crawl several urls, while using the requests library in python. I am scrutinizing the GET requests as well as the response headers. However, when crawling and getting the data from different urls I am facing the problem, that I don't know all 'key:values', which are coming in. Thus writing those data to a valid csv file is not really possible, in my point of view. Therefore I want to write the data into a json file.
The problem is similar to the following thread from 2014, but not the same:
Get a header with Python and convert in JSON (requests - urllib2 - json)
import requests, json
urls = ['http://www.example.com/', 'http://github.com']
with open('test.json', 'w') as f:
for url in urls:
r = requests.get(url)
rh = r.headers
f.write(json.dumps(dict(rh), sort_keys=True, separators=(',', ':'), indent=4))
I expect a json file, with the headers for each URL. I get a Json file with those data, but my IDE (PyCHarm) is showing an Error, which states out that
JSON standard allows only one top-level value. I have read the documentation:https://docs.python.org/3/library/json.html#repeated-names-within-an-object; but did not get it. Any hint would be appreciated.
EDIT: The only thing which is missing in the outcome is another comma. But where do I enter it and what command do I need for this?
You need to add it to an array and then finally do the json dump to a file. This will work.
urls = ['http://www.example.com/', 'http://github.com']
headers = []
for url in urls:
r = requests.get(url)
header_dict = dict(r.headers)
header_dict['source_url'] = url
headers.append(header_dict)
with open('test.json', 'w', encoding='utf-8') as f:
json.dump(headers, f, sort_keys=True, separators=(',', ':'), indent=4)
You still can write it to a csv:
import pandas as pd
df = pd.DataFrame(headers)
df.to_csv('test.csv')

Parse Twitter JSON Content in Ptython3

I searched for all similar questions and yet couldn't resolve below issue.
Here's my json file content:
https://objectsoftconsultants.s3.amazonaws.com/tweets.json
Code to get a particular element is as below:
import json
testsite_array = []
with open('tweets.json') as json_file:
testsite_array = json_file.readlines()
for text in testsite_array:
json_text = json.dumps(text)
resp = json.loads(json_text)
print(resp["created_at"])
Keep getting below error:
print(resp["created_at"])
TypeError: string indices must be integers
Thanks much for your time and help, well in advance.
I have to guess what you're trying to do and can only hope that this will help you:
with open('tweets.json') as f:
tweets = json.load(f)
print(tweets['created_at'])
It doesn't make sense to read a json file with readlines, because it is unlikely that each line of the file represents a complete json document.
Also I don't get why you're dumping the string only to load it again immediately.
Update:
Try this to parse your file line by line:
with open('tweets.json') as f:
lines = f.readlines()
for line in lines:
try:
tweet = json.loads(line)
print(tweet['created_at'])
except json.decoder.JSONDecodeError:
print('Error')
I want to point out however, that I do not recommend this approach. A file should contain only one json document. If the file does not contain a valid json document, the source for the file should be fixed.

Solve issue with nested keys in JSON

I am trying to adapt some python code from an awesome guide for dark web scanning/graph creation.
I have thousands of json files created with Onionscan, and I have this code that should wrap everything in a gephi graph. Unfortunately, this code is old, as the Json files are now formatted differently and this code does not work anymore:
code (partial):
import glob
import json
import networkx
import shodan
file_list = glob.glob("C:\\test\\*.json")
graph = networkx.DiGraph()
for json_file in file_list:
with open(json_file,"rb") as fd:
scan_result = json.load(fd)
edges = []
if scan_result('linkedOnions') is not None:
edges.extend(scan_result['linkedOnions'])
In fact, at this point I get "KeyError", because linkedOnions is one-level nested like this:
"identifierReport": {
"privateKeyDetected": false,
"foundApacheModStatus": false,
"serverVersion": "",
"relatedOnionServices": null,
"relatedOnionDomains": null,
"linkedOnions": [many urls here]
could you please help me fix the code above?
I would be VERY grateful :)
Lorenzo
this is the correct way to read nested JSON.
if scan_result['identifierReport']['linkedOnions'] is not None:
edges.extend(scan_result'identifierReport']['linkedOnions'])
Try this it will work for you if your JSON file is correct format
try:
scan_result = json.load(fd)
edges = []
if scan_result('linkedOnions') is not None:
edges.extend(scan_result['linkedOnions'])
except Exception,e:
#print your message or log
print e