Extracting the payload of a single Common Crawl WARC

Extracting the payload of a single Common Crawl WARC - html

I can query all occurances of certain base url within a given common crawl index, saving them all to a file and get a specific article (test_article_num) using the code below. However, I have not come across a way to extract the raw html for that article from the specific crawl-data ('filename' in output), even though I know the offset and length of the data I want. I feel like there should be a way to do this in python similar to this, maybe using requests and warcio (perhaps something akin to this), but I'm not sure. Any help is greatly appreicated.
EDIT:
I found exactly what I needed here.
import requests
import pathlib
import json
news_website_base = 'hobbsnews.com'
URL = "https://index.commoncrawl.org/CC-MAIN-2022-05-index?url="+news_website_base+"/*&output=json"
website_output = requests.get(URL)
pathlib.Path('data.json').write_bytes(website_output.content)
news_articles = []
test_article_num=300
for line in open('data.json', 'r'):
news_articles.append(json.loads(line))
print(news_articles[test_article_num])
news_URL=news_articles[test_article_num]['url']
news_warc_file=news_articles[test_article_num]['filename']
news_offset=news_articles[test_article_num]['offset']
news_length=news_articles[test_article_num]['length']
Code output:
{'urlkey': 'com,hobbsnews)/2020/03/22/no-new-positive-covid-19-tests-in-lea-in-last-24-hours/{{%20data.link', 'timestamp': '20220122015439', 'url': 'https://www.hobbsnews.com/2020/03/22/no-new-positive-covid-19-tests-in-lea-in-last-24-hours/%7B%7B%20data.link', 'mime': 'text/html', 'mime-detected': 'text/html', 'status': '404', 'digest': 'GY2UDG4G3V3S5TXDL3H7HE6VCSRBD3XR', 'length': '40062', 'offset': '21016412', 'filename': 'crawl-data/CC-MAIN-2022-05/segments/1642320303729.69/crawldiagnostics/CC-MAIN-20220122012907-20220122042907-00614.warc.gz'}
https://www.hobbsnews.com/2020/03/22/no-new-positive-covid-19-tests-in-lea-in-last-24-hours/%7B%7B%20data.link
crawl-data/CC-MAIN-2022-05/segments/1642320300343.4/crawldiagnostics/CC-MAIN-20220117061125-20220117091125-00631.warc.gz
21016412
40062

With the WARC URL, and WARC record offset and length it's simply:
download the range from offset until offset+length-1
pass the downloaded bytes to a WARC parser
Using curl and warcio CLI:
curl -s -r250975924-$((250975924+6922-1)) \
https://data.commoncrawl.org/crawl-data/CC-MAIN-2021-10/segments/1614178365186.46/warc/CC-MAIN-20210303012222-20210303042222-00595.warc.gz \
>warc_temp.warc.gz
warcio extract --payload warc_temp.warc.gz 0
Or with Python requests and warcio (cf. here):
import io
import requests
import warcio
warc_filename = 'crawl-data/CC-MAIN-2021-10/segments/1614178365186.46/warc/CC-MAIN-20210303012222-20210303042222-00595.warc.gz'
warc_record_offset = 250975924
warc_record_length = 6922
response = requests.get(f'https://data.commoncrawl.org/{warc_filename}',
headers={'Range': f'bytes={warc_record_offset}-{warc_record_offset + warc_record_length - 1}'})
with io.BytesIO(response.content) as stream:
for record in warcio.ArchiveIterator(stream):
html = record.content_stream().read()

Related

Pandas parallel URL downloads with pd.read_html

I know I can download a csv file from a web page by doing:
import pandas as pd
import numpy as np
from io import StringIO
URL = "http://www.something.com"
data = pd.read_html(URL)[0].to_csv(index=False, header=True)
file = pd.read_csv(StringIO(data), sep=',')
Now I would like to do the above for more URLs at the same time, like when you open different tabs in your browser. In other words, a way to parallelize this when you have different URLs, instead of looping through or doing it one at a time. So, I thought of having a series of URLs inside a dataframe, and then create a new column which contains the strings 'data', one for each URL.
list_URL = ["http://www.something.com", "http://www.something2.com",
"http://www.something3.com"]
df = pd.DataFrame(list_URL, columns =['URL'])
df['data'] = pd.read_html(df['URL'])[0].to_csv(index=False, header=True)
But it gives me error: cannot parse from 'Series'
Is there a better syntax, or does this mean I cannot do this in parallel for more than one URL?

You could try like this:
import pandas as pd
URLS = [
"https://en.wikipedia.org/wiki/Periodic_table#Presentation_forms",
"https://en.wikipedia.org/wiki/Planet#Planetary_attributes",
]
df = pd.DataFrame(URLS, columns=["URL"])
df["data"] = df["URL"].map(
lambda x: pd.read_html(x)[0].to_csv(index=False, header=True)
)
print(df)
# Output
URL data
0 https://en.wikipedia.org/wiki/Periodic_t... 0\r\nPart of a series on the\r\nPeriodic...
1 https://en.wikipedia.org/wiki/Planet#Pla... 0\r\n"The eight known planets of the Sol...

How can i extract information quickly from 130,000+ Json files located in S3?

i have an S3 was over 130k Json Files which i need to calculate numbers based on data in the json files (for example calculate the number of gender of Speakers). i am currently using s3 Paginator and JSON.load to read each file and extract information form. but it take a very long time to process such a large number of file (2-3 files per second). how can i speed up the process? please provide working code examples if possible. Thank you
here is some of my code:
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucket-name',StartAfter='')
for page in result:
if "Contents" in page:
for key in page[ "Contents" ]:
keyString = key[ "Key" ]
s3 = boto3.resource('s3')
content_object = s3.Bucket('bucket-name').Object(str(keyString))
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
x = (json_content['dict-name'])

In order to use the code below, I'm assuming you understand pandas (if not, you may want to get to know it). Also, it's not clear if your 2-3 seconds is on the read or includes part of the number crunching, nonetheless multiprocessing will speed this up dramatically. The gist is to read all the files in (as dataframes), concatenate them, then do your analysis.
To be useful for me, I run this on spot instances that have lots of vCPUs and memory. I've found the instances that are network optimized (like c5n - look for the n) and the inf1 (for machine learning) are much faster at reading/writing than T or M instance types, as examples.
My use case is reading 2000 'directories' with roughly 1200 files in each and analyzing them. The multithreading is orders of magnitude faster than single threading.
File 1: your main script
# create script.py file
import os
from multiprocessing import Pool
from itertools import repeat
import pandas as pd
import json
from utils_file_handling import *
ufh = file_utilities() #instantiate the class functions - see below (second file)
bucket = 'your-bucket'
prefix = 'your-prefix/here/' # if you don't have a prefix pass '' (empty string or function will fail)
#define multiprocessing function - get to know this to use multiple processors to read files simultaneously
def get_dflist_multiprocess(keys_list, num_proc=4):
with Pool(num_proc) as pool:
df_list = pool.starmap(ufh.reader_json, zip(repeat(bucket), keys_list), 15)
pool.close()
pool.join()
return df_list
#create your master keys list upfront; you can loop through all or slice the list to test
keys_list = ufh.get_keys_from_prefix(bucket, prefix)
# keys_list = keys_list[0:2000] # as an exampmle
num_proc = os.cpu_count() #tells you how many processors your machine has; function above defaults to 4 unelss given
df_list = get_dflist_multiprocess(keys_list, num_proc=num_proc) #collect dataframes for each file
df_new = pd.concat(df_list, sort=False)
df_new = df_new.reset_index(drop=True)
# do your analysis on the dataframe
File 2: class functions
#utils_file_handling.py
# create this in a separate file; name as you wish but change the import in the script.py file
import boto3
import json
import pandas as pd
#define client and resource
s3sr = boto3.resource('s3')
s3sc = boto3.client('s3')
class file_utilities:
"""file handling function"""
def get_keys_from_prefix(self, bucket, prefix):
'''gets list of keys and dates for given bucket and prefix'''
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
# use Delimiter to limit search to that level of hierarchy
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
print('keys in page: ', len(keys))
keys_list.extend(keys)
return keys_list
def read_json_file_from_s3(self, bucket, key):
"""read json file"""
bucket_obj = boto3.resource('s3').Bucket(bucket)
obj = boto3.client('s3').get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
return data
# you may need to tweak this for your ['dict-name'] example; I think I have it correct
def reader_json(self, bucket, key):
'''returns dataframe'''
return pd.DataFrame(json.loads(self.read_json_file_from_s3(bucket, key))['dict-name'])

How do I make a cURL request to zomato api?

I just began exploring APIs. This is my code so far. For locu API this works but for Zomato they use curl header request which I don't know how to use. Could someone guide or show me how?
import json
import urllib2
Key = 'Zomato_key'
url = 'https://developers.zomato.com/api/v2.1/categories'
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
print data

By looking at the Zomato API docs, it seems that the parameter user-key has to be set in the header.
The following works:
import json
import urllib2
Key = '<YOUR_ZOMATO_API_KEY>'
url = "https://developers.zomato.com/api/v2.1/categories"
request = urllib2.Request(url, headers={"user-key" : Key})
json_obj = urllib2.urlopen(request)
data = json.load(json_obj)
print data
If you want a more elegant way to query APIs, have a look at requests module (you can install using pip install requests).
I suggest you the following:
import json
import requests
Key = <YOUR_ZOMATO_API_KEY>'
url = "https://developers.zomato.com/api/v2.1/categories"
if __name__ == '__main__':
r = requests.get(url, headers={'user-key': Key})
if r.ok:
data = r.json()
print data
NB: I suggest you remove your Key from StackOverflow if you care about keeping it to yourself.

this didn't work for me can u suggest some other method for me.
-->the code when tried to compile is taking long time and returning an traceback error in request method that is in built
but curl command is working
curl -X GET --header "Accept: application/json" --header "user-key: c5062d18e16b9bb9d857391bb32bb52f" "https://developers.zomato.com/api/v2.1/categories"

Import Kaggle csv from download url to pandas DataFrame

I've been trying different methods to import the SpaceX missions csv file on Kaggle directly into a pandas DataFrame, without any success.
I'd need to send requests to login. This is what I have so far:
import requests
import pandas as pd
from io import StringIO
# Link to the Kaggle data set & name of zip file
login_url = 'http://www.kaggle.com/account/login?ReturnUrl=/spacex/spacex-missions/downloads/database.csv'
# Kaggle Username and Password
kaggle_info = {'UserName': "user", 'Password': "pwd"}
# Login to Kaggle and retrieve the data.
r = requests.post(login_url, data=kaggle_info, stream=True)
df = pd.read_csv(StringIO(r.text))
r is returning the html content of the page.
df = pd.read_csv(url) gives a CParser error:
CParserError: Error tokenizing data. C error: Expected 1 fields in line 13, saw 6
I've searched for a solution, but so far nothing I've tried worked.

You are creating a stream and passing it directly to pandas. I think you need to pass a file like object to pandas. Take a look at this answer for a possible solution (using post and not get in the request though).
Also i think the login url with redirect that you use is not working as it is. I know i suggested that here. But i ended up not using is because the post request call did not handle the redirect (i suspect).
The code i ended up using in my project was this:
def from_kaggle(data_sets, competition):
"""Fetches data from Kaggle
Parameters
----------
data_sets : (array)
list of dataset filenames on kaggle. (e.g. train.csv.zip)
competition : (string)
name of kaggle competition as it appears in url
(e.g. 'rossmann-store-sales')
"""
kaggle_dataset_url = "https://www.kaggle.com/c/{}/download/".format(competition)
KAGGLE_INFO = {'UserName': config.kaggle_username,
'Password': config.kaggle_password}
for data_set in data_sets:
data_url = path.join(kaggle_dataset_url, data_set)
data_output = path.join(config.raw_data_dir, data_set)
# Attempts to download the CSV file. Gets rejected because we are not logged in.
r = requests.get(data_url)
# Login to Kaggle and retrieve the data.
r = requests.post(r.url, data=KAGGLE_INFO, stream=True)
# Writes the data to a local file one chunk at a time.
with open(data_output, 'wb') as f:
# Reads 512KB at a time into memory
for chunk in r.iter_content(chunk_size=(512 * 1024)):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
Example use:
sets = ['train.csv.zip',
'test.csv.zip',
'store.csv.zip',
'sample_submission.csv.zip',]
from_kaggle(sets, 'rossmann-store-sales')
You might need to unzip the files.
def _unzip_folder(destination):
"""Unzip without regards to the folder structure.
Parameters
----------
destination : (str)
Local path and filename where file is should be stored.
"""
with zipfile.ZipFile(destination, "r") as z:
z.extractall(config.raw_data_dir)
So i never really directly loaded it into the DataFrame, but rather stored it to disk first. But you could modify it to use a temp directory and just delete the files after you read them.

How do I grab info from this json file?

I'm trying to grab some numbers from this json file, but I don't how to do it correctly. This is the json file I am trying to gather information from:
http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2016-17&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=
I've been trying to get this code to work, but I can't figure it out:
import json
from pprint import pprint
with open('data.json') as data_file:
data = json.load(data_file)
data["rowSet"] ["1610612737"] ["Atlanta Hawks"]
I'm trying to get the statistics from each team.

The following Python script should do it.
#!/usr/bin/env python
import json
with open('leaguedashteamstats.json') as data_file:
data = json.load(data_file)
# extract headers names
headers = data['resultSets'][0]['headers']
# extract raw json rows
raw_rows = data['resultSets'][0]['rowSet']
team_stats = []
for row in raw_rows:
print row[1] # prints team name
# mixes header names and values and prints them out
for (header, value) in zip(headers, row):
print header, value
print '\n'
Both data and code can be seen here:
https://gist.github.com/cevaris/24d0b7d97677667aedb14059a6959da1#file-1-team-stats-output

Disclaimer: this code doesn't contain any validation, but it should lead you in the right direction:
import json
with open('data.json') as data_file:
data = json.load(data_file)
for rs in data.get('resultSets'):
for r_ in [r for r in rs.get('rowSet') if r[1] == 'Atlanta Hawks']:
print(r_)
You basically need to determine specific keys that you are going to loop through, or obtain.
This should hopefully get you to where you need to be.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extracting the payload of a single Common Crawl WARC - html

Related

Pandas parallel URL downloads with pd.read_html

How can i extract information quickly from 130,000+ Json files located in S3?

How do I make a cURL request to zomato api?

Import Kaggle csv from download url to pandas DataFrame

How do I grab info from this json file?

Categories

Resources