This is my first time using any sort of code. I have been following along with an interactive tutorial and I seem to be stuck at the very first step, trying to import a json file containing info regarding football competition data. It seems fairly straightforward but error message after error message has started to drive me insane.
I am trying to load the data into python in order to follow along with a tutorial (I will leave a link below). I believe I have saved my files and data in the same way as in the tutorial but when I change the file directory and run: import json I get a few different error messages if someone could advise on what I’m doing wrong it would be greatly appreciated. My goal is to load in the data which I have downloaded from GitHub and open the competitions JSON file.
I am also happy to provide any information required to help answer this question.
YouTube video:https://youtu.be/GTtu0t03FMO
error messages:
FileNotFoundError: [Errno 2] No such file or directory: 'Statsbomb/data/competitions.json'
JSONDecodeError:Expecting value
#Load in Statsbomb competition and match data
#This is a library for loading json files.
import json
#Load the competition file
#Got this by searching 'how do I open json in Python'
with open('Statsbomb/data/competitions.json') as f:
competitions = json.load(f)
#Womens World Cup 2019 has competition ID 72
competition_id=72
#Womens World Cup 2019 has competition ID 72
competition_id=72
#Load the list of matches for this competition
with open('Statsbomb/data/matches/'+str(competition_id)+'/30.json') as f:
matches = json.load(f)
#Look inside matches
matches[0]
matches[0]['home_team']
matches[0]['home_team']['home_team_name']
matches[0]['away_team']['away_team_name']
#Print all match results
for match in matches:
home_team_name=match['home_team']['home_team_name']
away_team_name=match['away_team']['away_team_name']
home_score=match['home_score']
away_score=match['away_score']
describe_text = 'The match between ' + home_team_name + ' and ' + away_team_name
result_text = ' finished ' + str(home_score) + ' : ' + str(away_score)
print(describe_text + result_text)
#Now lets find a match we are interested in
home_team_required ="England"
away_team_required ="Sweden"
#Find ID for the match
for match in matches:
home_team_name=match['home_team']['home_team_name']
away_team_name=match['away_team']['away_team_name']
if (home_team_name==home_team_required) and (away_team_name==away_team_required):
match_id_required = match['match_id']
print(home_team_required + ' vs ' + away_team_required + ' has id:' + str(match_id_required))
#Exercise:
#1, Edit the code above to print out the result list for the Mens World cup
#2, Edit the code above to find the ID for England vs. Sweden
#3, Write new code to write out a list of just Sweden's results in the tournament.
with open('Statsbomb/data/matches/'+str(competition_id)+'/30.json') as f:
matches = json.load(f)
try:
with open('Statsbomb/data/matches/'+str(competition_id)+'/3.json') as f:
matches = json.load(f)
Related
I want to Load Multiple CSV files matching certain names into a dataframe. Currently i am looping through the whole folder and creating a list of filenames and then loading those csv's into the dataframe list and then concatenating that dataframe.
The approach i want to use (if possible) is to bypass all the code and read all files in a one liner kind of approach.
I know this can be done easily for single level of subfolders, but my subfolder structure is as follows
Root Folder
|
Subfolder1
|
Subfolder 2
|
X01.csv
Y01.csv
Z01.csv
|
Subfolder3
|
Subfolder4
|
X01.csv
Y01.csv
|
Subfolder5
|
X01.csv
Y01.csv
I want to read all "X01.csv" files while reading from Root Folder.
Is there a way i can read all the required files in code something like the below
filepath = "rootpath" + "/**/X*.csv"
df = spark.read.format("com.databricks.spark.csv").option("recursiveFilelookup","true").option("header","true").load(filepath)
This code works fine for single level of subfolders, is there any equivalent of this for multi level folders ? i thought the "recursiveFilelookup" option would look across all levels of subfolders, but apparently this is not the way it works.
Currently i am getting a
Path not found ... filepath
exception
any help please
Have you tried using the glob.glob function?
You can use it to search for files that match certain criteria inside a root path, and pass the list of files it finds to spark.read.csv function.
For example, I've recreated the folder structure from your example inside a Google Colab environment:
To get a list of all CSV files matching the criteria you've specified, you can use the following code:
import glob
rootpath = './Root Folder/'
# The following line of code looks through all files
# inside the rootpath recursively, trying to match the
# pattern specified. In this case, it tries to find any
# CSV file that starts with the letters X, Y, or Z,
# and ends with 2 numbers (ranging from 0 to 9).
glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True)
# Returns:
# ['./Root Folder/Subfolder5/Y01.csv',
# './Root Folder/Subfolder5/X01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Y01.csv',
# './Root Folder/Subfolder1/Subfolder 2/Z01.csv',
# './Root Folder/Subfolder1/Subfolder 2/X01.csv']
Now you can combine this with spark.read.csv capability of reading a list of files to get the answer you're looking for:
import glob
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rootpath = './Root Folder/'
spark.read.csv(glob.glob(rootpath + "**/[X|Y|Z][0-9][0-9].csv", recursive=True), inferSchema=True, header=True)
Note
You can specify more general patterns like:
glob.glob(rootpath + "**/*.csv", recursive=True)
To return a list of all csv files inside any subdirectory of rootpath.
Additionally, to consider only the immediate subdirectories files, you could use something like:
glob.glob(rootpath + "*.csv", recursive=True)
Edit
Based on your comments to this answer, does something like this works on Databricks?
from notebookutils import mssparkutils as ms
# databricks has a module called dbutils.fs.ls
# that works similarly to mssparkutils.fs, based on
# the following page of its documentation:
# https://docs.databricks.com/dev-tools/databricks-utils.html#ls-command-dbutilsfsls
def scan_dir(
initial_path: str,
search_str: str,
account_name: str,
):
"""Scan a directory and subdirectories for a string.
Parameters
----------
initial_path : str
The path to start the search. Accepts either a valid container name,
or the entire connection string.
search_str : str
The string to search.
account_name : str
The name of the account to access the container folders.
This value is only used, when the `initial_path`, doesn't
conform with the format: "abfss://<initial_path>#<account_name>.dfs.core.windows.net/"
Raises
------
FileNotFoundError
If the `initial_path` informed doesn't exist.
ValueError
If `initial_path` is not a string.
"""
if not isinstance(initial_path, str):
raise ValueError(
f'`initial_path` needs to be of type string, not {type(initial_path)}'
)
elif not initial_path.startswith('abfss'):
initial_path = f'abfss://{initial_path}#{account_name}.dfs.core.windows.net/'
try:
fdirs = ms.fs.ls(initial_path)
except Py4JJavaError as exc:
raise FileNotFoundError(
f'The path you informed \"{initial_path}\" doesn\'t exist'
) from exc
found = []
for path in fdirs:
p = path.path
if path.isDir:
found = [*found, *scan_dir(p, search_str)]
if search_str.lower() in path.name.lower():
# print(p.split('.net')[-1])
found = [*found, p.replace(path.name, "")]
return list(set(found))
Example:
# Change .parquet to .csv
spark.read.parquet(*scan_dir("abfss://CONTAINER_NAME#ACCOUNTNAME.dfs.core.windows.net/ROOT/FOLDER/", ".parquet"))
This method above worked for on Azure Synapse:
Iam currently doing a tweet search using Twitter Api. However, taking the tweet id is not working for me.
Here is my code:
searchQuery = '#BLM' # this is what we're searching for
searchQuery = searchQuery + "-filter:retweets"
Geocode="39.8, -95.583068847656, 2500km"
maxTweets = 1000000 # Some arbitrary large number
tweetsPerQry = 100 # this is the max the API permits
fName = 'tweetsBLM.json' # We'll store the tweets in a json file.
sinceId = None
#max_id = -1 # initial search
max_id=1278836959926980609 # the last id of previous search
tweetCount = 0
print("Downloading max {0} tweets".format(maxTweets))
with open(fName, 'w') as f:
while tweetCount < maxTweets:
try:
if (max_id <= 0):
if (not sinceId):
new_tweets = api.search(q=searchQuery,lang="en", geocode=Geocode,
count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery,lang="en",geocode=Geocode,
count=tweetsPerQry,
since_id=sinceId )
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, lang="en", geocode=Geocode,
count=tweetsPerQry,
max_id=str(max_id - 1) )
else:
new_tweets = api.search(q=searchQuery, lang="en", geocode=Geocode,
count=tweetsPerQry,
max_id=str(max_id - 1),
since_id=sinceId)
if not new_tweets:
print("No more tweets found")
break
for tweet in new_tweets:
f.write(jsonpickle.encode(tweet._json, unpicklable=False) +
'\n')
tweetCount += len(new_tweets)
print("Downloaded {0} tweets".format(tweetCount))
max_id = new_tweets[-1].id
except tweepy.TweepError as e:
# Just exit if any error
print("some error : " + str(e))
print('exception raised, waiting 15 minutes')
print('(until:', dt.datetime.now() + dt.timedelta(minutes=15), ')')
time.sleep(15*60)
break
print ("Downloaded {0} tweets, Saved to {1}".format(tweetCount, fName))
This code works perfectly fine. I initially run it and got about 40 000 tweets. Then i took the id of the last tweet of previous/initial search to go back in time. However, i was disappointed to see that there were no tweets anymore. I can not believe that for a second. I must be going wrong somewhere because #BLM has been very active in the last 2/3 months.
Any help is very welcome. Thank you
I may have found the answer. Using Twitter API, it is not possible to get older tweets (7 days old or more). Using max_id to get around this is not possible either.
The only way is to stream and wait for more than 7 days.
Finally, there is also this link that look for older tweets
https://pypi.org/project/GetOldTweets3/ it is an extension of the original Jefferson Henrique's work
I am writing a very simple python script to READ a CSV (no problem) and to write to another CSV (issue):
System info:
Windows 10
Powershell
Python 3.6.5 :: Anaconda, Inc.
Sample Data: Office Events
The purpose is to filter events based on criteria, and to write to another CSV with desired criteria.
For Example:
I would like to read from this CSV and write the events where Registrations (or column 4) is Greater than 0 (remove rows with registrations = 0)
# SCRIPT TO FILTER EVENTS TO BE PROCESSED
import os
import time
import shutil
import os.path
import fnmatch
import csv
import glob
import pandas
# Location of file containing ALL events
path = r'allEvents.csv'
# Writes to writer
writer = csv.writer(open(r'RegisteredEvents' + time.strftime("%m_%d_%Y-%I_%M_%S") + '.csv', "wb"))
writer.writerow(["Event Name", "Start Date", "End Date", "Registrations", "Total Revenue", "ID", "Status"])
#writer.writerow([r'Event Name', r'Start Date', r'End Date', r'Registrations', r'Total Revenue', r'ID', r'Status'])
#writer.writerow([b'Event Name', b'Start Date', b'End Date', b'Registrations', b'Total Revenue', b'ID', b'Status'])
def checkRegistrations(file):
reader = csv.reader(file)
data = list(reader)
for row in data:
#if row[3] > str(0):
if row[3] > int(0):
writer.writerow(([data]))
The Error I continue to get is:
writer.writerow(["Event Name", "Start Date", "End Date", "Registrations", "Total Revenue", "ID", "Status"])
TypeError: a bytes-like object is required, not 'str'
I have tried using the various commented out statements
For Example:
"" vs r"" vs r'' vs b''
if row[3] > int(0) **vs** if row[3] > str(0)
Every time I execute my script, It creates the file.. so the first csv writer line works (create and open the file)... the second line (to write the headers) is when the error appears...
Perhaps I am getting mixed up with syntax due to python versions, or perhaps I am misusing the CSV library, or (more than likely) I have endless to learn about data type IO and conversion... someone please help!!
I am aware of the excess of import libraries -- script came from another basic script to move files from one location to another based on filename and output a rowcounter for each file being moved.
With that being said, I may be unaware of any missing/ needed libraries
Please let me know if you have any questions, concerns or clarifications
Thanks in advance!
It looks like you are calling:
writer = csv.writer(open('file.csv', 'wb'))
The 'wb' argument is the file mode. The 'b' means that you are opening the file that you are writing to in binary mode. You are then trying to write a string which isn't what it is expecting.
Try getting rid of the 'b' in the 'wb'.
writer = csv.writer(open('file.csv', 'w'))
Let me know if that works for you.
I'm currently trying to construct a database of chemicals used in a university department, and their hazard classes. I then wish to output to a csv file. One step is to pull all the synonyms for the various chemicals from standard PDFs, such as this for gamma hexalactone:
sample PDF
At the moment, the code I'm using to extract the text just loses the greek characters which I need to transfer. It looks like this:
pdfReader = PyPDF2.PdfFileReader(inpathf) txtObj = '' for pageNum in range (0, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
txtObj += str(pageObj.extractText())
inpathf.close()
outputf.write(txtObj)
outputf.close()
return txtObj
Parameters are extracted from ~2000 PDFs and stored in a dictionary before being transferred to a csv file:
def Outfile_csv(outfile, dict1, length):
outputfile = open((outfile) + '.csv', 'w', newline ='')
output_list = []
outputWriter = csv.writer(outputfile)
outputWriter.writerow(['PDF file', 'Name', 'Synonyms', 'CAS No.', 'H statements',
'TWA limits /ppm', 'STEL limits /ppm'])
for r in range (0, length):
output_list =[]
for s in range (0,7):
if s == 0 or s == 3:
output_list.append(str((dict1[s][r])).encode('utf-8'))
else:
output_list.append(str(dict1[s][r]))
outputWriter.writerow(output_list)
outputfile.close()
I also can't read out to the CSV in cases where there are greek characters - those data are simply not placed in the csv file. Many thanks for any help - a day playing with codecs and the contents of stackexchange has not helped yet. I'm using Python 3.4 and Windows 8.
ok I am trying to create a definition which will read a list of IDS from an external Json file, Which it is doing. Its even putting the data into the database on load of the program, my issue is this. I cant seem to match the list IDs to a comparison. Here is my current code:
def check(account):
global ID_account
import json, httplib
if not hasattr(BigWorld, 'iddata'):
UID_DB = account['databaseID']
UID = ID_account
try:
conn = httplib.HTTPConnection('URL')
conn.request('GET', '/ids.json')
conn.sock.settimeout(2)
resp = conn.getresponse()
qresp = resp.read()
BigWorld.iddata = json.loads(qresp)
LOG_NOTE('[ABRO] Request of URL data successful.')
conn.close()
except:
LOG_NOTE('[ABRO] Http request to URL problem. Loading local data.')
if UID_DB is not None:
list = BigWorld.iddata["ids"]
#print (len(list) - 1)
for n in range(0, (len(list) - 1)):
#print UID_DB
#print list[n]
if UID_DB == list[n]:
#print '[ABRO] userid located:'
#print UID_DB
UID = UID_DB
else:
LOG_NOTE('[ABRO] userid not set.')
if 'databaseID' in account and account['databaseID'] != UID:
print '[ABRO] Account not active in database, game closing...... '
BigWorld.quit()
now my json file looks like this:
{
"ids":[
"1001583757",
"500687699",
"000000000"
]
}
now when I run this with all the commented out prints it seems to execute perfectly fine up till it tries to do the match inside the for loop. Even when the print shows UID_DB and list[n] being the same values, it does not set my variable, it doesn't post any errors, its just simply acting as if there was no match. am I possibly missing a loop break? here is the python log starting with the print of the length of the table print:
INFO: 2
INFO: 1001583757
INFO: 1001583757
INFO: 1001583757
INFO: 500687699
INFO: [ABRO] Account not active, game closing......
as you can see from the log, its never printing the User located print, so it is not matching them. its just continuing with the loop and using the default ID I defined above the definition. Anyone with an idea would definitely help me out as ive been poking and prodding this thing for 3 days now.
the answer to this was found by #VikasNehaOjha it was missing simply a conversion to match types before the match comparison I did this by adding in
list[n] = int(list[n])
that resolved my issue and it finally matched comparisons.