How to turn csv file data into a list WITHOUT the 'import csv' - csv

So what I'm trying to do is, I'm trying to read the data in the CSV file into the empty lists I've defined at the top.
How can I do
this without the 'csv import'
L = []
F = []
G = []
A = []
class client ():
fh = open('fit_clinic_20.csv', 'r')
for line in fh:
data = fh.readlines()
L, F, G, A = fh.split(',')

I would try:
L = []
F = []
G = []
A = []
fh = open('fit_clinic_20.csv', 'r')
# first: you read lines
data = fh.readlines()
for line in data:
# you split every line into values
L_value, F_value, G_value, A_value = line.split(',')
# you append values to lists
L.append(L_value)
F.append(F_value)
G.append(G_value)
A.append(A_value)
Surely, there are more compact ways to do this, but I think that this way is well understandable.

Related

Loading multiple csvs with mixed dtypes in tensorflow for training

I have 100s of csvs in a directory, with headers. I am trying to create a feedforward NN using tensorflow for regression.
What's the best way to import these csvs and train using tf & train it?
Also help to look at my preprocessing if I am doing it right?
Note: My features has mixed datatypes (int,float,string), My target is float
I can not concat the csv and use pandas to import, my data size is >50 GB so can not load it in-memory, have to read it iteratively from disc
Directory Path:
./data/train/ -> 100s of csvs
./data/test -> 100s of csvs
./data/valid -> 100s of csvs
Code:
Methodology:
Create Generator
Use Dataset API to load the data
Preprocess the Data (embedding, one-hot,etc)
Train fit
But, in generator I was able to give only output formats where the inputs/ outputs are homogeneous ddtypes.
Code:
def data_generator(file_list, batch_size = 2):
i = 0
while True:
if i*batch_size >= len(file_list): # This loop is used to run the generator indefinitely.
i = 0
np.random.shuffle(file_list)
else:
file_chunk = file_list[i*batch_size:(i+1)*batch_size]
data = []
labels = []
for file in file_chunk:
temp = pd.read_csv(open(file,'r')) # Change this line to read any other type of file
labels = temp.pop('ACTUAL_BOXES')
data.append(temp.values) # Convert column data to matrix like data with one channel
labels.append(labels)
data = np.asarray(data)
labels = np.asarray(labels)
yield data, labels # Here data will be mixed datatype arrays & lables will be a float dtype array
i = i + 1
#getting list of files inside the directory
train_file_list = np.sort(glob.glob('././data/train/*.csv'))
test_file_list = np.sort(glob.glob('././data/test/*.csv'))
val_file_list = np.sort(glob.glob('././data/val/*.csv'))
train_dataset = tf.data.Dataset.from_generator(data_generator,args= [train_file_list , batch_size = 2],
output_types = (tf.float32, tf.float32), #This is where I am struck
#my sample data and lables will be like this
data = ['a','b',1,2,3.14,2] #Mixed dtypes
lables = [1.0] #float
)
val_dataset = tf.data.Dataset.from_generator(data_generator,args= [val_file_list , batch_size = 2],
output_types = (tf.float32, tf.float32), #This is where I am struck
)
# Pre processing Part:
def encode_inputs(EMBEDDING_FEATURES,INDICATOR_FEATURES):
''' Function for encoding the deatures'''
encoded_features = []
for feature_name in EMBEDDING_FEATURES:
#Getting unique vocab list
vocabulary = np.array(list(flatten(vocab_list[feature_name])))
# categorical columns using the lists created above:
cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
feature_name, vocabulary)
embedding_dims = int(math.sqrt(len(vocabulary)))
# create an embedding from the categorical column:
cat_emb = tf.feature_column.embedding_column(cat_col,8) #,dimension=embedding_dims
# add the embeddings to the list of feature columns
encoded_features.append(cat_emb)
for feature_name in INDICATOR_FEATURES:
#Getting unique vocab list
vocabulary = list(flatten(vocab_list[feature_name]))
# indicator columns using the lists created above:
ind_col = tf.feature_column.categorical_column_with_vocabulary_list(
feature_name, vocabulary)
# create an embedding from the categorical column:
cat_one_hot = tf.feature_column.indicator_column(ind_col)
# add the embeddings to the list of feature columns
encoded_features.append(cat_one_hot)
# create the input layer for the model
feature_layer = tf.keras.layers.DenseFeatures(encoded_features)
return feature_layer
# Opening JSON file that contains vocab list for str cols
f = open('./vocab_list.json') # File that contains the unique values of each feature
vocab_list = json.load(f)
features_layer = encode_inputs(EMBEDDING_FEATURES,INDICATOR_FEATURES)
# Model Part
model = tf.keras.models.Sequential([
features_layer,
tf.keras.layers.Dense(30, activation = 'relu'),
tf.keras.layers.Dense(1)
])
m_loss = tf.keras.losses.mean_squared_error
m_optimizer = tf.keras.optimizers.SGD(lr = 1e-3)
batch_size = 32
model.compile(loss = m_loss, optimizer = m_optimizer, metrics = ['accuracy'])
model.fit(train_dataset ,epochs = 10, validation_data = val_dataset )

How do I apply NLP to the search engine I’m building using MySQL as data storage

I’m working on a search engine project for my country. I have the country’s domains list of the sites to crawl. So I have built a bot (the bot was written in python) to crawl some of the sites at the moment. When crawling is successful, the crawler will commit the crawled content to MySQL database. So I have data that people can search for in the MySQL remote server as I speak.
Now, I want to implement NLP in the search such that when a user enters a keyword in the search box, relevant results from MySQL database will show to the user based on the keyword used. I’m using python 3.8 and NLTK for this project. I haven’t done anything about NLP before. This is my first time. But I have read about it though. I also want to ask if using MySQL database is the right option for the search engine. If not, why can’t I use it and what should I use? I’m currently using MySQL because I’m more familiar with it a lot and I enjoy when using it for data storage. I’ve been struggling with this thing since last December. What I really need is the right NLP algorithm to use for selecting relevant results from MySQL database. I know that NLP is difficult to implement but I will appreciate if you can at least try to help out.
Here’s the code:
What I have done so far, I copied some of the code from Kaggle.com . Here is the link https://www.kaggle.com/amitkumarjaiswal/nlp-search-engine/notebook but I still haven’t been able to make it work for my own project.
import pandas as pd
import numpy as np
import string
import random
import nltk
import os
import re
#import nltk.corpus
import csv
#nltk.download('all')
#print(os.listdir(nltk.data.find("corpora")))
#pip install --upgrade nltk
from nltk.corpus import brown
from nltk.corpus import reuters
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer
#load 10k reuters news documents
len(reuters.fileids())
#view text from one document
reuters.raw(fileids=['test/14826'])[0:201]
exclude = set(string.punctuation)
alldocslist = []
for index, i in enumerate(reuters.fileids()):
text = reuters.raw(fileids=[i])
text = ''.join(ch for ch in text if ch not in exclude)
alldocslist.append(text)
print(alldocslist[1])
#tokenize words in all DOCS
plot_data = [[]] * len(alldocslist)
for doc in alldocslist:
text = doc
tokentext = word_tokenize(text)
plot_data[index].append(tokentext)
print(plot_data[0][1])
# Navigation: first index gives all documents, second index gives specific document, third index gives words of that doc
plot_data[0][1][0:10]
#make all words lower case for all docs
for x in range(len(reuters.fileids())):
lowers = [word.lower() for word in plot_data[0][x]]
plot_data[0][x] = lowers
plot_data[0][1][0:10]
# remove stop words from all docs
stop_words = set(stopwords.words('english'))
for x in range(len(reuters.fileids())):
filtered_sentence = [w for w in plot_data[0][x] if not w in stop_words]
plot_data[0][x] = filtered_sentence
plot_data[0][1][0:10]
#stem words EXAMPLE (could try others/lemmers )
snowball_stemmer = SnowballStemmer("english")
stemmed_sentence = [snowball_stemmer.stem(w) for w in filtered_sentence]
stemmed_sentence[0:10]
porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer("english")
stemmed_sentence = [ porter_stemmer.stem(w) for w in filtered_sentence]
stemmed_sentence[0:10]
# Create inverse index which gives document number for each document and where word appears
#first we need to create a list of all words
l = plot_data[0]
flatten = [item for sublist in l for item in sublist]
words = flatten
wordsunique = set(words)
wordsunique = list(wordsunique)
import math
from textblob import TextBlob as tb
def tf(word, doc):
return doc.count(word) / len(doc)
def n_containing(word, doclist):
return sum(1 for doc in doclist if word in doc)
def idf(word, doclist):
return math.log(len(doclist) / (0.01 + n_containing(word, doclist)))
def tfidf(word, doc, doclist):
return (tf(word, doc) * idf(word, doclist))
# THIS ONE-TIME INDEXING IS THE MOST PROCESSOR-INTENSIVE STEP AND WILL TAKE TIME TO RUN (BUT ONLY NEEDS TO BE RUN ONCE)
plottest = plot_data[0][0:1000]
worddic = {}
for doc in plottest:
for word in wordsunique:
if word in doc:
word = str(word)
index = plottest.index(doc)
positions = list(np.where(np.array(plottest[index]) == word)[0])
idfs = tfidf(word,doc,plottest)
try:
worddic[word].append([index,positions,idfs])
except:
worddic[word] = []
worddic[word].append([index,positions,idfs])
# the index creates a dic with each word as a KEY and a list of doc indexs, word positions, and td-idf score as VALUES
worddic['china']
# pickel (save) the dictonary to avoid re-calculating
np.save('worddic_1000.npy', worddic)
# create word search which takes multiple words and finds documents that contain both along with metrics for ranking:
## (1) Number of occruances of search words
## (2) TD-IDF score for search words
## (3) Percentage of search terms
## (4) Word ordering score
## (5) Exact match bonus
from collections import Counter
def search(searchsentence):
try:
# split sentence into individual words
searchsentence = searchsentence.lower()
try:
words = searchsentence.split(' ')
except:
words = list(words)
enddic = {}
idfdic = {}
closedic = {}
# remove words if not in worddic
realwords = []
for word in words:
if word in list(worddic.keys()):
realwords.append(word)
words = realwords
numwords = len(words)
# make metric of number of occurances of all words in each doc & largest total IDF
for word in words:
for indpos in worddic[word]:
index = indpos[0]
amount = len(indpos[1])
idfscore = indpos[2]
enddic[index] = amount
idfdic[index] = idfscore
fullcount_order = sorted(enddic.items(), key=lambda x:x[1], reverse=True)
fullidf_order = sorted(idfdic.items(), key=lambda x:x[1], reverse=True)
# make metric of what percentage of words appear in each doc
combo = []
alloptions = {k: worddic.get(k, None) for k in (words)}
for worddex in list(alloptions.values()):
for indexpos in worddex:
for indexz in indexpos:
combo.append(indexz)
comboindex = combo[::3]
combocount = Counter(comboindex)
for key in combocount:
combocount[key] = combocount[key] / numwords
combocount_order = sorted(combocount.items(), key=lambda x:x[1], reverse=True)
# make metric for if words appear in same order as in search
if len(words) > 1:
x = []
y = []
for record in [worddic[z] for z in words]:
for index in record:
x.append(index[0])
for i in x:
if x.count(i) > 1:
y.append(i)
y = list(set(y))
closedic = {}
for wordbig in [worddic[x] for x in words]:
for record in wordbig:
if record[0] in y:
index = record[0]
positions = record[1]
try:
closedic[index].append(positions)
except:
closedic[index] = []
closedic[index].append(positions)
x = 0
fdic = {}
for index in y:
csum = []
for seqlist in closedic[index]:
while x > 0:
secondlist = seqlist
x = 0
sol = [1 for i in firstlist if i + 1 in secondlist]
csum.append(sol)
fsum = [item for sublist in csum for item in sublist]
fsum = sum(fsum)
fdic[index] = fsum
fdic_order = sorted(fdic.items(), key=lambda x:x[1], reverse=True)
while x == 0:
firstlist = seqlist
x = x + 1
else:
fdic_order = 0
# also the one above should be given a big boost if ALL found together
#could make another metric for if they are not next to each other but still close
return(searchsentence,words,fullcount_order,combocount_order,fullidf_order,fdic_order)
except:
return("")
search('indonesia crude palm oil')[1]
# 0 return will give back the search term, the rest will give back metrics (see above)
search('indonesia crude palm oil')[1][1:10]
# save metrics to dataframe for use in ranking and machine learning
result1 = search('china daily says what')
result2 = search('indonesia crude palm oil')
result3 = search('price of nickel')
result4 = search('north yemen sugar')
result5 = search('nippon steel')
result6 = search('China')
result7 = search('Gold')
result8 = search('trade')
df = pd.DataFrame([result1,result2,result3,result4,result5,result6,result7,result8])
df.columns = ['search term', 'actual_words_searched','num_occur','percentage_of_terms','td-idf','word_order']
df
# look to see if the top documents seem to make sense
alldocslist[1]
# create a simple (non-machine learning) rank and return function
def rank(term):
results = search(term)
# get metrics
num_score = results[2]
per_score = results[3]
tfscore = results[4]
order_score = results[5]
final_candidates = []
# rule1: if high word order score & 100% percentage terms then put at top position
try:
first_candidates = []
for candidates in order_score:
if candidates[1] > 1:
first_candidates.append(candidates[0])
second_candidates = []
for match_candidates in per_score:
if match_candidates[1] == 1:
second_candidates.append(match_candidates[0])
if match_candidates[1] == 1 and match_candidates[0] in first_candidates:
final_candidates.append(match_candidates[0])
# rule2: next add other word order score which are greater than 1
t3_order = first_candidates[0:3]
for each in t3_order:
if each not in final_candidates:
final_candidates.insert(len(final_candidates),each)
# rule3: next add top td-idf results
final_candidates.insert(len(final_candidates),tfscore[0][0])
final_candidates.insert(len(final_candidates),tfscore[1][0])
# rule4: next add other high percentage score
t3_per = second_candidates[0:3]
for each in t3_per:
if each not in final_candidates:
final_candidates.insert(len(final_candidates),each)
#rule5: next add any other top results for metrics
othertops = [num_score[0][0],per_score[0][0],tfscore[0][0],order_score[0][0]]
for top in othertops:
if top not in final_candidates:
final_candidates.insert(len(final_candidates),top)
# unless single term searched, in which case just return
except:
othertops = [num_score[0][0],num_score[1][0],num_score[2][0],per_score[0][0],tfscore[0][0]]
for top in othertops:
if top not in final_candidates:
final_candidates.insert(len(final_candidates),top)
for index, results in enumerate(final_candidates):
if index < 5:
print("RESULT", index + 1, ":", alldocslist[results][0:100],"...")
# example of output
rank('indonesia palm oil')
# example of output
rank('china')
# Create pseudo-truth set using first 5 words
# Because I don't have a turth set I will generate a pseudo one by pulling terms from the documents - this is far from perfect
# as it may not approximate well peoples actual queries but it will serve well to build the ML architecture
df_truth = pd.DataFrame()
for doc in plottest:
first_five = doc[0:5]
test_sentence = ' '.join(first_five)
result = search(test_sentence)
df_temp = pd.DataFrame([result])
df_truth= pd.concat([df_truth, df_temp])
df_truth['truth'] = range(0,len(plottest))
df_truth1 = pd.DataFrame()
seqlen = 3
for doc in plottest:
try:
start = random.randint(0,(len(doc)-seqlen))
random_seq = doc[start:start+seqlen]
test_sentence = ' '.join(random_seq)
except:
test_sentence = doc[0]
result = search(test_sentence)
df_temp = pd.DataFrame([result])
df_truth1= pd.concat([df_truth1, df_temp])
df_truth1['truth'] = range(0,len(plottest))
# create another psuedo-truth set using different random 4 word sequence from docs
df_truth2 = pd.DataFrame()
seqlen = 4
for doc in plottest:
try:
start = random.randint(0,(len(doc)-seqlen))
random_seq = doc[start:start+seqlen]
test_sentence = ' '.join(random_seq)
except:
test_sentence = doc[0]
result = search(test_sentence)
df_temp = pd.DataFrame([result])
df_truth2= pd.concat([df_truth2, df_temp])
df_truth2['truth'] = range(0,len(plottest))
# create another psuedo-truth set using different random 2 word sequence from docs
df_truth3 = pd.DataFrame()
seqlen = 2
for doc in plottest:
try:
start = random.randint(0,(len(doc)-seqlen))
random_seq = doc[start:start+seqlen]
test_sentence = ' '.join(random_seq)
except:
test_sentence = doc[0]
result = search(test_sentence)
df_temp = pd.DataFrame([result])
df_truth3= pd.concat([df_truth3, df_temp])
df_truth3['truth'] = range(0,len(plottest))
# combine the truth sets and save to disk
truth_set = pd.concat([df_truth,df_truth1,df_truth2,df_truth3])
truth_set.columns = ['search term', 'actual_words_searched','num_occur','percentage_of_terms','td-idf','word_order','truth']
truth_set.to_csv("truth_set_final.csv")
truth_set[0:10]
truth_set
test_set = truth_set[0:3]
test_set
# convert to long format for ML
# WARNING AGAIN THIS IS A SLOW PROCESS DUE TO RAM ILOC - COULD BE OPTIMISED FOR FASTER PERFORMANCE
# BUG When min(maxnum, len(truth_set) <- is a int not a list because of very short variable length)
# row is row
# column is variable
# i is the result
final_set = pd.DataFrame()
test_set = truth_set[1:100]
maxnum = 5
for row in range(0,len(test_set.index)):
test_set = truth_set[1:100]
for col in range(2,6):
for i in range(0,min(maxnum,len(truth_set.iloc[row][col]))):
x = pd.DataFrame([truth_set.iloc[row][col][i]])
x['truth'] = truth_set.iloc[row]['truth']
x.columns = [(str(truth_set.columns[col]),"index",i),(str(truth_set.columns[col]),"score",i),'truth']
test_set = test_set.merge(x,on='truth')
final_set = pd.concat([final_set,test_set])
final_set.head()
final_set.to_csv("ML_set_100.csv")
final_set2 = final_set.drop(['actual_words_searched','num_occur','percentage_of_terms','search term','td-idf','word_order'], 1)
final_set2.to_csv("ML_set_100_3.csv")
final_set2.head()
final_set3 = final_set2
final_set3[0:10]
Obviously, the code above isn't returning searched keywords from MySQL database. I believe you understand me? Thank you very much!

Read csv with headers into a data structure in Octave

I mean to read a data file with one line of headers into a data structure ds that has the csv headers as fieldnames(ds).
So for the sample file below I would be able to do ds.x or ds.('x'), e.g.
But using
ds = importdata(ds_fname, delimiterIn = ',', headerlinesIn = 1);
I get instead
> fieldnames(ds)
ans =
{
[1,1] = data
[2,1] = textdata
[3,1] = colheaders
}
> ds.colheaders
ans =
{
[1,1] = x
[1,2] = y1
[1,3] = y2
}
> ds.data
ans =
<mydata>
How can I read directly a csv into a data strcture?
I managed to read the csv into a cell array and then creating the data structure, but I mean to avoid intermediate steps.
Sample data file
x,y1,y2
0.1,1.0,1.0e-1
4.,21.0,1.0e1
6,-1,1.0e+1
pkg load io
C = csv2cell('data.csv');
S = cell2struct( C(2:end,:).', C(1,:) );

BeautifulSoup4 & Python - multiple pages into DataFrame

I have some code which collects the description, price, and old price(if on sale) from online retailers over multiple pages. I'm looking to export this into a DataFrame and have had a go but run into the following error:
ValueError: Shape of passed values is (1, 3210), indices imply (3, 3210).
from bs4 import BeautifulSoup
import requests
import time
import pandas as pd
# Start Timer
then = time.time()
# Headers
headers = {"User-Agent": "Mozilla/5.0"}
# Set HTTPCode = 200 and Counter = 1
Code = 200
i = 1
scraped_data = []
while Code == 200:
# Put url together
url = "https://www.asos.com/women/jumpers-cardigans/cat/?cid=2637&page="
url = url + str(i)
# Request URL
r = requests.get(url, allow_redirects=False, headers=headers) # No redirects to allow infinite page count
data = r.text
Code = r.status_code
# Soup
soup = BeautifulSoup(data, 'lxml')
# For loop each product then scroll through title price, old price and description
divs = soup.find_all('article', attrs={'class': '_2qG85dG'}) # want to cycle through each of these
for div in divs:
# Get Description
Description = div.find('div', attrs={'class': '_3J74XsK'})
Description = Description.text.strip()
scraped_data.append(Description)
# Fetch TitlePrice
NewPrice = div.find('span', attrs={'data-auto-id':'productTilePrice'})
NewPrice = NewPrice.text.strip("£")
scraped_data.append(NewPrice)
# Fetch OldPrice
try:
OldPrice = div.find('span', attrs={'data-auto-id': 'productTileSaleAmount'})
OldPrice = OldPrice.text.strip("£")
scraped_data.append(OldPrice)
except AttributeError:
OldPrice = ""
scraped_data.append(OldPrice)
print('page', i, 'scraped')
# Print Array
#array = {"Description": str(Description), "CurrentPrice": str(NewPrice), "Old Price": str(OldPrice)}
#print(array)
i = i + 1
else:
i = i - 2
now = time.time()
pd.DataFrame(scraped_data, columns=["A", "B", "C"])
print('Parse complete with', i, 'pages' + ' in', now-then, 'seconds')
Right now your data is appended to list based on an algorithm that I can describe like this:
Load the web page
Append to list value A
Append to list value B
Append to list value C
What this creates for each run through the dataset is:
[A1, B1, C1, A2, B2, C2]
There exists only one column with data, which is what pandas is telling you. To construct the dataframe properly, either you need to swap it into a format where you have, on each row entry, a tuple of three values (heh) like:
[
(A1, B1, C1),
(A2, B2, C2)
]
Or, in my preferred way because it's far more robust to coding errors and inconsistent lengths to your data: creating each row as a dictionary of columns. Thus,
rowdict_list = []
for row in data_source:
a = extract_a()
b = extract_b()
c = extract_c()
rowdict_list.append({'column_a': a, 'column_b': b, 'column_c': c})
And the data frame is constructed easily without having to explicitly specify columns in the constructor with df = pd.DataFrame(rowdict_list).
You can create a DataFrame using the array dictionary.
You would want to set the values of the array dict to empty lists that way you can append the values from the webpage into the correct list. Also move the array variable outside of the while loop.
array = {"Description": [], "CurrentPrice": [], "Old Price": []}
scraped_data = []
while Code == 200:
...
On the line where you were previously defining the array variable you would then want to append the desciption, price and old price values like so.
array['Description'].append(str(Description))
array['CurrentPrice'].append(str(NewPrice))
array['Old Price'].append(str(OldPrice))
Then you can to create a DataFrame using the array variable
pd.DataFrame(array)
So the final solution would look something like
array = {"Description": [], "CurrentPrice": [], "Old Price": []}
scraped_data = []
while Code == 200:
...
# For loop
for div in divs:
# Get Description
Description = div.find('h3', attrs={'class': 'product__title'})
Description = Description.text.strip()
# Fetch TitlePrice
try:
NewPrice = div.find('div', attrs={'class': 'price product__price--current'})
NewPrice = NewPrice.text.strip()
except AttributeError:
NewPrice = div.find('p', attrs={'class': 'price price--reduced'})
NewPrice = NewPrice.text.strip()
# Fetch OldPrice
try:
OldPrice = div.find('p', attrs={'class': 'price price--previous'})
OldPrice = OldPrice.text.strip()
except AttributeError:
OldPrice = ""
array['Description'].append(str(Description))
array['CurrentPrice'].append(str(NewPrice))
array['Old Price'].append(str(OldPrice))
# Print Array
print(array)
df = pd.DataFrame(array)
i = i + 1
else:
i = i - 2
now = time.time()
print('Parse complete with', i, 'pages' + ' in', now - then, 'seconds')
Finally make sure you've imported pandas at the top of the module
import pandas as pd

How to iterate through a python dictionary from user input?

Background:
I have a JSON dictionary file as follows:
dictionary = {"Qui": "クイ", "Quiana": "キアナ", "Quick": "クイック", "Quickley": "クイックリー", "Quico": "キコ", "Quiej-Alvarez": "クエイ アルバレス", "Quigg": "クイッグ", "Quigley": "クイグリー", "Quijano": "クイジャーノ", "Quik": "クイック", "Quilici": "クイリチ", "Quill": "クィル"}
Then I will let the user enter as many keys as they want through input, finally return formatted string combined with key.value.
Question:
My code so far gets the job done in a very clunky/incomplete manner. Any advice on how to clean up the code and achieve my goal?
Current code:
import json
import sys, math
import codecs
#Part1
search_term,search_term2 = input("Enter a Name: ").split()
dictionary = {}
keys = dictionary.keys()
values = dictionary.values()
with open ('translation.json', 'r', encoding='utf-8-sig') as f:
term_data = json.load(f)
if search_term.casefold() in term_data:
word = search_term.title()
elif search_term.title() in term_data:
word = search_term.title()
output1 = "{}".format(term_data[search_term])
#Part 2
with open ('translation.json', 'r', encoding='utf-8-sig') as f:
term_data2 = json.load(f)
if search_term2.casefold() in term_data2:
word2 = search_term2.title()
elif search_term2.title() in term_data2:
word2 = search_term2.title()
#else:
#print("Name not found in dictionary.")
output2 = "{}".format(term_data2[search_term2])
print("{}・{}".format(output1,output2))
Your current code can just enter 2 keys which cannot meet your original requirements, I expand as follows, meanwhile make it simpler:
test.py:
import json
import codecs
with open('translation.json', 'r', encoding='utf-8-sig') as f:
term_data = json.load(f)
search_terms = input("Enter a name: ").split()
l = [term_data[i] for i in search_terms if i.casefold() in term_data or i.title() in term_data]
print('.'.join(l))
First we just need to open json file once, it's expensive to do IO operation, we need to avoid to do it again and again.
Second, we needn't repeat term match as you do with Part1, Part2. We can do it in loop, here I use list comprehension.
Finally, explain a litte:
split all user inputs to a list: search_terms
loop a user input terms with for i in search_terms
if the candidate term i's casefold() or title() in dictionary term_data, it's value in dic was put to new list l again, if not do nothing.
at last, use the separator . to join all the needed elements of list.
Output:
~$ python3 test.py
Enter a name: Qui Quill Quiana
クイ.クィル.キアナ