How to develop a Plagiarism detector? - projects

I am planning to make a Plagiarism Detector as my Computer Science Engineering final year project,for which I would like to take your suggestions on how to go about it.
I would appreciate if you could suggest which all fields in CS I need to focus on and also the language which would be the most appropriate to implement in.

The language is nearly irrelevant. Another questions exists that discusses this a bit more. Basically, the method suggested there is to use Google. Extract parts of the target-text, and search for them on Google.

I am making a plagiarism checker using Python as a hobby project.
The following steps are to be followed:
Tokenize the document.
Remove all the stop words using NLTK library.
Use GenSim library and find the most relevant words, line by line. This can be done by creating the LDA or LSA of the document.
Use Google Search API to search for those words.
Note:
you might have chosen to use the Google API and search the whole document at once. This will work when you are working with smaller amount of data. However when building plagiarism checker for sites and webscraped data, we will need to apply NLTK algorithms.
The Google search API will result in the top articles which have the same words which were resulted in the LDA or LSA from GenSim library functions of Python.
Hope it helped.

Here is a simple code to match the similarity percentage between two file
import numpy as np
def levenshtein(seq1, seq2):
size_x = len(seq1) + 1
size_y = len(seq2) + 1
matrix = np.zeros ((size_x, size_y))
for x in range(size_x):
matrix [x, 0] = x
for y in range(size_y):
matrix [0, y] = y
for x in range(1, size_x):
for y in range(1, size_y):
if seq1[x-1] == seq2[y-1]:
matrix [x,y] = min(
matrix[x-1, y] + 1,
matrix[x-1, y-1],
matrix[x, y-1] + 1
)
else:
matrix [x,y] = min(
matrix[x-1,y] + 1,
matrix[x-1,y-1] + 1,
matrix[x,y-1] + 1
)
#print (matrix)
return (matrix[size_x - 1, size_y - 1])
with open('original.txt', 'r') as file:
data = file.read().replace('\n', '')
str1=data.replace(' ', '')
with open('target.txt', 'r') as file:
data = file.read().replace('\n', '')
str2=data.replace(' ', '')
if(len(str1)>len(str2)):
length=len(str1)
else:
length=len(str2)
print(100-round((levenshtein(str1,str2)/length)*100,2),'% Similarity')
Create two files "original.txt" and "target.txt" in same directory with content.

you better try python,cause its easy to develop a program using this..i'm also doing a project on plagiarism detector..i suggest u to tokenize the string first..actually it is complicated but this is the way if u r trying to develop for source code,else if u r developing plagiarism detector for text file use cosine similarity method,LCS method or simply considering position..

Related

Having issue with max_norm parameter of torch.nn.Embedding

I use torch.nn.Embedding to embed my model’s categorical input features, however, I face problems when I set the max_norm parameter to not None.
There is a note on the pytorch docs page that explains how to use max_norm parameter through the following example:
n, d, m = 3, 5, 7
embedding = nn.Embedding(n, d, max_norm=True)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor(\[1, 2\])
a = embedding.weight.clone() # W.t() # weight must be cloned for this to be differentiable
b = embedding(idx) # W.t() # modifies weight in-place
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()
I can’t easily understand this example from the docs. What is the purpose of having both ‘a’ and ‘b’ and why ‘out’ is defined as, out = (a.unsqueeze(0) + b.unsqueeze(1))?
Do we need to first clone the entire embedding tensor as in ‘a’, and then finding the embeddings for our desired indices as in ‘b’? Then how do ‘a’ and ‘b’ need to be added?
In my code, I don’t have W explicitly, I am assuming that W is representative of the weights applied by the torch.nn.Linear layers. So, I just need to prepare the input (which includes the embeddings for categorical features) that goes into my network.
I greatly appreciate any instructions on this, as understanding this example would help me adapt my code accordingly.
Because W in the line computing a requires gradients, we must save embedding.weight to compute those gradients in the backward pass. However, in the line computing b, executing embedding(idx) will scale embedding.weight by max_norm - in place. So, without cloning it in line a, embedding.weight will be modified when line b is executed - changing what was saved for the backward pass to update W. Hence the requirement to clone embedding.weight - to save it before it gets scaled in line b.
If you don't use embedding.weight outside of the normal forward pass, you don't need to worry about all this.
If you get an error, post it (and your code).

Why doesn't Detectron2 not detect any other instances after transfer learning?

I have followed the tutorial from this website https://colab.research.google.com/drive/16jcaJoc6bCFAQ96jDe2HwtXj7BMD_-m5#scrollTo=U5LhISJqWXgM.
The question is: there were images that contained 'person' and many other instances but why weren't they detected and segmented?
from detectron2.utils.visualizer import ColorMode
dataset_dicts = get_balloon_dicts("balloon/val")
for d in random.sample(dataset_dicts, 3):
im = cv2.imread(<CUSTOM_IMAGE_CONTAINING_PERSON_DETECTED_WHEN_CUSTOM_WASNT_USED>) <--changed
outputs = predictor(im)
v = Visualizer(im[:, :, ::-1],
metadata=balloon_metadata,
scale=0.8,
instance_mode=ColorMode.IMAGE_BW # remove the colors of unsegmented pixels
)
v = v.draw_instance_predictions(outputs["instances"].to("cpu"))
cv2_imshow(v.get_image()[:, :, ::-1])
After downloading a custom photo containing people I ran inference on the image as instructured under Run a pre-trained detectron2 model using the model they used (model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")) but when I use the trained model of the balloons it does not detect the people in the image (the threshold was not the problem since I used 0.5 in both cases). Why is it so and how would I be able to make it show all the instances? Help would be greatly appreciated :D

Faster way to do multiple embeddings in PyTorch?

I'm working on a torch-based library for building autoencoders with tabular datasets.
One big feature is learning embeddings for categorical features.
In practice, however, training many embedding layers simultaneously is creating some slowdowns. I am using for-loops to do this and running the for-loop on each iteration is (I think) what's causing the slowdowns.
When building the model, I associate embedding layers with each categorical feature in the user's dataset:
for ft in self.categorical_fts:
feature = self.categorical_fts[ft]
n_cats = len(feature['cats']) + 1
embed_dim = compute_embedding_size(n_cats)
embed_layer = torch.nn.Embedding(n_cats, embed_dim)
feature['embedding'] = embed_layer
Then, with a call to .forward():
embeddings = []
for i, ft in enumerate(self.categorical_fts):
feature = self.categorical_fts[ft]
emb = feature['embedding'](codes[i])
embeddings.append(emb)
#num and bin are numeric and binary features
x = torch.cat(num + bin + embeddings, dim=1)
Then x goes into dense layers.
This gets the job done but running this for loop during each forward pass really slows down training, especially when a dataset has tens or hundreds of categorical columns.
Does anybody know of a way of vectorizing something like this? Thanks!
UPDATE:
For more clarity, I made this sketch of how I'm feeding categorical features into the network. You can see that each categorical column has its own embedding matrix, while numeric features are concatenated directly to their output before being passed into the feed-forward network.
Can we do this without iterating through each embedding matrix?
just use simple indexing [, though i'm not sure whether it is fast enough
Here is a simplified version for all feature have same vocab_size and embedding dim, but it should apply to cases of heterogeneous category features
xdim = 240
embed_dim = 8
vocab_size = 64
embedding_table = torch.randn(size=(xdim, vocab_size, embed_dim))
batch_size = 32
x = torch.randint(vocab_size, size=(batch_size, xdim))
out = embedding_table[torch.arange(xdim), x]
out.shape # (bz, xdim, embed_dim)
# unit test
i = np.random.randint(batch_size)
j = np.random.randint(xdim)
x_index = x[i][j]
w = embedding_table[j]
torch.allclose(w[x_index], out[i, j])

Riding the wave Numerical schemes for hyperbolic PDEs, lorena barba lessons, assistance needed

I am a beginner python user who is trying to get a feel for computer science, I've been learning how to use it by studying concepts/subjects I'm already familiar with, such as Computation Fluid Mechanics & Finite Element Analysis. I got my degree in mechanical engineering, so not much CS background.
I'm studying a series by Lorena Barba on jupyter notebook viewer, Practical Numerical Methods, and i'm looking for some help, hopefully someone familiar with the subjects of CFD & FEA in general.
if you click on the link below and go to the following output line, you'll find what i have below. Really confused on this block of code operated within the function that is defined.
Anyway. If there is anyone out there, with any suggestions on how to tackle learning python, HELP
In[9]
rho_hist = [rho0.copy()]
rho = rho0.copy() **# im confused by the role of this variable here**
for n in range(nt):
# Compute the flux.
F = flux(rho, *args)
# Advance in time using Lax-Friedrichs scheme.
rho[1:-1] = (0.5 * (rho[:-2] + rho[2:]) -
dt / (2.0 * dx) * (F[2:] - F[:-2]))
# Set the value at the first location.
rho[0] = bc_values[0]
# Set the value at the last location.
rho[-1] = bc_values[1]
# Record the time-step solution.
rho_hist.append(rho.copy())
return rho_hist
http://nbviewer.jupyter.org/github/numerical-mooc/numerical-mooc/blob/master/lessons/03_wave/03_02_convectionSchemes.ipynb
The intent of the first two lines is to preserve rho0 and provide copies of it for the history (copy so that later changes in rho0 do not reflect back here) and as the initial value for the "working" variable rho that is used and modified during the computation.
The background is that python list and array variables are always references to the object in question. By assigning the variable you produce a copy of the reference, the address of the object, but not the object itself. Both variables refer to the same memory area. Thus not using .copy() will change rho0.
a = [1,2,3]
b = a
b[2] = 5
print a
#>>> [1, 2, 5]
Composite objects that themselves contain structured data objects will need a deepcopy to copy the data on all levels.
Numpy array values changed without being aksed?
how to pass a list as value and not as reference?

Solving equation - Overflow error

Basically I just want to solve k. Note that the equation equals to 1.12
import math
from sympy import *
a = 1.45
b = 4.1
c = 14.0
al = math.log(a, 2)
bl = math.log(b, 2)
cl = math.log(c, 2)
k = symbols('k')
print solve(Eq(1/k**al + 1/k**bl + 1/k**cl, 1.12), k)
This raises OverflowError: Python int too large to convert to C long
Solution using other libraries welcomed too.
Since you are using numerical values, I am assuming that you are looking for a numerical solution. In that case, you should not use solve, because it tries to find a symbolic solution. The issue here is that it converts these floating point exponents into rational exponents, which have very large numerators and denominators, and it then at some point tries to make polynomials of degree corresponding to those large numbers, which is where it fails.
To solve numerically, you can use nsolve.
>>> print nsolve(Eq(1/k**al + 1/k**bl + 1/k**cl, 1.12), 2)
1.82427203413783
It's better to use numeric libraries like SciPy if you are interested in numeric solutions, though. You can use lambdify to convert your SymPy expressions into functions more suited for libraries like SciPy that use NumPy arrays.
It is a known issue.
You may try
solve(Eq(1/k**al + 1/k**bl + 1/k**cl, 1.12), k, rational=False)