Python - Why isn't this specific text being found by findall regex?

Python - Why isn't this specific text being found by findall regex? - html

EDIT: PLEASE DO NOT DOWNVOTE WITHOUT COMMENTING ON WHY YOU ARE DOWNVOTING. I AM TRYING MY BEST TO WRITE THIS PROPERLY!
I am trying to print all of the URL links of watches on a website. I have all of them printing fine except one, even though that one has the exact same regex conditions as the others. Can someone explain why this isn't printing please? Have I messed up some syntax somewhere? The following code should be able to be pasted into a Python editor (i.e. IDLE) and run.
## Import required modules
from urllib import urlopen
from re import findall
import re
## Provide URL
dennisov_url = 'https://denissov.ru/en/'
## Open and read URL as string named 'dennisov_html'
dennisov_html = urlopen(dennisov_url).read()
## Find all of the links when each watch is clicked (those with the designated
## preceeding text 'window.open', then any character that occurs zero or more
## times, then the text '/en/'. Remove matches with the word "History" and
## any " symbols in the URL.
watch_link_urls = findall('window.open.*(/en/[^history][^"]*/)', dennisov_html)
## For every URL, convert it into a string on a new line and add the domain
for link in watch_link_urls:
link = 'https://denissov.ru' + link
## Print out the full URLs
print link
## This code should show the link https://denissov.ru/en/speedster/ yet
## it isn't showing. It has the exact preceeding text as the other links
## that are printing and is in the same div container. If you inspect the
## website then search 'en/barracuda_mechanical/ and then 'en/speedster/'
## you will see that the speedster link is only a few lines below barracuda
## mechanical and there is nothing different about the two's preceeding
## text, so speedster should be printing

You can try this code with this pattern:
from urllib2 import urlopen
import re
url = 'https://denissov.ru/en/'
data = urlopen(url).read()
sub_urls = re.findall('window.open\(\'(/.*?)\'', data)
# take everything without deleting dublicates
# final_urls = [k for k in b if '/history' not in k and k is not '']
# Or: remove duplicates
set(k for k in b if '/history' not in k)
for k in final_urls:
link = 'https://denissov.ru' + k
print link
Will output something like this:
https://denissov.ru/eng/denissovdesign/index.html
https://denissov.ru/en/barracuda_limited/
https://denissov.ru/en/barracuda_chronograph/
https://denissov.ru/en/barracuda_mechanical/
https://denissov.ru/en/speedster/
https://denissov.ru/en/free_rider/
https://denissov.ru/en/nau_automatic/
https://denissov.ru/en/lady_flower/
https://denissov.ru/en/enigma/
https://denissov.ru/en/number_one/

If you want a regex to get all URLs that don't contain the word history and start with en/ then you should use a tempered greedy solution, like this:
en\/(?:(?!history).)*?\/
(?:(?!history).)*? is a tempered dot which will match any character which doesn't have history as a lookahead.
(?!history) is a negative lookahead to ensure that.
The ?: has been added to indicate that the group is a non-capturing one.
The *? indicates a non-greedy match so that it will match only upto the first /
Regex101 Demo
Change the python code like this:
watch_link_urls = findall('window.open.*(/en\/(?:(?!history).)*?\/)', dennisov_html)
Output:
https://denissov.ru/en/barracuda_limited/
https://denissov.ru/en/barracuda_chronograph/
https://denissov.ru/en/barracuda_mechanical/
https://denissov.ru/en/speedster/
https://denissov.ru/en/free_rider/
https://denissov.ru/en/nau_automatic/
https://denissov.ru/en/lady_flower/
https://denissov.ru/en/enigma/
https://denissov.ru/en/number_one/
Read more about tempered greedy here.

Related

NLTK: Is there a term for this procedure?

I was reading some stuff about NLTK and I read something of a procedure that turns the word such as "you're" into two tokens "you" and "are". I can't remember the source. Is there a term for this or something?

pip install contractions
# import library
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too?
I'd love to see u there my dear. It's awesome to meet new friends.
We've been waiting for this day for so long.'''
# creating an empty list
expanded_words = []
for word in text.split():
# using contractions.fix to expand the shortened words
expanded_words.append(contractions.fix(word))
expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print('Expanded_text: ' + expanded_text)
the source

Is it possible to give text format hints in google vision api?

I'm trying to detect handwritten dates isolated in images.
In the cloud vision api, is there a way to give hints about type?
example: the only text present will be dd/mm/yy, d,m and y being digits
The only thing I found is language hints in the documentation.
Sometimes I get results that include letters like O instead of 0.

There is not a way to give hints about type but you can filter the output using client libraries. I downloaded detect.py and requirements.txt from here and modified detect.py (in def detect_text, after line 283):
response = client.text_detection(image=image)
texts = response.text_annotations
#Import regular expressions
import re
print('Date:')
dateStr=texts[0].description
# Test case for letters replacement
#dateStr="Z3 OZ/l7"
#print(dateStr)
dateStr=dateStr.replace("O","0")
dateStr=dateStr.replace("Z","2")
dateStr=dateStr.replace("l","1")
dateList=re.split(' |;|,|/|\n',dateStr)
dd=dateList[0]
mm=dateList[1]
yy=dateList[2]
date=dd+'/'+mm+'/'+yy
print(date)
#for text in texts:
#print('\n"{}"'.format(text.description))
#print('Hello you!')
#vertices = (['({},{})'.format(vertex.x, vertex.y)
# for vertex in text.bounding_poly.vertices])
#print('bounds: {}'.format(','.join(vertices)))
# [END migration_text_detection]
# [END def_detect_text]
Then I launched detect.py inside the virtual environment using this command line:
python detect_dates.py text qAkiq.png
And I got this:
23/02/17
There are few letters that can be mistaken for numbers, so using str.replace(“letter”,”number”) should solve the wrong identifications. I added the most common cases for this example.

Extracting text from plain HTML and write to new file

I'm extracting a certain part of a HTML document (to be fair: basis for this is an iXBRL document which means I do have a lot of written formatting code inside) and write my output, the original file without the extracted part, to a .txt file. My aim is to measure the difference in document size (how much KB of the original document refers to the extracted part). As far as I know there shouldn't be any difference in HTML to text format, so my difference should be reliable although I am comparing two different document formats. My code so far is:
import glob
import os
import contextlib
import re
#contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def extractor():
os.chdir(r"F:\Test")
with stdout2file("FileShortened.txt"):
for file in glob.iglob('*.html', recursive=True):
with open(file) as f:
contents = f.read()
extract = re.compile(r'(This is the beginning of).*?Until the End', re.I | re.S)
cut = extract.sub('', contents)
print(file.split(os.path.sep)[-1], end="| ")
print(cut, end="\n")
extractor()
Note: I am NOT using BS4 or lxml because I am not only interested in HTML text but actually in ALL lines between my start and end-RegEx incl. all formatting code lines.
My code is working without problems, however as I have a lot of files my FileShortened.txt document is quickly going to be massive in size. My problem is not with the file or the extraction, but with redirecting my output to various txt-file. For now, I am getting everything into one file, what I would need is some kind of a "for each file searched, create new txt-file with the same name as the original document" condition (arcpy module?!)?
Somehting like:
File1.html --> File1Short.txt
File2.html --> File2Short.txt
...
Is there an easy way (without changing my code too much) to invert my code in the sense of printing the "RegEx Match" to a new .txt file instead of "everything except my RegEx match"?
Any help appreciated!

Ok, I figured it out.
Final Code is:
import glob
import os
import re
from os import path
def extractor():
os.chdir(r"F:\Test") # the directory containing my html
for file in glob.glob("*.html"): # iterates over all files in the directory ending in .html
with open(file) as f, open((file.rsplit(".", 1)[0]) + ".txt", "w") as out:
contents = f.read()
extract = re.compile(r'Start.*?End', re.I | re.S)
cut = extract.sub('', contents)
out.write(cut)
out.close()
extractor()

Why does readHTMLTable cannot successfully read premier league tables for May month?

The official Premier league website provides data with various statistics for league's teams over seasons (e.g. this one). I used the function readHTMLTable from XML R package to retrieve those tables. However, I noticed that the function can not read tables for May months while for others it works well. Here is an example:
april2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=APRIL&timelineView=date&toDate=1177887600000&tableView=CURRENT_STANDINGS"
april.df <- readHTMLTable(april2007.url, which = 1)
april.df[complete.cases(april.df),] ## correct table
march2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=APRIL&timelineView=date&toDate=1398639600000&tableView=CURRENT_STANDINGS"
march.df <- readHTMLTable(march2014.url, which = 1)
march.df[complete.cases(march.df), ] ## correct table
may2007.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2006-2007&month=MAY&timelineView=date&toDate=1179010800000&tableView=CURRENT_STANDINGS"
may.df1 <- readHTMLTable(may2007.url, which = 1)
may.df1 ## Just data for the first team
may2014.url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2013-2014&month=MAY&timelineView=date&toDate=1399762800000&tableView=CURRENT_STANDINGS"
may.df2 <- readHTMLTable(may2014.url, which =1)
may.df2 ## Just data for the first team
As you can see, the function can not retrieve data for May month.
Please, can someone explain why this happens and how it can be fixed?
EDIT AFTER #zyurnaidi answer:
Below is the code that can do the job without manual editing.
url <- "http://www.premierleague.com/en-gb/matchday/league-table.html?season=2009-2010&month=MAY&timelineView=date&toDate=1273359600000&tableView=CURRENT_STANDINGS" ## data for the 09-05-2010.
con <- file (url)
raw <- readLines (con)
close (con)
pattern <- '<span class=" cupchampions-league= competitiontooltip= qualifiedforuefachampionsleague=' ## it seems that this part of the webpage source code mess the things up
raw <- gsub (pattern = pattern, replacement = '""', x = raw)
df <- readHTMLTable (doc = raw, which = 1)
df[complete.cases(df), ] ## correct table

OK. There are few hints for me to find the problem here:
1. The issues happen consistently on May. This is the last month of each season. It means that there should be something unique in this particular case.
2. Direct parsing (htmlParse, from both link and downloaded file) produces a truncated file. The table and html file are just suddenly closed after the first team in the table is reported.
The parsed data always differs from the original right after this point:
<span class=" cupchampions-league=
After downloading and carefully checking the html file itself, I found that there are (uncoded?) character issues there. My guess, this is caused by the cute little trophy icons seen after the team names.
Anyway, to solve this issue, you need to take out these error characters. Instead of editing the downloaded html files, my suggestion is:
1. View page source the EPL url for May's league table
2. Copy all and paste to the text editor, save as an html file
3. You can now use either htmlParse or readHTMLTable
There might be better way to automate this, but hope it can help.

Insert file (foo.txt) into open file (bar.txt) at caret position

What would be the best method, please, to insert file (foo.txt) into open file (bar.txt) at caret position?
It would be nice to have an open-file dialog to choose anything to be inserted.
The word processing equivalent would be "insert file" here.
Here is a substitute for foo.sublime-snippet, which can be linked to form files elsewhere:
import sublime, sublime_plugin
class InsertFileCommand(sublime_plugin.TextCommand):
def run(self, edit):
v = self.view
template = open('foo.txt').read()
print template
v.run_command("insert_snippet", {"contents": template})

From within a text command you can access the current view. You can get the cursor positions using self.view.sel(). I don't know how to do gui stuff in python, but you can do file selection using the quick panel (similar to FuzzyFileNav).

Here is my unofficial modification of https://github.com/mneuhaus/SublimeFileTemplates which permits me to insert-a-file-here using the quick panel. It works on an OSX operating system (running Mountain Lion).
The only disadvantage I see so far is the inability to translate a double-slash \\ in the form file correctly -- it gets inserted instead as just a single-slash \. In my LaTex form files, the double-slash \\ represents a line ending, or a new line if preceded by a ~. The workaround is to insert an extra slash at each occurrence in the actual form file (i.e., put three slashes, with the understanding that only two slashes will be inserted when running the plugin). The form files need to be LF endings and I'm using UTF-8 encoding -- CR endings are not translated properly. With a slight modification, it is also possible to have multiple form file directories and/or file types.
import sublime, sublime_plugin
import os
class InsertFileCommand(sublime_plugin.WindowCommand):
def run(self):
self.find_templates()
self.window.show_quick_panel(self.templates, self.template_selected)
def find_templates(self):
self.templates = []
self.template_paths = []
for root, dirnames, filenames in os.walk('/path_to_forms_directory'):
for filename in filenames:
if filename.endswith(".tex"): # extension of form files
self.template_paths.append(os.path.join(root, filename))
self.templates.append(os.path.basename(root) + ": " + os.path.splitext(filename)[0])
def template_selected(self, selected_index):
if selected_index != -1:
self.template_path = self.template_paths[selected_index]
print "\n" * 25
print "----------------------------------------------------------------------------------------\n"
print ("Inserting File: " + self.template_path + "\n")
print "----------------------------------------------------------------------------------------\n"
template = open(self.template_path).read()
print template
view = self.window.run_command("insert_snippet", {'contents': template})
sublime.status_message("Inserted File: %s" % self.template_path)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Python - Why isn't this specific text being found by findall regex? - html

Related

NLTK: Is there a term for this procedure?

Is it possible to give text format hints in google vision api?

Extracting text from plain HTML and write to new file

Why does readHTMLTable cannot successfully read premier league tables for May month?

Insert file (foo.txt) into open file (bar.txt) at caret position

Categories

Resources