substring in python 3 given line number and offset - html

I'm trying parsing an html page with the HTMLParser library in python 3.
The function HTMLParser.getpos() return the line number and the offset of the last tag parsed.
For example I know that the "string" I want starts in line number 10 offset 5 and ends in line number 30 offset 10 how can I get the substring from line 10 offset 5 to line 30 offset 10 ?
Thanks.
html = 'this holds the entire html code'
MyParser.feed(html) #now the parser works its magic
start = (10,5) #this is returned from HTMLParser.getpos(), 10 is the line number and 5 is the offset of that line
end = (30,10) #same here
#I want to do something like this (I know this is invalid python code)
substring = html.substring(start,end) #return the html code as a string from line 10 offset 5 to line 30 offset 10
Better Explanation:
I'm trying to get a substring from a string.
I understand in python 3 it's called slice: string[a:b]
so if I wanted the substring 'jonny' form the string 'Hello jonny smith'
I would do this: substring = 'Hello jonny smith'[6:11]
The problem is that HTMLParser.getpos() returns a tuple (line number, offset of that line) so I can't do: substring = multy_line_string[line number:offset]

Assuming you are interested in HTML parsing try lxml --> http://docs.python-guide.org/en/latest/scenarios/scrape/

Indeed HTMLParser records the line and position in the line.
Here is the way to record the position in the stream:
def updatepos(self, i, j):
self.rawpos = i
return super().updatepos(i, j)
Now, assuming the string you are looking for stands inside mytag:
def handle_starttag(self, tag, attrs):
if tag == 'mytag':
self.mytag_pos = self.rawdata.index('>', self.rawpos) + 1
def handle_endtag(self, tag):
if tag == 'mytag':
self.mytag_data = self.rawdata[self.mytag_pos:self.rawpos]

Related

Use of function / return

I had the task to code the following:
Take a list of integers and returns the value of these numbers added up, but only if they are odd.
Example input: [1,5,3,2]
Output: 9
I did the code below and it worked perfectly.
numbers = [1,5,3,2]
print(numbers)
add_up_the_odds = []
for number in numbers:
if number % 2 == 1:
add_up_the_odds.append(number)
print(add_up_the_odds)
print(sum(add_up_the_odds))
Then I tried to re-code it using function definition / return:
def add_up_the_odds(numbers):
odds = []
for number in range(1,len(numbers)):
if number % 2 == 1:
odds.append(number)
return odds
numbers = [1,5,3,2]
print (sum(odds))
But I couldn’t make it working, anybody can help with that?
Note: I'm going to assume Python 3.x
It looks like you're defining your function, but never calling it.
When the interpreter finishes going through your function definition, the function is now there for you to use - but it never actually executes until you tell it to.
Between the last two lines in your code, you need to call add_up_the_odds() on your numbers array, and assign the result to the odds variable.
i.e. odds = add_up_the_odds(numbers)

scraping - find the last 5 score of each match - in html

i would like your help to get the last 5 score, i can't get it please help me.
from selenium import webdriver
import pandas as pd
from pandas import ExcelWriter
from openpyxl.workbook import Workbook
import time as t
import xlsxwriter
pd.set_option('display.max_rows', 5, 'display.max_columns', None, 'display.width', None)
browser = webdriver.Firefox()
browser.get('https://www.mismarcadores.com/futbol/espana/laliga/resultados/')
print("Current Page Title is : %s" %browser.title)
aux_ids = browser.find_elements_by_css_selector('.event__match.event__match--static.event__match--oneLine')
ids=[]
i = 0
for aux in aux_ids:
if i < 1:
ids.append( aux.get_attribute('id') )
i+=1
data=[]
for idt in ids:
id_clean = idt.split('_')[-1]
browser.execute_script("window.open('');")
browser.switch_to.window(browser.window_handles[1])
browser.get(f'https://www.mismarcadores.com/partido/{id_clean}/#h2h;overall')
t.sleep(5)
p_ids = browser.find_elements_by_css_selector('h2h-wrapper')
#here the code of the last 5 score of each match
I believe you can use your Firefox browser but have not tested with it. I use chrome so if you want to use chromedriver check the version of your browser and download the right one, also add it to your system path. The only thing with this approach is that it open a browser window until the page is loaded (because we are waiting for the javascript to generate the matches data). If you need anything else let me know. Good luck!
https://chromedriver.chromium.org/downloads
Known issues: Sometimes it will throw index out of range when retrieve matches data. This is something I am looking to it because it look like sometimes the xpath on each link change a little .
from selenium import webdriver
from lxml import html
from lxml.html import HtmlElement
def test():
# Here we specified the urls to for testing purpose
urls = ['https://www.mismarcadores.com/partido/noIPZ3Lj/#h2h;overall'
]
# a loop to go over all the urls
for url in urls:
# We will print the string and format it with the url we are currently checking, Also we will print the
# result of the function get_last_5(url) where url is the current url in the for loop.
print("Scores after this match {u}".format(u=url), get_last_5(url))
def get_last_5(url):
print("processing {u}, please wait...".format(u=url))
# here we get a instance of the webdriver
browser = webdriver.Chrome()
# now we pass the url we want to get
browser.get(url)
# in this variable, we will "store" the html data as a string. We get it from here because we need to wait for
# the page to load and execute their javascript code in order to generate the matches data.
innerHTML = browser.execute_script("return document.body.innerHTML")
# Now we will assign this to a variable of type HtmlElement
tree: HtmlElement = html.fromstring(innerHTML)
# the following variables: first_team,second_team,match_date and rows are obtained via xpath method(). To get the
# xpath go to chrome browser,open it and load one of the url to check the DOM. Now if you wish to check the xpath
# of each of this variables (elements in case of html), right click on the element->click inspect->the inspect
# panel will appear->the clicked element wil appear selected on the inspect panel->right click on it->Copy->Copy
# Xpath. first_team,second_team and match_date are obtained from the "title" section. Rows are obtained from the
# table of last matches in the tbody content
# When using xpath it will return a list of HtmElement because it will try to find all the elements that match our
# xpath, so that is why we use [0] (to get the first element of the list). This will give use access to a
# HtmlElement object so now we can access its text attribute.
first_team = tree.xpath('//*[#id="flashscore"]/div[1]/div[1]/div[2]/div/div/a')[0].text
print((type(first_team)))
second_team = tree.xpath('//*[#id="flashscore"]/div[1]/div[3]/div[2]/div/div/a')[0].text
# [0:8] is used to slice the string because in the title it contains also the time of the match ie.(10.08.2020
# 13:00) . To use it for comparing each row we need only (10.08.20), so we get from position 0, 8 characters ([0:8])
match_date = tree.xpath('//*[#id="utime"]')[0].text[0:8]
# when getting the first element with [0], we get a HtmlElement object( which is the "table" that have all matches
# data). so we want to get all the children of it, which are all the "rows(elements)" inside it. getchildren()
# will also return a list of object of type HtmlElement. In this case we are also slicing the list with [:-1]
# because the last element inside the "table" is the button "Mostar mas partidos", so we want to take that out.
rows = tree.xpath('//*[#id="tab-h2h-overall"]/div[1]/table/tbody')[0].getchildren()[:-1]
# we quit the browser since we do not need this anymore, we could do it after assigning innerHtml, but no harm
# doing it here unless you wish to close it before doing all this assignment of variables.
browser.quit()
# this match_position variable will be the position of the match we currently have in the title.
match_position = None
# Now we will iterate over the rows and find the match. range(len(rows)) is just to get the count of rows to know
# until when to stop iterating.
for i in range(len(rows)):
# now we use the is_match function with the following parameter: first_team,second team, match_date and the
# current row which is row[i]. if the function return true we found the match position and we assign (i+1) to
# the match_position variable. i+1 because we iterate from 0.
if is_match(first_team, second_team, match_date, rows[i]):
match_position = i + 1
# now we stop the for no need to go further when we find it.
break
# Since we only want the following 5 matches score, we need to check if we have 5 rows beneath our match. If
# adding 5 from the match position is less than the number of rows then we can do it, if not we will only get the
# rows beneath it(maybe 0,1,2,3 or 4 rows)
if (match_position + 5) < len(rows):
# Again we are slicing the list, in this case 2 times [match_position:] (take out all the rows before the
# match position), then from the new list obtained from that we do [:5] which is start from the 0 position
# and stop on 5 [start:stop]. we use rows=rows beacause when slicing you get a new list so you can not do
# rows[match_position:][:5] you need to assign it to a variable. I am using same variable but you can assign
# it to a new one if you wish.
rows = rows[match_position:][:5]
else:
# since we do not have enough rows, just get the rows beneath our position.
rows = rows[match_position:len(rows)]
# Now to get the list of scores we are using a list comprehension in here but I will explain it as a for loop.
# Before that, you need to know that each row(<tr> element in html) has 6 td elements inside it, the number 5 is
# the score of the match. then inside each "score element" we have a span element and then a strong element,
# something like
# <tr>
# <td></td>
# <td></td>
# <td></td>
# <td></td>
# <td><span><strong>1:2</strong></span></td>.
# <td></td>
# </tr>
# Now, That been said, since each row is a HtmlElement object , we can go in a for loop as following:
scores = []
for row in rows:
data = row.getchildren()[4].getchildren()[0].text_content()
# not the best way but we will get al the text content on the element, in this case the span element,
# if the string has more than 5 characters i.e. "1 : 2" then we will take as if it is i.e. "1 : 2(0 : 1)". So
# in this case we want to slice it from the 2nd character from right to left and get 5 characters from that
# position.
# using a ternary expression here, if the length of the string is equal to 5 then this is our score,
# if not then we have to slice it and get the last part, from -6 which is the white space before then 2 (in
# our example) to -1 (which is the 1 before the last ')' ).
score = data if len(data) == 5 else data[-6:-1]
scores.append(score)
print("finished processing {u}.".format(u=url))
# now we return the scores
return scores
def is_match(t1, t2, match_date, row):
# from each row we want to compare, t1,t2,match_date (this are obtained from the title) with the rows team1,
# team2 and date. Each row has 6 element inside it. Please read all the code on get_last_5 before reading this
# explanation. so the for this row, date is in position 0, team1 in 2, team2 in 3.
# <td><span>10.03.20</span></td>
date = row.getchildren()[0].getchildren()[0].text
# <td><span>TeamName</span></td> (when the team lost) or
# <td><span><strong>TeamName</strong></span></td> (when the team won)
team1element = row.getchildren()[2].getchildren()[0] # this is the span element
# using a ternary expression (condition_if_true if condition else condition_if_false)
# https://book.pythontips.com/en/latest/ternary_operators.html
# if span element have childrens , (getchildren()>0) then the team name is team1element.getchildren()[0].text
# which is the text of the strong element, if not the jsut get the text from the span element.
mt1 = team1element.getchildren()[0].text if len(team1element.getchildren()) > 0 else team1element.text
# repeat the same as team 1
team2element = row.getchildren()[3].getchildren()[0]
mt2 = team2element.getchildren()[0].text if len(team2element.getchildren()) > 0 else team2element.text
# basically we can compare only the date, but jsut to be sure we compare the names also. So, if the dates and the
# names are the same this is our match row.
if match_date == date and t1 == mt1 and t2 == mt2:
# we found it so return true
return True
# if not the same then return false
return False

Calling out the Sum of a Data I Made

I am working with a text file and need to call out the sum found from my last column of data [4] that I have made. I have done everything I need for the last column and have used total += square to add the first value in row one with the next value in row two and so on till I hit my 100th row in my text file. Now I need to be able to take my sum that I want in my 100 row and store it as a variable. How can I go about calling it out?
fullPath = open("localzscoretest.txt", "r") #Where I have our the current table located
import math
def globalchiSquare(fullPath):
for line in fullPath:
line = line.strip() #Strip it
lines = line.split(',') #split it
rows = lines[1:] #keeping the numbers
rows = map(float, rows) #getting my numbers in the .txt ready for the equation
square = (rows[4]**2) #squared the z score column
total += square
print total
globalchiSquare(fullPath)
change
square = (rows[4]**2) #squared the z score column
to be
square += (rows[4]**2) #squared the z score column
Give globalchiSquare a readlines() method in order to iterate.
In the function do
def globalchiSquare(fullPath):
for line in fullPath.readlines():
. . .
You should also keep your variables clear. When you say lines, it seems like you are saying that there are multiple--rows, too.
Just make it more simple and include the sum.
def globalchiSquare(fullPath):
total = 0
for line in fullPath.readlines(): # readlines() method
line = line.strip() # cut off ends
line = line.split(',') # create list
row = line[1:] # create row from line
row = map(float,row) # convert to floats
square = row[4]**2 # find square
print 'square',square
total += square
print 'total',total
return total
my_var = globalchiSquare(fullPath)
print my_var # should give total
EDIT: The return statement allows you to store the value of total.

Whats wrong with this Code while appending a list with a function?

def listc(favn):
num = 0
while num < favn :
num += 1
return num
list = []
i = int(raw_input("Input your favourite number : > "))
for num in range(0,i):
list.append(listc(i))
print list
The elements of the list are just same. Little iterations in code are sometime printing [None] in list also.
I want to generate a list with content as 1 to i.
There are two issues with your code.
First the while loop does not run 'favn' no. of times because the return statement is within while loop.It just runs single time, and everytime it returns 1.
Also, you should change
for num in range(0,i):
list.append(listc(i))
to
for num in range(0,i):
list.append(listc(num))
You will get the output you wanted.
If you want to generate a list from 1 to i, you can simply do list = range(1, i + 1).

Use RegExp and split to read file text flash

I have a file text more than 1 000 000 lines that begins by the character C and other one by M
Example:
C9203007870000000000000006339912610971240095400111200469300000 16122011AMI 00000100010000315 080
C9203007870000000000000006339912610971240095400111200469300000 09122011B 590001000100000270016092100
M920300787000000000000000633991261097124009540011120046930000031122011JVJF004 10 N
M920300787000000000000000633991261097124009540011120046930000009122011DEQP003 10 N
M920300787000000000000000633991261097124009540011120046930000012122011ACQK001 10Z N
C9203007870000000000000006339912610971240095400111200469300000 24122011AMI 00000100010000315 080
C9203007870000000000000006339912610971240095400111200469300000 24122011AMI 00000100010000315 080
I want to put in my array only the lines who begins with the character M
How I can add in my split: var pattern:RegExp = /^M/;
var mFileReference:FileReference;
var mArray:Array = new Array();
function onFileLoaded(event:Event):void
{
mFileReference = event.target as FileReference;
data = mFileReference["data"];
mArray = (data.toString()).split("\n");
}
I don’t want to pass by the loop ‘for’ its take a lot of time and resources
I want to add /^M/ to my split is it possible?
for each (var s:String in mArray)
{
if (pattern.test(s)) {
values.push(s);
}
}
Thanks everybody.
Try this regular expression:
/^M.*/gm
This should match all lines that begin with M and nothing else.
It uses the g flag to match all cases of the expression in the string, and it uses m for multiline mode, so ^ and $ will match the beginning/end of lines instead of the beginning/end of the string.
You can get get your array like this:
mArray = data.toString().match(/^M.*/gm);