How to parse HTML tbody data into Python in tabular format - html

I am new to python and I am trying to parse this data into tabular format in Python. I have considered examples but unable to get desired result.
Can someone please help me on this
<tbody>
<tr><td>Kupon in %</td><td>36,520</td></tr>
<tr><td>Erstes Kupondatum</td><td>03.07.2017</td></tr>
<tr><td>Letztes Kupondatum</td><td>03.04.2022</td></tr>
<tr><td>Zahlweise Kupon</td><td>Zinszahlung normal</td></tr>
<tr><td>Spezialkupon Typ</td><td>Zinssatz variabel</td></tr>
Need this data in this way :
Kupon in % 36,520
Erstes Kupondatum 03.07.2017
Letztes Kupondatum 03.04.2022

You can do that in two ways 1. Using list comprehension and 2. using for loop
both produce the same result its on you to choose.
from bs4 import BeautifulSoup
html = """<tbody>
<tr><td>Kupon in %</td><td>36,520</td></tr>
<tr><td>Erstes Kupondatum</td><td>03.07.2017</td></tr>
<tr><td>Letztes Kupondatum</td><td>03.04.2022</td></tr>
<tr><td>Zahlweise Kupon</td><td>Zinszahlung normal</td></tr>
<tr><td>Spezialkupon Typ</td><td>Zinssatz variabel</td></tr>"""
#1
soup = BeautifulSoup(html,'lxml')
print(' '.join([td.text for td in soup.find_all('td')]))
# 2
tags = []
tr = soup.find_all('td')
for td in tr:
tags.append(td.text)
print(' '.join(tags))
Output: Kupon in % 36,520 Erstes Kupondatum 03.07.2017 Letztes Kupondatum 03.04.2022 Zahlweise Kupon Zinszahlung normal Spezialkupon Typ Zinssatz variabel

Related

Most efficient way to read concatenated json in PySpark?

I'm getting one json file where each line in the json is a json itself of 1000 objects, like this:
{"id":"test1", "results": [{"property1": "sample1"},{"property2": "sample2"}]}
{"id":"test2", "results": [{"property1": "sample3"},{"property2": "sample4"}]}
If I read it as a json using spark.read.json(filepath), I'm getting:
+-----+--------------------+
| id| results|
+-----+--------------------+
|test1|[{sample1, null},...|
+-----+--------------------+
(Which is only the first json in the concatenated json)
While I'm trying to get:
+-----+---------+---------+
|id |property1|property2|
+-----+---------+---------+
|test1|sample1 |sample1 |
|test2|sample3 |sample4 |
+-----+---------+---------+
I end up by reading the json as text, and iterate over each row to treat it as json and union each dataframe:
df = (spark.read.text(data[self.files]))
dataCollect = df.collect()
i = 0
for row in dataCollect:
df_row = flatten_json(spark.read.json(spark.sparkContext.parallelize(row)))
if i == 0:
df_all = df_row
else:
df_all = df_row.unionByName(df_all, allowMissingColumns = True)
i = i + 1
flatten_json is a helper that helps me to automatically flatten the json.
I guess there is a better approach, any help would be much appreciate
Your JSON file is called JSON Lines or JSONL which is a supported file format that Pyspark can handle natively. So, use the regular spark.read.json to read it and perform the additional transformations to match with what you want.
df = spark.read.json('yourfile.json or json/directory')
# Explode the array into structs. This will generate lots of nulls.
df = (df.select('id', F.explode('results').alias('results'))
.select('id', 'results.*'))
# Group them and aggregate to remove the nulls.
df = (df.groupby('id')
.agg(*[F.first(x, ignorenulls=True).alias(x) for x in df.columns if x != 'id']))
I think this works fine for 1000 lines JSONL, however, if you are curious about alternative solution that doesn't involve generating/removing nulls, please check here: By using PySpark how to parse nested json. In some situations, the alternative solution which doesn't do explode could be more performant.

How to extract value from html via BeautifulSoup

I have parsed my string via BeautifulSoup.
from bs4 import BeautifulSoup
import requests
import re
def otoMoto(link):
URL = link
page = requests.get(URL).content
bs = BeautifulSoup(page, 'html.parser')
for offer in bs.find_all('div', class_= "offer-item__content ds-details-container"):
# print(offer)
# print("znacznik")
linkOtoMoto = offer.find('a', class_="offer-title__link").get('href')
# title = offer.find("a")
titleOtoMoto = offer.find('a', class_="offer-title__link").get('title')
rokProdukcji = offer.find('li', class_="ds-param").get_text().strip()
rokPrzebPojemPali = offer.find_all('li',class_="ds-param")
print(linkOtoMoto+" "+titleOtoMoto+" "+rokProdukcji)
print(rokPrzebPojemPali)
break
URL = "https://www.otomoto.pl/osobowe/bmw/seria-3/od-2016/?search%5Bfilter_float_price%3Afrom%5D=50000&search%5Bfilter_float_price%3Ato%5D=65000&search%5Bfilter_float_year%3Ato%5D=2016&search%5Bfilter_float_mileage%3Ato%5D=100000&search%5Bfilter_enum_financial_option%5D=1&search%5Border%5D=filter_float_price%3Adesc&search%5Bbrand_program_id%5D%5B0%5D=&search%5Bcountry%5D="
otoMoto(URL)
Result:
https://www.otomoto.pl/oferta/bmw-seria-3-x-drive-nowe-opony-ID6Dr4JE.html#d51bf88c70 BMW Seria 3 2016
[<li class="ds-param" data-code="year">
<span>2016 </span>
</li>, <li class="ds-param" data-code="mileage">
<span>50 000 km</span>
</li>, <li class="ds-param" data-code="engine_capacity">
<span>1 998 cm3</span>
</li>, <li class="ds-param" data-code="fuel_type">
<span>Benzyna</span>
</li>]
So I can extract single strings, but if I see this same class
class="ds-param"
I can't assigne, for example, production date to variable. Please let me know if you have any ideas :).
Have a nice day !
from the docs:
Some attributes, like the data-* attributes in HTML 5, have names that can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them into a dictionary and passing the dictionary into find_all() as the attrs argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]
so you could do something like
data_soup.find_all(attrs={"data-code": "year" })[0]. get_text()

How to parse a string to key value pair using regex?

What is the best way to parse the string into key value pair using regex?
Sample input:
application="fre" category="MessagingEvent" messagingEventType="MessageReceived"
Expected output:
application "fre"
Category "MessagingEvent"
messagingEventType "MessageReceived"
We already tried the following regex and its working.
application=(?<application>(...)*) *category=(?<Category>\S*) *messagingEventType=(?<messagingEventType>\S*)
But we want a generic regex which will parse the sample input to the expected output as key value pair?
Any idea or solution will be helpful.
input = 'application="fre" category="MessagingEvent" messagingEventType="MessageReceived"'
puts input.
scan(/(\w+)="([^"]+)"/). # scan for KV-pairs
map{ |k, v| %Q|#{k.ljust(30,' ')}"#{v}"| }. # adjust as you requested
join($/) # join with platform-dependent line delimiters
#⇒ application "fre"
# category "MessagingEvent"
# messagingEventType "MessageReceived"
Instead of using regex, it can be done by spliting and storing the string in hash like below:
input = 'application="fre" category="MessagingEvent" messagingEventType="MessageReceived"'
res = {}
input.split.each { |str| a,b = str.split('='); res[a] = b}
puts res
==> {"application"=>"\"fre\"", "category"=>"\"MessagingEvent\"", "messagingEventType"=>"\"MessageReceived\""}

Removing data from a json file on bases of their value

I had produced a script to parse some blast files from different samples. As I wanted to know the genes that all the samples had it commum I created a list, and a dictionary to count them. I have also produced a json file from the dictionary. Now I want to removed those genes whose counts are less than 100, as this is the number of samples, either from the dictionary or from the json file but I don't know how to.
This is part of the code:
###to produce a dictionary with the genes, and their repetitions
for extracted_gene in matches:
if extracted_gene in matches_counts:
matches_counts[extracted_gene]+=1
else:
matches_counts[extracted_gene]=1
print matches_counts #check point
#if matches_counts[extracted_gene]==100:
#print extracted_gene
#to convert a dictionary into a txt file and format it with json
with open('my_gene_extraction_trial.txt', 'w') as file:
json.dump(matches_counts,file, sort_keys=True, indent=2, separators=(',',':'))
print 'Parsing has finished'
I had tried different ways to do so:
a) ignoring the else statement but then it will give me an empty dict
b)trying to print only the ones whose values is 100, but it does not get printed
c) I read the documentation about json but I only can see how to delete elements by objects but not by values.
Can I anyone help me with this issue, please? This is getting me mad!
This is what it should look like:
# matches (list) and matches_counts (dict) already defined
for extracted_gene in matches:
if extracted_gene in matches_counts:
matches_counts[extracted_gene] += 1
else: matches_counts[extracted_gene] = 1
print matches_counts #check point
# Create a copy of the dict of matches to remove items from
counts_100 = matches_counts.copy()
for extracted_gene in matches_counts:
if matches_counts[extracted_gene] < 100:
del counts_100[extracted_gene]
print counts_100
Let me know if you still get errors.

R: Printing random forest model to html

I'm working on a Rmd document that I would like to compile to html using knitr package via the HTML export mechanism available in RStudio. The problem can be reproduced with the code below:
Example
# Set up
rm(list = ls())
data(airquality)
attach(airquality)
packs <- c("randomForest", "knitr", "xtable", "xtable", "stargazer")
lapply(packs, require, character.only=T, quietly = TRUE, warn.conflicts = FALSE)
# Model
airquality <- na.roughfix(airquality)
dummy <- randomForest(Ozone ~., data = airquality)
# Problem
kable(dummy)
xtable(dummy)
stargazer(dummy)
The issue is further illustrated by the output below:
Output
> # Problem
> kable(dummy)
Error in as.data.frame.default(x) :
cannot coerce class "c("randomForest.formula", "randomForest")" to a data.frame
> xtable(dummy)
Error in UseMethod("xtable") :
no applicable method for 'xtable' applied to an object of class "c('randomForest.formula', 'randomForest')"
> stargazer(dummy)
% Error: Unrecognized object type.
Is it possible to force the randomForest output into a nice html table that would be presentable in a markdown document?