Trying to scrape text from pages where data are loaded from external URL

Trying to scrape text from pages where data are loaded from external URL - json

I am using this code to collect the links to all past minutes issued by the central bank of Brazil
import requests
import textwrap
from bs4 import BeautifulSoup
url = "https://www.bcb.gov.br/api/servico/sitebcb/atascopom-conteudo/ultimas?quantidade=1000&filtro="
data = requests.get(url).json()
links = []
for i in range(178):
temp_link = "https://www.bcb.gov.br/"+data['conteudo'][i]['LinkPagina']
links.append(temp_link)
print(links)
The code does generate all the links as needed. Unfortunately, when I loop over the links and try to copy the main text in the body of the respective pages, I get empty results. Based on a previous related question, I believe the issue is that the data in the respective pages are loaded from external URLs. Unfortunately I do not know how to overcome this problem in the context of my loop.
Any help is appreciated.

import requests
from bs4 import BeautifulSoup
def main(url):
with requests.Session() as req:
params = {
"quantidade": 1000,
"filtro": ""
}
r = req.get(url, params=params)
items = [x['LinkPagina'].rsplit('/', 1)[-1]
for x in r.json()['conteudo']]
for x in items:
npr = {
"filtro": "IdentificadorUrl eq '{}'".format(x)
}
r = req.get(
'https://www.bcb.gov.br/api/servico/sitebcb/atascopom/principal', params=npr)
soup = BeautifulSoup(
r.json()['conteudo'][0]['OutrasInformacoes'], 'lxml')
print(soup.select_one('.lista1').text)
exit() # <-- Remove it.
main('https://www.bcb.gov.br/api/servico/sitebcb/atascopom-conteudo/ultimas')
Output:
1.
A
inflação medida pela variação do Índice Nacional de Preços ao Consumidor Amplo
(IPCA) atingiu 0,78% em maio, 0,17 ponto percentual (p.p.) acima da registrada
no mês anterior. Dessa forma, a inflação acumulada em doze meses registrou 9,32%
em maio (8,47% em maio de 2015), com os preços livres aumentando 8,82% (6,82%
em maio de 2015), e os administrados, 10,90% (14,09% em maio de 2015).
Especificamente sobre preços livres, os de itens comercializáveis aumentaram 9,55%
em doze meses até maio (5,71% em maio de 2015), e os de não comercializáveis,
8,19% (7,79% em maio de 2015). Note-se, ainda, que os preços no segmento de
alimentação e bebidas variaram 12,72% em doze meses até maio (8,80% em maio de
2015), e os dos serviços, 7,51% (8,23% em maio de 2015). Em síntese, as
informações disponíveis refletem, em parte, a dinâmica de maior persistência
dos preços no segmento de serviços – mas que já mostram alguma desaceleração –,os processos de realinhamento de preços relativos e choques temporários de
oferta no segmento de alimentação e bebidas.

Related

How to write single CSV file using pyspark in Databricks

Good morning all!!
Yesterday I was looking for a function that was able to write a single file CSV with pyspark in Azure Databricks but I did not find anything. So I've built my own function and I wanted to share my solution with the community and if it's possible create like a thread with different solutions for the same problem.
Sorry, because I commented the code in Spanish but basically the function does:
Save the dataframe you've created into a new directory (which is allocated inside the path you've defined and begin with 'temp_') and save the partition there using "coalesce(1)"
Rename the CSV file as you want and moves it to the desired path
Delete de temporary file
That's all! You have your unique CSV file
def escribe_fichero_unico(dataframe, path, file_name, file_format = 'csv'):
"""
Definición: (1) Genera carpeta temporal para guardar particiones que Spark genera por defecto
a la hora de guardar archivos, (2) Une todas las particiones en un único archivo, (3) Mueve este
archivo al directorio anterior y (4) Borra la carpeta temporal
Parámetros:
dataframe: dataframe que quieras guardar como fichero único
file_name: en formato string escribe nombre del archivo
file_format: en formato string escribe 'csv' o 'parquet'
path: en formato string escribe el path donde quieres guardar el csv
"""
import os
# 1) Guardamos el dataframe creando una carpeta temporal que guarda todas las particiones
path_temp = path + 'temp_' + file_name + '_trash'
if file_format == 'csv':
dataframe.coalesce(1).write.format('csv').mode('overwrite') \
.options(header="true", schema="true", delimiter=";") \
.save(path_temp)
else if file_format == 'parquet':
dataframe.coalesce(1).write.format("parquet").mode("overwrite") \
.save(path_temp)
# 2)Une todas las particiones en un único archivo
file_part = [file.path for file in dbutils.fs.ls(path_temp) if os.path.basename(file.path).startswith('part')][0]
# 3) Mueve este archivo al directorio anterior
dbutils.fs.mv(file_part, path + file_name + '.' + file_format)
# 4) Borra la carpeta temporal
dbutils.fs.rm(path_temp, True)
I hope this work for you as well :)

How to find json data?

chitown88 help me to find the json on this website : https://www.iwc.com/fr/fr/watch-collections/pilot-watches/iw329303-big-pilots-watch-43.html
It seems that you need to replace html by .productinfo.FR.json
Source : How to scrape specific information on a website
I would like to do the same output with this page : https://www.omegawatches.com/fr-fr/watch-omega-constellation-quartz-27-mm-12315276005001
But I cannot manage to scrape those informations because the page is dynamic and I cannot find the json data, I searched for hours.
Do you have any solutions in order to scrape the same output than the question source ?

There isn't anything special about this page. Just grab the right information with beautifulsoup:
import requests
from bs4 import BeautifulSoup
url = "https://www.omegawatches.com/fr-fr/watch-omega-constellation-quartz-27-mm-12315276005001"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
title = soup.title.text.split("|")[0].strip()
description = soup.select_one(".description-content").get_text(strip=True)
price = soup.select_one(".price").text
info = "\n".join(
tag.get_text(strip=True)
for tag in soup.select(
".product-info-details-right .li, .product-info-details-right li"
)
)
print(title)
print()
print(description)
print("-" * 80)
print(price)
print("-" * 80)
print(info)
Prints:
Constellation Quartz 27 mm - 123.15.27.60.05.001
L'esthétique particulièrement remarquable et intemporelle de la collection OMEGA Constellation se caractérise par l'originalité du cadran et la présence des fameuses griffes.Ce modèle brossé se distingue par un cadran en nacre blanche protégé par un verre saphir résistant aux rayures. La lunette sertie de diamants est montée sur un boîtier de 27 mm en acier inoxydable sur un bracelet également en acier inoxydable.Cette montre est animée par le calibre OMEGA 1376, un mouvement de précision à quartz.
--------------------------------------------------------------------------------
5 000,00 €
--------------------------------------------------------------------------------
Diamants
Entre‑corne :18 mm
Bracelet :acier
Boîtier :acier
Diamètre du boîtier :27 mm
Couleur du cadran :blanc
Verre :verre saphir bombé résistant aux rayures, traité antireflet à l’intérieur
Étanchéité :10 bars (100 mètres / 330 pieds)
Type de mouvement :quartz
Calibre :OMEGA 1376
Mouvement de précision à quartz, finition rhodiée.
Durée de vie de la pile :48 mois
Type :quartz
Swiss Made
GARANTIE 5 ANS
Paiement sécurisé
Livraison et retour offerts

How to fill variables with the content of a Json

I'm starting to work with PowerShell and my first task is to fix a problem with a script that already runs in the production environment.
This script is called via Webhook and it receives a parameter that the webhook passes to it.
I need to run this script inside PowerShell ISE to be able to debug it but I don't know how to fill the variables that are normally filled when it is called by Webhook.
Here is the beginning of the code where the variables are filled in, can someone give me a tip on how to fill the variable "WebHookData"..?
Thanks in advance.
I've tried to do this but it didn't work...
Sorry for putting images instead of the code, but for some reason I can't post the code.
This is the JSON that I use..
{"source":"la-draft-clipboard","value":[{"tokenKey":"8EAD3F03-E08F-4D58-8B1A-2AB8BD2F25DB","type":"literal","tokenExpression":"{"},{"tokenKey":"A7596123-17DF-49A9-AC18-1196A4CD457E","type":"new_line","tokenExpression":"\n"},{"tokenKey":"36DF511D-C1A9-4BC8-B2E9-37BCA058FB78","type":"literal","tokenExpression":" \"AutomationAccountName\": \"proj-00016-automation-account\","},{"tokenKey":"918137AE-EC61-4B77-A5F2-B527E2D4E3C9","type":"new_line","tokenExpression":"\n"},{"tokenKey":"DCC2D1C1-14F0-4869-A44C-08F8AB35B0B3","type":"literal","tokenExpression":" \"BeginPeakTime\": \"7:00\","},{"tokenKey":"61F7441B-0688-4AD2-A1A5-086C4F7F6D1E","type":"new_line","tokenExpression":"\n"},{"tokenKey":"2F3DD3CA-BD83-46EF-9529-C890C2E31CAF","type":"literal","tokenExpression":" \"ConnectionAssetName\": \"AzureRunAsConnection\","},{"tokenKey":"C6DD6FD0-E99A-48A8-96AA-3974D66FD9BD","type":"new_line","tokenExpression":"\n"},{"tokenKey":"A4E7A469-D08A-4C5A-8C6B-06E58996A0EC","type":"literal","tokenExpression":" \"EndPeakTime\": \"17:00\","},{"tokenKey":"E67547BC-98BB-4749-A84E-A36B761EE504","type":"new_line","tokenExpression":"\n"},{"tokenKey":"727D64BD-906C-4DA3-84C5-44F3054B2DEB","type":"literal","tokenExpression":" \"HostPoolName\": \"VDI-POOL-001\","},{"tokenKey":"92AFEBB8-4307-42C2-8BD0-C55ACC848940","type":"new_line","tokenExpression":"\n"},{"tokenKey":"F37993F9-1471-4E58-B43F-9BB08C4D4A03","type":"literal","tokenExpression":" \"LimitSecondsToForceLogOffUser\": 0,"},{"tokenKey":"8B2517D1-046E-43EF-BF75-B1EC5F31B83D","type":"new_line","tokenExpression":"\n"},{"tokenKey":"7464316E-6A8D-4F82-B269-95FF76A69014","type":"literal","tokenExpression":" \"LogOffMessageBody\": \"Salve seus trabalhos! Em aproximadamente 15 minutos, este terminal virtual será desligado automaticamente devido às políticas de otimização de custos da companhia. Caso seja necessário continuar suas atividades, um novo terminal poderá ser acessado após este período.\","},{"tokenKey":"7328955E-0025-4AA1-A0AE-CDAFA4238927","type":"new_line","tokenExpression":"\n"},{"tokenKey":"384AF3CF-CA86-4820-A5E1-230C09909662","type":"literal","tokenExpression":" \"LogOffMessageTitle\": \"ATENÇÃO!!!\","},{"tokenKey":"5E2EBD78-8599-487F-8DC5-CF9699595DDD","type":"new_line","tokenExpression":"\n"},{"tokenKey":"B7E409AF-A5AE-4622-A45E-5982FD15B03E","type":"literal","tokenExpression":" \"MaintenanceTagName\": \"NO_TAG\","},{"tokenKey":"3F9BF963-790D-45B1-9F04-D71A2B7C84DC","type":"new_line","tokenExpression":"\n"},{"tokenKey":"B6E94E37-69C0-4BF8-AE69-CD7B4EA9CB83","type":"literal","tokenExpression":" \"MinimumNumberOfRDSH\": 20,"},{"tokenKey":"00A1D37B-F82B-42F6-B792-75B39EBD6A83","type":"new_line","tokenExpression":"\n"},{"tokenKey":"F41B0C75-4541-4772-BF30-2D4F6DF045C6","type":"literal","tokenExpression":" \"ResourceGroupName\": \"proj-00016-wvd-rg\","},{"tokenKey":"FE6FC329-DC12-4782-83CE-F48BDC6B74B5","type":"new_line","tokenExpression":"\n"},{"tokenKey":"785500F8-3D71-4D91-AADA-D6ABF1EFD66B","type":"literal","tokenExpression":" \"ResourceGroupNameAutomation\": \"proj-00016-automation-rg\","},{"tokenKey":"BD3331BF-3BF9-4B9E-B9B8-C03E448B2D85","type":"new_line","tokenExpression":"\n"},{"tokenKey":"25586050-62A0-4CAF-81FD-C5770DF20B63","type":"literal","tokenExpression":" \"RunbookLogoffShutdown\": \"ARMLogoffAndShutdown\","},{"tokenKey":"C4B9E432-C41D-4374-9531-F2AEFDD51267","type":"new_line","tokenExpression":"\n"},{"tokenKey":"0155B6AB-7CAB-4C4E-BB1F-A643D9B0575B","type":"literal","tokenExpression":" \"SessionThresholdPerCPU\": 0.75,"},{"tokenKey":"3EAA1C7E-0119-40B9-9AF8-85D10E0FA3FD","type":"new_line","tokenExpression":"\n"},{"tokenKey":"2D904698-1386-47D7-9513-7CEE702BA0D3","type":"literal","tokenExpression":" \"TimeDifference\": \"-3:00\""},{"tokenKey":"40D497B6-AAED-4334-81C7-10B8C6745DE0","type":"new_line","tokenExpression":"\n"},{"tokenKey":"18EB90AF-25D4-4956-8A85-41BA555C6A95","type":"literal","tokenExpression":"}"}]}

Based on the JSON you've posted and the parts of the code we can see in the screenshot, give the following mock object a try:
$mockWebhookPayload = [pscustomobject]#{
WebhookName = 'NameOfWebhookGoesHere'
RequestHeader = #{ 'Content-Type' = 'application/json' }
RequestBody = #'
{
"AutomationAccountName": "proj-00016-automation-account",
"BeginPeakTime": "7:00",
"ConnectionAssetName": "AzureRunAsConnection",
"EndPeakTime": "17:00",
"HostPoolName": "VDI-POOL-001",
"LimitSecondsToForceLogOffUser": 0,
"LogOffMessageBody": "Salve seus trabalhos! Em aproximadamente 15 minutos, este terminal virtual será desligado automaticamente devido às políticas de otimização de custos da companhia. Caso seja necessário continuar suas atividades, um novo terminal poderá ser acessado após este período.",
"LogOffMessageTitle": "ATENÇÃO!!!",
"MaintenanceTagName": "NO_TAG",
"MinimumNumberOfRDSH": 20,
"ResourceGroupName": "proj-00016-wvd-rg",
"ResourceGroupNameAutomation": "proj-00016-automation-rg",
"RunbookLogoffShutdown": "ARMLogoffAndShutdown",
"SessionThresholdPerCPU": 0.75,
"TimeDifference": "-3:00"
}
'#
}
& .\path\to\webhook-script.ps1 -WebHookData $mockWebhookPayload

Loading data with JSON.parse: Unexpected token /

I'm pretty new with this JSON and AJAX staff, so I was following a tutorial on youtube: https://www.youtube.com/watch?v=rJesac0_Ftw&t=1029s.
The thing is that I have followed the steps exactly like in the video, but I get the following error:
VM34:1 Uncaught SyntaxError: Unexpected token / in JSON at position 0
at JSON.parse (<anonymous>)
at XMLHttpRequest.theRequest.onload (loader.js:5)
My JSON script:
[
{
"name":"一",
"sound": {
"kunyomi": ["ひと．つ"],
"onyomi": ["イチ"]
},
"description":"Representaba la unidad, el absoluto. Cuando funciona como componente, este carácter adquiere el significado de suelo o de techo según su posición: si se encuentra encima de otro componente, toma el significado de techo; si está debajo, de suelo. Todas las formas antiguas de los números están asociadas a fuerzas del universo y a la mitología. Los números pares son el ying y los impares son el yang.",
"examples":["-月[いちがつ] - Enero", "-日[ついたち] - Día uno", "-回[いっかい] - Dos veces", "-階[いっかい] - Primer Piso"]
},
{
"name":"ニ",
"sound": {
"kunyomi": ["ふた．つ"],
"onyomi": ["ニ、ジ"]
},
"description":"Representa el cielo 一 y la tierra 一, el ying y el yang. Al igual que en el caso de los numerales romanos, el kanji de dos es una simple duplicación del trazo horizontal que significa uno.",
"examples":["二月[にがつ] - Febrero", "二日[ふつか] - Día dos", "二回[にかい] - Dos veces"]
}
]
And I call that JSON with the following code:
var ourRequest = new XMLHttpRequest();
ourRequest.open('GET', 'http://127.0.0.1/japones_flat/kanjis_n5.json');
ourRequest.onload = function() {
"use strict";
var response = JSON.parse(ourRequest.responseText);
console.log(response[0]);
};
ourRequest.send();
I made a little research and it looks like the problem resides in the JSON.parse method, saying that the token "/" is making troubles. After that, I noticed that Dreamweaver had left by default a comment in my .json file, so I deleted it (Because it started with "/") but I keep getting this anoying error. Can you guys help me?
Thanks in advance!

Need help on an ER Diagram for a “Wireless Internet Service Provider”

I have this diagram.
Diagram (1/2)
Diagram (2/2)
Just in case : Equipo (Device), Servicio (Service) , Categoria(Category), Persona (Person), Comprobante (Voucher,Receipt)
I need to be able to do the following:
"Cuando un cliente informa de un equipo al cual se le suministrará alguno de los
servicios es necesario llevar registro de su fecha de alta, como también su fecha de baja si la hubiere.(Whenever a client informs of a device which is going to be given a service, it is needed to register the activation date and the deactivation date if there is one)
The first thing I tought was to just add attributes to the service table, for example ( act_date timestamp and dea_date timestamp), would that be okey ? Or maybe create another table, like the one that says "Categoria"(Category), but to save those dates. Any ideas ?
Thank you.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Trying to scrape text from pages where data are loaded from external URL - json

Related

How to write single CSV file using pyspark in Databricks

How to find json data?

How to fill variables with the content of a Json

Loading data with JSON.parse: Unexpected token /

Need help on an ER Diagram for a “Wireless Internet Service Provider”

Categories

Resources