Parsing Json like structure in spark - json

I have a file with data structured like this
{'analytics_category_id': 'Default', 'item_sub_type': '', 'deleted_at': '', 'product_category_id': 'Default', 'unit_price': '0.000', 'id': 'myntramprdii201907174a72fb2475d84103844083d1348acb9e', 'is_valid': True, 'measurement_uom_id': '', 'description': '', 'invoice_type': 'receivable_debit_note', 'linked_core_invoice_item_id': '', 'ref_3': '741423', 'ref_2': '6001220139357318', 'ref_1': '2022-07-04', 'tax_rate': '0.000', 'reference_id': '', 'ref_4': '', 'product_id': 'Default', 'total_amount': '0.000', 'tax_auth_party_id': '', 'item_type': 'Product', 'invoice_item_attributes': '', 'core_invoice_id': 'myntramprdi20190717a1e925911345463393bc4ac1b124dbe5', 'tax_auth_geo_id': '', 'quantity': 1}
{'analytics_category_id': 'Default', 'item_sub_type': '', 'deleted_at': '', 'product_category_id': 'Default', 'unit_price': '511.000', 'id': 'myntramprdii20190717c749a96d2e7144aea7fc5125287717f7', 'is_valid': True, 'measurement_uom_id': '', 'description': '', 'invoice_type': 'receivable_debit_note', 'linked_core_invoice_item_id': '', 'ref_3': '741424', 'ref_2': '6001220152640260', 'ref_1': '2022-07-07', 'tax_rate': '0.000', 'reference_id': '', 'ref_4': '', 'product_id': 'Default', 'total_amount': '511.000', 'tax_auth_party_id': '', 'item_type': 'Product', 'invoice_item_attributes': '', 'core_invoice_id': 'myntramprdi20190717a1e925911345463393bc4ac1b124dbe5', 'tax_auth_geo_id': '', 'quantity': 1}
I am trying to parse this in Spark using scala and create a dataframe off of it, but not able to do so because of the structure. I thought about replacing the ' with " but my text can also have the same. What I need is a key value pair of the data.
So far I have tried:
read.option("multiline", "true").json("s3://******/*********/prod_flattener/y=2019/m=07/d=17/type=flattened_core_invoices_items/invoice_items_2019_07_17_23_53_19.txt")
I did get some success reading this as a multiline text:
read.option("multiline", "true").textFile("s3://********/*********/prod_flattener/y=2019/m=07/d=17/type=flattened_core_invoices_items/invoice_items_2019_07_17_23_53_19.txt")
| value|
+--------------------+
|{'analytics_categ...|
|{'analytics_categ...|
+--------------------+
How do I read the keys as columns now?

Your issue is linked to the True value used as a boolean in your entries: this is not valid in JSON which requires true or false as boolean values
If your dataset is not very large, the easiest way is to load as text, fix this issue, write the fixed data then reopen it as json.
import spark.implicits._
import org.apache.spark.sql.types._
val initial = spark.read.text("s3://******/*********/prod_flattener/y=2019/m=07/d=17/type=flattened_core_invoices_items/invoice_items_2019_07_17_23_53_19.txt")
val fixed = initial
.select(regexp_replace('value,"\\bTrue\\b","true") as "value")
.select(regexp_replace('value,"\\bFalse\\b","false") as "value")
fixed.write.mode("overwrite").text("/tmp/fixed_items")
val json_df = spark.read.json("/tmp/fixed_items")
json_df: org.apache.spark.sql.DataFrame = [analytics_category_id: string, core_invoice_id: string ... 23 more fields]
If you don't want to a temporary dataset, you can directly use from_json to parse the fixed text value but you'll need to manually define your schema in spark beforehand and do some column renaming after parsing:
val jsonSchema = StructType.fromDDL("`analytics_category_id` STRING,`core_invoice_id` STRING,`deleted_at` STRING,`description` STRING,`id` STRING,`invoice_item_attributes` STRING,`invoice_type` STRING,`is_valid` BOOLEAN,`item_sub_type` STRING,`item_type` STRING,`linked_core_invoice_item_id` STRING,`measurement_uom_id` STRING,`product_category_id` STRING,`product_id` STRING,`quantity` BIGINT,`ref_1` STRING,`ref_2` STRING,`ref_3` STRING,`ref_4` STRING,`reference_id` STRING,`tax_auth_geo_id` STRING,`tax_auth_party_id` STRING,`tax_rate` STRING,`total_amount` STRING,`unit_price` STRING")
val jsonParsingOptions: Map[String,String] = Map()
val json_df = fixed
.select(from_json('value, jsonSchema, jsonParsingOptions) as "j")
.select(jsonSchema.map(f => 'j.getItem(f.name).as(f.name)):_*)
json_df: org.apache.spark.sql.DataFrame = [analytics_category_id: string, core_invoice_id: string ... 23 more fields]
As an aside, from the snippet you posted, you don't seem to require the multiline option but if you actually do you'll need to add the option to the jsonParsingOptions map.

Related

Filter one dictionary out of list of dictionaries and extract values of certain keys

Objective:
To filter out of list of dictionaries and extract specific values in that dictionary using python 3.x
Code:
output = data.decode("utf-8")
input = json.loads(output)
for itm in input['items']:
print(itm)
The above code output:
{'name': 'ufg', 'id': '0126ffc8-a1b1-423e-b7fe-56d4e93a80d6', 'created_at': '2022-06-16T04:37:32.958Z'}
{'name': 'xyz', 'id': '194ac74b-54ac-45c6-b4d3-c3ae3ebc1d27', 'created_at': '2022-06-26T10:32:50.307Z'}
{'name': 'defg', 'id': '3744bdaa-4e74-46f6-bccb-1dc2eca2d2c1', 'created_at': '2022-06-26T10:55:21.273Z'}
{'name': 'abcd', 'id': '41541893-f916-426b-b135-c7500759b0b3', 'created_at': '2022-06-24T08:39:39.806Z'}
Now Need to filter only as output for example, I want only dictionary with 'name' as 'abcd'
expected filtered output:
{'name': 'abcd', 'id': '41541893-f916-426b-b135-c7500759b0b3', 'created_at': '2022-06-24T08:39:39.806Z'}
Now, I need to extract only 'name' and 'id' for this 'abcd' into python variable to use it next part of program
Please suggest.
For your situation probably converting into dictionary then filtering would be best.
for itm in input['items']:
itm = dict(itm)
if(itm["name"] == "abcd"):
print(itm["name"],itm["id"])
however, if itm = dict(itm) wouldn't work for your situation, you can use json.loads(itm) then itm = dict(itm).

Scrape a table data from a paginated webpage where the url does not change but the table data changes

website: nafdac.gov.ng/our-services/registered-products
The code below runs but takes 7 hours to render 200 pages out of 5802, I'd appreciate it if anybody can help me find how to scape this website faster
# pip install webdriver-manager --user
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException,
StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as ec
import pandas as pd
import time
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.nafdac.gov.ng/our-services/registered-
products/')
container2 = []
wait_time_out = 20
ignored_exceptions
=NoSuchElementException,StaleElementReferenceException,)
for _ in range(0, 5802+1):
rows = WebDriverWait(driver, wait_time_out,
ignored_exceptions=ignored_exceptions).until(
ec.presence_of_all_elements_located((By.XPATH, '//*
[#id="table_1"]/tbody/tr')))
for row in rows:
time.sleep(10)
container2.append([table_data.text for table_data in
row.find_elements(By.TAG_NAME, 'td')])
WebDriverWait(driver, wait_time_out,
ignored_exceptions=ignored_exceptions).until(
ec.presence_of_element_located((By.XPATH, '//*
[#id="table_1_next"]'))).click()
time.sleep(10)
The data is retrieved through ajax call. Just get the data directly from the source. This got me all the data in about 50 seconds:
import requests
import pandas as pd
from bs4 import BeautifulSoup
# Get _wp_http_referer
url = 'https://www.nafdac.gov.ng/our-services/registered-products/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
wdtNonce = soup.find_all('input', {'type':'hidden'})[0]['value']
# URL to ajax
url = 'https://www.nafdac.gov.ng/wp-admin/admin-ajax.php?action=get_wdtable&table_id=1'
# Headers for the request
headers = {'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Mobile Safari/537.36'}
# form data needed to query in the post
payload = {
'draw': '1',
'columns[0][data]': '0',
'columns[0][name]': 'ID',
'columns[0][searchable]': 'true',
'columns[0][orderable]': 'true',
'columns[0][search][value]': '',
'columns[0][search][regex]': 'false',
'columns[1][data]': '1',
'columns[1][name]': 'product_group',
'columns[1][searchable]': 'true',
'columns[1][orderable]': 'true',
'columns[1][search][value]': '',
'columns[1][search][regex]': 'false',
'columns[2][data]': '2',
'columns[2][name]': 'product_name',
'columns[2][searchable]': 'true',
'columns[2][orderable]': 'true',
'columns[2][search][value]': '',
'columns[2][search][regex]': 'false',
'columns[3][data]': '3',
'columns[3][name]': 'presentation',
'columns[3][searchable]': 'true',
'columns[3][orderable]': 'true',
'columns[3][search][value]': '',
'columns[3][search][regex]': 'false',
'columns[4][data]': '4',
'columns[4][name]': 'active_ingredent',
'columns[4][searchable]': 'true',
'columns[4][orderable]': 'true',
'columns[4][search][value]': '',
'columns[4][search][regex]': 'false',
'columns[5][data]': '5',
'columns[5][name]': 'applicant_name',
'columns[5][searchable]': 'true',
'columns[5][orderable]': 'true',
'columns[5][search][value]': '',
'columns[5][search][regex]': 'false',
'columns[6][data]': '6',
'columns[6][name]': 'country',
'columns[6][searchable]': 'true',
'columns[6][orderable]': 'true',
'columns[6][search][value]': '',
'columns[6][search][regex]': 'false',
'columns[7][data]': '7',
'columns[7][name]': 'manufacturer',
'columns[7][searchable]': 'true',
'columns[7][orderable]': 'true',
'columns[7][search][value]': '',
'columns[7][search][regex]': 'false',
'columns[8][data]': '8',
'columns[8][name]': 'date_approved',
'columns[8][searchable]': 'true',
'columns[8][orderable]': 'true',
'columns[8][search][value]': '',
'columns[8][search][regex]': 'false',
'columns[9][data]': '9',
'columns[9][name]': 'expiry_date',
'columns[9][searchable]': 'true',
'columns[9][orderable]': 'true',
'columns[9][search][value]': '',
'columns[9][search][regex]': 'false',
'columns[10][data]': '10',
'columns[10][name]': 'registration_number',
'columns[10][searchable]': 'true',
'columns[10][orderable]': 'true',
'columns[10][search][value]': '',
'columns[10][search][regex]': 'false',
'order[0][column]': '0',
'order[0][dir]': 'asc',
'start': '0',
'length': '10000',
'search[value]': '',
'search[regex]': 'false',
'wdtNonce': wdtNonce}
# Iterating through the form data above to pull out column names
cols = []
for k, v in payload.items():
if 'name' in k:
cols.append(v)
# initialize a list of rows
rows = []
start = 0
while True:
# Update the start value of the form data to go through the "pages"
payload.update({'start':str(start)})
# Retunr the json fronm the post ajax
jsonData = requests.post(url, headers=headers, data=payload).json()
# Add the list of data into the list of "rows"
rows += jsonData['data']
print('Gathered rows: %d - %d' %(start+1, start+len(jsonData['data'])))
# If the data is less than 10000 items, we know we are on the last "page"
# So we'll break the loop
if len(jsonData['data']) < 10000:
print('Done!')
break
# Update the start variable so that on the next loop, it
# updates the start parameter in the form data to then get the
# next "page"
start = len(rows)
# Create the table from the final list of rows
df = pd.DataFrame(rows, columns=cols)
Output:
print(df)
ID product_group ... expiry_date registration_number
0 1 ... 30/10/2022 03-0740
1 3 ANIMAL FEED ... 30/07/2023 A9-0735
2 4 ANIMAL FEED ... 30/07/2023 A9-0744
3 5 ANIMAL FEED ... 27/06/2023 A9-0721
4 6 ANIMAL FEED ... 27/06/2023 A9-0722
... ... ... ... ...
58011 58.013 ... Apr-65
58012 58.014 ... A4-2582
58013 58.015 ... A4-0851
58014 58.016 ... A4-6613
58015 58.017 ... A4-3601
[58016 rows x 11 columns]

Am trying to import some databases in phpmyadmin and am getting the following errors

Warning in ./libraries/plugin_interface.lib.php#551
count(): Parameter must be an array or an object that implements Countable
Backtrace
./libraries/display_import.lib.php#371: PMA_pluginGetOptions(
string 'Import',
array,
)
./libraries/display_import.lib.php#456: PMA_getHtmlForImportOptionsFormat(array)
./libraries/display_import.lib.php#691: PMA_getHtmlForImport(
string '5ddbf3eff33f7',
string 'server',
string '',
string '',
integer 2097152,
array,
NULL,
NULL,
string '',
)
./server_import.php#34: PMA_getImportDisplay(
string 'server',
string '',
string '',
integer 2097152,
)
Have tried the following solutions with no success,phpmyadmin - count(): Parameter must be an array or an object that implements Countable

how to INSERT a HTML formatted text to MySQL?

I am creating a database and inserting data. our backend engineer said he need a column to save whole articles with HTML format. But when I am inserting data it gives me an error like this:
and I check the exact where the error comes from, I found:
looks like this part has some quote or punctuation issues, and the same line occurs multiple times. And I use str() function to convert the formatted HTML text(use type() to see the datatype is bs4.element.Tag) to string, but the problem still exists.
My database description is:
('id', 'mediumint(9)', 'NO', 'PRI', None, 'auto_increment')
('weburl', 'varchar(200)', 'YES', '', None, '')
('picurl', 'varchar(200)', 'YES', '', None, '')
('headline', 'varchar(200)', 'YES', '', None, '')
('abstract', 'varchar(200)', 'YES', '', None, '')
('body', 'longtext', 'YES', '', None, '')
('formed', 'longtext', 'YES', '', None, '')
('term', 'varchar(50)', 'YES', '', None, '')
And the function I used to collect full text is:
def GetBody(url,plain=False):
# Fetch the html file
response = urllib.request.urlopen(url)
html_doc = response.read()
# Parse the html file
soup = BeautifulSoup(html_doc, 'html.parser')
#find the article body
body = soup.find("section", {"name":"articleBody"})
if not plain:
return body
else:
text = ""
for p_tag in body.find_all('p'):
text = ' '.join([text,p_tag.text])
return text
And I import the data by this function:
def InsertDatabase(section):
s = TopStoriesSearch(section)
count1 = 0
formed = []
while count1 < len(s):
# tr = GetBody(s[count1]['url'])
# formed.append(str(tr))
# count1 = count1 + 1
(I use this to convert HTML to string, or use the code below)
formed.append(GetBody(s[count1]['url']))
count1 = count1 + 1
and this is my insert function:
for each in overall(I save everything in this list named overall):
cur.execute('insert into topstories(formed) values("%s")' % (each["formed"]))
Any tips to solve the problem?
The syntax of execute() function is as follows (link):
cursor.execute(operation, params=None, multi=False)
Therefore, you can provide values to be used in the query as an argument to the execute() function. In that case, it will handle values automatically, eliminating your problem:
import mysql.connector
cnx = mysql.connector.connect(...)
cur = cnx.cursor()
...
for each in overall:
# If 'each' is a dictionary containing 'formed' as key,
# i.e. each = {..., 'formed': ..., ...}, you can do as follows
cur.execute('INSERT INTO topstories(formed) VALUES (%s)', (each['formed']))
# You can also use dictionary directly if you use named placeholder in the query
cur.execute('INSERT INTO topstories(formed) VALUES (%(formed)s)', each)
...
cnx.commit()
cnx.close()

In my fetched json data, how can I seperate out the balance?

So, I have been testing block.io api, and so far I have this:
knee = block_io.get_address_balance(labels='shibe1')
s1 = json.dumps(knee)
d2 = json.loads(s1)
print (d2)
It returns me with this batch of text:
{'status': 'success', 'data': {'network': 'DOGE', 'available_balance': '0.0', 'pending_received_balance': '0.0', 'balances': [{'user_id': 1, 'label': 'shibe1', 'address': 'A9Bda9UMBcb1183PtsBxnbj5QgP6jwkCFG', 'available_balance': '0.00000000', 'pending_received_balance': '0.00000000'}]}}
How would I get it so that I could grab only the available_balance part of it, and print it out instead of all of the json data?
EDIT: Please help! Cant find a solution.
Try using some regex.
import re
data="{'status': 'success', 'data': {'network': 'DOGE', 'available_balance': '0.129',
'pending_received_balance': '0.0', 'balances': [{'user_id': 1, 'label': 'shibe1',
'address': 'A9Bda9UMBcb1183PtsBxnbj5QgP6jwkCFG', 'available_balance': '0.00000000',
'pending_received_balance': '0.00000000'}]}}"
pattern = re.compile("(?<=available_balance': ').*?(?=')")
matches = pattern.finditer(data)
for match in matches:
print(match.group())
Breakdown :
import re imports the regex library built into python
data="{'status': 'success', 'data': {'network': 'DOGE', 'available_balance': '0.129',
'pending_received_balance': '0.0', 'balances': [{'user_id': 1, 'label': 'shibe1',
'address': 'A9Bda9UMBcb1183PtsBxnbj5QgP6jwkCFG', 'available_balance': '0.00000000',
'pending_received_balance': '0.00000000'}]}}" is a string containing the data to match. You can replace this with the json data.
pattern = re.compile("(?<=available_balance': ').*?(?=')") compiles the regex for finding the data for available balance.
Regex breakdown
(?<= is a lookbehind, which means it will check if the value is actually available_balance.
.* matches everything inside a defined constraint.
(?= is a lookahead, which means it will match everything before the close parenthesis, and everything after the lookbehind.
pattern.finditer(data) matches the regex against data
for match in matches:
print(match.group()) prints the matches from the regex.
If you compile this code, you will get the following results :
0.129
0.00000000
If you want the code under your variables, here you go :
import re
pattern = re.compile("(?<=available_balance': ').*?(?=')")
matches = pattern.finditer(d2)
for match in matches:
print(match.group())