Having data in below json format
{"A": {"Show only": ["Buy"], "Apple MacBook Product Line": ["MacBook Pro", "MacBook Air", "MacBook (Original)"], "Color": ["Space Gray", "Silver", "Gold", "Rose Gold", "Gray", "White", "Champagne Gold"], "Seller": ["Back Market", "Best Buy", "eBay"]}, "B": {"Show only": ["Buy"], "Material": ["Human Hair", "Synthetic"], "Features": ["Lace Front"], "Seller": ["arabellahair", "Ergode", "PartyBell.com"]}}
tried using below to convert this to csv format
import pandas as pd
import csv
import glob
for file in glob.glob('file.json'):
dfjson = pd.read_json(file,encoding='utf-8',lines= False, dtype=str)
dfjson.to_csv("output.csv",index = True)
But the expected output is
A Show only Buy
A Apple MacBook Product Line_001 MacBook Pro
A Apple MacBook Product Line_002 MacBook Air
A Apple MacBook Product Line_003 MacBook (Original)
A Color_001 Space Gray
A Color_002 Silver
A Color_003 Gold
A Color_004 Rose Gold
A Color_005 Gray
A Color_006 White
A Color_007 Champagne Gold
A Seller_001 Back Market
A Seller_002 Best Buy
A Seller_003 eBay
B Show only Buy
B Material_001 Human Hair
B Material_002 Synthetic
B Features Lace Front
B Seller_001 arabellahair
B Seller_002 Ergode
B Seller_003 PartyBell.com
What changes can be made to get this output
You got wrong json structure for that , here i tried replace your json with right format
json_file = {"col1" :["A" , "A" , "A"] ,
"col2": ["Show only" ,"Apple MacBook Product" ,"Apple MacBook Product"],
"col3":["Buy","MacBook Pro","MacBook (Original)" ]}
df = pd.DataFrame(json_file)
df
col1 col2 col3
0 A Show only Buy
1 A Apple MacBook Product MacBook Pro
2 A Apple MacBook Product MacBook (Original)
Related
I am new to Data Scraping. I am reading the data from a file having JSON objects as one row
{"name": "Soul Sweet \u2018Taters (Step-by-Step!)", "ingredients": "4 whole Medium Sweet Potatoes\n1 cup Sugar\n1 cup Milk\n2 whole Eggs\n1 teaspoon Vanilla Extract\n1 teaspoon Salt\n1 cup Brown Sugar\n1 cup Pecans\n1/2 cup Flour\n3/4 sticks Butter", "cookTime": "PT30M","prepTime": "PT45M"}
{"name": "Cranberry-Pomegranate Sauce", "ingredients": "1 bag (about 12 To 16 Oz) Fresh Cranberries\n16 ounces, fluid Pomegranate Juice\n3/4 cups Sugar, More Or Less To Taste", "cookTime": "PT15M","prepTime": "PT2M"}
{"name": "Whiskey Maple Cream Sauce", "ingredients": "1-1/2 cup Heavy Cream\n5 Tablespoons Pure Maple Syrup\n3 Tablespoons Light Corn Syrup\n1 Tablespoon Whiskey (can Add More If Desired)","cookTime": "PT15M","prepTime": "PT5M"}
I am looking for assistance on below:
Filter rows that contain text sugar in the Ingredients column.
convert ISO prep time and ISO cook time to human-readable format to add a new column total time(prep time+cook time) for filtered rows.The ISO time is in minutes and even in Hours
Expected Output
name
total_cook_time
Soul Sweet \u2018Taters (Step-by-Step!)
75M
Cranberry-Pomegranate Sauce
17M
My sample code
import json
from datetime import datetime
currentdate=datetime.today().strftime('%Y/%m/%d')
absolutepath='project/sniper/'+'/'+currentdate+'/*.json'
new_data = []
totaltime = {}
data = [json.loads(line) for line in open('absolutepath', 'r')]
for d in data:
if 'sugar' in d.get('ingredients').lower() # case senstitive
new_data.append(d.get('name'))
total_time = int(d['prepTime'].replace('PT', '').replace('M','')) + int(d['cookTime'].replace('PT', '').replace('M',''))
totaltime[d['name']] = total_time
This is one way to do it, as long as the durations are the same format. Otherwise more logic and/or regex could be friendlier. Assumes you are working with a list of recipes called 'data'.
The new_data list holds recipes with Sugar and totaltime dictionary is the recipe name total time.
new_data = []
totaltime = {}
for d in data:
# print(d.get('ingredients')
if 'Sugar' in d.get('ingredients'): # case senstitive
new_data.append(d.get('ingredients'))
total_time = int(d['prepTime'].replace('PT', '').replace('M','')) + int(d['cookTime'].replace('PT', '').replace('M',''))
totaltime[d['name']] = total_time
totaltime
{'Soul Sweet ‘Taters (Step-by-Step!)': 75, 'Cranberry-Pomegranate Sauce': 17}
I have a csv file that has the following format:
company_id
year
sales
buys
location
3
2020
230
112
europe
3
2019
234
231
europe
2
2020
443
351
usa
2
2019
224
256
usa
and when I import it to elastic search I end up having one entry for each line.
However, I would like to import it in the format below:
[
{"company_id" : 3,
"location" : "europe",
"2020" : {"sales" : 230, "buys" : 112},
"2019" : {"sales" : 234, "buys" : 231}
},
{"company_id" : 2,
"location" : "usa",
"2020" : {"sales" : 443, "buys" : 351},
"2019" : {"sales" : 224, "buys" : 256}
}
]
Is there a way to write the ingest pipeline (processor) in order to achieve this?
Thanks in advance for your precious answers.
At the ingest pipeline level you'll only be able to handle one document (i.e. one row) at a time, so in order to aggregate the way you want, you need to do it at the Logstash level using the aggregate filter.
if your rows are correctly sorted by location, you can use the following example from the official documentation.
One word of caution, though: if you add year as a field, your mapping will keep growing as years go by and you potentially risk mapping explosion.
So I am using a webscraper to pull information on sneakers from a website. The son data that comes back is structured like so
[
{
"web-scraper-order": "1554084909-97",
"web-scraper-start-url": "https://www.goat.com/sneakers",
"productlink": "$200AIR JORDAN 6 RETRO 'INFRARED' 2019",
"productlink-href": "https://www.goat.com/sneakers/air-jordan-6-retro-black-infrared-384664-060",
"name": "Air Jordan 6 Retro 'Infrared' 2019",
"price": "Buy New - $200",
"description": "The 2019 edition of the Air Jordan 6 Retro ‘Infrared’ is true to the original colorway, which Michael Jordan wore when he captured his first NBA title. Dressed primarily in black nubuck with a reflective 3M layer underneath, the mid-top features Infrared accents on the midsole, heel tab and lace lock. Nike Air branding adorns the heel and sockliner, an OG detail last seen on the 2000 retro.",
"releasedate": "2019-02-16",
"colorway": "Black/Infrared 23-Black",
"brand": "Air Jordan",
"designer": "Tinker Hatfield",
"technology": "Air",
"maincolor": "Black",
"silhouette": "Air Jordan 6",
"nickname": "Infrared",
"category": "lifestyle",
"image-src": "https://image.goat.com/crop/1250/attachments/product_template_additional_pictures/images/018/675/318/original/464372_01.jpg.jpeg"
},
{
"web-scraper-order": "1554084922-147",
"web-scraper-start-url": "https://www.goat.com/sneakers",
"productlink": "$190YEEZY BOOST 350 V2 'CREAM WHITE / TRIPLE WHITE'",
"productlink-href": "https://www.goat.com/sneakers/yeezy-boost-350-v2-cream-white-cp9366",
"name": "Yeezy Boost 350 V2 'Cream White / Triple White'",
"price": "Buy New - $220",
"description": "First released on April 29, 2017, the Yeezy Boost 350 V2 ‘Cream White’ combines a cream Primeknit upper with tonal cream SPLY 350 branding, and a translucent white midsole housing full-length Boost. Released again in October 2018, this retro helped fulfill Kanye West’s oft-repeated ‘YEEZYs for everyone’ Twitter mantra, as adidas organized the biggest drop in Yeezy history by promising pre-sale to anyone who signed up on the website. Similar to the first release, the ‘Triple White’ 2018 model features a Primeknit upper, a Boost midsole and custom adidas and Yeezy co-branding on the insole.",
"releasedate": "2017-04-29",
"colorway": "Cream White/Cream White/Core White",
"brand": "adidas",
"designer": "Kanye West",
"technology": "Boost",
"maincolor": "White",
"silhouette": "Yeezy Boost 350",
"nickname": "Cream White / Triple White",
"category": "lifestyle",
"image-src": "https://image.goat.com/crop/1250/attachments/product_template_additional_pictures/images/014/822/695/original/116662_03.jpg.jpeg"
},
However, I want to change it so that the top level node is sneakers and the next level down would be a specific sneaker brand ( Jordan, Nike, Adidas) and then the list of sneakers that belong to that brand. So my JSON structure would look something like this
Sneakers {
Adidas :{
[shoe1,
shoe2,
....
] },
Jordan: {
[shoe1,
shoe2,
....
]
}
}
I am not sure what tool I could use to do that. Any help would be greatly appreciated. All I have at the moment is the JSON file and it is not in the structure that I want it to be in.
One way of doing this would be to populate a dict whose keys are brand names and their values are lists of sneaker records. Assuming that data is your original list, here's the code:
sneakers_by_brand = {}
for record in data:
if sneakers_by_brand.get(record.get("brand")):
sneakers_by_brand[record.get("brand")].append(record)
else:
sneakers_by_brand[record.get("brand")] = [record]
print(sneakers_by_brand)
I am currently working with Amazon User Metadata of size roughly around 9 GB. I have this data in .json format and want to read in R, due to such huge volume of data I want to divide the files in small chunks (either by size or by number of elements). If you can help me out how can I do this in R. The format of file is as follows:
{
"asin": "0000031852",
"title": "Girls Ballet Tutu Zebra Hot Pink",
"price": 3.17,
"imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg",
"related":
{
"also_bought": ["B00JHONN1S", "B002BZX8Z6", "B00D2K1M3O", "0000031909", "B00613WDTQ", "B00D0WDS9A", "B00D0GCI8S", "0000031895", "B003AVKOP2", "B003AVEU6G", "B003IEDM9Q", "B002R0FA24", "B00D23MC6W", "B00D2K0PA0", "B00538F5OK", "B00CEV86I6", "B002R0FABA", "B00D10CLVW", "B003AVNY6I", "B002GZGI4E", "B001T9NUFS", "B002R0F7FE", "B00E1YRI4C", "B008UBQZKU", "B00D103F8U", "B007R2RM8W"],
"also_viewed": ["B002BZX8Z6", "B00JHONN1S", "B008F0SU0Y", "B00D23MC6W", "B00AFDOPDA", "B00E1YRI4C", "B002GZGI4E", "B003AVKOP2", "B00D9C1WBM", "B00CEV8366", "B00CEUX0D8", "B0079ME3KU", "B00CEUWY8K", "B004FOEEHC", "0000031895", "B00BC4GY9Y", "B003XRKA7A", "B00K18LKX2", "B00EM7KAG6", "B00AMQ17JA", "B00D9C32NI", "B002C3Y6WG", "B00JLL4L5Y", "B003AVNY6I", "B008UBQZKU", "B00D0WDS9A", "B00613WDTQ", "B00538F5OK", "B005C4Y4F6", "B004LHZ1NY", "B00CPHX76U", "B00CEUWUZC", "B00IJVASUE", "B00GOR07RE", "B00J2GTM0W", "B00JHNSNSM", "B003IEDM9Q", "B00CYBU84G", "B008VV8NSQ", "B00CYBULSO", "B00I2UHSZA", "B005F50FXC", "B007LCQI3S", "B00DP68AVW", "B009RXWNSI", "B003AVEU6G", "B00HSOJB9M", "B00EHAGZNA", "B0046W9T8C", "B00E79VW6Q", "B00D10CLVW", "B00B0AVO54", "B00E95LC8Q", "B00GOR92SO", "B007ZN5Y56", "B00AL2569W", "B00B608000", "B008F0SMUC", "B00BFXLZ8M"],
"bought_together": ["B002BZX8Z6"]
},
"salesRank": {"Toys & Games": 211836},
"brand": "Coxlures",
"categories": [["Sports & Outdoors", "Other Sports", "Dance"]]
}
We scrape the website www.theft-alerts.com. Now we get all the text.
connection = urllib2.urlopen('http://www.theft-alerts.com')
soup = BeautifulSoup(connection.read().replace("<br>","\n"), "html.parser")
theftalerts = []
for sp in soup.select("table div.itemspacingmodified"):
for wd in sp.select("div.itemindentmodified"):
text = wd.text
if not text.startswith("Images :"):
print(text)
with open("theft-alerts.json", 'w') as outFile:
json.dump(theftalerts, outFile, indent=2)
Output:
STOLEN : A LARGE TAYLORS OF LOUGHBOROUGH BELL
Stolen from Bromyard on 7 August 2014
Item : The bell has a diameter of 37 1/2" is approx 3' tall weighs just shy of half a ton and was made by Taylor's of Loughborough in 1902. It is stamped with the numbers 232 and 11.
The bell had come from Co-operative Wholesale Society's Crumpsall Biscuit Works in Manchester.
Any info to : PC 2361. Tel 0300 333 3000
Messages : Send a message
Crime Ref : 22EJ / 50213D-14
No of items stolen : 1
Location : UK > Hereford & Worcs
Category : Shop, Pub, Church, Telephone Boxes & Bygones
ID : 84377
User : 1 ; Antique/Reclamation/Salvage Trade ; (Administrator)
Date Created : 11 Aug 2014 15:27:57
Date Modified : 11 Aug 2014 15:37:21;
How can we categories the text for the JSON file. The JSON file is now empty.
Output JSON:
[]
You can define a list and append all dictionary objects that you create to the list. e.g:
import json
theftalerts = [];
atheftobject = {};
atheftobject['location'] = 'UK > Hereford & Worcs';
atheftobject['category'] = 'Shop, Pub, Church, Telephone Boxes & Bygones';
theftalerts.append(atheftobject);
atheftobject['location'] = 'UK';
atheftobject['category'] = 'Shop';
theftalerts.append(atheftobject);
with open("theft-alerts.json", 'w') as outFile:
print(json.dump(theftalerts, outFile, indent=2))
After this run the theft-alerts.json will contain this json object:
[
{
"category": "Shop",
"location": "UK"
},
{
"category": "Shop",
"location": "UK"
}
]
You can play with this to generate your own JSON object.
Checkout the json module
Your JSON output remains empty because your loop doesn't append to the list.
Here's how I would extract the category name:
theftalerts = []
for sp in soup.select("table div.itemspacingmodified"):
item_text = "\n".join(
[wd.text for wd in sp.select("div.itemindentmodified")
if not wd.text.startswith("Images :")])
category = sp.find(
'span', {'class': 'itemsmall'}).text.split('\n')[1][11:]
theftalerts.append({'text': item_text, 'category': category})