I want to parse a .json column through Power BI. I have imported the data directly from the server and have a .json column in the data along with other columns. Is there a way to parse this json column?
Example:
Key IDNumber Module JsonResult
012 200 Dine {"CategoryType":"dining","City":"mumbai"',"Location":"all"}
97 303 Fly {"JourneyType":"Return","Origin":"Mumbai (BOM)","Destination":"Chennai (MAA)","DepartureDate":"20-Oct-2016","ReturnDate":"21-Oct-2016","FlyAdult":"1","FlyChildren":"0","FlyInfant":"0","PromoCode":""}
276 6303 Stay {"Destination":"Clarion Chennai","CheckInDate":"14-Oct-2016","CheckOutDate":"15-Oct-2016","Rooms":"1","NoOfPax":"2","NoOfAdult":"2","NoOfChildren":"0"}
I wish to retain the other columns and also get the simplified parsed columns.
There is an easier way to do it, in the Query Editor on the column you want to read as a json:
Right click on the column
Select Transform>JSON
then the column becomes a Record that you can split in every property of the json using the button on the top right corner.
Use Json.Document function like this
let
...
your_table=imported_the_data_directly_from_the_server,
json=Table.AddColumn(your_table, "NewColName", each Json.Document([JsonResult]))
in
json
And then expand record to table using Table.ExpandRecordColumn
Or by clicking this button
Use Json.Document() function to convert string to Json data.
let
Source = Json.Document(Json.Document(Web.Contents("http://localhost:18091/pools/default/buckets/Aggregation/docs/AvgSumAssuredByProduct"))[json]),
#"Converted to Table" = Record.ToTable(Source),
#"Filtered Rows" = Table.SelectRows(#"Converted to Table", each not Text.Contains([Name], "type_")),
#"Renamed Columns" = Table.RenameColumns(#"Filtered Rows",{{"Name", "AvgSumAssuredByProduct"}}),
#"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"Value", type number}})
in
#"Changed Type"
import json
from urllib import urlopen
import string
from UserList import *
l=[]
j=[]
d_base=urlopen('https://api.thingspeak.com/channels/193888/fields/1.json?results=1')
data = json.load(d_base)
for k in data['feeds']:
name = k['entry_id']
value = k['field1']
l.append(name)
j.append(value)
print l[0]
print j[0]
**this python code may useful for you **
**270
1035
**
Related
I read a .csv file to create a data frame and I want to write the data to a kafka topic. The code is the following
df = spark.read.format("csv").option("header", "true").load(f'{file_location}')
kafka_df = df.selectExpr("to_json(struct(*)) AS value").selectExpr("CAST(value AS STRING)")
kafka_df.show(truncate=False)
And the data frame looks like this:
value
"{""id"":""d215e9f1-4d0c-42da-8f65-1f4ae72077b3"",""latitude"":""-63.571457254062715"",""longitude"":""-155.7055842710919""}"
"{""id"":""ca3d75b3-86e3-438f-b74f-c690e875ba52"",""latitude"":""-53.36506636464281"",""longitude"":""30.069167069917597""}"
"{""id"":""29e66862-9248-4af7-9126-6880ceb3b45f"",""latitude"":""-23.767505281795835"",""longitude"":""174.593140405442""}"
"{""id"":""451a7e21-6d5e-42c3-85a8-13c740a058a9"",""latitude"":""13.02054867061598"",""longitude"":""20.328402498420786""}"
"{""id"":""09d6c11d-7aae-4d17-8cd8-183157794893"",""latitude"":""-81.48976715040848"",""longitude"":""1.1995769642056189""}"
"{""id"":""393e8760-ef40-482a-a039-d263af3379ba"",""latitude"":""-71.73949722379649"",""longitude"":""112.59922770487054""}"
"{""id"":""d6db8fcf-ee83-41cf-9ec2-5c2909c18534"",""latitude"":""-4.034680969008576"",""longitude"":""60.59645511854336""}"
After I wrote it to Kafka I want to read it and transform the binary data from column "value" back to json string but the result is that the value contains only the id, not the whole string. Any ideea why?
from pyspark.sql import functions as F
df = consume_from_event_hub(topic, bootstrap_servers, config, consumer_group)
string_df = df.select(F.col("value").cast("string"))
string_df.display()
value
794541bc-30e6-4c16-9cd0-3c5c8995a3a4
20ea5b50-0baa-47e3-b921-f9a3ac8873e2
598d2fc1-c919-4498-9226-dd5749d92fc5
86cd5b2b-1c57-466a-a3c8-721811ab6959
807de968-c070-4b8b-86f6-00a865474c35
e708789c-e877-44b8-9504-86fd9a20ef91
9133a888-2e8d-4a5a-87ce-4a53e63b67fc
cd5e3e0d-8b02-45ee-8634-7e056d49bf3b
the CSV the format is this
id,latitude,longitude
bd6d98e1-d1da-4f41-94ba-8dbd8c8fce42,-86.06318155350924,-108.14300138138589
c39e84c6-8d7b-4cc5-b925-68a5ea406d52,74.20752175171859,-129.9453606091319
011e5fb8-6ab7-4ee9-97bb-acafc2c71e15,19.302250885973592,-103.2154291337162
You need to remove selectExpr("CAST(value AS STRING)") since to_json already returns a string column
from pyspark.sql.functions import col, to_json, struct
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(f'{file_location}')
kafka_df = df.select(to_json(struct(col("*"))).alias("value"))
kafka_df.show(truncate=False)
I'm not sure what's wrong with the consumer. That should have worked unless consume_from_event_hub does something specifically to extract the ID column
I am currently trying to import a big csv file (50GB+) without any headers into a pyarrow table with the overall target to export this file into the Parquet format and further to process it in a Pandas or Dask DataFrame. How can i specify the column names and column dtypes within pyarrow for the csv file?
I already thought about to append the header to the csv file. This enforces a complete rewrite of the file which looks like a unnecssary overhead. As far as I know, pyarrow provides schemas to define the dtypes for specific columns, but the docs are missing a concrete example for doing so while transforming a csv file to an arrow table.
Imagine that this csv file just has for an easy example the two columns "A" and "B".
My current code looks like this:
import numpy as np
import pandas as pd
import pyarrow as pa
df_with_header = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
print(df_with_header)
df_with_header.to_csv("data.csv", header=False, index=False)
df_without_header = pd.read_csv('data.csv', header=None)
print(df_without_header)
opts = pa.csv.ConvertOptions(column_types={'A': 'int8',
'B': 'int8'})
table = pa.csv.read_csv(input_file = "data.csv", convert_options = opts)
print(table)
If I print out the final table, its not going to change the names of the columns.
pyarrow.Table
1: int64
3: int64
How can I now change the loaded column names and dtypes? Is there maybe also a possibility to for example pass in a dict containing the names and their dtypes?
You can specify type overrides for columns:
fp = io.BytesIO(b'one,two,three\n1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
convert_options=csv.ConvertOptions(
column_types={
'one': pa.int8(),
'two': pa.int8(),
'three': pa.int8(),
}
))
But in your case you don't have a header, and as far as I can tell this use case is not supported in arrow:
fp = io.BytesIO(b'1,2,3\n4,5,6')
fp.seek(0)
table = csv.read_csv(
fp,
parse_options=csv.ParseOptions(header_rows=0)
)
This raises:
pyarrow.lib.ArrowInvalid: header_rows == 0 needs explicit column names
The code is here: https://github.com/apache/arrow/blob/3cf8f355e1268dd8761b99719ab09cc20d372185/cpp/src/arrow/csv/reader.cc#L138
This is similar to this question apache arrow - reading csv file
There should be fix for it in the next version: https://github.com/apache/arrow/pull/4898
I'm a new spark user currently playing around with Spark and some big data and I have a question related to Spark SQL or more formally the SchemaRDD. I'm reading a JSON file containing data about some weather forecasts and I'm not really interested in all of the fields that I have ... I only want 10 fields out of 50+ fields returned for each record. Is there a way (similar to filter) that I can use to specify the names of some fields that I want remove from spark.
Just a small descriptive example. Consider I have the Schema "Person" with 3 fields "Name", "Age", and "Gender" and I'm not interested in the "Age" field and wold like to remove it. Can I use spark some how to do that. ? Thanks
If you are using Spark 1.2, you can do the following (using Scala)...
If you already know what fields you want to use, you can construct the schema for these fields and apply this schema to the JSON dataset. Spark SQL will return a SchemaRDD. Then, you can register it and query it as a table. Here is a snippet...
// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
// The schema is encoded in a string
val schemaString = "name gender"
// Import Spark SQL data types.
import org.apache.spark.sql._
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
// Create the SchemaRDD for your JSON file "people" (every line of this file is a JSON object).
val peopleSchemaRDD = sqlContext.jsonFile("people.txt", schema)
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Only values of name and gender fields will be in the results.
val results = sqlContext.sql("SELECT * FROM people")
When you look at the schema of peopleSchemaRDD (peopleSchemaRDD.printSchema()), you will only see name and gender field.
Or, if you want to explore the dataset and determine what fields you want after you see all fields, you can ask Spark SQL to infer the schema for you. Then, you can register the SchemaRDD as a table and use projection to remove unneeded fields. Here is a snippet...
// Spark SQL will infer the schema of the given JSON file.
val peopleSchemaRDD = sqlContext.jsonFile("people.txt")
// Check the schema of peopleSchemaRDD
peopleSchemaRDD.printSchema()
// Register peopleSchemaRDD as a table called "people"
peopleSchemaRDD.registerTempTable("people")
// Project name and gender field.
sqlContext.sql("SELECT name, gender FROM people")
You can specify what fields you would like to have in the schemaRDD. Below is an example. Create a case class, with only the fields that you need. Read the data into an rdd, then specify the only the fileds that you need(in the same order as you have specified the schema in the case class).
Sample Data: People.txt
foo,25,M
bar,24,F
Code:
case class Person(name: String, gender: String)
val people = sc.textFile("People.txt").map(_.split(",")).map(p => Person(p(0), p(2)))
people.registerTempTable("people")
Table structure like -
db.define_table('parent',
Field('name'),format='%(name)s')
db.define_table('children',
Field('name'),
Field('mother','reference parent'),
Field('father','reference parent'))
db.children.mother.requires = IS_IN_DB(db, db.parent.id,'%(name)s')
db.children.father.requires = IS_IN_DB(db, db.parent.id,'%(name)s')
Controller :
grid = SQLFORM.grid(db.children, orderby=[db.children.id],
csv=True,
fields=[db.children.id, db.children.name, db.children.mother, db.children.father])
return dict(grid=grid)
Here grid shows proper values i.e names of the mother and father from the parent table.
But when I try to export it via csv link - resulted excelsheet shows ids and not the names of mother and father.
Please help!
The CSV download just gives you the raw database values without first applying each field's represent attribute. If you want the "represented" values of each field, you have two options. First, you can choose the TSV (tab-separated-values) download instead of CSV. Second, you can define a custom export class:
import cStringIO
class CSVExporter(object):
file_ext = "csv"
content_type = "text/csv"
def __init__(self, rows):
self.rows = rows
def export(self):
if self.rows:
s = cStringIO.StringIO()
self.rows.export_to_csv_file(s, represent=True)
return s.getvalue()
else:
return ''
grid = SQLFORM.grid(db.mytable, exportclasses=dict(csv=(CSVExporter, 'CSV')))
The exportclasses argument is a dictionary of custom download types that can be used to override existing types or add new ones. Each item is a tuple including the exporter class and the label to be used for the download link in the UI.
We should probably add this as an option.
BACKGROUND:
I am having issues trying to search through some CSV files.
I've gone through the python documentation: http://docs.python.org/2/library/csv.html
about the csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds) object of the csv module.
My understanding is that the csv.DictReader assumes the first line/row of the file are the fieldnames, however, my csv dictionary file simply starts with "key","value" and goes on for atleast 500,000 lines.
My program will ask the user for the title (thus the key) they are looking for, and present the value (which is the 2nd column) to the screen using the print function. My problem is how to use the csv.dictreader to search for a specific key, and print its value.
Sample Data:
Below is an example of the csv file and its contents...
"Mamer","285713:13"
"Champhol","461034:2"
"Station Palais","972811:0"
So if i want to find "Station Palais" (input), my output will be 972811:0. I am able to manipulate the string and create the overall program, I just need help with the csv.dictreader.I appreciate any assistance.
EDITED PART:
import csv
def main():
with open('anchor_summary2.csv', 'rb') as file_data:
list_of_stuff = []
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
print list_of_stuff
main()
The documentation you linked to provides half the answer:
class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)
[...] maps the information read into a dict whose keys are given by the optional fieldnames parameter. If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames.
It would seem that if the fieldnames parameter is passed, the given file will not have its first record interpreted as headers (the parameter will be used instead).
# file_data is the text of the file, not the filename
reader = csv.DictReader(file_data, ("title", "value"))
for i in reader:
list_of_stuff.append(i)
which will (apparently; I've been having trouble with it) produce the following data structure:
[{"title": "Mamer", "value": "285713:13"},
{"title": "Champhol", "value": "461034:2"},
{"title": "Station Palais", "value": "972811:0"}]
which may need to be further massaged into a title-to-value mapping by something like this:
data = {}
for i in list_of_stuff:
data[i["title"]] = i["value"]
Now just use the keys and values of data to complete your task.
And here it is as a dictionary comprehension:
data = {row["title"]: row["value"] for row in csv.DictReader(file_data, ("title", "value"))}
The currently accepted answer is fine, but there's a slightly more direct way of getting at the data. The dict() constructor in Python can take any iterable.
In addition, your code might have issues on Python 3, because Python 3's csv module expects the file to be opened in text mode, not binary mode. You can make your code compatible with 2 and 3 by using io.open instead of open.
import csv
import io
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
data = dict(csv.reader(f))
print(data['Champhol'])
As a warning, if your csv file has two rows with the same value in the first column, the later value will overwrite the earlier value. (This is also true of the other posted solution.)
If your program really is only supposed to print the result, there's really no reason to build a keyed dictionary.
import csv
import io
# Python 2/3 compat
try:
input = raw_input
except NameError:
pass
def main():
# Case-insensitive & leading/trailing whitespace insensitive
user_city = input('Enter a city: ').strip().lower()
with io.open('anchor_summary2.csv', 'r', newline='', encoding='utf-8') as f:
for city, value in csv.reader(f):
if user_city == city.lower():
print(value)
break
else:
print("City not found.")
if __name __ == '__main__':
main()
The advantage of this technique is that the csv isn't loaded into memory and the data is only iterated over once. I also added a little code the calls lower on both the keys to make the match case-insensitive. Another advantage is if the city the user requests is near the top of the file, it returns almost immediately and stops looking through the file.
With all that said, if searching performance is your primary consideration, you should consider storing the data in a database.