I want to analyse some spectral data.
I have a ~6500 csv files.
Each .csv file contains data with the fromat shown in pics.
How can I transpose all csv files??
....so then....I can combine them in powerQuery??
Thank you!
This will read in all .csv files in a directory, transpose and combine them for you
Use Home...advanced editor... to paste into PowerQuery and edit the 2nd line with the appropriate directory path
Based on Alexis Olson recent answer Reading the first n rows of a csv without parsing the whole file in Power Query
let
Source = Folder.Files("C:\directory\subdirectory\"),
#"Filtered Rows" = Table.SelectRows(Source, each ([Extension] = ".csv")),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Content", "Name"}),
#"Invert" = Table.TransformColumns(#"Removed Other Columns", {{"Content", each Table.Transpose(Csv.Document(_))}}),
MaxColumns = List.Max(List.Transform(#"Invert"[Content], each Table.ColumnCount(_))),
#"Expanded Content" = Table.ExpandTableColumn(#"Invert", "Content", List.Transform({1..MaxColumns}, each "Column" & Number.ToText(_))),
#"Promoted Headers" = Table.PromoteHeaders(#"Expanded Content", [PromoteAllScalars=true]),
#"Filtered Rows1" = Table.SelectRows(#"Promoted Headers", each ([Date] <> "Date"))
in #"Filtered Rows1"
Related
I'm trying to import a JSON file with the following format into Excel:
[
[1,2,3,4],
[5,6,7,8]
]
I want to get a spreadsheet with 2 rows and 4 columns, where each row contains the contents of the inner array as separate column values, e.g.
Column A
Column B
Column C
Column D
1
2
3
4
5
6
7
8
Although this would seem to be an easy problem to solve, I can't seem to find the right PowerQuery syntax, or locate an existing answer that covers this scenario. I can easily import as a single column with 8 values, but can't seem to split the inner array into separate columns.
Assuming the JSON looks like
[
[1,2,3,4],
[5,6,7,8]
]
then this code in Powerquery
let Source = Json.Document(File.Contents("C:\temp\j.json")),
#"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Added Custom" = Table.AddColumn(#"Converted to Table", "Custom", each Text.Combine(List.Transform([Column1], each Text.From(_)),",")),
ColumnTitles = List.Transform({1 .. List.Max(Table.AddColumn(#"Added Custom", "count", each List.Count(Text.PositionOfAny([Custom],{","},Occurrence.All))+1)[count])}, each "Column." & Text.From(_)),
#"Split Column by Delimiter" = Table.SplitColumn(#"Added Custom", "Custom", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), ColumnTitles),
#"Removed Columns" = Table.RemoveColumns(#"Split Column by Delimiter",{"Column1"})
in #"Removed Columns"
generates
It converts the JSON to a list of lists, converts to a table of lists, expands the list to be text with commas, then expands that into columns after dynamically creating column names for the new columns by counting the max number of commas
I have a table containing other tables in its values. These other tables can be formatted either as CSV or JSON.
Can you please tell me how I can import this data into individual tables in PowerBI? I've tried with the PowerQuery GUI but so far unsuccessfully, perhaps there will be a need to use the code in the advanced editor.
I can't just parse this data outside PowerBI because of company guidelines prohibiting the use of scripts, so everything must be done within PowerBI - though the PowerQuery code is allowed.
csv:
"id,normdist,poissondist,binomial\r\n1,0.00013383,0.033689735,0.009765625\r\n2,0.004431848,0.084224337,0.043945313\r\n3,0.053990967,0.140373896,0.1171875\r\n4,0.241970725,0.17546737,0.205078125\r\n5,0.39894228,0.17546737,0.24609375\r\n6,0.241970725,0.146222808,0.205078125\r\n7,0.053990967,0.104444863,0.1171875\r\n8,0.004431848,0.065278039,0.043945313\r\n9,0.00013383,0.036265577,0.009765625\r\n10,1.49E-06,0.018132789,0.000976563\r\n"
json (by row)
[{"id":1,"normdist":0.0001,"poissondist":0.0337,"binomial":0.0098},{"id":2,"normdist":0.0044,"poissondist":0.0842,"binomial":0.0439},{"id":3,"normdist":0.054,"poissondist":0.1404,"binomial":0.1172},{"id":4,"normdist":0.242,"poissondist":0.1755,"binomial":0.2051},{"id":5,"normdist":0.3989,"poissondist":0.1755,"binomial":0.2461},{"id":6,"normdist":0.242,"poissondist":0.1462,"binomial":0.2051},{"id":7,"normdist":0.054,"poissondist":0.1044,"binomial":0.1172},{"id":8,"normdist":0.0044,"poissondist":0.0653,"binomial":0.0439},{"id":9,"normdist":0.0001,"poissondist":0.0363,"binomial":0.0098},{"id":10,"normdist":1.49e-06,"poissondist":0.0181,"binomial":0.001}]
Let's say that the data is in the CSV version but just a string in a database so that it looks like this in the query editor:
In order to expand this into a table, we need to split it into rows and columns. The Home tab has a Split Column tool we'll use like this using By Delimiter option from the dropdown:
That is, we use "\r\n" to split the cell into multiple rows.
Now our column looks like this:
Remove any blank rows and use the Split Column tool again. This time, you can leave the defaults since it will automatically guess that you want to split by comma and expand into rows.
If you promote the headers and clean up the column types, the final result should be
Full M Query for this example that you can paste into the Advanced Editor:
let
Source = Table.FromRows(Json.Document(Binary.Decompress(Binary.FromText("7ZFJbsMwDEXvkrUacBKHA/QUcRYpshHQ2EXc+6OS4mzkC3Th3Qf5+ThdLqdyT/PyfNzL+pt+lrKuy9z1V5mXR7l9T9NzmmZMcAYAZHZuklk9jHMPh2lWyi8n9ZAIo4s37UIkzNa0cEhm5Je1kzJHQGhLowAbe2jTaOi2MaUGSDAMjFpLtCxqHUmQwRzf3VuWw6P29MEoCsFvoo5EUaol4HukjVPW5URceZzSx801kzlw7DeP8ZxKmrPZ/pwICc8Snx/QRgZ05Ard6rtzQ56u6Xjm8czjmf/wmdc/", BinaryEncoding.Base64), Compression.Deflate)), let _t = ((type nullable text) meta [Serialized.Text = true]) in type table [Column1 = _t]),
#"Split by \r\n into Rows" = Table.ExpandListColumn(Table.TransformColumns(Source, {{"Column1", Splitter.SplitTextByDelimiter("\r\n", QuoteStyle.Csv), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Column1"),
#"Filtered Blank Rows" = Table.SelectRows(#"Split by \r\n into Rows", each [Column1] <> null and [Column1] <> ""),
#"Split into Columns" = Table.SplitColumn(#"Filtered Blank Rows", "Column1", Splitter.SplitTextByDelimiter(",", QuoteStyle.Csv), {"Column1.1", "Column1.2", "Column1.3", "Column1.4"}),
#"Promoted Headers" = Table.PromoteHeaders(#"Split into Columns", [PromoteAllScalars=true]),
#"Filtered Repeated Headers" = Table.SelectRows(#"Promoted Headers", each ([id] <> "id")),
#"Changed Type" = Table.TransformColumnTypes(#"Filtered Repeated Headers",{{"id", Int64.Type}, {"normdist", type number}, {"poissondist", type number}, {"binomial", type number}})
in
#"Changed Type"
I am new to Python and trying to extract certain data from rows of a csv file into individual .txt files (to create a corpus for NLP). So far I have the following:
import csv
with open(r"file.csv", "r+", encoding='utf-8') as f:
reader = csv.reader(f)
data = list(reader)
t = (data[1][91])
fn = str(data[1][90])
g = open("%s.txt" %fn,"w+")
for i in range(1):
g.write(t)
g.close
Which does what I want for the 1st row, however I am not sure how to get the program to loop up to row 1047. Note: the [1] signifies row 1, the [91] & [90] should remain fixed.
Thanks in advance!
I need to convert a bunch (23) of CSV files (source s3) into parquet format. The input CSV contains headers in all files. When I generated code for that using Glue. The output contains 22 header rows also in separate rows which means it ignored the first header. I need help in ignoring all the headers while doing this transformation.
Since I'm using from_catalog function for my input, I don't have any format_options to ignore the header rows.
Also, can I set an option in the Glue table that the header is present in the files? Will that automatically ignore the header when my job runs?
Part of my current approach is below. I'm new to Glue. This code was actually auto-generated by Glue.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
datasink1 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://my-bucket-name/full/s3/path-parquet"}, format = "parquet", transformation_ctx = "datasink1")
Faced exact issue while working on a ETL job which used AWS Glue.
The documentation for from_catalog says:
additional_options – A collection of optional name-value pairs. The possible options include those listed in Connection Types and Options for ETL in AWS Glue except for endpointUrl, streamName, bootstrap.servers, security.protocol, topicName, classification, and delimiter.
I tried using the below snippet and some of its permutations with from_catalog. But nothing worked for me.
additional_options = {"format": "csv", "format_options": '{"withHeader": "True"}'},
One way to go about fixing this is by using from_options instead of from_catalog and pointing directly to the S3 bucket or folder. This is what it should look like:
datasource0 = glueContext.create_dynamic_frame.from_options(
connection_type="s3",
connection_options={
'paths': ['s3://bucket_name/folder_name'],
"recurse": True,
'groupFiles': 'inPartition'
},
format="csv",
format_options={
"withHeader": True
},
transformation_ctx = "datasource0"
)
But if you can't do this for any reason and want to stick with from_catalog, using a filter worked for me.
Assuming that one of your header's name is name, this is what the snippet can look like:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "my_datalake", table_name = "my-csv-files", transformation_ctx = "datasource0")
filtered_df = Filter.apply(frame = datasource0, f = lambda x: x["name"] != "name")
Not very sure about how spark's dataframes or glue's dynamicframes deal with csv headers and why data read from catalog had headers in rows as well as schema, but this seemed to solve my issue by removing the header values from the rows.
I want to parse a .json column through Power BI. I have imported the data directly from the server and have a .json column in the data along with other columns. Is there a way to parse this json column?
Example:
Key IDNumber Module JsonResult
012 200 Dine {"CategoryType":"dining","City":"mumbai"',"Location":"all"}
97 303 Fly {"JourneyType":"Return","Origin":"Mumbai (BOM)","Destination":"Chennai (MAA)","DepartureDate":"20-Oct-2016","ReturnDate":"21-Oct-2016","FlyAdult":"1","FlyChildren":"0","FlyInfant":"0","PromoCode":""}
276 6303 Stay {"Destination":"Clarion Chennai","CheckInDate":"14-Oct-2016","CheckOutDate":"15-Oct-2016","Rooms":"1","NoOfPax":"2","NoOfAdult":"2","NoOfChildren":"0"}
I wish to retain the other columns and also get the simplified parsed columns.
There is an easier way to do it, in the Query Editor on the column you want to read as a json:
Right click on the column
Select Transform>JSON
then the column becomes a Record that you can split in every property of the json using the button on the top right corner.
Use Json.Document function like this
let
...
your_table=imported_the_data_directly_from_the_server,
json=Table.AddColumn(your_table, "NewColName", each Json.Document([JsonResult]))
in
json
And then expand record to table using Table.ExpandRecordColumn
Or by clicking this button
Use Json.Document() function to convert string to Json data.
let
Source = Json.Document(Json.Document(Web.Contents("http://localhost:18091/pools/default/buckets/Aggregation/docs/AvgSumAssuredByProduct"))[json]),
#"Converted to Table" = Record.ToTable(Source),
#"Filtered Rows" = Table.SelectRows(#"Converted to Table", each not Text.Contains([Name], "type_")),
#"Renamed Columns" = Table.RenameColumns(#"Filtered Rows",{{"Name", "AvgSumAssuredByProduct"}}),
#"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"Value", type number}})
in
#"Changed Type"
import json
from urllib import urlopen
import string
from UserList import *
l=[]
j=[]
d_base=urlopen('https://api.thingspeak.com/channels/193888/fields/1.json?results=1')
data = json.load(d_base)
for k in data['feeds']:
name = k['entry_id']
value = k['field1']
l.append(name)
j.append(value)
print l[0]
print j[0]
**this python code may useful for you **
**270
1035
**