Power Query does not recognize tab as a delimiter in .txt files in the code - multiple-columns

this is my first post here, so I apologize in advance if the question has been already answered somewhere or I do something wrong. To summarize the problem:
I am doing some spectroscopy measurements and the data from the software I am using is saved in hundreds of .txt files. All files have the same content: first column refers to the wavelength, the second column is the intensity. Columns are separated from one another with a tab. The idea is to insert all of these .txt files in Power Query, rearrange the columns so there is only one column with the wavelength (since it is always the same for all measurements), and the remaining columns would be intensities (second column) of all inserted files.
Therefore, the desired output should look like this:
Wavelength (1st file), intensity (1st file), intensity (2nd file), intensity (3rd file),..., intensity (last file).
I found this brilliant solution, but the issue is that it works flawlessly if the columns are separated by a comma. I tried changing the code so it recognizes the tab, but stuff that I tried didn't work. I also found about Power Query yesterday, so I am a total beginner here. Here is the code:
let
Source = Folder.Files("C:\Users\xxxxx\Desktop\new"),
// Standard UI; step renamed
FilteredTxt = Table.SelectRows(Source, each [Extension] = ".txt"),
// Standard UI; step renamed
RemovedColumns = Table.RemoveColumns(FilteredTxt,{"Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path"}),
// UI add custom column "FileContents" with formula Csv.Document([Content]); step renamed
AddedFileContents = Table.AddColumn(RemovedColumns, "FileContents", each Csv.Document([Content])),
// Standard UI; step renamed
RemovedBinaryContent = Table.RemoveColumns(AddedFileContents,{"Content"}),
// In the next 3 steps, temporary names for the new columns are created ("Column2", "Column3", etcetera)
// Standard UI: add custom Index column, start at 2, increment 1
#"Added Index" = Table.AddIndexColumn(RemovedBinaryContent, "Index", 2, 1),
// Standard UI: select Index column, Transform tab, Format, Add Prefix: "Column"
#"Added Prefix" = Table.TransformColumns(#"Added Index", {{"Index", each "Column" & Text.From(_, "en-US"), type text}}), //type text
// Standard UI:
#"Renamed Columns" = Table.RenameColumns(#"Added Prefix",{{"Index", "ColumnName"}}),
// Now we have the names for the new columns
// Advanced Editor: create a list with records with FileContents (tables) and ColumnNames (text) (1 list item (or record) per txt file in the folder)
// From this list, the resulting table will be build in the next step.
ListOfRecords = Table.ToRecords(#"Renamed Columns"),
// Advanced Editor: use List.Accumulate to build the table with all columns,
// starting with Column1 of the first file (Table.FromList(ListOfRecords{0}[FileContents][Column1], each {_}),)
// adding Column2 of each file for all items in ListOfRecords.
BuildTable = List.Accumulate(ListOfRecords,
Table.FromList(ListOfRecords{0}[FileContents][Column1], each{_}),
(TableSoFar,NewColumn) =>
Table.ExpandTableColumn(Table.NestedJoin(TableSoFar, "Column1", NewColumn[FileContents], "Column1", "Dummy", JoinKind.LeftOuter), "Dummy", {"Column2"}, {NewColumn[ColumnName]})),
#"Sorted Rows" = Table.Sort(BuildTable,{{"Column1", Order.Ascending}})
in
#"Sorted Rows"
//each {_}
//Splitter.SplitTextByWhitespace
This is the output I get when I run the code:
and if I change the first five of rows of .txt files so there is a comma between the columns, I get this:
The desired output (first five rows)
I was trying to change the each{_} in the Table.FromList line towards the end with the Splitter function, but it was not working.
I would be very grateful if you could take a look at the code, and suggest what needs to be changed in order for it to work.
Cheers!

Modify your code to insert the #"Added Prefix2" code as below
#"Added Prefix" = Table.TransformColumns(#"Added Index", {{"Index", each "Column" & Text.From(_, "en-US"), type text}}), //type text
#"Added Prefix2" = Table.TransformColumns(#"Added Prefix" , {{"FileContents", each Table.SplitColumn(_, "Column1", Splitter.SplitTextByEachDelimiter({"#(tab)"}, QuoteStyle.Csv, false), {"Column1", "Column2"})}}),
// Standard UI:
#"Renamed Columns" = Table.RenameColumns(#"Added Prefix2",{{"Index", "ColumnName"}}),
I prefer this version when I do similar. More compact and preserves file names of source files
let Source = Folder.Files("C:\directory\subdirectory"),
#"Filtered Rows" = Table.SelectRows(Source, each ([Extension] = ".txt")),
#"Added Custom1" = Table.AddColumn(#"Filtered Rows", "Custom", each Csv.Document(File.Contents([Folder Path]&"\"&[Name]),[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None])),
#"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom1", "Custom", {"Column1"}, {"Column1"}),
#"Split Column by Delimiter" = Table.SplitColumn(#"Expanded Custom", "Column1", Splitter.SplitTextByEachDelimiter({"#(tab)"}, QuoteStyle.Csv, false), {"Column1", "Column2"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Split Column by Delimiter",{"Name", "Column1", "Column2"}),
#"Pivoted Column" = Table.Pivot(#"Removed Other Columns1", List.Distinct(#"Removed Other Columns1"[Name]), "Name", "Column2")
in #"Pivoted Column"

Related

Split sentences by Case change where two words are "stuck" together

I am attempting to clean up the following data which has been extracted from HTML.
Some sentences haven't quite split correctly with the Capitalised word at the start of one sentence "stuck" to the preceding word.
The image below illustrates what I am trying to achieve:
So in essence if there is a sentence like: The boy plays with the ballThe Girl plays with the Console in a row. This would split to:
The boy plays with the ball
The Girl plays with the Console
M code so far with the actual data ( must be run in power BI as uses Html.Table function which is not available in excel).
let
Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://echa.europa.eu/registration-dossier/-/registered-dossier/14184/7/1"))}),
#"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Contains([Column1], "General Population - Hazard via oral route") then [Column1] else null),
#"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] <> null)),
#"Kept Last Rows" = Table.LastN(#"Filtered Rows", 1),
#"Removed Other Columns" = Table.SelectColumns(#"Kept Last Rows",{"Custom"}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("</dd><dt>", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
#"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter", "Text", each Html.Table([Custom], {{"Custom",":root"}})),
#"Expanded Text" = Table.ExpandTableColumn(#"Added Custom1", "Text", {"Custom"}, {"Custom.1"})
in
#"Expanded Text"
Image still looks incorrect (informationOverall is not split) but if you want to split by character transition, you can do so from the ribbon.

Splitting CSV column data into new CSV file using VBScript

I have a CSV file where 2 columns contain several different text values e.g.
Column 1: Reptiles, Health, Hygiene
Column 2: Purity
I need to use VBscript to split these columns into a new CSV file without changing the current file, expected output in new CSV file shown below:
Column 1 Column 2
Reptiles Reptiles
Health Health
Hygiene Hygiene
Purity Purity
Unfortunately(?) it must be done with VB Script and nothing else.
Here is an example of how the data looks (of course the data consistently repeats with some extra entries through the same columns in file 1.
And here is an example of how it needs to look but it needs to repeat down until all unique entries from Column 1 and 2 in the original file have been input as a single entry to Column 1 in the new file and copied to Column 2 in the same new file. e.g.
Examples in text format as requested:
Original file:
Column 1,Column 2
"Reptiles, Health, Hygiene",Purity
New File:
Column 1,Column 2
Reptiles,Reptiles
Health,Health
Hygiene,Hygiene
Purity,Purity
I think this is a simple matter of using the FileSystemObject with Split function.
Assuming each input line is just one set of data you can remove the double quotes and process from there
Try this VB script out (edited to process header line separately):
Const Overwrite = True
Set ObjFso = CreateObject("Scripting.FileSystemObject")
Set ObjOutFile = ObjFso.CreateTextFile("My New File Path", Overwrite)
Set ObjInFile = ObjFso.OpenTextFile("My Old File Path")
' Skip processing first header line and just write it out as is
strLine = ObjInFile.ReadLine
ObjOutFile.WriteLine strLine
Do Until ObjInFile.AtEndOfStream
' Remove all double quotes to treat this as one set of data
strLine = Replace(ObjInFile.ReadLine, """","")
varData = Split(strLine,",")
' Write out each element twice into its own line
For i = 0 to uBound(varData)
ObjOutFile.WriteLine varData(i) & "," & varData(i)
Next i
Loop
ObjInFile.Close
ObjOutFile.Close

Excel Web Powerquery: Excel merges data strings in cells --> How do I delimit the data?

I am using Excel 2016 and would like to download Odds from Oddschecker.com via the Web Powerquery function into an Excel Spreadsheet.
More specifically, I am trying to download the data from this Website:
https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history
The problem I have is that some odds on this Website are being merged without space between them into single cells:
Is there any way in Powerquery to delimit the data strings/odds so that they are not being merged?
Thank you very much in advance for any kind of help.
Another approach in the code below using recursive function fnSearchTR (embedded in the query) to drill down the HTML document until the name "TR" is found (or after 100 iterations just to prevent endless iterating). I noticed that this is the place where the required data is located, at least today.
Remark: I also adjusted the second step in the code to select the "Document".
This is a more dynamic solution as it doesn't matter where in the document structure the "TR" is located; otherwise if the document structure is adjusted, then it is still possible that other "TR"'s are found first, but so far it works.
Otherwise also "TR"'s are found with other content, but these will be filtered out as errors or null values after the data type of the first column is adjusted to date.
This query also uses the function "ExpandTables" from my previous answer (I corrected the typo and added a "x", otherwise no changes in the function).
let
Source = Web.Page(Web.Contents("https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history")),
Data0 = Table.SelectRows(Source, each [Caption] = "Document"){0}[Data],
ChildrenWithTable = Table.SelectRows(Data0, each [Children] is table),
fnSearchTR = (newChildren as table, counter as number) as table =>
let
Combined = Table.Buffer(Table.Combine(newChildren[Children])),
ChildrensChildrenWithTable = Table.AddColumn(newChildren, "ChildrensChildren", each Table.SelectRows([Children], each [Children] is table)),
ChildrensChildrenCombined = Table.Combine(ChildrensChildrenWithTable[ChildrensChildren]),
CombinedAll = if ChildrensChildrenCombined[Name]{0} = "TR"
then ChildrensChildrenCombined
else if Table.RowCount(ChildrensChildrenCombined) = 0 or counter = 100
then Combined
else #fnSearchTR(ChildrensChildrenCombined, counter + 1)
in
CombinedAll,
CombinedAll = if Table.RowCount(ChildrenWithTable) = 0 then Data0 else fnSearchTR(ChildrenWithTable, 0),
#"Filtered Rows" = Table.SelectRows(CombinedAll, each ([Name] = "TR")),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Children"}),
#"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns", "ExpandTables", each ExpandTables([Children])),
#"Removed Columns" = Table.RemoveColumns(#"Invoked Custom Function",{"Children"}),
#"Expanded ExpandTables" = Table.ExpandTableColumn(#"Removed Columns", "ExpandTables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded ExpandTables",{{"Column1", type date}}),
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Changed Type", {"Column1"}),
#"Filtered Rows1" = Table.SelectRows(#"Removed Errors", each ([Column1] <> null))
in
#"Filtered Rows1"
Although I can't test it, since this site is blacklisted in Russian Internet segment, I suppose there are <cr>s or <lf>s there, and they aren't transformed to new lines.
What you need is to run Text.Replace against all cells with data to replace these characters.
But then you'll probably need these values as separate rows, and this is far more complex task. :)
Inspired by Gil Raviv's http://datachant.com/2017/03/30/web-scraping-power-bi-excel-power-query/
Edit April 11, 2017: this solution is highly dependent on the structure of the website, or in other words: yesterday it worked fine, but today it doesn't, unfortunately.
The following query with associated function works with me:
let
Source = Web.Page(Web.Contents("https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history")),
Data0 = Source{1}[Data],
Children = Data0{0}[Children],
Children1 = Children{1}[Children],
Children2 = Children1{4}[Children],
Children3 = Children2{0}[Children],
Children4 = Children3{0}[Children],
Children5 = Children4{0}[Children],
Children6 = Children5{3}[Children],
Children7 = Children6{0}[Children],
Children8 = Children7{1}[Children],
Children9 = Children8{3}[Children],
Children10 = Children9{0}[Children],
Children11 = Children10{2}[Children],
Children12 = Children11{2}[Children],
Children13 = Children12{0}[Children],
Children14 = Children13{1}[Children],
#"Removed Other Columns" = Table.SelectColumns(Children14,{"Children"}),
#"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns", "EpandTables", each EpandTables([Children])),
#"Expanded EpandTables" = Table.ExpandTableColumn(#"Invoked Custom Function", "EpandTables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded EpandTables",{"Children"}),
#"Removed Blank Rows" = Table.SelectRows(#"Removed Columns", each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null}))),
#"Parsed Date" = Table.TransformColumns(#"Removed Blank Rows",{{"Column1", each Date.From(DateTimeZone.From(_)), type date}})
in
#"Parsed Date"
Function ExpandTables (edit: #"Added Custom" line adjusted by adding Table.SelectRows)
(ChildTable as table) =>
let
#"Removed Other Columns1" = Table.SelectColumns(ChildTable,{"Children"}),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns1", "Custom", each try if [Children] is null then null else if [Children][Text]{0} <> null then [Children][Text]{0} else Lines.ToText(List.Transform(Table.SelectRows([Children], each [Children] <> null)[Children], each _[Text]{0})) otherwise null),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Children"}),
#"Transposed Table" = Table.Transpose(#"Removed Columns")
in
#"Transposed Table"
The problem is the HTML for one of the combined cells is:
<td><div class="oo">11/4</div><div class="oi">13/5</div><div class="oo">11/4</div></td>
As far as I know, div layout rules don't imply a newline, so Power Query doesn't insert one. We don't run a full layout engine, so we don't know that the column width means each div should be on its own line.
(If anybody knows more about HTML layout semantics, let me know and I can suggest a fix to my team.)
You can text-replace the HTML like this to inject your own delimiter ; in between the div elements
let
WebPageWithReplace = (url as text, old as text, new as text) =>
let
Source = Web.Contents(url),
TextReplace = Text.ToBinary(Text.Replace(Text.FromBinary(Source), old, new)),
Page = Web.Page(TextReplace)
in
Page,
Invoked = WebPageWithReplace(
"https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history",
"</div><div",
"</div>;<div"),
Data = Invoked{1}[Data]
in
Data
And that way Web.Page will still find and parse the HTML table.

Horizontal append in for loop?

I have a for loop iterating over a folder of one column csv's using glob, it makes some adjustments and then appends the results to a list and saves to a new csv, it resembles:
data= []
infiles = glob.glob("*.csv")
for file in infiles:
df = pd.io.parsers.read_csv(file)
(assorted adjustments)
data.append(df)
fullpanel = pd.concat(panel)
fullpanel.to_csv('data.csv')
The problem is that makes one long column, I need each column (of differing lengths) added next to each other.
I think you can add parameter axis=1 to concat for columns added next to each other. Also you can change pd.io.parsers.read_csv to pd.read_csv and panel to data in concat.
data= []
infiles = glob.glob("*.csv")
for file in infiles:
df = pd.read_csv(file)
(assorted adjustments)
data.append(df)
fullpanel = pd.concat(data, axis=1)
fullpanel.to_csv('data.csv')

replicating a row with changing a field in Python

I have a large csv file with two columns like this:
Id and vehicle
and I like to replicate the rows and if the vehicle is "truck", but instead put "car".
I have this code, but there is an error
which says
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
what does it mean? where I am wrong?
infilename = r'external carriers.csv'
outfilename = r'outputCSV.csv'
with open(infilename, 'rb') as fp_in, open(outfilename, 'wb') as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
for row in reader:
if len(row) == 2:
if row == "truck":
writer.writerow = "car"
It's obvious, you have opened the file in text mode you need rt :
with open(infilename, 'rt') as fp_in, open(outfilename, 'wt') as fp_out:
Also if you want to check the vehicle type you need to check the row[1] which preserve your car name and then reassign it and write the row to your output file.Also note that you don't need to check the length of your rows since calling the len function can be terrible in term if your performance which has O(n) and for large files (specially with large rows) is very inefficient.
infilename = r'external carriers.csv'
outfilename = r'outputCSV.csv'
with open(infilename, 'rt') as fp_in, open(outfilename, 'wt') as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
for row1,row2 in reader:
if row2 == "truck":
writer.writerow([row1,'car'])