Excel Web Powerquery: Excel merges data strings in cells --> How do I delimit the data? - html

I am using Excel 2016 and would like to download Odds from Oddschecker.com via the Web Powerquery function into an Excel Spreadsheet.
More specifically, I am trying to download the data from this Website:
https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history
The problem I have is that some odds on this Website are being merged without space between them into single cells:
Is there any way in Powerquery to delimit the data strings/odds so that they are not being merged?
Thank you very much in advance for any kind of help.

Another approach in the code below using recursive function fnSearchTR (embedded in the query) to drill down the HTML document until the name "TR" is found (or after 100 iterations just to prevent endless iterating). I noticed that this is the place where the required data is located, at least today.
Remark: I also adjusted the second step in the code to select the "Document".
This is a more dynamic solution as it doesn't matter where in the document structure the "TR" is located; otherwise if the document structure is adjusted, then it is still possible that other "TR"'s are found first, but so far it works.
Otherwise also "TR"'s are found with other content, but these will be filtered out as errors or null values after the data type of the first column is adjusted to date.
This query also uses the function "ExpandTables" from my previous answer (I corrected the typo and added a "x", otherwise no changes in the function).
let
Source = Web.Page(Web.Contents("https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history")),
Data0 = Table.SelectRows(Source, each [Caption] = "Document"){0}[Data],
ChildrenWithTable = Table.SelectRows(Data0, each [Children] is table),
fnSearchTR = (newChildren as table, counter as number) as table =>
let
Combined = Table.Buffer(Table.Combine(newChildren[Children])),
ChildrensChildrenWithTable = Table.AddColumn(newChildren, "ChildrensChildren", each Table.SelectRows([Children], each [Children] is table)),
ChildrensChildrenCombined = Table.Combine(ChildrensChildrenWithTable[ChildrensChildren]),
CombinedAll = if ChildrensChildrenCombined[Name]{0} = "TR"
then ChildrensChildrenCombined
else if Table.RowCount(ChildrensChildrenCombined) = 0 or counter = 100
then Combined
else #fnSearchTR(ChildrensChildrenCombined, counter + 1)
in
CombinedAll,
CombinedAll = if Table.RowCount(ChildrenWithTable) = 0 then Data0 else fnSearchTR(ChildrenWithTable, 0),
#"Filtered Rows" = Table.SelectRows(CombinedAll, each ([Name] = "TR")),
#"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Children"}),
#"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns", "ExpandTables", each ExpandTables([Children])),
#"Removed Columns" = Table.RemoveColumns(#"Invoked Custom Function",{"Children"}),
#"Expanded ExpandTables" = Table.ExpandTableColumn(#"Removed Columns", "ExpandTables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}),
#"Changed Type" = Table.TransformColumnTypes(#"Expanded ExpandTables",{{"Column1", type date}}),
#"Removed Errors" = Table.RemoveRowsWithErrors(#"Changed Type", {"Column1"}),
#"Filtered Rows1" = Table.SelectRows(#"Removed Errors", each ([Column1] <> null))
in
#"Filtered Rows1"

Although I can't test it, since this site is blacklisted in Russian Internet segment, I suppose there are <cr>s or <lf>s there, and they aren't transformed to new lines.
What you need is to run Text.Replace against all cells with data to replace these characters.
But then you'll probably need these values as separate rows, and this is far more complex task. :)

Inspired by Gil Raviv's http://datachant.com/2017/03/30/web-scraping-power-bi-excel-power-query/
Edit April 11, 2017: this solution is highly dependent on the structure of the website, or in other words: yesterday it worked fine, but today it doesn't, unfortunately.
The following query with associated function works with me:
let
Source = Web.Page(Web.Contents("https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history")),
Data0 = Source{1}[Data],
Children = Data0{0}[Children],
Children1 = Children{1}[Children],
Children2 = Children1{4}[Children],
Children3 = Children2{0}[Children],
Children4 = Children3{0}[Children],
Children5 = Children4{0}[Children],
Children6 = Children5{3}[Children],
Children7 = Children6{0}[Children],
Children8 = Children7{1}[Children],
Children9 = Children8{3}[Children],
Children10 = Children9{0}[Children],
Children11 = Children10{2}[Children],
Children12 = Children11{2}[Children],
Children13 = Children12{0}[Children],
Children14 = Children13{1}[Children],
#"Removed Other Columns" = Table.SelectColumns(Children14,{"Children"}),
#"Invoked Custom Function" = Table.AddColumn(#"Removed Other Columns", "EpandTables", each EpandTables([Children])),
#"Expanded EpandTables" = Table.ExpandTableColumn(#"Invoked Custom Function", "EpandTables", {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}, {"Column1", "Column2", "Column3", "Column4", "Column5", "Column6", "Column7", "Column8", "Column9", "Column10", "Column11", "Column12", "Column13", "Column14", "Column15", "Column16", "Column17", "Column18", "Column19", "Column20", "Column21", "Column22", "Column23", "Column24", "Column25", "Column26", "Column27", "Column28", "Column29"}),
#"Removed Columns" = Table.RemoveColumns(#"Expanded EpandTables",{"Children"}),
#"Removed Blank Rows" = Table.SelectRows(#"Removed Columns", each not List.IsEmpty(List.RemoveMatchingItems(Record.FieldValues(_), {"", null}))),
#"Parsed Date" = Table.TransformColumns(#"Removed Blank Rows",{{"Column1", each Date.From(DateTimeZone.From(_)), type date}})
in
#"Parsed Date"
Function ExpandTables (edit: #"Added Custom" line adjusted by adding Table.SelectRows)
(ChildTable as table) =>
let
#"Removed Other Columns1" = Table.SelectColumns(ChildTable,{"Children"}),
#"Added Custom" = Table.AddColumn(#"Removed Other Columns1", "Custom", each try if [Children] is null then null else if [Children][Text]{0} <> null then [Children][Text]{0} else Lines.ToText(List.Transform(Table.SelectRows([Children], each [Children] <> null)[Children], each _[Text]{0})) otherwise null),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Children"}),
#"Transposed Table" = Table.Transpose(#"Removed Columns")
in
#"Transposed Table"

The problem is the HTML for one of the combined cells is:
<td><div class="oo">11/4</div><div class="oi">13/5</div><div class="oo">11/4</div></td>
As far as I know, div layout rules don't imply a newline, so Power Query doesn't insert one. We don't run a full layout engine, so we don't know that the column width means each div should be on its own line.
(If anybody knows more about HTML layout semantics, let me know and I can suggest a fix to my team.)
You can text-replace the HTML like this to inject your own delimiter ; in between the div elements
let
WebPageWithReplace = (url as text, old as text, new as text) =>
let
Source = Web.Contents(url),
TextReplace = Text.ToBinary(Text.Replace(Text.FromBinary(Source), old, new)),
Page = Web.Page(TextReplace)
in
Page,
Invoked = WebPageWithReplace(
"https://www.oddschecker.com/politics/european-politics/french-election/next-president/bet-history/marine-le-pen/today#all-history",
"</div><div",
"</div>;<div"),
Data = Invoked{1}[Data]
in
Data
And that way Web.Page will still find and parse the HTML table.

Related

Are there any ways to remove duplicates in a specific row while maintaining other rows?

I'm trying to remove duplicates in a specific column(promotion-ids.1 column) while others remain.
Is there any ways to solve it?
Is this what you want? It nulls out the contents if the row above it has the same contents in that particular column? (Sort of the inverse of fill...down...)
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
shiftedList = {null} & List.RemoveLastN(Table.Column(Source,"promotion-ids.1"),1),
custom1 = Table.ToColumns(Source) & {shiftedList},
custom2 = Table.FromColumns(custom1,Table.ColumnNames(Source) & {"Prev"}),
x=Table.ReplaceValue(custom2, each [#"promotion-ids.1"], each if [Prev]=[#"promotion-ids.1"] then null else[#"promotion-ids.1"] ,Replacer.ReplaceValue,{"promotion-ids.1"}),
#"Removed Columns" = Table.RemoveColumns(x,{"Prev"})
in #"Removed Columns"
thats a terrible way to store the data, but might be what you are looking for

Split sentences by Case change where two words are "stuck" together

I am attempting to clean up the following data which has been extracted from HTML.
Some sentences haven't quite split correctly with the Capitalised word at the start of one sentence "stuck" to the preceding word.
The image below illustrates what I am trying to achieve:
So in essence if there is a sentence like: The boy plays with the ballThe Girl plays with the Console in a row. This would split to:
The boy plays with the ball
The Girl plays with the Console
M code so far with the actual data ( must be run in power BI as uses Html.Table function which is not available in excel).
let
Source = Table.FromColumns({Lines.FromBinary(Web.Contents("https://echa.europa.eu/registration-dossier/-/registered-dossier/14184/7/1"))}),
#"Added Custom" = Table.AddColumn(Source, "Custom", each if Text.Contains([Column1], "General Population - Hazard via oral route") then [Column1] else null),
#"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Custom] <> null)),
#"Kept Last Rows" = Table.LastN(#"Filtered Rows", 1),
#"Removed Other Columns" = Table.SelectColumns(#"Kept Last Rows",{"Custom"}),
#"Split Column by Delimiter" = Table.ExpandListColumn(Table.TransformColumns(#"Removed Other Columns", {{"Custom", Splitter.SplitTextByDelimiter("</dd><dt>", QuoteStyle.None), let itemType = (type nullable text) meta [Serialized.Text = true] in type {itemType}}}), "Custom"),
#"Added Custom1" = Table.AddColumn(#"Split Column by Delimiter", "Text", each Html.Table([Custom], {{"Custom",":root"}})),
#"Expanded Text" = Table.ExpandTableColumn(#"Added Custom1", "Text", {"Custom"}, {"Custom.1"})
in
#"Expanded Text"
Image still looks incorrect (informationOverall is not split) but if you want to split by character transition, you can do so from the ribbon.

Power Query does not recognize tab as a delimiter in .txt files in the code

this is my first post here, so I apologize in advance if the question has been already answered somewhere or I do something wrong. To summarize the problem:
I am doing some spectroscopy measurements and the data from the software I am using is saved in hundreds of .txt files. All files have the same content: first column refers to the wavelength, the second column is the intensity. Columns are separated from one another with a tab. The idea is to insert all of these .txt files in Power Query, rearrange the columns so there is only one column with the wavelength (since it is always the same for all measurements), and the remaining columns would be intensities (second column) of all inserted files.
Therefore, the desired output should look like this:
Wavelength (1st file), intensity (1st file), intensity (2nd file), intensity (3rd file),..., intensity (last file).
I found this brilliant solution, but the issue is that it works flawlessly if the columns are separated by a comma. I tried changing the code so it recognizes the tab, but stuff that I tried didn't work. I also found about Power Query yesterday, so I am a total beginner here. Here is the code:
let
Source = Folder.Files("C:\Users\xxxxx\Desktop\new"),
// Standard UI; step renamed
FilteredTxt = Table.SelectRows(Source, each [Extension] = ".txt"),
// Standard UI; step renamed
RemovedColumns = Table.RemoveColumns(FilteredTxt,{"Name", "Extension", "Date accessed", "Date modified", "Date created", "Attributes", "Folder Path"}),
// UI add custom column "FileContents" with formula Csv.Document([Content]); step renamed
AddedFileContents = Table.AddColumn(RemovedColumns, "FileContents", each Csv.Document([Content])),
// Standard UI; step renamed
RemovedBinaryContent = Table.RemoveColumns(AddedFileContents,{"Content"}),
// In the next 3 steps, temporary names for the new columns are created ("Column2", "Column3", etcetera)
// Standard UI: add custom Index column, start at 2, increment 1
#"Added Index" = Table.AddIndexColumn(RemovedBinaryContent, "Index", 2, 1),
// Standard UI: select Index column, Transform tab, Format, Add Prefix: "Column"
#"Added Prefix" = Table.TransformColumns(#"Added Index", {{"Index", each "Column" & Text.From(_, "en-US"), type text}}), //type text
// Standard UI:
#"Renamed Columns" = Table.RenameColumns(#"Added Prefix",{{"Index", "ColumnName"}}),
// Now we have the names for the new columns
// Advanced Editor: create a list with records with FileContents (tables) and ColumnNames (text) (1 list item (or record) per txt file in the folder)
// From this list, the resulting table will be build in the next step.
ListOfRecords = Table.ToRecords(#"Renamed Columns"),
// Advanced Editor: use List.Accumulate to build the table with all columns,
// starting with Column1 of the first file (Table.FromList(ListOfRecords{0}[FileContents][Column1], each {_}),)
// adding Column2 of each file for all items in ListOfRecords.
BuildTable = List.Accumulate(ListOfRecords,
Table.FromList(ListOfRecords{0}[FileContents][Column1], each{_}),
(TableSoFar,NewColumn) =>
Table.ExpandTableColumn(Table.NestedJoin(TableSoFar, "Column1", NewColumn[FileContents], "Column1", "Dummy", JoinKind.LeftOuter), "Dummy", {"Column2"}, {NewColumn[ColumnName]})),
#"Sorted Rows" = Table.Sort(BuildTable,{{"Column1", Order.Ascending}})
in
#"Sorted Rows"
//each {_}
//Splitter.SplitTextByWhitespace
This is the output I get when I run the code:
and if I change the first five of rows of .txt files so there is a comma between the columns, I get this:
The desired output (first five rows)
I was trying to change the each{_} in the Table.FromList line towards the end with the Splitter function, but it was not working.
I would be very grateful if you could take a look at the code, and suggest what needs to be changed in order for it to work.
Cheers!
Modify your code to insert the #"Added Prefix2" code as below
#"Added Prefix" = Table.TransformColumns(#"Added Index", {{"Index", each "Column" & Text.From(_, "en-US"), type text}}), //type text
#"Added Prefix2" = Table.TransformColumns(#"Added Prefix" , {{"FileContents", each Table.SplitColumn(_, "Column1", Splitter.SplitTextByEachDelimiter({"#(tab)"}, QuoteStyle.Csv, false), {"Column1", "Column2"})}}),
// Standard UI:
#"Renamed Columns" = Table.RenameColumns(#"Added Prefix2",{{"Index", "ColumnName"}}),
I prefer this version when I do similar. More compact and preserves file names of source files
let Source = Folder.Files("C:\directory\subdirectory"),
#"Filtered Rows" = Table.SelectRows(Source, each ([Extension] = ".txt")),
#"Added Custom1" = Table.AddColumn(#"Filtered Rows", "Custom", each Csv.Document(File.Contents([Folder Path]&"\"&[Name]),[Delimiter=",", Encoding=1252, QuoteStyle=QuoteStyle.None])),
#"Expanded Custom" = Table.ExpandTableColumn(#"Added Custom1", "Custom", {"Column1"}, {"Column1"}),
#"Split Column by Delimiter" = Table.SplitColumn(#"Expanded Custom", "Column1", Splitter.SplitTextByEachDelimiter({"#(tab)"}, QuoteStyle.Csv, false), {"Column1", "Column2"}),
#"Removed Other Columns1" = Table.SelectColumns(#"Split Column by Delimiter",{"Name", "Column1", "Column2"}),
#"Pivoted Column" = Table.Pivot(#"Removed Other Columns1", List.Distinct(#"Removed Other Columns1"[Name]), "Name", "Column2")
in #"Pivoted Column"

replicating a row with changing a field in Python

I have a large csv file with two columns like this:
Id and vehicle
and I like to replicate the rows and if the vehicle is "truck", but instead put "car".
I have this code, but there is an error
which says
_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)
what does it mean? where I am wrong?
infilename = r'external carriers.csv'
outfilename = r'outputCSV.csv'
with open(infilename, 'rb') as fp_in, open(outfilename, 'wb') as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
for row in reader:
if len(row) == 2:
if row == "truck":
writer.writerow = "car"
It's obvious, you have opened the file in text mode you need rt :
with open(infilename, 'rt') as fp_in, open(outfilename, 'wt') as fp_out:
Also if you want to check the vehicle type you need to check the row[1] which preserve your car name and then reassign it and write the row to your output file.Also note that you don't need to check the length of your rows since calling the len function can be terrible in term if your performance which has O(n) and for large files (specially with large rows) is very inefficient.
infilename = r'external carriers.csv'
outfilename = r'outputCSV.csv'
with open(infilename, 'rt') as fp_in, open(outfilename, 'wt') as fp_out:
reader = csv.reader(fp_in, delimiter=",")
writer = csv.writer(fp_out, delimiter=",")
for row1,row2 in reader:
if row2 == "truck":
writer.writerow([row1,'car'])

Creating a value in a control on a form

I need to create a value in a text box control upon triggering a certain event to allow me to then relink my forms to a different master/child link scheme. This value is to be used subsequently to create an if statement. For some strange reason, the value is generated and formatted correctly but regardless of what is in the text box, the If statement does not recognise this value and knows it only as blank. I tried numbers, letters but everything is the same.
In my example below, after updating the control (text box) 'txtDeviation' to the value of '1', for some strange reason is not recognised in as the value 1.
Private Sub cmdSkillsTracking_Click()
Form_frmValueChain01!frmValueChain02.SetFocus
Form_frmValueChain01.Pagina370.Visible = False
Form_frmValueChain01.Pagina371.Visible = True
If txtDeviation01 < 1 Then
Form_frmValueChain01.Form.frmValueChain07.LinkMasterFields = "txtMicroProcess01e"
Form_frmValueChain01.Form.frmValueChain07.LinkChildFields = "ID"
Form_frmValueChain01.Form.frmValueChain17.LinkMasterFields = "txtSubProcessID"
Form_frmValueChain01.Form.frmValueChain17.LinkChildFields = "IDskillsmatrix"
Form_frmValueChain01.Form.frmValueChain16.LinkMasterFields = "txtSubProcessID"
Form_frmValueChain01.Form.frmValueChain16.LinkChildFields = "ID"
Else
Form_frmValueChain01.Form.frmValueChain07.LinkMasterFields = "txtMicroProcess01f"
Form_frmValueChain01.Form.frmValueChain07.LinkChildFields = "ID"
Form_frmValueChain01.Form.frmValueChain14.LinkMasterFields = "txtMicroProcess01f"
Form_frmValueChain01.Form.frmValueChain14.LinkChildFields = "subprocessID"
Form_frmValueChain01.Form.frmValueChain10c.LinkMasterFields = "txtMicroProcess01f"
Form_frmValueChain01.Form.frmValueChain10c.LinkChildFields = "ID"
Form_frmValueChain01.Form.frmValueChain101.LinkMasterFields = "txtMicroProcess01f"
Form_frmValueChain01.Form.frmValueChain101.LinkChildFields = "ID"
Form_frmValueChain01.Form.frmValueChain07.LinkMasterFields = "txtMicroProcess01e"
Form_frmValueChain01.Form.frmValueChain07.LinkChildFields = "ID"
Form_frmValueChain01.Form.frmValueChain17.LinkMasterFields = "txtSubProcessID"
Form_frmValueChain01.Form.frmValueChain17.LinkChildFields = "IDskillsmatrix"
Form_frmValueChain01.Form.frmValueChain16.LinkMasterFields = "txtSubProcessID"
Form_frmValueChain01.Form.frmValueChain16.LinkChildFields = "ID"
End If
Two things I see here;
Since you are using a less than operator, you seem to want to treat
this text box value as numeric. If so, you will need to convert the
text value of the text box to numeric.
Next,you need to prefix the reference to the text box with "me."
Your IF statement should look like this;
If val(me.txtDeviation01) < 1 Then
...