W jednym z projektów musiałem pobrać zawartość strony internetowej do arkusza. Niestety nie mogłem użyć wbudowanych narzędzi Excela z racji skomplikowania strony.
Poniżej prezentuję uproszczoną zawartość HTML strony. Moim zadaniem było pobranie każdego paragrafu <p> klasy <div class=”SomeClassNameChild”>, ale klasa <div class=”NextClassChildName”> która jest dzieckiem <div class=”SomeClassNameChild”> musiała znajdować się drugiej kolumnie arkusza.
Inne klasy miały być pominięte.
Przykładowy HTML:
<html> <head></head> <body> <div class="SomeClassNameParent"> <div class="SomeClassNameChild" id="child-1"> <!-- This class is duplicated many times --> <span class="icon-close"></span> <h5><span class="number">1</span> Some title</h5> <img src="SomePath data-src="SomePath" class="lazy" alt=""> <p>Some text here #1</p> <p>Some text here #2</p> <p> </p> <p>Some text here #3</p> <div class="NextClassChildName"> <h6><span class="icon"></span>Text</h6> <p>Some text here #4</p> </div> <div class="NextClassChildName2"> <button class="prev child-1" style="display:none">< SomeButton</button> <button class="next child-1">SomeButton</button> </div> </div> </body> </html>
Kod w VBA wygląda następująco:
Private Sub DowloadloadWebContent() ' https://stackoverflow.com/questions/41558714/pull-value-from-website-html-div-class-using-excel-vba ' https://stackoverflow.com/questions/14108948/excel-vba-check-if-entry-is-empty-or-not-space ' https://stackoverflow.com/questions/41851152/vba-web-scrape-getelementsbyclassname ' http://automatetheweb.net/vba-getelementsbytagname-method/ ' https://stackoverflow.com/questions/46036355/extracting-div-class-information-from-website-with-vba ' Needed references (Go to Tools -> References): ' Microsoft HTML Object Library Dim htmlEle1 As IHTMLElement Dim HTMLDoc As HTMLDocument Dim URL, NameOfClass As String Dim SheetRowCounter, HtmlWebRowCounter, HtmlLineNumber As Long Dim oElement As Object URL = "Test_URL_Path" Set HTMLDoc = New HTMLDocument ' Connect to WebPage With CreateObject("WINHTTP.WinHTTPRequest.5.1") .Open "GET", URL, False .send HTMLDoc.body.innerHTML = .responseText End With ' Paste the HTML content to active sheet With ActiveWorkbook.ActiveSheet ' Looping only <div> sections For Each htmlEle1 In HTMLDoc.getElementsByTagName("div") ' Get the name of selected class NameOfClass = htmlEle1.className Select Case NameOfClass Case "SomeClassNameChild" ' Catch SomeClassNameChild only SheetRowCounter = SheetRowCounter + 1 ' Get SomeClassNameChild class index .Cells(SheetRowCounter, 1) = htmlEle1.getAttribute("id") ' Get title only <h5> of SomeClassNameChild HtmlWebRowCounter = -1 For Each oElement In htmlEle1.getElementsByTagName("h5") .Cells(SheetRowCounter, 2) = .Cells(SheetRowCounter, 2) & _ " " & htmlEle1.getElementsByTagName("h5")(HtmlWebRowCounter + 1).innerText HtmlWebRowCounter = HtmlWebRowCounter + 1 Next ' Get all <p> of SomeClassNameChild HtmlWebRowCounter = -1 For HtmlLineNumber = 0 To UBound(Split(htmlEle1.innerHTML, vbCrLf)) ' Eliminate class NextClassChildName from loop If Split(htmlEle1.innerHTML, vbCrLf)(HtmlLineNumber) = "<DIV class=NextClassChildName>" Then Exit For Else If Left(Split(htmlEle1.innerHTML, vbCrLf)(HtmlLineNumber), 3) = "<P>" Then .Cells(SheetRowCounter, 3) = .Cells(SheetRowCounter, 3) & _ " " & htmlEle1.getElementsByTagName("p")(HtmlWebRowCounter + 1).innerText HtmlWebRowCounter = HtmlWebRowCounter + 1 End If End If Next ' in this web page some of descriptions are not placed in <p> but in <div> directly or in <p class="name"> ' for that reason here is a try to catch this line descriptions and paste in to proper cell HtmlWebRowCounter = -1 If .Cells(SheetRowCounter, 3) = "" Or Len(Trim(.Cells(SheetRowCounter, 2))) = 0 Then For HtmlLineNumber = 0 To UBound(Split(htmlEle1.innerHTML, vbCrLf)) If Split(htmlEle1.innerHTML, vbCrLf)(HtmlLineNumber) = "<DIV class=NextClassChildName>" Then Exit For Else If Left(Split(htmlEle1.innerHTML, vbCrLf)(HtmlLineNumber), 5) = "<DIV>" Then .Cells(SheetRowCounter, 3) = .Cells(SheetRowCounter, 3) & _ " " & htmlEle1.getElementsByTagName("div")(HtmlWebRowCounter + 1).innerText HtmlWebRowCounter = HtmlWebRowCounter + 1 End If End If Next End If ' Get <p> of NextClassChildName class. <h6> tag should not be considered Case "NextClassChildName2" HtmlWebRowCounter = -1 For Each oElement In htmlEle1.getElementsByTagName("p") .Cells(SheetRowCounter, 4) = .Cells(SheetRowCounter, 4) & _ " " & htmlEle1.getElementsByTagName("p")(HtmlWebRowCounter + 1).innerText HtmlWebRowCounter = HtmlWebRowCounter + 1 Next ' Class <div class="NextClassChildName2"> is not considered (no case) in loop. No data is taken from there End Select Next End With End Sub
Ciekawymi elementami są:
Wyodrębnienie pojedyńczych linii HTML z całości stringa:
Split(htmlEle1.innerHTML, vbCrLf)(HtmlLineNumber)
Obliczenie ilości wierszy w HTML stringu:
UBound(Split(htmlEle1.innerHTML, vbCrLf))
Sprawdzenie czy komórka jest pusta pomimo pobrania wartości <p> </p>, gdzie o ile dobrze pamiętam samo LEN zwracało liczbę 2:
Len(Trim(.Cells(SheetRowCounter, 2))) = 0
Warto tez zwrócić uwagę, że do pobrania zawartości używam dwóch właściwości:
innerText: do pobrania samego tekstu np. z tagu <p>,
innerHTML: do poruszania się po HTMLu (pętla) i odszukania interesujących elementów.