How to extract text segments from a paragraph when separated by icon image in WebDriver

问题

I am writing a selenium test on our web page. There is one label field that has one to many text segments separated by a right-caret icon. I am trying to extract the individual text segments from the label into a list.

This is what the html looks like in the DOM. In this case there are 3 individual text segments: "MainSchedule", "Container1", and "Container1.2".

<p class="MuiTypography-root MuiTypography-body1" style="word-break: break-all;">
    "MainSchedule"
    <svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="caret-right" class="svg-inline--fa fa-caret-right fa-w-6 sm-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 192 512" style="margin: 0px 5px;">
        <path fill="currentColor" d="M0 384.662V127.338c0-17.818 21.543-26.741 34.142-14.142l128.662 128.662c7.81 7.81 7.81 20.474 0 28.284L34.142 398.804C21.543 411.404 0 402.48 0 384.662z"/>
    </svg>
    "Container1"
    <svg aria-hidden="true" focusable="false" data-prefix="fas" data-icon="caret-right" class="svg-inline--fa fa-caret-right fa-w-6 sm-icon" role="img" xmlns="http://www.w3.org/2000/svg" viewBox="0 0 192 512" style="margin: 0px 5px;">
        <path fill="currentColor" d="M0 384.662V127.338c0-17.818 21.543-26.741 34.142-14.142l128.662 128.662c7.81 7.81 7.81 20.474 0 28.284L34.142 398.804C21.543 411.404 0 402.48 0 384.662z"/>
    </svg>
    "Container1.2"
</p>

I can easily get the paragraph object with

WebElement label = WebDriver.findElement(By.cssSelector("p.MuiTypography-root"))

but when I try to do a getText() off of label it returns all 3 of the text segments in one string with no breaks to show where the image icons are.

Using the Chrome tools I can look at the element's properties and on the "p.MuiTypography-root" I see the "firstChild" text content is the first text segment "MainSchedule". I have tried

label.findElement(By.xpath("first-child"))

and it just throws an error. From that "firstChild" I can step through the "nextSibling" in the Chrome tools and find the ones that hold the individual text segments. But I have not figured out how to code this to read them.

I am writing my tests in java.

回答1:

You can't do this directly in Selenium, because you need to return text fragments, and the Selenium finders all return web elements.

However, there are xpath selectors you can use for this, which will return the specific text fragments you need. The basic approach is an xpath selector like this:

//p[contains(@class, 'MuiTypography-root')]/text()[position() = 1]

This will return the first fragment of text inside the <p> element - so, this (after trimming off the excess whitespace):

"MainSchedule"

How to use the above xpath selector? We will change the above "1" so it is not hard-coded; we will determine the number of possible text fragments we need to extract, and we will build a loop accordingly.

We use the xpath classes and parsers provided in Java as follows:

import java.io.IOException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;

import org.w3c.dom.Document;
import org.xml.sax.SAXException;

...

// assuming Firefox (I guess you are using Chrome):
System.setProperty("webdriver.gecko.driver", "your/path/here/geckodriver.exe");
WebDriver driver = new FirefoxDriver();
String uri = "your URL in here";
driver.navigate().to(uri);

// Here is where we use the Java parser and xpath classes:
DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
Document doc = docBuilder.parse(uri);
XPath xPath = XPathFactory.newInstance().newXPath();

// count how many <svg> tags there are.
String svgCounter = "count(//p[contains(@class, 'MuiTypography-root')]/svg)";
String count = xPath.compile(svgCounter).evaluate(doc);
// There can be up to this many pieces of text we need to extract:
int max = Integer.parseInt(count) + 1;

String expressionOne = "//p[contains(@class, 'MuiTypography-root')]/text()[position() = %s]";

for (int i = 1; i <= max; i++) {
    String result = xPath.compile(String.format(expressionOne, i)).evaluate(doc).trim();
    if (!result.isBlank()) {
        System.out.println(result);
    }
}

driver.quit();

The above print statements print the following:

"MainSchedule"
"Container1"
"Container1.2"

Points to note:

(1) This approach assumes that you have a well-formed HTML document which can be parsed at this step:

Document doc = docBuilder.parse(uri);

(2) The above code assumes there is one <p> element with an unspecified number of child <svg> tags. If you have multiple such <p> elements in your page, then you will need to adjust the above code accordingly, to process each <p> element one-by-one.

(3) If you don't have a well-formed HTML document, the above approach may fail. There is a hackier approach you can take, in that case - but it is not really recommended because it involves using a regular expression to split up a string of HTML - almost never a good idea. Often, this will be brittle and fail in surprising ways.

The hack goes like this:

String expressionTwo = "//p[contains(@class, 'MuiTypography-root')]";
WebElement element1 = driver.findElement(By.xpath(expressionTwo));
String html = element1.getAttribute("innerHTML").replace('\n', ' ');
String[] items = html.split("<svg .*?</svg>");
for (String item : items) {
    System.out.println(item.trim());
}

In this case, we use the innerHTML attribute to get the string we need to manipulate.

来源：https://stackoverflow.com/questions/61830165/how-to-extract-text-segments-from-a-paragraph-when-separated-by-icon-image-in-we

标签

java

selenium-webdriver