How to extract the text in the textarea frame of the DeepL page?

问题

From https://www.deepl.com/translator#en/fr/Hello%2C%20how%20are%20you%20today%3F

We see this:

But in code, the translated text "Bonjour, comment allez-vous aujourd'hui?" doesn't appear in any place of the page's source and the frame's code looks like:

<textarea class="lmt__textarea lmt__target_textarea lmt__textarea_base_style" 
data-gramm_editor="false" tabindex="110" dl-test="translator-target-input" 
lang="fr-FR" style="height: 300px;"></textarea>

And no matter how I read the text or source through BeautifulSoup, the translation in that textarea frame just can't be extracted.

import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.deepl.com/translator#en/fr/Hello%2C%20how%20are%20you%20today%3F')
bsoup = BeautifulSoup(response.content.decode('utf8'))

bsoup.find_all('textarea')

How to extract the translations from any part of the page from the https://www.deepl.com/translator?

回答1:

This comes from the result of an external API using JSON RPC on :

POST https://www2.deepl.com/jsonrpc

with some parameters such as the text to translate to and the target language.

An example in python using python-requests :

import requests
import time

url = "https://www2.deepl.com/jsonrpc"
text = "Hello, how are you today?"

r = requests.post(
    url,
    json = {
        "jsonrpc":"2.0",
        "method": "LMT_handle_jobs",
        "params": {
            "jobs":[{
                "kind":"default",
                "raw_en_sentence": text,
                "raw_en_context_before":[],
                "raw_en_context_after":[],
                "preferred_num_beams":4,
                "quality":"fast"
            }],
            "lang":{
                "user_preferred_langs":["FR","EN"],
                "source_lang_user_selected":"auto",
                "target_lang":"FR"
            },
            "priority":-1,
            "commonJobParams":{},
            "timestamp": int(round(time.time() * 1000))
        },
        "id": 40890008
    }
)

print(r.json())

回答2:

To extract text from textarea field, use .get_attribute('value').

Here I add the way Selenium waits for an element using WebDriverWait with the .visibility_of_element_located method.

But sometimes when an element is available (for this case), it doesn't guarantee that the text already exists, so add a loop until text != ''

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

#maybe you need browser executable path here
driver = webdriver.Chrome()
driver.get('https://www.deepl.com/translator#en/fr/Hello%2C%20how%20are%20you%20today%3F')

while True:
    element = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.lmt__side_container--target textarea')))
    if(element.get_attribute('value') != ''):
        time.sleep(1)
        text_target = element.get_attribute('value')
        break

print(text_target)
driver.quit()

Hope this helps.

回答3:

To extract the text Bonjour, comment allez-vous aujourd'hui ? you need to induce WebDriverWait for the visibility_of_element_located() and get_attribute("value"). You can use either of the following Locator Strategies:

Using CSS_SELECTOR and get_attribute("value"):

driver.get('https://www.deepl.com/translator#en/fr/Hello%2C%20how%20are%20you%20today%3F')
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "textarea.lmt__textarea.lmt__target_textarea.lmt__textarea_base_style"))).get_attribute("value"))

Using XPATH and get_attribute("value"):

driver.get('https://www.deepl.com/translator#en/fr/Hello%2C%20how%20are%20you%20today%3F')
print(WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, "//textarea[@class='lmt__textarea lmt__target_textarea lmt__textarea_base_style']"))).get_attribute("value"))

Console Output:

Bonjour, comment allez-vous aujourd'hui ?

Note : You have to add the following imports :

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

回答4:

Alternative with pyperclip and another locator (the button to copy the text) :

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pyperclip

driver.get('https://www.deepl.com/translator#en/fr/Hello%2C%20how%20are%20you%20today%3F')
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "div.lmt__target_toolbar__copy > button"))).click()
data = pyperclip.paste()

来源：https://stackoverflow.com/questions/62218673/how-to-extract-the-text-in-the-textarea-frame-of-the-deepl-page

标签

python

html

selenium

beautifulsoup

css-selectors