How to only keep certain sentences of strings in pandas Dataframe

问题

In my pandas dataframe, I have 100 news articles under the article column. Each news article is a string. I want to only keep the first 3 sentences of each string, but I don't know how. (I noticed each sentence is separated by \n).

Please suggest possible solutions.

The dataframe looks like this:

print("Reading data from csv file")

print(read)

Reading data from csv file
    Unnamed: 0                                            article
0            0  \nChina’s ambassador to the US wants American ...
1            1  \nMissouri has become the first state to file ...
2            2  \nThe US is slamming the Communist Chinese gov...
3            3  \nSecretary of State Mike Pompeo on Thursday r...
4            4  \nThe US — along with Russia, China and India ...
..         ...                                                ...
95          95  \nChina has reported no new deaths from the co...
96          96  \nThe World Health Organization on Tuesday fin...
97          97  \nAfter two months of being shut down due to t...
98          98  \nSome coronavirus patients may suffer neurolo...
99          99  \nChina may be past the worst of the COVID-19 ...

[100 rows x 2 columns]

回答1:

Assuming your strings are of the format:

"\nA\nB\nC\nD\nE\nF\n"

You can reduce them to just the first three lines with:

x = "\nA\nB\nC\nD\nE\nF\n"
x = "\n".join(x.split("\n", maxsplit=4)[1:4])

This takes the string, splits into a list of lines, and joins the first three lines back together with a \n. So, in the above example, x becomes:

'A\nB\nC'

In Pandas you could apply this to a column with:

df['article'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[1:4]))

One small note is that if there are less than three lines, it would leave a stray \n at the end of those strings, so you could either strip it away with a strip at the end of the lambda expression
df['a'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[1:4]).strip()) or ensure that every article ended with \n with
df['a'].apply(lambda x: "\n".join(x.split("\n", maxsplit=4)[1:4]).strip() + '\n')

As you have asked, the mechanics of what is going on in x = "\n".join(x.split("\n", maxsplit=4)[1:4]) it is as follows:

For each string, say x = "\nA\nB\nC\nD\nE\nF\n"

It is split into a list, using the "\n" as the dividing point. So:
x.split("\n", maxsplit=4) yields a list that contains:
['', 'A', 'B', 'C', 'D\nE\nF\n']. The initial empty entry is because the string starts with \n. I have used maxsplit=4 because we are going to discard everything after the 3rd line, so there's no point splitting them.

Now we want to join 'A', 'B', 'C' back into a string, they are at indexes 1,2,3 in the list, so we use a slice of [1:4] (as the last entry is NOT included in the slice), so:
x.split("\n", maxsplit=4)[1:4] contains just:
['A', 'B', 'C']

Finally they can be joined back together with
"\n".join(x.split("\n", maxsplit=4)[1:4]) which gives us:
'A\nB\nC' which is the first three lines, separated with \n

来源：https://stackoverflow.com/questions/61916583/how-to-only-keep-certain-sentences-of-strings-in-pandas-dataframe

标签

python

pandas

split