How to Compare Strings without case sensitive in Spark RDD?

生来就可爱ヽ(ⅴ<●) 提交于 2020-05-15 05:08:11

问题


I have following Dataset

drug_name,num_prescriber,total_cost
AMBIEN,2,300
BENZTROPINE MESYLATE,1,1500
CHLORPROMAZINE,2,3000

Wanted to find out number of A's and B's from above DataSet along with the header. I am using the following code to find out num of A's and number of B's.

from pyspark import SparkContext
from pyspark.sql import SparkSession

logFile = 'Sample.txt'
spark = SparkSession.builder.appName('GD App').getOrCreate()
logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()
print('{0} {1}'.format(numAs,numBs))

It returned the output as 1 1. I wanted to compare without the case sensitivity. I have tried the following, but it is returning the error as 'Column' object is not callable

numAs = logData.filter((logData.value).tolower().contains('a')).count()
numBs = logData.filter((logData.value).tolower().contains('b')).count()

Please help me out.


回答1:


To convert to lower case, you should use the lower() function (see here) from pyspark.sql.functions.So you could try:

import pyspark.sql.functions as F

logData = spark.createDataFrame(
    [
     (0,'aB'),
     (1,'AaA'),
     (2,'bA'),
     (3,'bB')
    ],
    ('id', "value")
)
numAs = logData.filter(F.lower((logData.value)).contains('a')).count()

You mention 'I am using the following code to find out num of A's and number of B's.' Note that if you want to count the actual occurrences of a character instead of the amount of rows that contain the character, you could do something like:

def count_char_in_col(col: str, char: str):
    return F.length(F.regexp_replace(F.lower(F.col(col)), "[^" + char + "]", ""))

logData.select(count_char_in_col('value','a')).groupBy().sum().collect()[0][0]

which in the above example will return 5.

Hope this helps!



来源:https://stackoverflow.com/questions/51607061/how-to-compare-strings-without-case-sensitive-in-spark-rdd

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!