问题
I have two dataframes and both contains sql table.
This is my first Dataframe
Original_Input Cleansed_Input Core_Input Type_input
TECHNOLOGIES S.A TECHNOLOGIES SA
A & J INDUSTRIES, LLC A J INDUSTRIES LLC
A&S DENTAL SERVICES AS DENTAL SERVICES
A.M.G Médicale Inc AMG Mdicale Inc
AAREN SCIENTIFIC AAREN SCIENTIFIC
My second dataframe is :
Name_Extension Company_Type Priority
co llc Company LLC 2
Pvt ltd Private Limited 8
Corp Corporation 4
CO Ltd Company Limited 3
inc Incorporated 5
CO Company 1
I removed, punctuations, ASCII and digits and have put this data in the cleansed_input
column in df1
.
That cleansed_input
column in df1
needs to be checked with the Name_Extension
column of df2
. If the value from cleansed_input
has any value from Name_Extension
at the end then that should be split and put in type_input column
of df1
and not just like that but abbreviated.
For example, if CO
is present in cleansed_column
then that should be abbreviated as Company
and put in the type_input column
and the remaining text should be in core_type
column of df1
. Also there is priority given, am not sure if thats needed.
Expected output:
Original_Input Cleansed_Input Core_Input Type_input
TECHNOLOGIES S.A TECHNOLOGIES SA TECHNOLOGIES SA
A & J INDUSTRIES, LLC A J INDUSTRIES LLC A J INDUSTRIES LLC
A&S DENTAL SERVICES AS DENTAL SERVICES
A.M.G Médicale Inc AMG Mdicale Inc AMG Mdicale Incorporated
AAREN SCIENTIFIC AAREN SCIENTIFIC
I tried many methods like with isin, mask, contains, etc but am not sure what to put in where.
I got an error that said "Series are mutable, they cannot be hashed"
. I was unsure of why I got that error when I was trying things with dataframe.
I am not having that code and am using jupiter notebook and sql server and isin doesn't seem to work in jupiter.
Same way there is another split to be done. The original_input column to be split as parent_compnay name and alias name.
Here is my code:
import pyodbc
import pandas as pd
import string
from string import digits
import sqlalchemy
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import String
from io import StringIO
from itertools import chain
import re
#Connecting SQL with Python
server = '172.16.15.9'
database = 'Database Demo'
username = '**'
password = '******'
engine = create_engine('mssql+pyodbc://**:******@'+server+'/'+database+'?
driver=SQL+server')
#Reading SQL table and grouping by columns
data=pd.read_sql('select * from [dbo].[TempCompanyName]',engine)
#df1=pd.read_sql('Select * from company_Extension',engine)
#print(df1)
#gp = df.groupby(["CustomerName", "Quantity"]).size()
#print(gp)
#1.Removing ASCII characters
data['Cleansed_Input'] = data['Original_Input'].apply(lambda x:''.join([''
if ord(i) < 32 or ord(i) > 126 else i for i in x]))
#2.Removing punctuations
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda
x:''.join([x.translate(str.maketrans('', '', string.punctuation))]))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i
in x if i not in string.punctuation]))
#3.Removing numbers in a table.
data['Cleansed_Input']= data['Cleansed_Input'].apply(lambda
x:x.translate(str.maketrans('', '', string.digits)))
#df['Cleansed_Input'] = df['Cleansed_Input'].apply(lambda x:''.join([i for i
in x if i not in string.digits]))
#4.Removing trialing and leading spaces
data['Cleansed_Input']=df['Cleansed_Input'].apply(lambda x: x.strip())
df=pd.DataFrame(data)
#data1=pd.DataFrame(df1)
df2 = pd.DataFrame({
"Name_Extension": ["llc",
"Pvt ltd",
"Corp",
"CO Ltd",
"inc",
"CO",
"SA"],
"Company_Type": ["Company LLC",
"Private Limited",
"Corporation",
"Company Limited",
"Incorporated",
"Company",
"Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})
data.to_sql('TempCompanyName', con=engine, if_exists='replace',index= False)
回答1:
IIUC, we can use some basic regex :
first we remove any trailing and leading white space and split by white space, this returns a list of lists which we can break out using chain.from_iterable
then we use some regex with pandas methods str.findall
and str.contains
to match your inputs.
from itertools import chain
ext = df2['Name_Extension'].str.strip().str.split('\s+')
ext = list(chain.from_iterable(i for i in ext))
df['Type_Input'] = df['Cleansed_Input'].str.findall('|'.join(ext),flags=re.IGNORECASE).str[0]
s = df['Cleansed_Input'].str.replace('|'.join(ext),'',regex=True,case=False).str.strip()
df.loc[df['Type_Input'].isnull()==False,'Core_Input'] = s
print(df)
Original_Input Cleansed_Input type_input core_input
0 TECHNOLOGIES S.A TECHNOLOGIES SA NaN NaN
1 A & J INDUSTRIES, LLC A J INDUSTRIES LLC LLC A J INDUSTRIES
2 A&S DENTAL SERVICES AS DENTAL SERVICES NaN NaN
3 A.M.G Médicale Inc AMG Mdicale Inc Inc AMG Mdicale
4 AAREN SCIENTIFIC AAREN SCIENTIFIC NaN NaN
回答2:
Assuming you have read in the dataframes as df1
and df2
, the first step is to create 2 lists - one for the Name_Extension
(keys) and one for Company_Type
(values) as shown below:
keys = list(df2['Name_Extension'])
keys = [key.strip().lower() for key in keys]
print (keys)
>>> ['co llc', 'pvt ltd', 'corp', 'co ltd', 'inc', 'co']
values = list(df2['Company_Type'])
values = [value.strip().lower() for value in values]
print (values)
>>> ['company llc', 'private limited', 'corporation', 'company limited', 'incorporated', 'company']
Next step will be to map each value in the Cleansed_Input
to a Core_Input
and Type_Input
. We can use pandas apply method on the Cleansed_Input
column
To get the Core_input
:
def get_core_input(data):
# preprocess
data = str(data).strip().lower()
# check if the data end with any of the keys
for key in keys:
if data.endswith(key):
return data.split(key)[0].strip() # split the data and return the part without the key
return None
df1['Core_Input'] = df1['Cleansed_Input'].apply(get_core_input)
print (df1)
>>>
Original_Input Cleansed_Input Core_Input Type_input
0 TECHNOLOGIES S.A TECHNOLOGIES SA None NaN
1 A & J INDUSTRIES, LLC A J INDUSTRIES LLC None NaN
2 A&S DENTAL SERVICES AS DENTAL SERVICES None NaN
3 A.M.G Médicale Inc AMG Mdicale Inc amg mdicale NaN
4 AAREN SCIENTIFIC AAREN SCIENTIFIC None NaN
To get the Type_input
:
def get_type_input(data):
# preprocess
data = str(data).strip().lower()
# check if the data end with any of the keys
for idx in range(len(keys)):
if data.endswith(keys[idx]):
return values[idx].strip() # return the value of the corresponding matched key
return None
df1['Type_input'] = df1['Cleansed_Input'].apply(get_type_input)
print (df1)
>>>
Original_Input Cleansed_Input Core_Input Type_input
0 TECHNOLOGIES S.A TECHNOLOGIES SA None None
1 A & J INDUSTRIES, LLC A J INDUSTRIES LLC None None
2 A&S DENTAL SERVICES AS DENTAL SERVICES None None
3 A.M.G Médicale Inc AMG Mdicale Inc amg mdicale incorporated
4 AAREN SCIENTIFIC AAREN SCIENTIFIC None None
This is a pretty easy to follow solution, but is not the most efficient way to solve the problem, I am sure.. Hopefully it solves your use case.
回答3:
Here is a possible solution you can implement:
df = pd.DataFrame({
"Original_Input": ["TECHNOLOGIES S.A",
"A & J INDUSTRIES, LLC",
"A&S DENTAL SERVICES",
"A.M.G Médicale Inc",
"AAREN SCIENTIFIC"],
"Cleansed_Input": ["TECHNOLOGIES SA",
"A J INDUSTRIES LLC",
"AS DENTAL SERVICES",
"AMG Mdicale Inc",
"AAREN SCIENTIFIC"]
})
df_2 = pd.DataFrame({
"Name_Extension": ["llc",
"Pvt ltd",
"Corp",
"CO Ltd",
"inc",
"CO",
"SA"],
"Company_Type": ["Company LLC",
"Private Limited",
"Corporation",
"Company Limited",
"Incorporated",
"Company",
"Anonymous Company"],
"Priority": [2, 8, 4, 3, 5, 1, 9]
})
# Preprocessing text
df["lower_input"] = df["Cleansed_Input"].str.lower()
df_2["lower_extension"] = df_2["Name_Extension"].str.lower()
# Getting the lowest priority matching the end of the string
extensions_list = [ (priority, extension.lower_extension.values[0])
for priority, extension in df_2.groupby("Priority") ]
df["extension_priority"] = df["lower_input"] \
.apply(lambda p: next(( priority
for priority, extension in extensions_list
if p.endswith(extension)), None))
# Merging both dataframes based on priority. This step can be ignored if you only need
# one column from the df_2. In that case, just give the column you require instead of
# `priority` in the previous step.
df = df.merge(df_2, "left", left_on="extension_priority", right_on="Priority")
# Removing the matched extensions from the `Cleansed_Input` string
df["aux"] = df["lower_extension"].apply(lambda p: -len(p) if isinstance(p, str) else 0)
df["Core_Input"] = df.apply(
lambda p: p["Cleansed_Input"]
if p["aux"] == 0
else p["Cleansed_Input"][:p["aux"]].strip(),
axis=1
)
# Selecting required columns
df[[ "Original_Input", "Core_Input", "Company_Type", "Name_Extension" ]]
I assumed that the "Priority" column would have unique values. However, if this isn't the case, just sort the priorities and create an index based on that order like this:
df_2.sort_values("Priority").assign(index = range(df_2.shape[0]))
Also, next time give the data example in a format that allows anyone to load easily. It was cumbersome to handle the format you sent.
EDIT: Not related with the question, but it might be of some help. You can simplify the steps from 1 to 4 with the following:
data['Cleansed_Input'] = data["Original_Input"] \
.str.replace("[^\w ]+", "") \ # removes non-alpha characters
.str.replace(" +", " ") \ # removes duplicated spaces
.str.strip() # removes spaces before or after the string
EDIT 2: SQL version of the solution (I'm using PostgreSQL, but I used standard SQL operators, so the differences shouldn't be that huge).
SELECT t.Original_Name,
t.Cleansed_Input,
t.Name_Extension,
t.Company_Type,
t.Priority
FROM (
SELECT df.Original_Name,
df.Cleansed_Input,
df_2.Name_Extension,
df_2.Company_Type,
df_2.Priority,
ROW_NUMBER() OVER (PARTITION BY df.Original_Name ORDER BY df_2.Priority) AS rn
FROM (VALUES ('TECHNOLOGIES S.A', 'TECHNOLOGIES SA'), ('A & J INDUSTRIES, LLC', 'A J INDUSTRIES LLC'),
('A&S DENTAL SERVICES', 'AS DENTAL SERVICES'), ('A.M.G Médicale Inc', 'AMG Mdicale Inc'),
('AAREN SCIENTIFIC', 'AAREN SCIENTIFIC')) df(Original_Name, Cleansed_Input)
LEFT JOIN (VALUES ('llc', 'Company LLC', '2'), ('Pvt ltd', 'Private Limited', '8'), ('Corp', 'Corporation', '4'),
('CO Ltd', 'Company Limited', '3'), ('inc', 'Incorporated', '5'), ('CO', 'Company', '1'),
('SA', 'Anonymous Company', '9')) df_2(Name_Extension, Company_Type, Priority)
ON lower(df.Cleansed_Input) like ( '%' || lower(df_2.Name_Extension) )
) t
WHERE rn = 1
来源:https://stackoverflow.com/questions/59680268/split-and-replace-in-one-dataframe-based-on-a-condition-with-another-dataframe-i