Question:
My goal is to extract the substring between a set of parentheses, but only if it starts with a digit. Several of the strings will have multiple sets of parentheses but only one will contain a string that starts with a digit.Currently, it is extracting everything between the first parenth and the last one, rather than it seeing 2 seprate sets of them.
As far as only using the parentheses with a substring that starts with a digit, I am lost as to how to even approach this.
Any help is appreciated.
import pandas as pd
cols = ['a', 'b']
data = [
['xyz - (4 inch), (four inch)', 'abc'],
['def', 'ghi'],
['xyz - ( 5.5 inch), (five inch)', 'abc'],
]
df = pd.DataFrame(data=data, columns=cols)
df['c'] = df['a'].str.extract("\((.*)\)")
Desired output: a b c
0 xyz - (4 inch), (four inch) abc 4 inch
1 def ghi NaN
2 xyz - ( 5.5 inch), (five inch) abc NaN
current output: a b c
0 xyz - (4 inch), (four inch) abc 4 inch), (four inch
1 def ghi NaN
2 xyz - ( 5.5 inch), (five inch) abc 5.5 inch), (five inch
Answer:
The following pattern should do the job:\((\d[^.)]+)\)
What it does is
- Matches the character ‘(‘
- Start capturing numbers and everything that doesn’t contain ‘)’ or ‘.’.
- End capturing.
- Matches the character ‘)’
You can see a detailed explanation on regex101
Final code:
import pandas as pd
cols = ['a', 'b']
data = [
['xyz - (4 inch), (four inch)', 'abc'],
['def', 'ghi'],
['xyz - ( 5.5 inch), (five inch)', 'abc'],
]
df = pd.DataFrame(data=data, columns=cols)
df['c'] = df['a'].str.extract("\((\d[^.)]+)\)")
print(df)
Output generated:a b c
0 xyz - (4 inch), (four inch) abc 4 inch
1 def ghi NaN
2 xyz - ( 5.5 inch), (five inch) abc NaN
If you have better answer, please add a comment about this, thank you!