Extracting Numbers Before Month Names in a Pandas Column Using Regular Expressions

Extracting Numbers Before Month Names in a Pandas Column

===========================================================

In this article, we’ll explore how to use regular expressions to extract numbers occurring before month names in a pandas column. We’ll dive into the details of regular expression syntax and demonstrate a step-by-step approach to achieve this task.

Background on Regular Expressions

Regular expressions (regex) are a powerful tool for matching patterns in strings. They consist of special characters, character classes, and quantifiers that help us define complex patterns. In the context of text processing, regex is often used for data cleaning, filtering, and extraction.

In this article, we’ll focus on using regex to extract numbers before month names. We’ll start by explaining the basics of regex syntax and then dive into the specific pattern required for this task.

Basic Regex Syntax

Before diving into the specifics of extracting numbers before month names, let’s review some basic regex syntax:

  • . matches any single character (except a newline)
  • \d matches any digit
  • \w matches any word character (alphanumeric plus underscore)
  • \s matches any whitespace character
  • ^ matches the start of a string
  • $ matches the end of a string

These characters form the foundation of regex patterns. We’ll use them to build more complex patterns that match our desired input.

Regex Patterns for Month Names

To extract numbers before month names, we need to define a pattern that matches:

  1. A number (either one or two digits) followed by whitespace and then a month name.
  2. The month name itself should not be extracted; only the preceding number.

Here’s the regex pattern that achieves this:

\b(\d{1,2})\s(?:January|February|March|April|May|June|July|August|September|October|November|December)
    ^

Let’s break down this pattern:

  • \b matches a word boundary (the start of a word or the end of a word).
  • (\d{1,2}) captures one or two digits in group 1. This is the number we want to extract.
  • \s matches whitespace characters.
  • (?:January|February|March|April|May|June|July|August|September|October|November|December) defines a non-capturing group that contains the month names. The ? quantifier means “zero or more” occurrences of the preceding element, ensuring that only one match is made.
  • ^ matches the start of a string.

Using Regex with Pandas

Now that we have our regex pattern, let’s see how to use it with pandas in Python:

import pandas as pd

# Sample DataFrame with "col1" containing strings like '133    h missed intake office visit on 28 June 1994 a...'
df = pd.DataFrame({
    'col1': ['133    h missed intake office visit on 28 June 1994 a...',
             '136    11 February 1985 CPT Code: 90801 - Psychiatric...',
             '150                  12 March 1980 SOS-10 Total Score:\n',
             '151                      22 June 1990 Medical History:\n',
             '165    .On 18 August 1975 patient presented to BH ED/...',
             '181                18 August 1995 Primary Care Doctor:\n',
             '182    eby 13 June 1974 it appears amitriptyline had ...',
             '188    12 March 2004 CPT Code: 90801 - Psychiatric Di....',
             '228    s 20 yo M carries dx of BPAD, presents for psy...',
             '229    t Allergies Sulfa (Sulfonamide Antibiotics) - ...',
             '230    B/R Walnut Ridge. Raised with sister and parent...',
             '231    50 yo DWF with a history of alcohol use disord...',
             '232    )HTN, hypercholesterolemia, DM, sleep apnea,, ...']
})

# Extract numbers before month names
df['col2'] = df['col1'].str.extract(r'\b(\d{1,2})\s(?:January|February|March|April|May|June|July|August|September|October|November|December)'
                                   , expand=True)

print(df)

Conclusion

In this article, we explored how to extract numbers occurring before month names in a pandas column using regular expressions. We covered the basics of regex syntax and defined a pattern that matches our desired input. Finally, we demonstrated how to use this pattern with pandas in Python.

By following these steps and applying regex to your data cleaning tasks, you’ll be able to efficiently extract relevant information from your datasets.


Last modified on 2024-03-02