Extracting Articles from RTF or TXT Files Using Regular Expressions in R

Extracting Different Articles from a Single Text File

===========================================================

In this post, we’ll explore how to extract different articles from a single text file using regular expressions in R.

Introduction

The problem statement is as follows: given an RTF or TXT file containing newspaper articles, extract the date, title, and body of each article. The articles are stored in separate lines with the title being bolded and underlined, while the body consists of several paragraphs underneath. We’ll use regular expressions to achieve this task.

Sample Data

Here’s some sample data to help us understand the problem better:

RTF File (bild_afd_all.rtf):

title1 12 words 12-12-2004 BILD ZBILD BIBU 2 295 German Copyright first bla bla bla bla bla bla bla bla~


*   TXT File (`bild_afd_all.txt`):

    ```txt
title1  12  words  12-12-2004 BILD ZBILD BIBU 2 295 German Copyright first bla bla bla bla bla bla ~
title2  10  words  10-12-2004 BILD ZBILD BIBU 2 1235 German Copyright first da da da da da da da~ 
title3  12  words  10-12-2004 BILD ZBILD BIBU 2 1235 German Copyright first info info info info info~

Solution Overview

Our solution involves the following steps:

Read the text file into R.
Remove blank lines from the text.
Find the word count indexes where the article titles start.
Extract the dates, titles, and bodies of the articles using regular expressions.

Step-by-Step Solution

Step 1: Load Required Libraries

First, we need to load the required libraries in R:

library(readr)
library(stringr)
library(tidyverse)

Step 2: Read Text File

Next, we read the text file into R using read_file from the stringr package:

htmlText <- read_file("bild_afd_all.rtf")

However, since RTF files are not natively supported by R, we’ll use a workaround and convert it to a character vector:

txt <- gsub("\n", " ", htmlText)

Step 3: Remove Blank Lines

Then, we remove blank lines from the text:

txt = txt[txt != ""]

Step 4: Find Word Count Indexes

Next, we find the word count indexes where the article titles start using str_detect and which:

idx_word = which(str_detect(txt, "[0-9]+ +words$"))

However, in the given code, a different pattern is used (\\d?\\,?\\d+ words$) so we will change it to this.

Also note that due to differences in the RTF file formatting (bold and underline), our regular expression may not always be 100% accurate.

idx_word = which(str_detect(txt, "\\d?\\,?\\d+ words$"))

Step 5: Extract Dates

After finding the word count indexes, we extract the dates of the articles using str_extract_all:

date <- str_extract_all(txt, "\\d{1,2} [A-Z][a-z]+ \\d{4}")[[1]]

However, in the given code this has been changed to use a different pattern (\\d?\\,?\\d+).

Also note that since there can be multiple dates on the same line, only the first date is extracted.

date <- str_extract_all(txt, "\\d?\\,?\\d+")[[1]]

Step 6: Extract Titles

Then, we extract the titles of the articles by slicing the text at each word count index and then using str_extract to find the first word that follows words, which is our article title:

title <- str_c(txt[idx_word], collapse = " ")
title = str_extract(title, "\\w+")

However, since RTF files store bolded text in a different way, this will not work perfectly.

Step 7: Extract Bodies

Finally, we extract the bodies of the articles by slicing the text at each word count index and then removing the title that follows words, which is our article body:

body <- txt[idx_word + 1]
article = str_extract(body, "\\w+")

However, again since RTF files store underlined text in a different way, this will not work perfectly.

Putting it all Together

Now we’ll put everything together into a function that takes the input file name as an argument and returns the extracted data:

extract_articles <- function(file_name) {
  # Load libraries
  library(readr)
  library(stringr)
  library(tidyverse)

  # Read text file
  htmlText <- read_file(paste0("files/", file_name))
  
  # Convert to character vector and remove newline characters
  txt <- gsub("\n", " ", htmlText)
  
  # Remove blank lines from the text
  txt = txt[txt != ""] 
  
  # Find word count indexes where article titles start
  idx_word = which(str_detect(txt, "\\d?\\,?\\d+ words$"))
  
  # Extract dates of articles using str_extract_all and \d{1,2} [A-Z][a-z]+ \d{4}
  date <- str_extract_all(txt, "\\d?\\,?\\d+")[[1]]
  
  # Extract titles of articles by slicing txt at each word count index
  title <- str_c(txt[idx_word], collapse = " ")
  title = str_extract(title, "\\w+")
  
  # Extract bodies of articles by slicing txt at each word count index and removing the title that follows 'words'
  body <- txt[idx_word + 1]
  article = str_extract(body, "\\w+")
  
  # Create data frame with extracted data
  df <- tibble(
    Title = title,
    Date = paste0("December ", date[1], ", YYYY"),
    Article = article
  )
  
  return(df)
}

Example Usage

Finally, we can use the extract_articles function to extract articles from our sample RTF and TXT files:

# Extract articles from RTF file
df_rtf <- extract_articles("bild_afd_all.rtf")

# Print extracted articles from RTF file
print(df_rtf)

# Extract articles from TXT file
df_txt <- extract_articles("bild_afd_all.txt")

# Print extracted articles from TXT file
print(df_txt)

This will print the extracted data for both the RTF and TXT files.

Last modified on 2024-05-02