Extracting Different Articles from a Single Text File
===========================================================
In this post, we’ll explore how to extract different articles from a single text file using regular expressions in R.
Introduction
The problem statement is as follows: given an RTF or TXT file containing newspaper articles, extract the date, title, and body of each article. The articles are stored in separate lines with the title being bolded and underlined, while the body consists of several paragraphs underneath. We’ll use regular expressions to achieve this task.
Sample Data
Here’s some sample data to help us understand the problem better:
RTF File (
bild_afd_all.rtf):
title1 12 words 12-12-2004 BILD ZBILD BIBU 2 295 German Copyright first bla bla bla bla bla bla bla bla~
* TXT File (`bild_afd_all.txt`):
```txt
title1 12 words 12-12-2004 BILD ZBILD BIBU 2 295 German Copyright first bla bla bla bla bla bla ~
title2 10 words 10-12-2004 BILD ZBILD BIBU 2 1235 German Copyright first da da da da da da da~
title3 12 words 10-12-2004 BILD ZBILD BIBU 2 1235 German Copyright first info info info info info~
Solution Overview
Our solution involves the following steps:
- Read the text file into R.
- Remove blank lines from the text.
- Find the word count indexes where the article titles start.
- Extract the dates, titles, and bodies of the articles using regular expressions.
Step-by-Step Solution
Step 1: Load Required Libraries
First, we need to load the required libraries in R:
library(readr)
library(stringr)
library(tidyverse)
Step 2: Read Text File
Next, we read the text file into R using read_file from the stringr package:
htmlText <- read_file("bild_afd_all.rtf")
However, since RTF files are not natively supported by R, we’ll use a workaround and convert it to a character vector:
txt <- gsub("\n", " ", htmlText)
Step 3: Remove Blank Lines
Then, we remove blank lines from the text:
txt = txt[txt != ""]
Step 4: Find Word Count Indexes
Next, we find the word count indexes where the article titles start using str_detect and which:
idx_word = which(str_detect(txt, "[0-9]+ +words$"))
However, in the given code, a different pattern is used (\\d?\\,?\\d+ words$) so we will change it to this.
Also note that due to differences in the RTF file formatting (bold and underline), our regular expression may not always be 100% accurate.
idx_word = which(str_detect(txt, "\\d?\\,?\\d+ words$"))
Step 5: Extract Dates
After finding the word count indexes, we extract the dates of the articles using str_extract_all:
date <- str_extract_all(txt, "\\d{1,2} [A-Z][a-z]+ \\d{4}")[[1]]
However, in the given code this has been changed to use a different pattern (\\d?\\,?\\d+).
Also note that since there can be multiple dates on the same line, only the first date is extracted.
date <- str_extract_all(txt, "\\d?\\,?\\d+")[[1]]
Step 6: Extract Titles
Then, we extract the titles of the articles by slicing the text at each word count index and then using str_extract to find the first word that follows words, which is our article title:
title <- str_c(txt[idx_word], collapse = " ")
title = str_extract(title, "\\w+")
However, since RTF files store bolded text in a different way, this will not work perfectly.
Step 7: Extract Bodies
Finally, we extract the bodies of the articles by slicing the text at each word count index and then removing the title that follows words, which is our article body:
body <- txt[idx_word + 1]
article = str_extract(body, "\\w+")
However, again since RTF files store underlined text in a different way, this will not work perfectly.
Putting it all Together
Now we’ll put everything together into a function that takes the input file name as an argument and returns the extracted data:
extract_articles <- function(file_name) {
# Load libraries
library(readr)
library(stringr)
library(tidyverse)
# Read text file
htmlText <- read_file(paste0("files/", file_name))
# Convert to character vector and remove newline characters
txt <- gsub("\n", " ", htmlText)
# Remove blank lines from the text
txt = txt[txt != ""]
# Find word count indexes where article titles start
idx_word = which(str_detect(txt, "\\d?\\,?\\d+ words$"))
# Extract dates of articles using str_extract_all and \d{1,2} [A-Z][a-z]+ \d{4}
date <- str_extract_all(txt, "\\d?\\,?\\d+")[[1]]
# Extract titles of articles by slicing txt at each word count index
title <- str_c(txt[idx_word], collapse = " ")
title = str_extract(title, "\\w+")
# Extract bodies of articles by slicing txt at each word count index and removing the title that follows 'words'
body <- txt[idx_word + 1]
article = str_extract(body, "\\w+")
# Create data frame with extracted data
df <- tibble(
Title = title,
Date = paste0("December ", date[1], ", YYYY"),
Article = article
)
return(df)
}
Example Usage
Finally, we can use the extract_articles function to extract articles from our sample RTF and TXT files:
# Extract articles from RTF file
df_rtf <- extract_articles("bild_afd_all.rtf")
# Print extracted articles from RTF file
print(df_rtf)
# Extract articles from TXT file
df_txt <- extract_articles("bild_afd_all.txt")
# Print extracted articles from TXT file
print(df_txt)
This will print the extracted data for both the RTF and TXT files.
Last modified on 2024-05-02