Split a plain text file into toots — split_to

This function takes a plain text file (such as a Quarto blog post) and splits it into toots, does some cleaning, and returns an object with a data frame and some intermediate products.

Usage

split_to_toots(
  x,
  fragmentsToSkip = getOption("quartodon_fragmentsToSkip", 1),
  tootSeparator = getOption("quartodon_tootSeparator", "^-----\\s*$"),
  preprocess = getOption("quartodon_preprocess", list(c("^#.*", ""), c("`", ""))),
  imgRegex = getOption("quartodon_imgRegex",
    "^!\\[([^\\]]*)\\]\\(([^\\)]*)\\)\\{?([^}]*)\\}?$"),
  imgAltRegex = getOption("quartodon_imgAltRegex", "fig-alt=\"([^\"]*)\""),
  urlRegex = getOption("quartodon_urlRegex",
    "(?!\\!)\\[([^\\]]*)\\]\\(([^\\)]*)\\)\\{?([^}]*)\\}?"),
  cleanWhitespace = getOption("quartodon_urlRegex", TRUE)
)

Arguments

x: The plain text file as a character vector.
fragmentsToSkip: The number of fragments to skip when reading the text file (Quarto post, R Markdown file, etc). By default, the first fragment (i.e. the lines preceding the toot separator specified in tootSeparator, by default the first five dashes, -----) will be skipped.
tootSeparator: The separator that is used to split the file into toots: matched against every line (i.e. element of the character vector).
preprocess: A list of 2-element vectors specifying the preprocessing to perform on each extracted toot. These two argument are the first two arguments to a call to gsub(), with the toot as the third argument, and perl = TRUE.
imgRegex: The regular expression used to find images. It should have one capturing group that extracts the path to the image.
imgAltRegex: The regular expression used to find the images' alt text; it should have one capturing group that extracts the alt text.
urlRegex: The regular expression used to find hyperlinks. It should have one capturing group that extracts the title (not the URL).
cleanWhitespace: Whether to clean white space. If TRUE, all newline characters (\n) are stripped from the beginning and end of each toot, and all sequences of more than two newline characters are replaced with exactly two newline characters.

Value

An object with a data frame and some intermediate products.

Examples

### Get example post directory
examplePostDir <-
  system.file("example-post",
              package = "quartodon");

### Get an example text (see the intro vignette)
exampleText <-
  readLines(
    file.path(examplePostDir, "quartodon.Rmd"),
    encoding = "UTF-8"
  );

### Extract the toots
extractedToots <- split_to_toots(
  exampleText
);

### Look at the text of the third extracted toot:
cat(extractedToots$df$toots[1]);
#> This thread explains the {quartodon} R 📦 (see https://quartodon.opens.science).
#> 
#> The #rstats quartodon 📦 allows you to post a Mastodon thread from a plain text file (e.g., a blog post from a Quarto, {blogdown}, or {distill} website, another Quarto or R Markdown file, or just a plain text file).
#> 
#> This effectively allows you to post blog posts to Mastodon in a thread of toots 📑➡️🪄➡️🐘🐘🐘