30 KiB

Raw Blame History

How It Is Made: Master Research Thesis

@published 2020/02/10, 12:00 {.published}

Uff, after six months of writing, reviewing, deleting, yelling and finally giving up, I finished my Master research thesis. You can check it out here.

The thesis is about intellectual property, commons and cultural and philosophical production. It was for the Philosophy Master Degree at National Autonomous University of Mexico. This research was written in Spanish and it consists of almost 27K words and ~100 pages.

Since the beginning I decided not to write it with a text processor such as LibreOffice nor Microsoft Office. I made that decision because:

Office software was designed for that particular kind of work, not for research purposes.
Bibliography managing or writing review could be very very messy.
I need several outputs which would require heavy clean up if I write the research on +++ODT+++ or +++DOCX+++ formats.
I wanted to see how far I could go by using just Markdown, a terminal and +++FOSS+++.

In general the thesis is actually an automated repository where you can see everything---including the whole bibliography, the site and the writing history. The methodology is based on automated and multiformat standardized publishing, or as I like to call it: branched publishing.

Here isn't the space to discuss the method, but these are some general ideas:

We have some inputs which are our working files.
We need several outputs which would be our ready-to-ship files.
We want automation so we only focus on writing and editing, instead of losing our time in formatting or having nightmares with layout design.

In order to achieve that it's necessary to avoid any kind of +++WYSIWYG+++ and Desktop Publishing approaches. Instead, branched publishing employs +++WYSIGYM+++ and typesetting systems.

So let's start!

Inputs

I have two main input files: the content of the research and the bibliography. I used Markdown for the content. I decided to use BibLaTeX for the bibliography.

Markdown

Why Markdown? Because it is:

easy to read, write and edit
easy to process
a lightweight format
a plain and open format

Markdown format was intended for blog writing. So “vanilla” Markdown isn't enough for research or scholar writing. And I'm not fan of Pandoc's Markdown.

Don't get me wrong, Pandoc is the Swiss knife for document conversion, its name suits perfectly. But for the kind of publishing I do, Pandoc is part of the automation process, not for inputs nor outputs. When I use Pandoc as a middleman for some formats, it helps me to save a lot of time.

For inputs and output formats I think Pandoc is a great general purpose tool, but not enough for a fussy publisher like this dog. Plus, I love scripting so I prefer to employ my time on that instead of configuring Pandoc's outputs---it makes me learn more. So in this publishing process Pandoc is used when I haven't resolve something or I'm very lazy to do it, +++LOL+++.

Unlike text processing formats as +++ODT+++ or +++DOCX+++, +++MD+++ is very easy to customize. You don't need to install plugins, rather you just generate more syntax!

So Pecas Markdown was the base format for the content. The additional syntax was for citing the bibliography by its id.

BibLaTeX

Formatting bibliography is one of the main headaches for many researchers. People require a lot of time to learn how to quote and cite. And no matter how much experience they have, usually the references or the bibliography has typos.

I known it by experience. Most of our clients have a huge mess on their bibliography. But 99.99% percent of the times it's because they do it manually… So I decided to avoid that hell.

They are several alternatives for bibliography formatting and the most common one is BibLaTeX, the successor of BibTeX. With this type of format you can arrange your bibliography as an object notation. Here is a sample of an entry:

@book{proudhon1862a,
  author    = {Proudhon, Pierre J.},
  date      = {1862},
  file      = {:recursos/proudhon1862a.pdf:PDF},
  keywords  = {prio2,read},
  publisher = {Office de publicité},
  title     = {Les Majorats littéraires},
  url       = {http://alturl.com/fiubs},
}

At the beginning of the entry you indicate its type and id. Each entry has an array of key-value pairs. Depending the type of reference, there are some mandatory keys. If you need more, you can just put them in. This could be very hard to edit directly because +++PDF+++ compilation doesn't tolerate syntax errors. For comfort, you can use some +++GUI+++ like JabRef. With this software you can easily generate, edit or delete bibliographic entries as if they were rows in a spreadsheet.

So I have two types of input formats: +++BIB+++ for bibliography and +++MD+++ for content. I make cross-references by generating some additional syntax that invokes bibliographic entries by their id. It sounds complicated, but for writing purposes it's just something like this:

@textcite[someone2020a] states… Now I am paraphrasing someone so I would cite her at the end @parencite[someone2020a].

When bibliography is processed I get something like this:

Someone (2020) states… Now I am paraphrasing someone so I would cite her at the end (Someone, 2020).

This syntax is based on LaTeX textual and parenthetical citations styles for BibLaTeX. The at sign (@) is the character I use at the beginning of any additional syntax for Pecas Markdown. For processing purposes I could use any other kind of syntax. But for writing and editing tasks I found the at sign to be very accessible and easy to find.

The example was very simple and it couldn't made clear the point of doing this. By using ids:

I don't have to worry if the bibliographic entries change.
I don't have to learn any citation style.
I don't have to write the bibliography section, it is done automatically!
I always get the correct structure.

In a further section I explain how it is possible. The main idea is that with some scripts these two inputs became one, a Markdown file with added bibliography, ready for automation processes.

Outputs

I hate +++PDF+++ as the only research output, because most of the times I made a general reading on screen and, if I want some detailed reading, with notes and shit, I prefer to print it. It isn't comfortable to read a +++PDF+++ on screen and most of the times printed +++HTML+++ or ebooks are aesthetically unpleasant. That's why I decided to deliver different formats, so readers can pick what likes the most.

Seeing how publishing is becoming more and more centralized, unfortunately the deployment of +++MOBI+++ formats for Kindle readers is recommendable---by the way, +++FUCK+++ Amazon, they steal to writers and publishers, use Amazon only if the text isn't in another source. I don't like proprietary software as Kindlegen, but it is the only legal way to deploy +++MOBI+++ files. I hope that little by little Kindle readers at least start to hack their devices. Right now Amazon is the shit, but remember: if you don't have it, you don't own it. See what happened with books in Microsoft Store…

And the cherry of the cake was a petition from my tutor. He wanted an editable file he could use easily. A long time ago Microsoft monopolized ewriting, so the easiest solution is to provide a +++DOCX+++ file. I would rather use +++ODT+++ format but I have seen how some people doesn't know how to open it. My tutor isn't part of that group, but for the outputs it's good to think not only in what we need but in what we could need. People barely read research, if it isn't accessible in what they know, they won't read.

So, the following outputs are:

+++EPUB+++ as standard ebook format.
+++MOBI+++ for Kindle readers.
+++PDF+++ for printing.
+++DOCX+++ as editable file.

Ebooks

I don't use Pandoc for ebooks, instead I use a publishing tool we are developing: Pecas. “Pecas” means “freckles”, but in this context it's in honor of a pinto dog from my childhood.

Pecas allows me to deploy +++EPUB+++ and +++MOBI+++ formats from +++MD+++ plus document statistics, file validations and easy metadata handling. Each Pecas project can be heavily customized since it allows Ruby, Python or shell scripts. The main objective behind this is the ability to remake ebooks from recipes. Therefore, the outputs are disposable in order to save space and because you don't need them all the time and you shouldn't edit final formats!

Pecas is rolling release software with General Public License, so it's open, free and libre program. For a couple months Pecas has been unmaintained because this year we are gonna start all over again, with cleaner code, easier installation and a bunch of new features---I hope, we need your support.

PDF

For +++PDF+++ output I rely on LaTeX and LuaLaTeX. Why? Just because it is what I'm used to. I don't have any particular argument against other frameworks or engines inside TeX family. It's a world I still have to dig more.

Why don't I use desktop publishing instead, like InDesign or Scribus? Outside its own workflow, desktop publishing is hard to automate and maintain. This approach is great if you just want a +++PDF+++ output or if you desire to work with a +++GUI+++. For file longevity and automated and multiformat standardized publishing simply it isn't the best option.

Why don't I just export a +++PDF+++ from the +++DOCX+++ file? I work on publishing, I still have some respect to my eyes…

Anyways, for this output I use Pandoc as a middleman. I could manage the conversion from +++MD+++ to +++TEX+++ format with scripts, but I was lazy. So, Pandoc converts +++MD+++ to +++TEX+++ and LuaLaTeX compiles it into a +++PDF+++. I don't call both programs explicitly, instead I wrote an script to automate this job. In a further section I explain this.

DOCX

This output doesn't have anything special. I didn't customize its styles. I just call Pandoc via another script. Remember, this file is for editing so its layout doesn't really matters.

Writing

Besides the publishing method used in this research, I want to comment some particularities about the influence of the technical setup over the writing.

Text Editor

I never use word processors, so writing this thesis wasn't an exception. Instead, I prefer to use text editors. Between them I have a particular taste for the most minimalist ones like Vim or Gedit.

Vim is a terminal text editor. I use it as regular basis---sorry Emacs folks. I write almost anything, including this thesis, with Vim because of its minimalist interface. No fucking buttons, no distractions, just me and the black-screen terminal.

Gedit is a +++GUI+++ text editor and I use it mainly for RegEx or searches. In this project I utilized it for quick looks to the bibliography. I like JabRef as a bibliography manager, but for getting the ids I just need access to the raw +++BIB+++ file. Gedit was a good companion for that particular job, because its lack of “buttonware”---this annoying tendency to put buttons everywhere.

Citations

I want the research to be as accessible as possible. I didn't want to use a complicated citation style. That's why I only used parenthetical and textual cites.

This could be an issue for most scholars. But when I see typos in their complex citations and quotations I can't have any empathy. If you are gonna add complexity to your work, the least you can do is to do it right. And, let's be honest, most scholars add complexity because they wanna look fancy---i.e. they conform with formation rules for research texts in order to be part of a community or gain some objectivity.

Block Quotations

You are not gonna see any block quotes in the research. This wasn't only because accessibility---some people can't distinguish this type of quotes---but for the way the bibliography was handled.

One of the main purposes for block quotations is to provide a first and extended hand to someone else saying. But sometimes it's use as text filling. In a common way to do research in Philosophy, the output tends to be a “final” paper. That text is the research plus the bibliography. This format doesn't allow to embed any other files, like papers, websites, books or data bases. If you want to provide some literal information, quotes and block quotes are the way to do it.

Because this thesis is actually an automated repository, it contains all references used for the research. It has bibliography, but also each quoted work for backup and educational purposes. Why would I use block quotes if you can easily check the files? Even better, you could use some search function or read the whole data for validation purposes.

Moreover, the university doesn't allow submissions of long texts. I'm agree with that, I think we have other technical capabilities that allow us to be more synthetic. By putting aside block quotes, I had more space for the actual research.

Take it or leave it, research as repository and not as a file give us more possibilities for accessibility, portability and openness.

Footnotes

Oh, the footnotes! So beautiful technic for displaying side text. It works great, it allows metawriting and so on. But it works as expected if the output you are thinking of is, firstly, a file and, secondly, a text with fixed layout. In other kind of outputs, footnotes can be a nightmare.

I have the conviction that most of the footnotes can be inside text. This is because three personal experiences. During my undergraduate and graduate studies, as a Philosophy student we have to read a lot of fucking critical editions, which tend to have their “critical” notes as footnotes. For this types of text I get it, people doesn't like to mix their words with someone else, less if it's between a philosophical authority and a contemporary philosopher---notice that it's a taste not a mandate. But this is a shitty Master research thesis, not a critical edition.

I used to hate footnotes, now I just dislike them. Part of my job is to review, extract and fix other peoples footnotes. I can bet you that half of the time footnotes aren't properly displayed or they are missing. Commonly this is not a software error. Sometimes is because people do them manually. But I won't blame publishers nor designers for their mistakes. For how things are developing on publishing, most of the time the issue is the lack of time. We are being pushed to publish books as quick as we can and one of the side effects is the loss of quality. Bibliography, footnotes and block quotes are the easiest way to find out how much care does a text has.

I do blame some authors for this mess. I repeat, it is just a personal experience, but in my work I have seen that most authors put footnotes in the following situations:

They want to add shit but not to rewrite shit.
They aren't good at writing or they are in a hurry, so footnotes are the way to go.
They think that by adding footnotes, block quotes or references they would “earn” objectivity.

I think my thesis needs more rewriting, I could have said things in a more comprehensive way, but I was done---writing philosophy is not my thing, I prefer to talk or programming (!) it. That means I took my time on the review process---ask my tutor about that, +++LMFAO+++. It would have been easier for me to just add footnotes, but it would have been harder to you to read that shit. Besides that, footnotes takes more space than rewriting.

So, with respect to the reader and in agreement with the text extension of my university, I decided not to use footnotes.

Programming

As you can see, I had to write some scripts and use third party software in order to have a thesis as an automated repository. It sounds difficult or as nonsense, but, doesn't Philosophy has that kind of reputation too? >:)

MD Tools

The first challenges I had were:

I needed to know exactly how many pages I have wrote.
I wanted an easier way to beautify +++MD+++ format.
I had to do some quality checks in my writing.

Thus, I decided to develop some programs for these jobs: texte, texti and textu, respectively.

This programs are actually Ruby scripts that I put on my /usr/local/bin directory. You can do the same, but I wouldn't recommended it. Right now in Programando +++LIBRE+++ros we are refactoring all that shit so they can be shipped as a Ruby gem. So I recommend you to wait.

With texte I can know the number of lines, characters, characters with spaces, words and three differents sizes of pages: by every 1,800 characters with spaces, by every 250 words and an average of both---you can set another lengths for page sizes.

The +++MD+++ beautifier is texti. By the moment it only works well on paragraphs. It was enough for me, my issue was with the disparate length of lines---yeah, I don't use line wrap.

I also tried to avoid some typical mistakes while using quotation marks or brackets: sometimes we forget to close them. So textu is for this quality check.

These three programs were very helpful for my writing, that is why we decided to continue its development as a Ruby gem. For our work and personal projects, +++MD+++ is our main format, so we have the obligation to provide tools in order to help writers and publishers that are also using Markdown.

Baby Biber

If you are into TeX family, you probably know Biber, the bibliography processing program. With Biber we can compile bibliographic entries of BibLaTeX in +++PDF+++ outputs and do checks or clean ups.

I started to have issues with the references because our publishing method implies the deployment of outputs in separate processes from the same inputs, in this case +++MD+++ and +++BIB++ formats. With Biber I was able to add the bibliographic entries but only for +++PDF+++.

The solution I came to was the addition of references in +++MD+++ before any other process. By doing this I merged the inputs in one +++MD+++ file. This new file is used for the deployment of all the outputs.

This solution implies the use of Biber as a clean up tool and the development of a program that process bibliographic entries of BibLaTeX inside Markdown files. Baby Biber is this program. I wanted to honor Biber and made clear that this program is on its baby steps.

What does Baby Biber do?

It creates a new +++MD+++ file with references and bibliography.
It adds references if the original +++MD+++ file calls to @textcite or @parencite with a correct BibLaTeX id.
It adds bibliography at the end of the documents according to the called references.

One headache with references and bibliography styles is how to customize them. With Pandoc you can use pandoc-citeproc which allow you to select any style written in Citation Style Language (+++CSL+++). This styles are in +++XML+++ and it is a serious thing: you should apply this standards. You can check different +++CSL+++ citation styles in its official repo.

Baby Biber still doesn't support +++CSL+++! Instead, it uses +++YAML+++ format for its configuration. This is because of two issues:

I didn't take the time to read how to implement +++CSL+++ citation styles.
My University allows me to use any kind of citation style as long as it has uniformity and display the information in a clear manner.

So, yeah, I have a huge debt here. And maybe it would stay like that. The new version of Pecas will implement and improve the work done by Baby Biber---I hope.

PDF exporter

The last script I wrote is for the automation of +++PDF+++ compilation with LuaLaTeX and Biber (optionally).

I don't like the default layouts of Pandoc and I could have read the docs in order to change that behavior, but I decided to experiment a bit. The new version of Pecas will implement +++PDF+++ outputs, so I wanted to play a little more in the formatting, as I did with Baby Biber. Besides, I needed a quick program for +++PDF++ outputs, because sometimes we publish fanzines.

So, export-pdf is this experiment. It uses Pandoc to convert +++MD+++ to +++TEX+++ files. Then it does some clean up and injects the template. Finally it compiles the +++PDF+++ with LuaLaTeX and Biber---if you want to add the bibliographic entries by this mean. It also exports a +++PDF+++ booklet with pdfbook2, but I don't deploy it in this repo because the +++PDF+++ is letter size, to big for a booklet.

I have a huge debt here that I won't pay. It is cool to have a program for +++PDF+++ outputs that I understand, but I still want to experiment with ConTeXt.

I think ConTeXt could be a lot of help because I can use +++XML+++ files for +++PDF+++ outputs. I defend Markdown as input format for writers and publishers, but for automation +++XML+++ format is way better. For the new version of Pecas I have been thinking in the possibility to use +++XML+++ for any kind of standard output like +++EPUB+++, +++PDF+++ or +++JATS+++. I have problems with +++TEX+++ format because it creates an additional format just for one output, why would I allow it if +++XML+++ can provide me with at least three outputs?

Third parties

I already mentioned the third party software I used for this repo:

Vim as main text editor.
Gedit as side text editor.
JabRef as bibliography manager.
Pandoc as document converter.
LuaLaTeX as +++PDF+++ engine.
Biber as bibliography cleaner.

The tools I developed and this software are all +++FOSS+++, so you can use them if you want without paying or asking for permission---and without warranty xD

Deployment

There is a fundamental design issue in this research as automated repository: I should have put all the scripts in one place. At the beginning of the research I thought it would be easier to place each script side by side its inputs. Over time I realized it wasn't a good idea.

The good thing is that there is one script that works as a wrapper. You don't really have to now anything of that. You just write the research in Markdown, fill the BibLaTeX bibliography and every time you want or your server is configured, call that script.

This is a simplified listing showing the places of each script, inputs and outputs inside the repo:

.
├─ [01] bibliografia
│   ├─ [02] bibliografia.bib
│   ├─ [03] bibliografia.html
│   ├─ [04] clean.sh
│   ├─ [05] config.yaml
│   └─ [06] recursos
├─ [07] index.html
└─ [08] tesis
    ├─ [09] docx
    │   ├─ [10] generate
    │   └─ [11] tesis.docx
    ├─ [12] ebooks
    │   ├─ [13] generate
    │   └─ [14] out
    │       ├─ [15] generate.sh
    │       ├─ [16] meta-data.yaml
    │       ├─ [17] tesis.epub
    │       └─ [18] tesis.mobi
    ├─ [19] generate-all
    ├─ [20] md
    │   ├─ [21] add-bib
    │   ├─ [22] tesis.md
    │   └─ [23] tesis_with-bib.md
    └─ [24] pdf
        ├─ [25] generate
        └─ [26] tesis.pdf

Bibliography pathway

Even in a simplified view you can see how this repo is a fucking mess. The bibliography [01] and the thesis [08] are the main directories in this repo. As a sibling you have the website [07].

The bibliography directory isn't part of the automation process. I worked the +++BIB+++ file [02] in different moments than my writing. I exported it to +++HTML+++ [03] every time I used JabRef. This +++HTML+++ is for queries from the browser. Over there it's also a simple script [04] to clean the bibliography with Biber and the configuration file [05] for Baby Biber. Are you a data hoarder? There is an special directory [06] for you with all the works used for this research ;)

Engine on

In the thesis directory [08] is where everything moves smoothly when you call to generate-all [19], the wrapper that turns on the engine!

The wrapper does the following steps:

It adds the bibliography [21] to the original +++MD+++ file [22], leaving a new file [23] to act as input.
It compiles [25] the +++PDF+++ output [26].
It generates [13] the +++EPUB+++ [17] and +++MOBI+++ [18] according to their metadata [16] and Pecas config file [15].
It exports [10] the +++MD+++ to +++DOCX+++ [11].

And that's it. The process to developing a thesis as a automate repository allows me to just worry about three things:

Write the research.
Manage the bibliography.
Deploy all outputs automatically.

The legal stuff

That's how it works, but we still have to talk about how can the thesis legally be used…

This research was pay by all Mexicans through its taxes. The National Council of Science and Technology (Conacyt, by its acronym in Spanish) gave me an scholarship to study the Master in Philosophy at the UNAM---yeah, American and British folks, must of the times we got pay here for our graduate studies.

This scholarship is a problematic privilege. So the least I can do in return is to liberate everything that was pay by my homies and give free workshops and advices. I repeat: it is the least we can do. I'm in disagree of using this privilege for having a nice or party living and then drop-off. In this country and all the crisis we are having, this scholarship is to improve your community and not only you.

In general, I have the conviction that if you are a researcher or a graduate student and you already get pay---it doesn't matter it's a salary or an scholarship, it doesn't matter you are in a public or private university, it doesn't matter you get the money from public or private administrations---you have a commitment with your community, with our species and with our planet. If you wanna talk about free labor and exploitation---which it does happen---please look at the bottom. In this shitty world you are on the upper levels of this nonsense pyramid.

As a researcher, scientist, philosopher, theorist, artist and so on, you should help other people. You can still feed your ego and believe you are the shit or the next +++AAA+++ thinker, philosopher or artist. Both things doesn't overlap---but it's still annoying.

That is why this research has a copyfarleft license for its content and a copyleft license for its code. Actually, it's the same licensing scheme of this blog.

With the Open and Free Publishing License (+++LEAL+++, by its acronym in Spanish that also means “loyal”) you are free to use, copy, reedit, modify, share or sell any of this content under the following conditions:

Anything produced with this content must be under some type of +++LEAL+++.
All files---editable or final formats---must be public access.
The content usage cannot imply defamation, exploitation or surveillance.

Copyfarleft is the way---but not the solution---that fits our context and our possibilities of freedom. Don't come here with your liberal and individualistic notion of freedom---like the dudes from Weblate that kicked this blog out because its content license “is not free,” even though they say the code, but not the content, should use a “free” license, like the fucking +++GPL+++ this blog has for its code. This type of liberal freedom doesn't work in a place were no State or corporation can warrant us a minimum set of individuals freedoms, as happens on Asia, Africa and the other America---Latin America and the America that it isn't portrayed in the “American Dream” adds.

Last thoughts

As thesis works over an hypothesis, the technical and legal pathway of this research worked with the possibility of having a thesis as an automated repository, instead of a thesis as a file. At the end, the possibility became a fact, but in a limited way.

I think that the idea of a thesis as a automated repo is doable and it could be a better way for research deployment rather than uploading a single file. But this implementation has a lot of leaks that made it not suitable for escalation.

Further work it is necessary to be able to ship this as an standard practice. This technique could also be applied for automation and uniformity among publications, like papers in a journal or a book collection. The required labor it isn't to much, and maybe it is something I would work for the PhD. But right now is all I can offer!

Thanks to @hacklib for pushing me to write this post and, again, thanks to my +++S.O.+++ for persuade me to study the Master degree and for reviewing this post. Thanks to my tutor, Programando +++LIBRE+++ros and Gaby for their academic support. I can't forget to say thanks to Colima Hacklab, Rancho Electrónico and Miau Telegram Group for their technical support. And also thanks to all the people and organizations I mention in the acknowledment section of the research!

30 KiB Raw Blame History