#+OPTIONS: ':nil *:t -:t ::t <:t H:3 \n:nil ^:t arch:headline
#+OPTIONS: author:t broken-links:nil c:nil creator:nil
#+OPTIONS: d:(not "LOGBOOK") date:t e:t email:nil f:t inline:t num:t
#+OPTIONS: p:nil pri:nil prop:nil stat:t tags:t tasks:t tex:t
#+OPTIONS: timestamp:t title:t toc:t todo:t |:t
#+TITLE: Finding one's way with tags and desktop search application
#+AUTHOR: Christophe Pouzat
#+EMAIL: christophe.pouzat@parisdescartes.fr
#+LANGUAGE: en
#+SELECT_TAGS: export
#+EXCLUDE_TAGS: noexport
#+CREATOR: Emacs 26.1 (Org mode 9.1.9)
#+STARTUP: indent

* Table of contents                                                     :TOC:
- [[#leibniz-quote][Leibniz quote]]
- [[#searching-with-a-text-editor][Searching with a text editor]]
- [[#search-with-a-hand-made-index-in-a-notebook][Search with a hand-made index in a notebook]]
- [[#search-with-a-materialized-index][Search with a "materialized" index]]
- [[#towards-the-sophisticated-tools-of-computers][Towards the "sophisticated" tools of computers]]
- [[#desktop-search-engines][Desktop search engines]]
- [[#why-labels-tags?][Why labels/tags?]]
- [[#metadata][Metadata]]
  - [[#image-files][Image files]]
  - [[#pdf-files][=PDF= files]]
  - [[#audios-files][Audios files]]

* Leibniz quote
I found the introductory quote on the  [[http://www.backwordsindexing.com/index.html]] website.
Leibniz was a librarian for a fair part of his life, this partly explains is concern for classification, indexation, etc.

* Searching with a text editor

The corresponding slide is here to remind its viewer something already known and that is perceived as a huge improvement by people switching from paper to numerical note taking.

Unix/Linux users also know the [[https://en.wikipedia.org/wiki/Grep][grep]]  command-line utility for searching plain-text data sets for lines that match a [[https://en.wikipedia.org/wiki/Regular_expression][regular expression]] in one or several files; we will come back to it.

* Search with a hand-made index in a notebook  

Again something we already discussed (in sequence 2).

* Search with a "materialized" index

A reminder.

* Towards the "sophisticated" tools of computers
- The techniques we have just seen or reviewed only work for one "document" - search with the text editor, index of a notebook - and/or for one type of document.
- The computer tools at our disposal allow us to go further in the indexing of digital files.
- It is possible to add labels or keywords to text files as well as to image files (`jpg`, `png`) or "mixed" files (`pdf`) thanks to the metadata they contain.
- Desktop search engines make it possible to index all the text files in a given tree structure but also the metadata of the other files.

* Desktop search engines

Desktop search engines like:
- [[http://docfetcher.sourceforge.net/en/index.html][DocFetcher]] (Linux, MacOS, Windows) ;
- [[https://wiki.gnome.org/Projects/Tracker][Tracker]] (Linux) ;
- [[https://www.lesbonscomptes.com/recoll/][Recoll]] (Linux, MacOS, Windows) ;
- [[https://en.wikipedia.org/wiki/Spotlight_(software)][Spotlight]] (MacOS) ;
allow us to search for the /content/ of text files, emails, files generated by =word processors=--/i.e./ files that essentially contain text, but are stored in a standard format =doc=, =docx=, =odt=, etc. that are not text formats--, =pdf= files--when they are not /images/ of text--, but also the [[https://en.wikipedia.org/wiki/Portable_Document_Format#Metadata][metadata]] of =pdf=, etc. 

Desktop search engines use [[https://fr.wikipedia.org/wiki/Indexation_automatic_documents][indexing]] techniques that significantly reduce search times, compared to the search functions built into operating systems by default. Unlike the latter, they also often support [[https://fr.wikipedia.org/wiki/M%C3%A9A9tadonn%C3%A9e][metadata]], and are able to make a [[https://fr.wikipedia.org/wiki/Analyse_syntactic][parsing]] of the files.

As an example of "integrated default search functions", we will find on Unix/Linux systems the program [[https://fr.wikipedia.org/wiki/Grep][grep]] with which we can search for occurrences of the word "Placcius" in the "module1/resources" directory of our repository [[https://gitlab.inria.fr/learninglab/mooc-rr/mooc-rr-ressources][mooc-rr-ressources]] (after cloning it):

#+NAME: grep-Placcius-module1/ressources
#+BEGIN_SRC sh :results output :exports both
grep -r Placcius
#+END_SRC 

#+RESULTS: grep-Placcius-module1/ressources
#+begin_example
sequence1.org:- [[#note-cabinets-from-placcius-and-leibniz][Note cabinets from Placcius and Leibniz]]
sequence1.org:* Note cabinets from Placcius and Leibniz
sequence2_fr.org:Nous revenons sur le « bout de papier » ou la fiche comme support de note. L'inconvénient est que le bout de papier ou la fiche se perdent facilement et ne servent à rien s'ils ne sont pas *classés* en plus d'être rangés. Problème résolu par l'armoire de Placcius. D'une certaine façon, sa conception fait qu'on accède à son contenu par l'index.
sequence2.org:We see (again) Placcius' and Leibniz's closet since it displays both the benefits and the shortcomings of media that hold *a single note*.
sequence2.org:These problems are solved by Placcius' cabinet, the content of which is fundamentally accessed through the index.
sequence5_fr.org:- les notes manuscrites sur fiches sont généralement stockées dans un meuble dont la structure matérialise un index — comme l'armoire de Placcius et Leibniz — ;
sequence5_fr.org:: PITCHME.md:Remarquez l'avantage des « bouts de papiers classés » de Placcius et Leibniz sur le _codex_ de Galilée : les premiers peuvent être facilement réordonnées.
sequence1_fr.org:- [[#armoires-à-notes-de-placcius-et-leibniz][Armoires à notes de Placcius et Leibniz]]
sequence1_fr.org:* Armoires à notes de Placcius et Leibniz
#sequence5_fr.org#:- les notes manuscrites sur fiches sont généralement stockées dans un meuble dont la structure matérialise un index — comme l'armoire de Placcius et Leibniz — ;
#sequence5_fr.org#:module1/ressources/sequence5_fr.org:: PITCHME.md:Remarquez l'avantage des « bouts de papiers classés » de Placcius et Leibniz sur le _codex_ de Galilée : les premiers peuvent être facilement réordonnées.
#sequence5_fr.org#:module1/slides/misc/Notes_module1.org:: PITCHME.md:Remarquez l'avantage des « bouts de papiers classés » de Placcius et Leibniz sur le _codex_ de Galilée : les premiers peuvent être facilement réordonnées.
#sequence5_fr.org#:module1/slides/misc/PITCHME.md:Remarquez l'avantage des « bouts de papiers classés » de Placcius et Leibniz sur le _codex_ de Galilée : les premiers peuvent être facilement réordonnées.
#+end_example


* Why labels/tags?

A query based on a single word often returns a very large number of proposals, even though most desktop search engines allow you to filter them. An effective way to limit their number is to include in our documents labels, i.e. labelled anchor points, which will be easily indexed by the desktop search engine and whose label does not correspond to any word or phrase in the dictionary--this is a simplified version of the work of the /indexer/, the person responsible for building a book index--. To keep the label meaningful, simply frame a word with a pair of punctuation marks such as ":", "";" or "?". A label such as ":code:" will be easily memorized and will make a perfect equivalent of the keyword "code" used in the example notebook in the second sequence of this module--to illustrate Locke's method--. 

We still have one more technical detail to resolve in the case of our notes taken in text format such as =Markdown=. Indeed, we do not want our labels to appear in the =html=, =pdf= or =docx= outputs of our notes. A way to do this, for light markup languages that do not have labels--for example, =Markdown= does not have them, while =org= has them--is to include them in comments. In =Markdown=, everything framed by =<!--= and =-->= is considered a comment and is not included in the =html= or =pdf= output of the notes. This allows us to use:

#+BEGIN_EXAMPLE
<!-- ;code; -->
#+END_EXAMPLE       

in our notes at a location we would like to find when we are looking for material on programming. 

* Metadata
** Image files
We now know how to add labels to a text file, but we often also have to work with files containing images or photos, such as [[https://fr.wikipedia.org/wiki/JPEG][JPEG]] files--digital cameras all use this format--, [[https://en.wikipedia.org/wiki/Graphics_Interchange_Format][GIF]] or [[https://en.wikipedia.org/wiki/Portable_Network_Graphics][PNG]]. The question then arises, can we add labels to our image files so that our desktop search engines index them? The answer is yes, thanks to the [[https://fr.wikipedia.org/wiki/M%C3%A9tadonn%C3%A9e][metadata]] that these files contain. Metadata, in this case, is data stored in the file but not shown by the rendering software (at least not shown by default). We all know that this metadata "exists"; it contains the date, GPS location, exposure time, etc. of our digital photos. In the =JPEG= files, they are stored according to the the [[https://fr.wikipedia.org/wiki/Exchangeable_image_file_format][exchangeable image file format]] (=EXIF=). Most image and photo manipulation software allows access to and modification of metadata content. The example illustrated in the course uses a very simple "command line" solution, [[http://owl.phy.queensu.ca/~phil/exiftool/][ExifTool]] that allows you to view and modify metadata. Other software such as [[http://www.exiv2.org/index.html][exiv2]] or [[https://imagemagick.org/script/index.php][ImageMagick]] allow you to do this (to name only free software available on Linux, Windows and MacOS). Some of the elements of the =EXIF= format are strings, i.e. text, that we are free to use as we wish; we can therefore use them to add our labels. We illustrate in the course how to do it with =ExifTool=, but we could also have done it with ImageMagick's program [[https://www.imagemagick.org/script/command-line-options.php#comment][mogrify]]. All the desktop search engines we mentioned will "look" at the metadata of the =JPEG= files during the indexing phase and thus allow us to use the labels we have inserted.

=EXIF= is not the only existing metadata format; a more recent format is the [[https://fr.wikipedia.org/wiki/Extensible_Metadata_Platform][Extensible Metadata Platform ]](=XMP=), available for a larger number of file formats--it is not currently read on =JPEG= files by =DocFetcher=, so we have highlighted the =EXIF= format, but this should evolve quite quickly--; other engines like =Tracker= and =Recoll= read it.

** =PDF= files

In addition to image files, we are all very frequently called upon to work with "composite" files--containing text, images, and more--that are the [[https://fr.wikipedia.org/wiki/Portable_Document_Format][PDF]] files. These files also contain metadata, and it was for them that Adobe initially introduced the =XMP= format that we just discussed. This metadata can be read and modified, in particular the element =Keywords= which can contain arbitrary long character strings and is perfect for hosting our labels. The program =ExifTool=, allows you to modify the metadata of the files =PDF=. The desktop search engines we mentioned above will all read the metadata from the =PDF= files during the indexing phase.

** Audios files  
Audio formats like [[https://en.wikipedia.org/wiki/MP3][mp3]] and [[https://en.wikipedia.org/wiki/Ogg][ogg]] also contain metadata where the song titles, artist names, etc. are stored; these metadata can be set by us and are read by the desktop search engines during indexation.