<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.0">Jekyll</generator><link href="https://mirkobronzi.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://mirkobronzi.github.io/" rel="alternate" type="text/html" /><updated>2021-08-27T13:19:06+00:00</updated><id>https://mirkobronzi.github.io/feed.xml</id><title type="html">Mirko Bronzi</title><subtitle>Natural Language Processing expert expert and Machine/Deep Learning enthusiast, fascinated by Software Engineering.</subtitle><author><name>Mirko Bronzi</name></author><entry><title type="html">Deep Learning Project Template (Cookiecutter)</title><link href="https://mirkobronzi.github.io/dl-project-template/" rel="alternate" type="text/html" title="Deep Learning Project Template (Cookiecutter)" /><published>2021-07-29T00:00:00+00:00</published><updated>2021-07-29T00:00:00+00:00</updated><id>https://mirkobronzi.github.io/dl-project-template</id><content type="html" xml:base="https://mirkobronzi.github.io/dl-project-template/">&lt;h2 id=&quot;tldr&quot;&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;Starting a new Deep Learning (DL) project usually means deciding between a flexible but
less long term-scalable approach(e.g., &lt;a href=&quot;https://colab.research.google.com/&quot;&gt;Google Colab&lt;/a&gt;),
or a more organized one that requires a decent amount of time/work to set up properly
(in particular when we plan to use several tools that should interact with one-another).&lt;/p&gt;

&lt;p&gt;For our new DL projects, we created a &lt;a href=&quot;https://github.com/mila-iqia/cookiecutter-pyml&quot;&gt;project template&lt;/a&gt;
(Cookiecutter) that can be instantiated in minutes, providing code that runs
out of the box.
The instantiated project contains tools for Deep Learning (&lt;a href=&quot;https://pytorch.org/&quot;&gt;PyTorch&lt;/a&gt;,
&lt;a href=&quot;https://www.pytorchlightning.ai/&quot;&gt;PyTorch Lightning&lt;/a&gt;, &lt;a href=&quot;https://www.tensorflow.org/&quot;&gt;TensorFlow&lt;/a&gt;,
&lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/keras&quot;&gt;Keras&lt;/a&gt;), tools to help running experiments
(&lt;a href=&quot;https://mlflow.org/&quot;&gt;MLflow&lt;/a&gt;, &lt;a href=&quot;https://github.com/Epistimio/orion&quot;&gt;Orion&lt;/a&gt;),
and tools that help with code best practices (&lt;a href=&quot;https://docs.pytest.org/en/6.2.x/&quot;&gt;pytest&lt;/a&gt;,
Continuous Integration, documentation management with &lt;a href=&quot;https://www.sphinx-doc.org/en/master/#&quot;&gt;Sphinx&lt;/a&gt;).&lt;/p&gt;

&lt;h2 id=&quot;what-is-it&quot;&gt;What is it&lt;/h2&gt;

&lt;p&gt;A project template (sometimes called Cookiecutter) is just a template that gets instantiated
(usually providing various options to change the project setup) to bootstrap the project code/environment.
This helps save a lot of time (in particular if the team starts new projects often).&lt;/p&gt;

&lt;p&gt;Our &lt;a href=&quot;https://github.com/mila-iqia/cookiecutter-pyml&quot;&gt;project template&lt;/a&gt; prepare a setup that can be
used for Deep Learning projects. In particular, it provides all the necessary files/folders
that will take care of the boilerplate code necessary to run experiments.
Also, a README file is provided with the final instructions to complete the setup, and with
instructions on how to perform training, running tests, etc.&lt;/p&gt;

&lt;p&gt;The goal is to provide all the tools to take care of the research part, and also
all the necessary tools to help the developer to implement best practices (to keep the project
code manageable over time).&lt;/p&gt;

&lt;h3 id=&quot;deep-learning-setup&quot;&gt;Deep Learning setup&lt;/h3&gt;

&lt;p&gt;The code is based on either &lt;a href=&quot;https://www.pytorchlightning.ai/&quot;&gt;PyTorch Lightning&lt;/a&gt; or
&lt;a href=&quot;https://www.tensorflow.org/api_docs/python/tf/keras&quot;&gt;Keras&lt;/a&gt; (this is a choice left to the developer).
The logging will be performed using MLflow, and a hyper-parameter search can be performed using
&lt;a href=&quot;https://github.com/Epistimio/orion&quot;&gt;Orion&lt;/a&gt;.
The code is already setup to use those libraries.&lt;/p&gt;

&lt;h3 id=&quot;code-best-practices&quot;&gt;Code best practices&lt;/h3&gt;

&lt;p&gt;A good strategy to help implement best practices is to use automatic checks for code/documentation,
and to write tests to check that the code logic works. It is a good habit to run all those checks
every time that the code is pushed to the server. This can be done by a human, but it is better done
using a continuous integration (CI) process.
To help with this, the instantiated project supports CI tools such as &lt;a href=&quot;https://github.com/features/actions&quot;&gt;GitHub actions&lt;/a&gt;,
Azure and &lt;a href=&quot;https://travis-ci.com/&quot;&gt;Travis&lt;/a&gt;. In fact,
the related configuration files are already provided and ready to run &lt;a href=&quot;https://flake8.pycqa.org/en/latest/&quot;&gt;flake8&lt;/a&gt;
to check for code format, &lt;a href=&quot;https://www.sphinx-doc.org/en/master/#&quot;&gt;Sphinx&lt;/a&gt; to check for proper documentation,
and &lt;a href=&quot;https://docs.pytest.org/en/6.2.x/&quot;&gt;pytest&lt;/a&gt; to run the tests.&lt;/p&gt;

&lt;h2 id=&quot;how-to-use&quot;&gt;How to use&lt;/h2&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/mila-iqia/cookiecutter-pyml&quot;&gt;project template&lt;/a&gt; is instantiated with a simple command:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;pip install -U cookiecutter
cookiecutter https://github.com/mila-iqia/cookiecutter-pyml.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This command will ask some questions and instantiate the project skeleton.
In particular, it will ask whether the DL backbone should be PyTorch (with PyTorch Lightning) or TensorFlow (with Keras).&lt;/p&gt;

&lt;p&gt;After the instantiation is done, the developer will find a working project that can be run
(the README file - in the project itself - will contain the last steps in order to complete the setup,
such as installing the dependencies).
This is because the code includes some synthetic data, a data loader, and a model, all based on a toy task
(i.e., given a sequence of numbers, compute the sum).&lt;/p&gt;

&lt;p&gt;This is a tree view of an initialized repository:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;.
├── LICENSE
├── README.md
├── config
│   └── hooks
│       └── pre-commit
├── docs
│   ├── ...
├── examples
│   ├── data
│   │   ├── ...
│   ├── local
│   │   ├── config.yaml
│   │   └── run.sh
│   ├── local_orion
│   │   ├── config.yaml
│   │   ├── orion_config.yaml
│   │   └── run.sh
│   ├── slurm
│   │   ├── ...
│   └── slurm_orion
│       ├── ...
├── setup.py
├── tests
│   └── test_hp_utils.py
└── wonderful_project
    ├── __init__.py
    ├── data
    │   ├── __init__.py
    │   └── data_loader.py
    ├── main.py
    ├── models
    │   ├── __init__.py
    │   ├── model_loader.py
    │   ├── my_model.py
    │   └── optim.py
    ├── train.py
    └── utils
        ├── __init__.py
        ├── file_utils.py
        ├── hp_utils.py
        ├── logging_utils.py
        └── reproducibility_utils.py
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Of course, it’s the developer’s job to change those elements to address their task of interest.&lt;/p&gt;

&lt;p&gt;To run the model provided for the toy tasks, it is enough to just:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cd examples/local
sh run.sh
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Or, if on a cluster that support &lt;a href=&quot;https://slurm.schedmd.com/sbatch.html&quot;&gt;Slurm&lt;/a&gt;, then the command is:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;cd examples/slurm
sh run.sh
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This will train the models (available under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;output&lt;/code&gt;), print the log on the screen (or in a file, if using Slurm
) and generate the MLflow plots (under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;mlruns&lt;/code&gt;).
More details and instructions are available in the README file contained in the instantiated project itself - enjoy!&lt;/p&gt;</content><author><name>Mirko Bronzi</name></author><category term="coding" /><category term="machine_learning" /><category term="deep_learning" /><summary type="html">TL;DR Starting a new Deep Learning (DL) project usually means deciding between a flexible but less long term-scalable approach(e.g., Google Colab), or a more organized one that requires a decent amount of time/work to set up properly (in particular when we plan to use several tools that should interact with one-another).</summary></entry><entry><title type="html">Data Structures and Performances: Lists</title><link href="https://mirkobronzi.github.io/data-structure-and-performance-list/" rel="alternate" type="text/html" title="Data Structures and Performances: Lists" /><published>2011-07-18T00:00:00+00:00</published><updated>2011-07-18T00:00:00+00:00</updated><id>https://mirkobronzi.github.io/data-structure-and-performance:list</id><content type="html" xml:base="https://mirkobronzi.github.io/data-structure-and-performance-list/">&lt;p&gt;The goal of this post is to see how a contiguous-memory structure (arrays) compares to a pointer-based
one (linked lists).&lt;/p&gt;

&lt;h1 id=&quot;recap&quot;&gt;Recap&lt;/h1&gt;

&lt;p&gt;Let me start by recapping the difference between an array-based list (also called dynamic array) and
a linked list.
The first one is based on the array concept, i.e., store elements contiguously in memory, preserving a given order;
the second one uses pointers to keep track of the order of the elements (see
&lt;a href=&quot;https://en.wikipedia.org/wiki/Array_data_structure#Efficiency_comparison_with_other_data_structures&quot;&gt;here&lt;/a&gt;
for a quick comparison between linked list and dynamic array).&lt;/p&gt;

&lt;p&gt;These different ways of implementing a list provide different performances:
roughly speaking, array list requires less resources to read elements, and more resources to write
elements (adding or removing), vice-versa for the linked list.&lt;/p&gt;

&lt;h1 id=&quot;readwrite-performances&quot;&gt;Read/Write performances&lt;/h1&gt;
&lt;p&gt;In this post we are going to validate this assertion by experimentally comparing the performances of
these data structures. We are going to use the Java language, and the ArrayList and LinkedList class.&lt;/p&gt;

&lt;p&gt;We start by comparing a simple read operation (get an element from the middle of a list), and a
simple write operation (add an element to the end of the list). The results are as following:&lt;/p&gt;

&lt;p&gt;(x: size of list, y: time in ms)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2011-07-18/get_median_element.png&quot; alt=&quot;Cost of getting the element in the middle&quot; /&gt;&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2011-07-18/add_element_in_the_end.png&quot; alt=&quot;Cost of adding an element at tehe end of the list&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The “read” operation (get the median element) is much faster using an array list.
In fact, getting an element from an array requires constant time. On the other hand, we would
expect the linked list to be faster when adding elements. This is actually not true (as shown in the
second graph): we can explain this considering that adding an element at the end of a list is a
simple append operation which requires no shifting of the other elements even if the array list
has to copy itself when making more spaces for more elements. It is also true that this happens few
times. In particular, if the array doubles its capacity every time, it has to do it only log(n)
times. In fact, we start from capacity 2, then 4, then 8…till n.
The linked list has to deal with pointers, which is always an added “burden” (in terms of space
and time)&lt;/p&gt;

&lt;p&gt;In order to validate our assumption about linked list (i.e., linked list perform better when
dealing with “write” operation), we can try to recreate a scenario where the array list performs
worse due to its “less flexible” structure. We can do this by planning an experiment where we
add elements in the middle of the list (which requires the array list to shift all the following
elements to the right, while the linked list can simply change a pointer)&lt;/p&gt;

&lt;p&gt;(x: size of list, y: time in ms)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2011-07-18/add_element_in_the_middle.png&quot; alt=&quot;Cost of adding an element in the middle of the list&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Again, the linked list performs worse. This is because before inserting an element, we have to find
the median element, and as we’ve seen, this is not so easy for a linked structure. So, we could try
removing this problem by adding the element to the beginning of the list; this way the shifting
problem for the array list should remain, while the linked list should be able to access the element
(the first one) in constant time&lt;/p&gt;

&lt;p&gt;(x: size of list, y: time in ms)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2011-07-18/add_element_at_the_beginning.png&quot; alt=&quot;Cost of adding an element at the beginning of the list&quot; /&gt;&lt;/p&gt;

&lt;p&gt;In fact, now the linked list performs much better.&lt;/p&gt;

&lt;p&gt;Summarizing: it is true that a linked list perform better when dealing with “write” operation, but
it is also true that when accessing elements far from the list start, the element access operation
comes with a high cost, because we must follow the chain of pointers. It is important to highlight
that this is not true for the last elements, in fact the linked list implementation is  (usually)
“smart” enough to start the element research from the nearest endpoint; as we can see from
the following graph, the worst cases for linked list element access are in the list middle&lt;/p&gt;

&lt;p&gt;(x: element index; y: time in ms)&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/assets/images/2011-07-18/element_access_cost.png&quot; alt=&quot;Cost of accessing an element&quot; /&gt;&lt;/p&gt;

&lt;p&gt;We could consider the linked list “not so useful” because they perform almost always worse than an
array list. As a matter of fact, the only scenario where they perform better is when dealing with
elements at the beginning and at the end of the list. Actually, this scenario is very common; in
fact, when working with stack, we are only interested in getting elements from the first position,
while with queue we are only interested in the first and last position. Because of this, and because
of stack and queue importance, the linked lists are valuable data structure.&lt;/p&gt;</content><author><name>Mirko Bronzi</name></author><category term="algorithms" /><category term="performances" /><category term="data_structures" /><category term="java" /><summary type="html">The goal of this post is to see how a contiguous-memory structure (arrays) compares to a pointer-based one (linked lists).</summary></entry></feed>