How To Build Python Packages
By nearly any metric, Python has become one of the most important Programming languages in existence. It is the de-facto standard for data processing and for implementing machine learning and artificial intelligence solutions. It is used to implement complex websites and applications, and Django and Flask (its most prominent web application frameworks) have never been more popular. Python is the most frequently taught first language at university, and has become an important part of the embedded device landscape, systems testing, and scripting.
A major reason (perhaps the reason) of Python's success is its rich ecosystem of modules and libraries. Modules are reusable libraries that can be used through the import
keyword. They can contain functions, data structures, global variables, and classes. Without libraries like NumPy, Pandas, SciKit-Learn, and Django; Python would just be another nicely designed niche language instead of the behemoth it has become.
Because Python modules are just folders and files located with the Python path, there are many strategies you can use to include additional modules into a project. You might incorporate them inside of the structure of your program, download an archive to the site-packages
folder, or modify your path at startup to locate the new code.
While there are many options, the most robust way to distribute Python software is to tap into its ecosystem of "distribution" utilities and create a "software package." Software packages are modules which include additional information that allow for them to be versioned and published. When you take the additional step of creating a package from your module, it is possible to publish your code to software repositories such as the Python Package Index (PyPI) and leverage tools such as pip
to handle dependencies and testing.
In this article, we'll look at how you go about creating a package. We'll start at the foundation: ensuring your API has a design, including documentation, ensuring that there is a clear license, and following solid patterns in the design of your repository. We'll then examine the core piece of a package (the setup.py
file), and show how to provide the essential metadata and options so people can install your code with nothing more than pip install
packagename
.
Preparing the Way
Good packages begin with clean and understandable code. To that end, the first steps in the packaging journey involve making decisions about how people will interact with your code (designing your API), providing information about how it works by creating documentation, specifying how it can be used and redistributed by including a license, and giving the whole thing a structure so that you can create the packaging scripts.
Step 1: Design Your API
The very first step of publishing a package is determining how it should be used and the interface it will expose. This involves:
- ensuring that all your folders have been converted to modules through the use of a
__init__.py
file - separating your code into public and private methods
- adopting code naming conventions so that users will understand what specific constructs are
- actively controlling where code lives and managing how components should be imported
Convert Folders to Modules
__init__.py
allows for Python to treat your folders as a module. When you package your library, all of the folders need to be modules which can be imported from a global path. The file can be empty, but a good practice is to use it in order to conveniently collect the public API of your application in a location that can be imported by your users.
Separate Your Code Into Public and Private Methods
In most libraries, there is often code intended to be consumed by your users (public methods) and code that provides support and internal state management (private methods). Some programming languages, such as Java and C++, provide keywords and other structures that enable you to protect private methods and prevent people from using them. Python doesn't support explicit protection of methods through use of a private
or protected
keyword, but there are conventions that you can follow to clue people in that a method is part of your implementation code.
Code Naming Conventions
Using an underscore prefix in your method name, such as def
_iternal()
, is one way to signal to others that a method is intended to be private. Another is to use double underscores __methodname__
is another when working with classes.
Methods with double underscores play an important role in Python in addition to "merely" indicating that something is supposed to be private. Python's object model provides a large number of such methods that can be overridden to customize the behavior of certain object behaviors such as how a class is sorted, added to other types of objects, hashed, and more. The Python Data Model documentation provides additional detail.
Controlling Imports and Code Location
A second thing you can do to control how your code is meant to be consumed is deliberately create a public facing set of methods that are easy to access. There are two common techniques used here.
- Add a
__init__.py
file to the module and place all top level (public) methods in that file. - Include a special variable called
__all__
to explicitly define which functions should be included if a user uses a wildcard import such asfrom module import *
.
Step 2: Create Documentation
As packages are often intended to be used by others, it is important to provide documentation about how they work. Generally, there are three types of reference that should be included with a package:
- method and class documentation in the form of docstrings
- architectural documentation describing how a package is organized with details about how the components are used in concert
- tutorial documentation which shows the package in action
Docstrings
Each function and class in the library should include a "docstring." Docstrings are multi-string literals which begin and end with three quotation marks ('''
) that directly follow a class, module, function, or method definition. They are added to code constructs as the __doc__
property and can be accessed using the help()
function. They are used to communicate what a piece of code is supposed to do and how it might be used.
The purpose of docstrings is a little different than comments. Comments are generally used to describe how a piece of functionality has been implemented and signposts for developers who might need to modify or extend the code. Docstrings are externally facing reference related to the purpose of the code, properties, and inputs.
As you prepare for packaging, it's a good idea to review the docstings for your code and ensure that they provide enough detail for other developers to understand what a method or class is used for. In general, a good docstring contains:
- a short summary of what the code does
- a list of the input parameters, their types, and behaviors.
- summary of the return value
The listing below shows an example of how docstrings might be used to provide information about a class, its properties, and methods.
class Vehicle(object): ''' Vehicles is a base class for cars and trucks and provides methods common to both. @property make (str): Manufacturer of the vehicle @property model (str): Model of the vehicle ''' def __init__(self, mpg, hybrid=False, *args, **kwargs): ''' Constructor method @input mpg (str): How far the vehicle can travel on a single tank of gas @input hybrid (bool, default=False): Indicates whether the vehicle is a hybrid ''' def travel_time(self, start, dest): ''' Calculate the travel time from the start location to the destiation @input start (GPSPoint): Starting location @input dest (GPSPoint): Ending location '''
If you are consistent about including docstrings in your code, it is possible to leverage a number of tools such as Sphinx, PyDoc, or PyDoc-Markdown to generate reference pages for your code. Reference documentation is important if you want your package to be used by other developers. If you wish, steps or commands can be built into your package scripts so they become part of your distribution process (though this is beyond the scope of this article).
Architectural Documentation, Examples, and Tutorials
While your docstrings will provide information about specific classes, their properties, and methods; they often lack the high-level perspective on how your library should be used. For that, you need to provide some degree of architectural overview, annotated examples, and tutorial code. There are many ways that this additional information might be (such as through an external website), but a good option is to incorporate them alongside your code in the package in the form of a docs
subfolder or repository.
The organization of your documentation will depend greatly on your package and its complexity. Generally, you should include a README
file that describes the package from a high level, information about how the code is installed, details about the build process (if you use extension modules or link to compiled code), and a quick-start of how people can be up and running quickly.
A pattern that many projects use is keep their documentation with all of the associated media in one repository and the code in a second. This can work well because it keeps the code repository focused and small (which helps to speed up transfers and CI/CD automation), while also allowing documentation contributors to work on user-facing pieces of the documentation without requiring the code.
For packaging purposes, the two can be linked in a "packaging" repository that can automate the process of creating HTML or PDF reference manuals at the same time that build artifacts are created. If using a source forge such as GitLab or GitHub, this pattern is often explicitly encouraged by providing a "wiki" repository that lives alongside the code repository in a "Project."
Step 3: Ensure that the Code is Licensed
Before you distribute your code, you need to ensure that it contains a license. The license will describe things like copyright, usage, and distribution terms. If publishing to PyPI, it is generally a good idea to use a common license like the Apache 2.0 License that provides clarity about how your code may be used and spells out what rights you reserve and grant to your users.
If you are publishing the code privately, the license should include information about copyright, use limitations, and other intellectual property restrictions. GitHub (and also GitLab) provide a set of templates for popular license that can be added to a project with a single click. Code without an explicit license is considered to be copyrighted and falls under the purview of "All rights reserved."
Step 4: Architect Your Package
The last step in the design of your package is to determine the layout of your code and the packaging scripts that will exist alongside it, the role that version control will play, and how it will be distributed. While there are several patterns that you can use, the two most common are the monolithic and submodule-linked packaging repositories.
- Monolithic Repository. All of the code -- functional components, testing, and packaging -- are placed inside a single repository. This way of structuring the code is convenient because there is never any question about where anything is located, but comes at the cost of making it a little more complicated to efficiently embed the code within larger project or to utilize individual pieces (like the documentation or tests) on their own.
- Submodule-linked Repository. Separate documentation, testing, and code repositories with a "packaing" repository that links them through Git submodules. In this pattern, the packaging repository forms a shell around the functional and testing repositories and (mostly) provides the scripts for building, installing, and publishing. As compared to the monolithic repository, using submodules allows you to split your code by function and provides more flexibility in how you incorporate it into larger projects. The flexibility comes at the cost of additional complexity, requiring an additional step to ensure all code is available when installing or upgrading a package.
Because both approaches offer benefits and drawbacks, the "right" choice often depends on how your package will be consumed. In the next section of this article, we'll look at example scripts for both approaches.
In order to build and install correctly, your needs a specific structure with the license, README
, and other files located in the root directory and tests, documentation, and code in subfolders. The listing below shows a sample packing repository layout with source code, documentation, tests, and examples.
packagename |-- README.md |-- LICENSE |-- packagename |-- __init__.py |-- module1.py |-- module2.py |-- docs |-- home.md |-- install.md |-- module1.md |-- module2.md |-- tests |-- __init__.py |-- test1.py |-- test2.py |-- examples |-- __init__.py |-- example1.py
It is a common convention for the root level folder and the source code folder to use the name of the package. In the example above, this is packagename
. If your repository includes top-level testing or example directories, those need to be structured as modules for your package scripts to execute commands that use them.
Creating and Publishing a Package
After organizing, documenting, licensing, and structuring you're ready to move to the actual packaging of your software. This involves two steps:
- create the files required by Python to build and deploy
- running the commands that will package and publish your code
Step 1: Create Packaging Components
While packaging code can become very complex, with hooks into compiler toolchains and testing automation, there is only one required component: the setup.py
file.
setup.py
setup.py
is the heart of a Python package and dictates everything the installer needs to know when deploying (or otherwise modifying) a package. It provides two things:
- metadata about the package: the name, descriptions, the author, contact information, the license, and dependencies
- entrypoint commands such as
install
,develop
, andsdist
and the customizations that you might wish to make to the default deployment logic
Originally introduced by the distutils
library created by Greg Ward and released in 1998, setup.py
has been a part of Python since nearly beginning. distutils
used it as the gateway to automate the build, configuration, and install of software; and those capabilities were then extended by the setuptools
package which has become the de-facto standard for creating Python packages.
Example 1: Monolithic Project Without Dependencies
The code listing below shows an example of a simple setup.py
file created using setuptools
.
# Example 1: simple setup.py file for a pacakge called packagename # Import "setup" method from setuptools from setuptools import setup # setup is the gateway to the package build process. # The only required components for a package are # the name, author and contact, and description. setup( name='packagename', verions='0.1.0', author='John Doe', author_email='john.doe@example.com', description='Example package for a library called packagename', license='GPL', url='http://code.example.com/example-package', packages=['packagename'])
For monolithic projects without external dependencies, the example above is fully functional. The structure of the repository should be similar to that shown in the Preparation Step 4 listing above.
The code imports the setup
method from setuptools
and initializes a default set of commands. This example provides an install
command for adding the code to the local site-packages
and an sdist
command that can generate a "wheel" package for submission to PyPI.
To adapt the example for your project, you will need to update the name of your package, the contact information of the author/maintainer, and the version. Important: Each time you release a new build of your package, you need to update your version number.
Example 2: Monolithic Project With Dependencies
More complex libraries often have additional components and dependencies, however. The example file below shows a setup.py
script which include NumPy and Pandas as external dependencies that are required by the library at the time of installation.
import pathlib from setuptools import setup, find_packages HERE = pathlib.Path(__file__).parent # It can be convenient to put the metadata for a package at the top # of a file VERSION = '0.1.0' PACKAGE_NAME = 'packagename' AUTHOR = 'John Doe' AUTHOR_EMAIL = 'john.doe@example.com' URL = 'https://code.example.com/example-package2' LICENSE = 'Apache License 2.0' DESCRIPTION = 'Example package for a library called packagename' # PyPI supports a "long description," which basically a README # file that is published alongside your package metadata. # The code here shows how to dynamically load the text of # your package README and provide it for the long description. # The example in this article uses Mardown for readme and # documentation, other supported formats include # Restructured Text (RST). LONG_DESCRIPTION = (HERE / "README.md").read_text() LONG_DESC_TYPE = "text/markdown" # Dependencies for the package INSTALL_REQUIRES = [ 'numpy', 'pandas' ] # Initialize setup with metadata and package dependencies. # find_packages is used to recourse the structure of the # project and dynamically generate the package setup(name=PACKAGE_NAME, version=VERSION, description=DESCRIPTION, long_description=LONG_DESCRIPTION, long_description_content_type=LONG_DESC_TYPE, author=AUTHOR, license=LICENSE, author_email=AUTHOR_EMAIL, url=URL, install_requires=INSTALL_REQUIRES, packages=find_packages(exclude=['docs', 'tests', 'examples']) )
Building on the first example, we've introduced several additional elements:
- Because parts of the packaging code such as the version and description might be updated frequently, it is convenient to put them at the top of the file as global variable so they are easy to find and modify.
- PyPI supports a "long description," which is basically a
README
description that will appear alongside other package metadata like the website and version. This example code to load aREADME.md
file located at the same level in the project assetup.py
. - A dependency list that will ensure that both
numpy
andpandas
are also installed. - Automated discovery of module/package names from within the package repository using
find_packages
. For simple projects, it's straightforward to manually add the module names to thepackages
argument ofsetup
. For large projects, however, it can become a burden to keep the package list updated.find_packages
traverses the structure of the package repository and generates this list automatically. To prevent modules we might not want included -- such as docs, examples, and tests -- it is possible to specify a list of patterns using theexclude
option.
Example 3: Packaging Repository With Submodules
For packaging repositories that include submodules, you need to add a customization to the install
and build process: submodule checkout. If using Git, this means running git submodule update --init --recursive
. The example setup.py
in the listing below provides customized entrypoints for three commands so that the submodule update happens during install
, develop
, and sdist
.
This example is taken from the Sonador client packaging repository. Sonador is an open source cloud visualization, ETL, and research platform for working with medical images.
import os, pathlib from setuptools.command.develop import develop from setuptools.command.install import install from setuptools.command.sdist import sdist from setuptools import setup, find_packages from subprocess import check_call # Package HERE = pathlib.Path(__file__).parent PACKAGE_NAME = 'sonador' VERSION = '0.1.1' AUTHOR = 'Rob Oakes' AUTHOR_EMAIL = 'rob.oakes@oak-tree.us' URL = 'https://code.oak-tree.tech/oak-tree/medical-imaging/packages/sonador.git' LICENSE = 'Apache License 2.0' DESCRIPTION = 'Sonador open source cloud platform for medical imaging visualization and research' LONG_DESCRIPTION = (HERE/"README.md").read_text() LONG_DESC_TYPE = 'text/markdown' INSTALL_REQUIRES = [ 'client', 'tabulate', 'pydicom' ] def gitcmd_update_submodules(): ''' Check if the package is being deployed as a git repository. If so, recursively update all dependencies. @returns True if the package is a git repository and the modules were updated. False otherwise. ''' if os.path.exists(os.path.join(HERE, '.git')): check_call(['git', 'submodule', 'update', '--init', '--recursive']) return True return False class gitcmd_develop(develop): ''' Specialized packaging class that runs git submodule update --init --recursive as part of the update/install procedure. ''' def run(self): gitcmd_update_submodules() develop.run(self) class gitcmd_install(install): ''' Specialized packaging class that runs git submodule update --init --recursive as part of the update/install procedure. ''' def run(self): gitcmd_update_submodules() install.run(self) class gitcmd_sdist(sdist): ''' Specialized packaging class that runs git submodule update --init --recursive as part of the update/install procedure;. ''' def run(self): gitcmd_update_submodules() sdist.run(self) setup( cmdclass={ 'develop': gitcmd_develop, 'install': gitcmd_install, 'sdist': gitcmd_sdist, }, name=PACKAGE_NAME, version=VERSION, description=DESCRIPTION, long_description_content_type=LONG_DESC_TYPE, author=AUTHOR, license=LICENSE, author_email=AUTHOR_EMAIL, url=URL, install_requires=INSTALL_REQUIRES, packages=find_packages())
Entrypoint commands in setuptools
are organized as classes. To create customized versions, you first need to create a subclass and then override the run
method so that it incorporates your custom logic. Here, we incorporate a subprocess call that executes git submodule update
. To prevent the command from running during the install of an already packaged version of the code, we check to see if the .git
version control directory is present.
We let setuptools.setup
know about the customized versions of the entrypoints by passing the new classes in a dictionary keyed to the command name as the cmdclass
option.
Step 2: Install or Build/Deploy
With the setup.py
file in place, you have all of the pieces needed for your package. You can install the package through setup.py intall
:
python setup.py install
Or create a wheel version that can be published through setup.py sdist
:
python setup.py sdist bdist_wheel
Publishing Your Package
When building "package archive files" meant for distribution through PyPI, the sdist
command will generate several new directories (dist
, build
, and packagename.egg-info
) and a set of build artifacts. The important pieces (the actual "package") will be found in the dist
folder and include:
- a
tar.gz
file that contains all of the installation files for the package - a
wheel
file
These two files are what need to be sent to PyPI (or similar).
It is a good idea to include the dist
, build
, and packagename.egg-info
folders in the .gitignore
of your repository so that they aren't accidentally committed to version control.
Conclusion
Python's module/library/package ecosystem and the tools that power it (pip
, setuptools
, and others) are one of the major reasons behind its success. As we've seen in this article, creating packages is a straightforward process split into two phases: prepare and package.
To prepare:
- Organize your code into a cohesive API.
- Provide documentation so that people can navigate your code.
- Provide a license so people know how to use and redistribute.
- Structure the repository for your packaging scripts.
To package:
- Create a
setup.py
file that provides the metadata, dependencies, and commands needed for your code. - Deploy and win!
Comments
Loading
No results found