Yes: a dictionary is a data type. No: a dictionary is not a way to implement abstract data types; doing so is lazy programming and is asking for trouble later on.
What do I mean by this? In Python and other similar dynamic languages, dictionaries are a mapping of keys to values that have no typing restrictions: the dictionary is heterogeneous, and a single dictionary can contain elements of different types both as its keys and its values. To make things worse, the syntax of the language makes it incredibly easy to create and populate dictionaries (unlike, say, in C++). Combine these two facts together and the temptation to abuse a dictionary to implement a structured data type is high.
Let’s look at a fictitious function to check if a given process is within its current resource limits:
def process_within_limits(pid, limits, usages):
"""Checks if a process is within its limits.
Args:
pid: int. The process identifier.
limits: dict(int, dict(str, int)). Mapping of process
identifiers to the limits for the corresponding
process. The limits of a process are a mapping of
resource names to the numerical limit. The valid
names currently are 'cpu' and 'ram'.
usages: dict(int, dict(str, int)). Same as limits but
for the current instantaneous measurements of the
process resource consumption.
Returns:
bool. True if the process is within its limits.
"""
if pid not in limits:
raise ValueError('Missing process limit')
limit = limits[pid]
if pid not in usages:
raise ValueError('Missing process usage')
usage = usages[pid]
assert 'cpu' in usage and 'cpu' in limit
cpu_in_quota = usage['cpu'] <= limit['cpu']
assert 'ram' in usage and 'ram' in limit
ram_in_quota = usage['ram'] <= limit['ram']
return cpu_in_quota and ram_in_quota
This code is nesting two dictionaries in the limits
and usages
arguments, and is using each level in a different semantical manner. In the first level we have a mapping of process identifiers to either the process’ limits or current usage counts; in other words, a perfectly valid use case for a dictionary. In the second level, however, we have a collection of resource names mapped to values; needless to say, this is bad practice (with very few exceptions).
The way to improve this code is by defining an actual data type so that the various attributes are properly represented by member fields. In Python, we can use the standard collections.namedtuple
class to simplify this:
import collections
# Resource limit or usage values for a process.
#
# Attributes:
# cpu: float. Resource value for the CPU usage.
# ram: int. Resource value for the RAM usage, in bytes.
Resources = collections.namedtuple('Resources', 'cpu ram')
def process_within_limits(pid, limits, usages):
"""Checks if a process is within its limits.
Args:
pid: int. The process identifier.
limits: dict(int, Resources). Mapping of process
identifiers to the limits for the corresponding
process.
usages: dict(int, Resources). Mapping of process
identifiers to the current instantaneous resource
usage values.
Returns:
bool. True if the process is within its limits.
"""
if pid not in limits:
raise ValueError('Missing process limit')
limit = limits[pid]
if pid not in usages:
raise ValueError('Missing process usage')
usage = usages[pid]
cpu_in_quota = usage.cpu <= limit.cpu
ram_in_quota = usage.ram <= limit.ram
return cpu_in_quota and ram_in_quota
This is already quite an improvement. There are two things that I want to highlight here:
- The explanation of the function arguments no longer describes what the contents of the resources dictionary should be. Doing so is now unnecessary because we have an actual type with documentation to do so.
- Access to the member fields is done via a field instead of dynamically via a map query. This allows the validation of the accesses at build type (not in the case of Python, of course) and also the enforcement of types and/or data invariants if any.
There is one more twist to all this. By having extracted the dictionary as a data type, we can now clearly see that some functionality belongs in the data type itself for encapsulation purposes: invariant checking, operator overload, auxiliary methods… In this specific example, just imagine if you ever wanted to add a new resource dimension to the Resources
class: you wouldn’t like to have to hunt down all callers to ensure they know about the new field! So we move the necessary functionality into the type:
import collections
class Resources(collections.namedtuple('Resources', 'cpu ram')):
"""Resource limit or usage values for a process.
Attributes:
cpu: float. Resource value for the CPU usage.
ram: int. Resource value for the RAM usage, in bytes.
"""
def __le__(self, other):
return self.cpu <= other.cpu and self.ram <= other.ram
def process_within_limits(pid, limits, usages):
"""Checks if a process is within its limits.
Args:
pid: int. The process identifier.
limits: dict(int, Resources). Mapping of process
identifiers to the limits for the corresponding
process.
usages: dict(int, Resources). Mapping of process
identifiers to the current instantaneous resource
usage values.
Returns:
bool. True if the process is within its limits.
"""
if pid not in limits:
raise ValueError('Missing process limit')
limit = limits[pid]
if pid not in usages:
raise ValueError('Missing process usage')
usage = usages[pid]
return usage <= limit
Let me conclude by saying that this is all inspired by actual production code I’ve had to deal with… so the example and its simplicity are not that contrived.