Future

CodingBlocks

97. Data Structures – (some) Trees

We ring in 2019 with a discussion of various trees as Allen questions when should you abstract while Michael and Joe introduce us to the Groot Tree.

Are you reading these notes via your podcast player? You can find the full show notes for this episode and join the conversation at https://www.codingblocks.net/episode97.

Sponsors

  • O’Reilly Software Architecture Conference – Microservices, domain-driven design, and more. The O’Reilly Software Architecture Conference covers the skills and tools every software architect needs. Use the code BLOCKS during registration to get 20% off of most passes.
  • Discover.bot – a digital space for bot developers and enthusiasts of all skill levels to learn from one another, share stories, and move the conversation forward. New to bots? Get started with their Beginner’s Guide.
  • Clubhouse – The first project management platform for software development that brings everyone on every team together to build better products. Sign up for two free months of Clubhouse by visiting clubhouse.io/codingblocks.

Survey Says

Anonymous Vote
Sign in with Wordpress
How good were the holidays to you?
  • Awesome. Got everything I hoped for.
  • I got things I didn't even ask for.
  • It was good. I got everything I purchased.
  • Glad to be back at work.
  • I didn't get squat.

News

  • Thank you to everyone that left us a review:
    • iTunes: SoundShop, TimGraf, bothzoli
    • Stitcher: KSprings, Trickermand, Delden, Flo_
  • Is it really always better to code to an interface?

I am Groot

Why do trees matter?

  • Trees are one of the most important data structures, because there are certain types of problems that are really difficult to solve without them and they are really efficient for certain tasks.
  • They are also some of the most difficult for programmers to deal with, because you often need recursive algorithms to work with them.
  • Recursive solutions are neat, but can easily lead to stack overflows if you are not careful (especially if your language doesn’t support tail recursion).
  • Many traditional persistence layers are not setup to easily store tree data.
  • Why programmers use trees:
    • Model “real world” hierarchical data.
    • Memory/Processor efficient.
    • “Good enough” solutions.

Common uses of Trees

  • Manipulate hierarchical data (persistence, middleware, UI):
    • Directory structure on your computer.
    • Product categories on an eCommerce site.
    • Interview questions.
    • Data files (XML, HTML, JSON).
  • Search.
  • Manipulate sorted lists of data.
  • As a workflow for compositing digital images for visual effects.
  • Router algorithms.

There are lots of Trees

  • The basic structure of a tree is very simple, in most cases you can imagine a wrapper node object that has a value (your data) and a set of children nodes.
  • There are other ways to save trees, like in an array but you’ll generally see those used in special cases like heaps or when you are serializing data.
  • … that’s it! If you’re looking at a file system, then the value would contain the current path and other metaData (like permissions) about that path, the children of that node would contain other nodes
  • Despite the simple example, Wikipedia lists 115 types of trees against 7 categories! (and 20 types of arrays!)
    • The many types of tree categories:
      • Binary Trees
      • B-Trees
      • Heaps
      • Trees (Value Trees)
      • Multiway trees
      • Space-Partitioning Trees
      • Application-Specific Trees
  • How can you have so many types for such a simple concept as value + children?
    • Constraints on the data structure (binary vs n-array).
    • Stored meta data (red black).
    • Rules you use for interacting with the trees (heaps).
  • These variations have big downstream effects.

Common terminology

  • Although the data structure itself is simple, there is a lot of data that you can extrapolate from about the entire tree.
  • Algorithms use this data in addition to the data stored in the nodes.
  • The basics:
    • Root is the topmost node of the tree.
    • Edge is the link between two nodes.
    • Child is a node that has a parent node.
    • Parent is a node that has an edge to a child node.
    • Leaf is a node that does not have a child node in the tree.
    • Height is the length of the longest path to a leaf.
    • Depth is the length of the path to its root.
  • For example, if you are dealing with a really large tree, then you may opt to return a “good enough” solution that doesn’t exceed a certain depth.
  • If you’re performing a lot of searches on a tree, then it may make sense for you to optimize the tree as you insert and delete nodes so that your searches can check a consistently small set of nodes.
  • The point is, even if the data structure is small – there are big time implications, and the recursive algorithms you write can be … mind bendy.

Binary Tree

What is it?

  • Unlike arrays, lists, queues, stacks, etc., trees are not linear data structures, rather they are hierarchical.
  • A tree whose elements have at most two child elements, usually referred to as left and right.

How does it work?

  • Every tree has a root node – that’s the top most node in the tree – it has no parent.
  • Nodes or elements that have no children are called leaf nodes.
  • Nodes are referred to as either a parent or a child.
  • Types of binary trees:
    • Full Binary Tree – every node has 0 or 2 children.
    • Complete Binary Tree – every level is completely filled with nodes except the last level could have every node filled except one right node.
      • Binary heap is an example of this.
    • Perfect Binary Tree – all internal nodes have two children, and all leaf nodes are at the same level.
    • Balanced Binary Tree – height of the tree is O(log n) where n is the number of nodes.
      • Great for performance because they have O(log n) time for search, inserts and deletes.
    • Degenerate or Pathological Tree – each node has exactly one child (until you get to the leaf node).
      • Basically no different than a linked list.

Pros

  • When ordered, can provide a sped up search over linked lists, but still slower than an ordered array.
  • Can be performant on insert/deletes into the tree – faster than inserting into the middle of an array, but slower than a linked list.
  • No limit on the number of nodes that can be added – unlike an array – this is because each node contains pointers to its child and parent nodes.

Cons

  • Extra memory, not as efficient as arrays

When to use?

  • You need to do comparisons and you’re not sure of how many you need, or there will be many inserts/deletes.

B-Trees

What is it?

  • It’s a self-balancing search tree.
    • AVL Tree – (Adelson-Velsky and Landis)
  • Primary idea behind a b-tree is reducing the number of times the disk is accessed – keep as much in memory as possible.
  • Most operations for search, insert, delete, max, min – require O(h) disk accesses, where h is the height of the tree.
    • This is achieved by creating a “fat tree” – this means that the tree is short but wide, and this is accomplished by putting as many keys as possible in a single b-tree node.
      • Typically the b-tree node is the same size as the block size on the disk.
  • All leaf nodes are at the same level.
  • Minimum degree “t” – t depends on the block size of the disk.
  • EVERY node must contain t-1 keys.
    • Root may contain at minimum 1 key.
  • All nodes, including root, may have at most 2t – 1 keys.
  • Number of children of a node are equal to the number of keys in that node + 1.
  • All the keys are sorted in order within the node.
  • B-trees grow and shrink from the root node.
    • Unlike a binary search tree that grows and shrinks vertically (downward).
  • Time complexity to search, insert and delete are O(log n).

How does it work?

  • Search – similar to a binary search tree:
    • If you’re searching for a key, you recurse down the nodes.
    • If the non-leaf node has the key, then the node itself is returned.
    • If the non-leaf node doesn’t have the key, then the child node that is the child of the first key with a greater value in the current node is traversed.
    • If you make it all the way to a leaf node, and the key isn’t found, NULL is returned.
  • Insert:
    • Always starts at a leaf node.
    • Do a binary search to find the node where the insert should happen – if there is room, then just insert the new key in that node and make sure the keys stay ordered.
    • If the node is full …
      • Evenly split the node into two nodes.
      • Choose a median from the elements being split – anything lower than the median goes to the new left child node, and anything greater goes to the new right node.
      • The median goes into the parent node, and then splitting can keep going up the tree.

Pros

  • Better, more consistent search times than binary search because of the balancing.
  • There are self-balancing binary trees (like red-black trees) but are optimized for batch inserts.

Cons

  • If you’ve met the max degree, then you need to re-balance the tree which can make inserts slower.

When to use?

  • B-tree is well suited for storage systems that read and write relatively large blocks of data, such as disks.
  • Commonly used in databases and file systems – because the aim is to reduce the number of disk reads necessary.

Resources We Like

Tip of the Week

  • Use PlantUML to create plain text UML diagrams.
    • There’s also a VS Code extension for it.
    • Thanks bothzoli!
  • Walkthrough: Debugging a Parallel Application in Visual Studio (Visual Studio Docs)
  • In SQL Server Management Studio, when executing multiple SELECT statements, the record count in the bottom right hand corner shows the aggregate count for all queries. Click on a specific result set to see the individual count for a particular query.
  • When using OS X/macOS, remember that a lot of functionality might be hidden. Use the OPTION key to see additional functionality.

Episode source