Sunday 30 March 2014

Sorting and Efficiency

Over the past few weeks, two of our primary focuses have been sorting and efficiency. In this slog post, I will concentrate on how these two concepts are deeply connected. I hope that this post can serve as a primer for fellow students in the course. I will briefly explain sorting and efficiency. I will then explain some of the difficulties that I ran into, and what I did to get over them.

The definition of 'sorting' does not differ from its conventional meaning in the case of computer programming. It is essentially a way of taking a list of unsorted numbers and sorting them in a designated fashion, e.g. smallest-to-greatest or greatest-to-smallest. However, there is one key difference between sorting items via programming and sorting items intuitively, as we do with, say, a deck of cards; sorting items via an object-oriented programming language requires an algorithm. This algorithm provides the logic by which Python can systematically sort numbers. When we sort cards intuitively, there is not necessarily a clear logic by which we sort cards. We might see a small card and stick it in the front. We might see a large card and stick it in the back. We might deviate from these two patterns until we finally reach the desired outcome, a sorted list. However, programming languages do not have the benefit of being able to correct themselves. In this way, the programmer has to provide a perfectly logical algorithm for sorting items. There are multiple ways of doing this, each with pros and cons. In this post, I will primarily discuss three types of sorting: bubble sort, selection sort, and insertion sort.

Bubble sort employs a simple and transparent algorithm. It goes through every index in a list, and checks if the item at that index is greater than the item succeeding it. If it is, it swaps the item, this process continues until the function reaches the second last index, at which point the list is sorted. The diagram below illustrates the method of sorting.

Taken from http://mytutorialspace.com/bubble-sort/

Next, I will introduce selection sort. The first pass of this sorting algorithm takes the smallest item in the list and places it at index 0, i.e. at the start of the list. It dichotomizes the list by creating a sorted section and an unsorted section. After the first pass, the sorted section is comprised only of the item at index 0; this is the only item that we know for certain is sorted. The next pass will take the smallest item in the unsorted section and move it to index 0 of the unsorted section, i.e. one index after the end of the sorted section. In this way, the sorted section will be comprised of two sorted numbers in the next pass. The size of the sorted section will increase by one, and the unsorted section will decrease by one. This process continues until the length of the unsorted section equals one. This means that the item in the unsorted section is the largest number. Hence, the list is sorted. The diagram below clearly illustrates this concept.
Red: sorted, Blue: unsorted
Taken from http://mytutorialspace.com/write-program-demonstrate-selection-sort-array/

Also, note that selection sort can also work in the opposite ways: it can move the largest items to the end of the list instead of moving the smallest items to the from of the list

Next, I will explain insertion sort. This algorithm starts by taking the first two indexes, 0 and 1, and comparing their sizes. It will swap them if index 0 is larger than index 1. Otherwise, it will do nothing. After this occurs, index 0 and 1 can be considered a sorted region. The rest, for all we know, are unsorted. The algorithm then moves forward one index at a time and checks if the item at that index is less than any of the items in the sorted region. If it is less, the sorting algorithm will move that item behind the item that it is closest to it in size, but still greater than it. Otherwise, the algorithm will proceed to the next index. This continues until the end of the list. The diagram below show this concept.

Red: sorted, Blue: unsorted
Taken from http://www.stoimen.com/blog/2012/02/13/computer-algorithms-insertion-sort/


Now you may be wondering, why do we need all of these algorithms? They all do the same thing after all. The answer is efficiency. Each of these algorithms have different time complexities, that is, they require different amounts of time to output the desired result. This is a concept that I struggled with a lot at first. It was not always clear to me which sorting algorithm should have what time complexity, and I did not intuitively understand how to apply big O notation to an algorithm.

To overcome the hurdle of solving big O notation, I found it useful to break each algorithm down by looking at the code of the actual function. By looking at the code of the actual function, I was able to look for three things in particular: division (/), recursion, and nested for loops. These two things are worth paying attention to because they often revelatory of the time complexity of the algorithm. If we see division without recursion, we know that the function is logarithmic, because the number of passes is the quotient of n, the number of items in the list array. Hence, we end up with a time complexity of O(log(n)). If we see division and recursion, each level of recursion accounts the previous amount of items n divided by 2, for instance, because the function divides the number of items by 2. It does this n times, where n is the number of items in the original list array. Hence, we end up with O(n log(n)). If we see a nested for loop, we know that for each item, we are going to have to index another set of items. Hence we end up having to check the number of items n^2. Hence, we end up with a time complexity of O(n^2). Of course there is also O(n) and O(1). These however, are fairly intuitive: O(n) means the sorting algorithm indexes every item in the list, where n represents the amount of items. O(1) means that the sorting algorithm's time complexity. It is invariable, so we represent it using a constant value, i.e. 1.

I hope my approach to the issue of big O notation can help some other people in the class as we prepare for the final. For further exploration of sorting and efficiency, check out Zoe's post!

Saturday 29 March 2014

A2 Part II Overview

In this post, I will discuss the functions of A2 part II. I will note the difficulties that I ran into, and how I overcame them.

is_regex(s)

The hardest function in this assignment, by far, is is_regex. Writing code that can recognize valid regexes was fairly simply, because the criteria was laid of out for us in the assignment's handout. The difficulty came with writing all of the necessary code for recognizing invalid regexes, and returning False, instead of an error message. We know that an empty string, '', cannot be a valid regex. We also know that a regex must have an equal amount of opening, (, and closing, ), parentheses. These were the first conditions that I check for. This way, we can easily to detect two kinds of invalid regexes and return False without forcing Python to run through the rest of the code. This saves lots of time and memory.

If the regex tree has a length of one and it is in '012e' it is valid. Otherwise, I check whether the regex, if valid, is nested. In order to do this, I find the central separator, '.' or '|'. This is more difficult than it initially appears. At first, I did not account for the possibility of a multiple parenthesized regex within one regex tree. In this implementation I incorrectly searched for the first index of '.' or '|'. Of course, in the case of a nested regex tree, s.index('.') or s.index('|') will only lead to the first instance of this separator, and not the central separator that we are looking for.

The key to accounting for nested regexes is to count parentheses. If s is valid the separator will always occur at an index greater than one where, in a for loop, there has been one more opening parentheses than closing parentheses. Take ((1.0).(2.e)) for instance. The root node occurs at 6, at which point there has been one more opening parentheses (two in total by this point) than closing parentheses (one in total by this point). The only case in which this will not work, is where there is an asterisk(s) following the first nested regex, e.g. ((1.0)****.(2.e)). According to the process described above, the central node should again be index 6. However, in this case index 6 refers to an asterisk. In fact indexes 6-9 all refer to asterisks; the root node, '.', is at index 10. What should we do in this case? A later if condition checks if variable index refers to an asterisk, '*'. if it does, we index to the index of the closest separator, if there is one. If there is no separator, but there are parentheses, this must be an invalid regex. Parenthesized regex trees necessarily have a root node of '.' or '|'.

The next step is to simply use recursion to perform the operations listed above on the left and right side of the parenthesized regex tree, i.e. from the start of the structure to the index of the root node for the left side, and from the index of the root node + 1 to the end of the regex.

all_regex_permutations(s)

The successful implementation of all_regex_permutations depends upon a perfect implementation of is_regex. For this function it is useful to have a helper function that generates all permutations of s, whether they are valid regexes or not. Then, in the main function write a line of code that returns the set returned by the helper function, less all of the invalid regexes. This is where is_regex comes in handy in order to parse through all of the items in the set returned by the helper function.

On a side note, if your implementation of is_regex is not perfect, using all_regex_permutations can be a great way to debug and find out what is wrong with your code. As programmers, sometimes it is difficult to imagine all possible combinations of characters that could break our code. All_regex_permutations will test is_regex with every possible combination of the characters included in variable s. If something is wrong with your implementation of is_regex, the error message returned by all_regex_permutations will let you know which line is wrong or problematic.

regex_match(r, s)

This function asks us to use a concept that we haven't touched before: ternary strings. Understanding ternary strings requires a close reading of the assignment handout, in order to know how each string should interact with each regex tree structure, e.g. DotTree, BarTree, etc... You will have to use recursion for every structure other than Leaf. In order to test your code, I recommend trying some of the examples shared here on Piazza.

build_regex_tree(regex)

The implementation of build_regex_tree is very similar to that if is_regex. One key difference is that build_regex_tree assumes that variable regex is valid. Hence, we don't have to account for invalid regexes, which makes our implementation much easier. The second key difference, is that we have to deal with the regex tree subclasses defined in regextree.py. Otherwise the implementation is the same as is_regex. We use recursion to access and output the left and rights sides of regex if it is nested. Otherwise, we return a leaf of the singular node '012e'. If there are asterisks, we create asterisk trees containing the other appropriate regex tree subclasses.


Anyway, I hope this guide helps you in understanding the assignment, even if it is already over; I'm sure some of the concepts employed in this assignment will appear, in some form, on the final exam. Happy studying!

Yanming Mai also created a post about A2, and some of its more challenging parts, such as efficiency (although this wasn't necessary) and style with the large amount of code.


Binary Search Trees and Linked Lists

Binary search trees (BSTs) and linked lists (LLs) are both abstract data types. In this post, I will explain their structure and utility.

Before defining BST, I would like to quickly review binary trees (BT). At their most basic level, BTs are defined as having a three node structure: a root node, a left node, and a right, node-- each of which can equal None. In a deeper recursive structure, each left and right node, stemming from the root, will have their own left and right nodes. A BST, is structurally similar to a BT, but with two conditions:


  1. All items to the left of the root node must be smaller in value than the root node
  2. All items to the right of the root node must be greater in value than the root node.

Please see the diagram below for an example of a B (a) vs. a BST (b).

BT vs BST
 Taken from http://msdn.microsoft.com/en-us/library/ms379572(v=vs.80).aspx
The diagram above shows a BT (a) and a BST (b), each containing similar information, only organized differently. In (b) all of the items to the left of the root node are less than 7. All the items to the right of the root node are greater than 7. BSTs have one obvious advantage over BTs: efficiency. For instance, if we want to search for a value in a BST, we can determine whether the value is less than or greater than the root node, and then traverse only the appropriate side. In this way, we can ignore one half of the tree. Because BTs are unorganized, searching for an item requires a traversal of both sides.

I would now like to switch my focus to linked lists (LLs). At its most basic level, an LL is structurally defined by a root/head node and a reference/tail node. A reference node can either be empty or contain another LL-- this is where recursion comes in handy. See the diagram below for an example of a linked list:

A Linked List Data Structure,
Taken from https://wiki.cs.auckland.ac.nz/compsci105ss/index.php/Linked_Lists

The Diagram above would be represented, in python, as:

LinkedList(2, LinkedList(4, LinkedList(6, LinkedList(8, None))))

We can tell immediately that parsing through this structure will require recursion because of its nested structure. This raises the questions, why use linked lists? Aren't regular lists simpler? They don't require recursion, after all. I immediately asked these two questions after reading about LLs. The benefit to LL is realized when muting them. Take the regular list [1, 2, 3, 4] for instance. In order to remove 2 from this list, we have to reassign the index of every item after it, i.e. 3 and 4. In this example, this might not seem so bad, given the short length of our list. However, if we were dealing with a list of millions of items, it would require a significant amount of memory. Now, if we express the previous example as a linked list, LinkedList(1, LinkedList(2, LinkedList(3, LinkedList(4, None)))), we can remove items with far less memory. In order to remove 2, python just has to reassign the reference node of 1 to 3. It does not have to reassign each index. The diagram below illustrates this concept.

Deleting an item in a LL
Taken from http://www.cs.iit.edu/~cs561/cs331/linked_lists/DeletionMiddle.html

In this way, the advantage of both BSTs and LLs is their efficiency relative to their counterparts, BTs and lists respectively. BSTs allow us to reject an entire side of a binary tree structure when searching for a value. LLs don't necessitate a reindexing of all its constituent items in the same way that lists do; LLs only require a reassignment of a reference node.

For more information on LLs and BSTs, I recommend checking out GR's post. He or she provides some great metaphors for binary search trees. Ilikechicken also provided a post with some useful graphical representations of LinkedLists.

Tuesday 11 March 2014

Should Computer Science be Referred to as a Science?

In the academy, we have a tendency to label disciplines of study as either an 'art' or a 'science.' This sort of categorization is beneficial in some ways; for instance, by labelling a group of courses as 'sciences,' we understand that their premises are in some way linked through a similar grounding in scientific rationality. The 'arts,' on the other hand, are often characterized by a blend of empirical support and speculation. By breaking dividing courses into two groups, we can choose a path of study based on our fundamental interests: objectively provable phenomenon via the sciences or creative expression and thought experiments via the arts. However, it is important to note that these categories are not always mutually exclusive. On these grounds, I think this categorical dichotomy actually imposes a limit on our understanding of each field within these umbrella categories, e.g. physics, computer science, philosophy, etc...., to inhibit our intellectual development. The popular categorization of computer science as a 'science' reveals the intellectually limiting nature of the division of the academy into faculties of 'arts' and 'sciences,' because computer science is grounded in both a scientific rationality, characteristic of the sciences, and desire for creative expression, partly characteristic of the arts. The artistic side, however, is often forgotten.

We can view the expression of ideas as a basic premise of object-oriented programming. When we write a function, we hope to take a parameter, interact with it in some meaninful way, and then express that interaction via an output or by changing the value of a variable. This allows us to express an idea to the user of our function. In this way, computer science is reminiscent of a language course in the arts. In learning a second language, we hope to take an idea, and articulate, or output, it in a way that that other people can understand. Expression is one of the fundamental purposes of language. This common goal of language courses in the humanities and computer science courses in the sciences sends a rift through the idea of the arts and sciences being distinct fields from each other. The constructed separation of these two fields obscures the otherwise clear connection between them, and inhibits us from applying knowledge from one to the other interchangeably.

Furthermore, we often think of programming as a means to a strictly utilitarian end. In other words, we tend to value programming for it's ability to produce something that we can use productively. For instance, we might write a function that can produce a numerical value based on a wide range of sources, so that we can use it in some productive way. We do this in the most efficient and calculated way possible. While human language can serve this purpose as well, via the dissemination of commands, it can serve other purposes as well. For instance, we can write poems or prose for sheer literary appreciation. If we step outside of the dominant use of programming-language, i.e. for utilitarian purposes, can we appreciate it in the same way we do other languages, such as French or English? Can we not use it to represent something that carries value outside of a paradigm of utilitarianism? I think the case of video games is a salient example of how computer science can be used to produce something that is valuable without being strictly utilitarian. In 2012, independent software development studio Thatgamecompany produced Journey, a game centred around a nameless protagonist on a mysterious quest.




Of course, object-oriented programming language serves as a backbone for Journey, and therefore provides a strict and calculated mathematical rationality for how everything in the game's digital environment works, e.g. movement, lighting, geometrical shapes. However, these myriad mathematical attributes coalesce to form a digital world defined by its enigma and enjoyability. Journey's value is undeniable, but not for utilitarian reasons. Journey, contrary to what is often prescribed in the academy, reaffirms the possibility of computer science being used as a means to an artistic end, and not a strictly scientific one. I think that if the academy expressed computer science as both a scientific and artistic tool, students would have a broader understanding of how computer science can be used, and perhaps be inclined to use it more creatively.

Thus, the case of computer science forces us to deconstruct our preconceptions of 'science' and 'art,' and ask does the departmentalization of the academy limit our ability to recognize the possible applications of computer science?

On a side note, I would also like to take a moment to recognize a fellow slogger. ACC's post on A1 provides a great overview of the difficulties of the assignment. If anyone struggled with A1, I strongly encourage you to look at ACC's test in order to brush up before the final.

Sunday 23 February 2014

Recursion

Over the past few weeks, we have spent a considerable amount of time studying recursion. Recursion allows programmers to compensate for unpredictably nested ADTs, such as lists, tuples, or dictionaries. Any of these three types of objects are capable of having nested values, which a for loop, at least from what we have learned so far, is incapable of collecting data from in an efficient manner. Recursive data structures always have at least two definitions, of which one is circular. For example, we can define a nested string list as: a) a nested string list or b) strings. In this case, a is the circular definition because it defines a nested string list as a nested string list. Keeping this in mind, if we wanted to, say, count the number of strings in a nested string list, a recursive function would refer to these two definitions to count strings until it reads every nested string list within the original list. If it met definition b it would add +1 to an accumulator variable. In the end, the function will return the accumulator variable. If it met definition a it would enter the next nested string list, perform the recursive function, and add it's output to the accumulator variable at the level above. The accumulator variable at the top level at the function, i.e. outside all of the nested string lists of x depth, will eventually equal the sum of the outputs of every recursion performed.

So far in lecture, we have mainly covered recursion over nested lists. For the sake of practice, this week I wrote a function that counts the number of nested dictionaries within a dictionary. It accounts for the possibility of dictionaries being nested within dictionaries and/or lists. If you're interested, or have any suggestions for how I can improve my code, please let me know!

def num_dictionaries(o: object) -> int:
    """Return number of dictionaries in o.
    """

    total_dicts = 0

    if o == {}:
        return 1
    elif isinstance(o, dict):
        for key in list(o.keys()):
            if isinstance(o[key], dict):
                total_dicts += num_dictionaries(o[key]) + 1
            elif isinstance(o[key], list):
                total_dicts += num_dictionaries(o[key])
            else:
                total_dicts += 0
    elif isinstance(o, list):
        for item in o:
            if isinstance(item, dict):
                total_dicts += num_dictionaries(item) + 1
            elif isinstance(item, list):
                for subitem in item:
                    total_dicts += num_dictionaries(subitem)
            else:
                total_dicts += 0
    else:
        total_dicts += 0

    return total_dicts

I'd also like to take this opportunity to thank http://lambda--x.blogspot.ca/ for the words of support in his/her post: http://lambda--x.blogspot.ca/2014/02/meta-recursion.html! I have also enjoyed reading your slog. I appreciate the amount of work that you put into it. Your posts, such as the one above, have exposed me to ideas that we don't encounter in-class, and have made me value CSC148H's slogging community.

Sunday 16 February 2014

Binary Trees and Intuition

This week I was particularly interested in the use of binary trees for the purpose of organizing information. I think there is still a tendency, even amongst people born in the information age, to look at computer programming as an enigmatic and inaccessible activity. However, this week's study of binary trees reminded me that the computer is a human invention that, as such, is often fundamentally grounded in common human experience, not esoteric computational jargon. A binary tree's use of roots, branches, and leaves is clearly reflective of the natural world, and therefore requires little thinking to at least conceptually understand. In light of this, I think binary trees are a great example of the approachability of computer science. Perhaps, this and similarly intuitive concepts can be used to convince people with computer-phobia that programming is not as enigmatic as many people think.

I'd also like to give a shout out to CaroneCSC148 for his/her post on trees. It provides information on tree traversal: inorder, preorder, and postorder. Great job!

Sunday 9 February 2014

Binary Code

 As students of computer science, our work tends to strive for utilitarian value-- We want our code to have a clear and useful purpose. In doing this, however, we often obscure or overlook what is going on behind-the-scenes in a programming language, and focus solely on how we can use a programming language to a productive end. However, in doing so I think we necessarily lose sight of some of the simple wonders of computer science: the application of binary code.

In this week's lab, we learned about ways in which binary code can represent numerical values other than 1 and 0. For instance, "10" represents "2," "11" represents "3," and "100" represents "4." In learning this, I was amazed, again, by the fact that all information, however complex, can be represented via two characters: "1" and "0". We can teach computers to complete myriad tasks with these two simple values. But how, then, did we first teach computers to handle binary code? If there was no linguistic foundation, i.e. the kind that binary code provides for all other programming languages, how did we teach computers to learn how to interpret simple code? How did we make something, i.e. binary code, out of nothing?

I'd also like to take a moment to recognize Denise Jiang's post on recursion. She provides a great analogy for a recursive function without a base case: two parallel mirrors.