I began working at the RTIS (Real Time Intelligent System) Lab at UNLV this week. For the first few days I was responsible for annotating videos of Olympic diving which would later be used to train a machine learning algorithm to rate athletic performance. This actually ties in with work I had done previously at the lab while in high school. Most of the interns in the lab are responsible for annotating a few of these videos during their internship in order to ensure that there are adequate training examples for the algorithm. I was also introduced to my specific project which involves a facial recognition system for access monitoring. I was given a research paper regarding my project as well as access to all previous work done on the project. The specific goal of the project is to develop a complete hardware-software system which can be used for access monitoring in a secured area using facial recognition technology (FRT). This paper also states that there are currently two implementations of the system - a lightweight and full scale version - which run on a Raspberry Pi and desktop computer respectively. The lightweight version is ideal for deployment in a real life scenario since it is relatively low cost and has the flexibility to be set up in multiple locations. However, the current facial recognition algorithm, Eigenfaces, is somewhat outdated and inefficient. Unfortunately, the Raspberry Pi is limited in its ability to perform complex machine learning tasks due to its lack of adequate processing power. Intel recently released the Movidius Neural Compute Stick which is specifically designed for the deployment of Convolutional Neural Networks on low-power applications through the use of its low-power VPU architecture.
At the beginning of this week I continued my research on the facial recognition systems and decided to test out an implementation of the OpenFace algorithm. First, facial detection is performed on an image using a Histogram of Oriented Gradients (HOG) face detector. Then, this face is aligned in order to normalize faces which have been slightly turned in one direction or another. After this, the aligned image is passed through a Convolutional Neural Network (CNN) which produces a 128-dimensional embedding that lies on a unit hypersphere. These embeddings have the unique property that similarity in faces can be measured by the Euclidean distance between two embeddings which is a result of the triplet loss used in training the convolutional model. This is the algorithm I think I will want to use in my implementation of the system on the Raspberry Pi since this is the algorithm currently used on the full-scale system. I ultimately managed to get a Python and Torch implementation of OpenFace working. At the end of each week my mentor oragnizes lab meetings in which one of the other students working in the lab will present on a topic or paper relating to their work.
After doing some more research over the weekend and at the beginning of this week I discovered that I would be unable to successfully use the current implementation of the OpenFace algorithm with the Movidius Neural Compute Stick. However, Tensorflow - an open source machine learning framework - support for the Neural Compute Stick had recently been introduced. This was important since there existed an implementation of Google's Facenet algorithm, which in similar to the OpenFace algorithm, with a caffe and tensorflow implementation. More importantly, there was also a version of this Facenet included as an example for the Movidius Neural Compuet Stick. I decided to use this Facenet implementation instead of the previously decided upon OpenFace implementation. I wrote a program in Python which would take as input a single, pre-aligned, image and then output a 128-dimensional embedding unique to this image. This was the basic functionality I would need for developing the full system. Ideally, each frame from a video capture would go through face detection and then recognition on those detected faces in order to develop a real time visualization of the system for testing purposes.
Once my program was working in python using the Tensorflow library, I needed to utilize the Movidius Neural Compute Stick (NCS). The first step was to convert the model I had been using into a model which could be successfully loaded on the Neural Compute Stick. Then, I was able to replace the code I had written to run the tensorflow graph with code which would instead pass the image through a graph loaded into the NCS. I then began working on the full system. I would use OpenCV - a library of programming functions mainly aimed at real time computer vision - to run video capture from an input and output camera. I would then run each frame through a HOG based face detector combined with a linear classifier, an image pyramid and a sliding window detection scheme. This type of face detector can be found in the dlib library which contains a wide range of machine learning algorithms and tools. After acquriing the faces in the image, each face will be aligned, again using dlib, and then passed through the facenet CNN to generate output embeddings. The faces can then be recognized using a nearest neighbors algorithm since the Euclidean distance between embeddings is directly related to similarity between faces. This would display in an output window with rectangles around detected faces and names alongside the rectangles to denote the identity of the faces. Upon user request, an entry/exit to a room would be simulated and then output to a web based access log. The ultimate goal is to replace the user request with a soft trigger which can detect when a user is "entering" or "exiting" a room and then perform the facial recognition at this point. After the recognition is performed, the information is stored in a meaningful way and is then sent to the webpage access logs.
I developed a system which has the basic functionalities listed in last's week report up to the point where the information is sent to an access log. A websocket is an advanced technology which allows for communication between the user's browser and a server. After doing some research, I discovered a python library called Flask which would allow me to host a server and a related library called flask-socketio which would allow the use of websocket communications on that server. After some testing, I managed to implement a version of the program which could effectively send and display information regarding access to the webpage. Fow now, I store all access information in arrays in my python code. When a new value is added to these arrays or the webpage is reloaded I send these values through the server and they are received and displayed on the webpage in real time. The ultimate goal for this phase is to have an access log which is fully synchronized with any number of facial recognition systems set up in a given facility. This will require me to change the way in which these values are saved so that they can be accessed globally by various other recognition systems.
The ultimate structure of the program requires a multithreaded approach in which I simultaneously host a server for websocket communication and perform facial detection and recognition on video input. Python has a threading library which I decided to utilize for this task. Generally speaking, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler. Multiple threads may exist and execute concurrently within one process, sharing resources such as memory, while different processes do not share these resources. These types of multithreaded applications have many advantages such as faster execution, lower resource consumption, better system utilization and cimplified sharing and communication. While the current prototype model of the facial recognition system allowed for the use of two cameras, I decided that for testing purposes I would first perfect the program on one camera representing the entrance to a room and simply "add in" the exit camera later. My first thread was responsible for hosting the flask server while the second gathered video input, performed detection and recognition and then sent this output through a websocket. The aforementioned sharing of process memory between threads proved to be a very useful property of multithreading since it allowed for accessing the websocket variable from both processes. The performance of this implementation was surprisingly good as I had, rather naively, expected at least some performance drop when performing multiple tasks at once. The web access logs updated in real time and were persistent when closed and reopened so I decided that it was time to deploy this application on the Raspberry Pi.
I moved development over to the Raspberry Pi which is a small single-board computer which allows for lightweight and mobile deployment of computing applications. After a somewhat lengthy setup process which involved backing up the current Raspberry Pi sd card in order to update the operating system to Raspbian Stretch I was able to successfully test my program. It was apparent that while the facial recognition ran smoothly, various other parts of the application including the dlib HOG face detection and the nearest neighbors search were quite disappointing in their performance. I also noticed some mediocre results with the facial recognition output which sometimes became confused between various users. I decided that my most immediate concern was addressing the slow runtime of the facial detection since the architecture of the program didn't require the nearest neighbors search to be performed as often and I attributed the performance of the facial recognition to a poorly trained model. I eventually decided that once initial detections had been made, I would be able to perform subsequent detections on the smaller areas in which faces had been previously seen. This greatly reduced the size of the image being passed into the dlib face detector and improved runtime by a factor of 4. Another approach would be use motion detection in conjunction with face detection. Motion detection also has the possible application of providing a much needed soft trigger mechanism to indicate when an entry into or exit from a room is occurring.
After further testing, my approach proved to be ineffective when dealing with a larger number of faces. I decided that a more reasonable approach would be to implement a type of facial detection using Haar Cascades which I had read would give much better performance. A Haar featured-based cascade classifier utilizes a machine learning approach where a cascade function is trained from a lot of positive and negative images and subsequently used to detect objects in other images. The OpenCV library provides the necessary framework to perform cascade classifier training. However, it also comes with a pretrained frontal face cascade classifier which I could use. One big consideration dealt with the effectiveness of the cascade classifier. While it was able to successfully detect frontal faces, any sort of rotation of the face along any axis would render the detector useless. In general this is a challenge faced in the implementation of the facial recognition system but it will require further testing to determine how often facial detections are not performed even though a person is present. I eventually determined that although the cascade classifier might not be as effective as the dlib facial detection, it would work in the favor of reducing unwanted errors and provide better pre-aligned images for passage through the facial recognition network. The implementation of the cascade classifier greatly improved performance and detections were able to be performed every other frame while still producing a video output with around 25-30 frames per second. The determination I had made can be summed up by saying that while extremely accurate and persistent facial detection can be achieved through the use of a HOG detector in combination with some other form of tracking, it would make it "more difficult" to compare two faces which could potentially be in diametrically opposed positions. Instead, a somewhat less robust and consistently accurate model would improve the recognition accuracy since all the detected faces would be in more similar positions and would thus reduce error.
I had also learned last week that a big problem with my design was the use of Python's threading module. The Global Interpreter Lock, or GIL, prevents multiple threads from executing Python bytecodes at once and thus from taking full advantage of multithreading while maintaining performance. I eventually decided to instead use the Python multiprocessing module which allows multiple processes to be run at once. While the threads ran within the same process and shared resources, the different processes produced by the multiprocessing module are not able to share resources. The multithreading approach had allowed me to easily access the websocket variable from a different thread but it would be much more difficult in this approach where a websocket object cannot be shared between two processes. It is possible to communicate between two processes through the use of a Pipe, which returns an input and output side of a pipe for sending and receiving certain types of objects and values. I ultimately decided to run my program using four different processes, or the exact number of cores on the Raspberry Pi. Two processes would read video, each from separate cameras and perform facial detection but not recognition. Once a face had been detected and the user request to perform an entrance/exit had been submitted, a cropped version of the face would be stored locally and a pointer to this local variable would be sent through an output end of a pipe to the receiving end where another process would determine who appeared in the croppped images as well as various other pieces of information. This process would then send this information through the websocket to be displayed on the webpage while simultaneously storing all the information in local storage in the form of .json files. These .json files would then be used to retrieve information whenever a webpage was reloaded. A fourth process simply hosted a flask server. Ultimately, this implementation of the program ran extremely well and is modularized in such a way as to allow easy future expansion on certain aspects of the program. For example, if we would like to automatically detect faces which are greater than a certain size (meaning they are most definitely entering) then it would be easy to modify the processes which performed facial detection in order to do so without affecting various other parts of the program.
I made a few minor modifications to my code and spent some time adding comments and a readme in order to properly explain how my program worked and how to set up the Raspberry Pi in order to properly run the program. I provided these instructions so that my program would be easy to understand and so that it would be easy for future students working on this project to continue where I left off. While the current prototype works well for testing, it is still far from being able to run independently. First, it is necessary to implement an "auto capture" feature for faces which will determine when a user is performing an entrance/exit and then automatically send this information to the access logs. This will require a lot of real world testing in order to gain an idea of how to properly position the camera as well as to determine what constitutes an entry or exit. It is also necessary to implement a central database for the operation of multiple units which will essentially synchronize any number of facial recognition units placed around a building. I also prepared a presentation to give at the end of the week, which is similar to the lab presentations given by the other students. I discussed my basic role in expanding upon an already existing prototype system. I also discussed the newer Facenet recognition which I used in place of the outdated EigenFace recognition and explained the performance benefits of upgrading to such a model.