Student-Faculty Programs Office
Summer 2025 Announcements of Opportunity


<< Prev    Record 21 of 52    Next >>           Back To List

Project:  Accelerating Large Scale Pattern Recognition Methods - SURF@Newcastle in 2025
Disciplines:  Computation and Neural Systems, Mathematics, CS, Applied Math, Physics
Mentor:  Pablo Moscato, Professor, (EAS), pablo.moscato@newcastle.edu.au, Phone: +61 2 424216209
Mentor URL:  https://www.newcastle.edu.au/profile/pablo-moscato  (opens in new window)
Background:  NOTE: This project is being offered by a Caltech alum and is open only to Caltech students. The project will be conducted at the University of Newcastle in Newcastle, Australia.

Searching for patterns that can help to classify between samples of different classes is an essential topic in machine learning and data mining. From the perspective of explainable AI, these patterns have to be informative as well, giving the domain experts new insights about what makes samples different.

However, at universities and colleges around the world, the emphasis has been to identify “small patterns”. There is the well-known “Occam Razor” argument, if a small explanation is sufficient to make the classification, and just uses a few features or variables, then it is unnecessary to include solutions that involve a larger number of features.

In practice, however, some of these small patterns may not be interpretable for the final user. It may indicate, for instance, but people who understand the problem domain can not “connect the dots” with only a few indicative features deemed of interest.

Over the past two decades we have been consolidating an approach based on a generalization of the k-Feature Set problem called the (alpha,beta)-k-Feature Set problem. Nearly 20 papers have been published about them with a wide range of applications. We suggest looking at the paper in [1] to understand the problem in the context of elicitation of molecular toxicity in proteins. That paper contains links to most of our published papers in the topic. It has also been used for the analysis of US Presidency elections data, and the topic is also described in [2].

These problems are normally solved using integer programming methods and heuristically using memetic algorithms. Two former PhD students of our team also worked in an approach that is also supported by mathematical programming solvers [3].

This project stems from a collaboration with a Caltech SURF students in 2024 who developed a memetic algorithm for another problem, the L-Pattern Identification problem (which is also described in [2]). Her code, in Python, provides a stepping stone for the development of a variant of the L-Pattern Identification problem which, in essence, is related to the (alpha,beta)-k-Feature Set problem. The student talk, “Assessing the Scalability and Interpretability of the L-Pattern Identification Problem”, was selected as one of the six finalists of the Perpall SURF Speaking Competition to take place at the Hameetman Auditorium in Cahill, on Tuesday, January 21, 2025, at 3:20 PM. You may wish to attend the talk of interested in knowing more and meet the previous SURF student.

The purpose of this project then is to extend the scalability of both the (alpha,beta)-k-Feature Set problem codes and develop a variant of the L-Pattern Identification Problem that can address instances having nearly one million features and hundreds of thousands of samples. This probably would require some understanding of Python and GPU computing to achieve such a scalability.

[This project may have more than one individual, so working collaborators are invited to apply as a team.]
Description:  The student will continue the ongoing development of open source codes for memetic algorithms for machine learning problems, mainly in classification.

The method will be tested with a number of datasets of interest and available for experimentation. A comparison with other machine learning approaches are expected, thus the deliverables may help the team to continue the collaboration after SURF and engage in ongoing competitions in international events dedicated to this area or those such as being sponsored by Kaggle and other international groups.

We expect that candidates could continue developing this research area while returning to Caltech, if interested in developing an ongoing collaboration with the mentors (as it has happened in the past).

The internship may provide the necessary time for effective communication of what the core problems are and find a first solution which may result in, at least, one journal publication.
References:  1) The (α, β)-k Boolean Signatures of Molecular Toxicity: Microcystin as a Case Study, Pablo Moscato, Sabrina Jaeger-Honz, Mohammad Nazmul Haque, Falk Schreiber, https://www.biorxiv.org/content/10.1101/2024.12.29.630644v1.abstract
2) Marketing Meets Data Science: Bridging the Gap, by P. Moscato and N.J. de Vries, in Business and Consumer Analytics: New Ideas, Springer, 2019, https://link.springer.com/chapter/10.1007/978-3-030-06222-4_1
3) Heuristic Solutions for the (alpha,beta)-k Feature Set Problem, • Leila M. Naeni & Amir Salehipour, in Heuristics for Optimization and Learning, Springer, 2020, https://link.springer.com/chapter/10.1007/978-3-030-58930-1_9
Student Requirements:  High-level programming skills, interest in scientific computing/machine learning/artificial intelligence. Experience in HPC and GPU computing.
Programs:  This AO can be done under the following programs:

  Program    Available To
       SURF    Caltech students only 

Click on a program name for program info and application requirements.



<< Prev    Record 21 of 52    Next >>           Back To List
 

Problems with or questions about submitting an AO?  Call Student-Faculty Programs of the Student-Faculty Programs Office at (626) 395-2885.
 
About This Site