SYSTEM FOR GENERATING FUNCTIONALITY REPRESENTATION, INDEXING, SEARCHING, COMPONENTIZING, AND ANALYZING OF SOURCE CODE IN CODEBASES AND METHOD THEREOF
First Claim
1. A computer-implemented method for organizing, functionality indexing and constructing a source code search engine, the method comprising:
- a. crawling a set of data entities in a repository system, each of the data entities representing one or more of a source code units and/or subsets of the source code units;
b. parsing said set of data entities into abstract syntax trees (ASTs) architecture;
c. modeling said set of data entities into a code graph (CG) architecture such that each one or more of a source code units and/or subsets of the source code units are set as vertices and connections between said each one or more of a source code units and/or subsets of the source code units are set as edges;
d. establishing type ontology (TO) architecture of said set of data entities by processing said set of data and assigning meta-data tags to each one or more of a source code units and/or subsets of the source code units, said tags representing classification attributes;
e. generating semantic ID based on linguistic, structural and contextual analyses of said set of data entities, said semantic ID corresponding to source code functionality of said one or more of a source code units and/or subsets of the source code units, said linguistic analysis employing linguistic clues, said structural linguistic analysis employing structural clues, and said contextual analysis employing contextual clues; and
f. organizing and storing said set of data entities in functionality representation index (FRI) architecture.
1 Assignment
0 Petitions
Accused Products
Abstract
Systems and methods are disclosed for generating functionality representation, indexing, searching, componentizing, an analyzing source code source code unit in a one or more of code repositories. The systems and method include one or more of crawling a set of data entities in a repository system; parsing said set of data entities into abstract syntax trees (ASTs) architecture; modeling said set of data entities into a code graph (CG) architecture; establishing type ontology (TO) architecture of said set of data entities; organizing and storing said set of data entities in functionality representation index (FRI) architecture; componentizing one or more projects in the repositories into code components; and making the components discoverable by functionality and analyzable for performance, usage volume, etc.
-
Citations
20 Claims
-
1. A computer-implemented method for organizing, functionality indexing and constructing a source code search engine, the method comprising:
-
a. crawling a set of data entities in a repository system, each of the data entities representing one or more of a source code units and/or subsets of the source code units; b. parsing said set of data entities into abstract syntax trees (ASTs) architecture; c. modeling said set of data entities into a code graph (CG) architecture such that each one or more of a source code units and/or subsets of the source code units are set as vertices and connections between said each one or more of a source code units and/or subsets of the source code units are set as edges; d. establishing type ontology (TO) architecture of said set of data entities by processing said set of data and assigning meta-data tags to each one or more of a source code units and/or subsets of the source code units, said tags representing classification attributes; e. generating semantic ID based on linguistic, structural and contextual analyses of said set of data entities, said semantic ID corresponding to source code functionality of said one or more of a source code units and/or subsets of the source code units, said linguistic analysis employing linguistic clues, said structural linguistic analysis employing structural clues, and said contextual analysis employing contextual clues; and f. organizing and storing said set of data entities in functionality representation index (FRI) architecture. - View Dependent Claims (2, 3)
-
-
4. A computer-implemented method for automatically extracting and characterizing code components from a software project, comprising steps of
a. obtaining one or more software projects stored in a non-transitory computer readable medium; -
b. scraping said projects; c. detecting one or more programming languages of each of said projects; d. detecting one or more environments of each said project; e. parsing each file in each said project to obtain an abstract syntax tree of each said file; f. identifying potential components in each said abstract syntax tree, said potential components comprising single nodes and/or collectivized nodes in said abstract syntax tree; g. parsing said project across said files of each said project to obtain a project dependency graph of said project, said files being additional potential components of said project; h. analyzing usage patterns of said potential components in said ASTs and said project dependency graph; i. associating each said potential component with metadata comprising said usage patterns of said component; j. generating a functionality representation for each potential component by analyzing said usage patterns, said dependencies and linguistic, contextual, and structural information of potential component; k. appending said functionality representation each said potential component to said metadata of said potential component; l. classifying a subset of the potential components as code components, using a statistical model trained on componentized projects;
said statistical model employing said metadata as features;m. creating a component dependency graph, each node of said component dependency graph associated with a said code component; n. matching the code components with test files of the project, if any, and appending matched test files to said metadata of matched code components; o. matching the code components with asset files of the project, if any and appending matched asset files to said metadata of matched code components; p. assigning a unique name to each said code component; q. classifying each said code component as an external or internal component; r. analyzing file dependencies of each said code component to identify a main file, if any, of said code component; s. creating edges between component nodes in the component dependency graph, said edges associated with said dependencies and with metadata regarding types of connections between said code components of said component nodes. - View Dependent Claims (5, 6, 7, 8, 9, 10, 11, 12, 13)
-
-
14. A computer-implemented method for providing discoverability and analytics for software components in a codebase, comprising steps of
a. obtaining one or more component dependency graphs of code components stored in a code repository, said code components each associated with metadata comprising a functionality representation of said component 705; -
b. indexing said functionality representations; c. indexing language models; d. indexing dictionaries, lexicons, and ontologies; e. grouping code components from the component dependency graphs into functional clusters; f. storing said functional clusters in a functionality representation index (FRI); g. mapping a natural language query to one or more functional identifiers (FRIDs) in said FRI and producing a set of ranked results; h. analyzing said code components; and i. displaying results of analyzing code components and insights and analysis regarding entire projects and codebase repositories with more than one project. - View Dependent Claims (15, 16, 17, 18, 19, 20)
-
Specification