Поисковая система java скриптов

Mini Search Engine – Just the basics, using Neo4j, Crawler4j, Graphstream and Encog

Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.

As you can only have 1 writable instance of the in memory implementation, I made a little double lock singleton factory to create and clear the DB.

package net.briandupreez.pci.chapter4; import org.neo4j.graphdb.GraphDatabaseService; import org.neo4j.graphdb.factory.GraphDatabaseFactory; import org.neo4j.kernel.impl.util.FileUtils; import java.io.File; import java.io.IOException; import java.util.HashMap; import java.util.Map; public class CreateDBFactory < private static GraphDatabaseService graphDb = null; public static final String RESOURCES_CRAWL_DB = "resources/crawl/db"; public static GraphDatabaseService createInMemoryDB() < if (null == graphDb) < synchronized (GraphDatabaseService.class) < if (null == graphDb) < final Mapconfig = new HashMap<>(); config.put("neostore.nodestore.db.mapped_memory", "50M"); config.put("string_block_size", "60"); config.put("array_block_size", "300"); graphDb = new GraphDatabaseFactory() .newEmbeddedDatabaseBuilder(RESOURCES_CRAWL_DB) .setConfig(config) .newGraphDatabase(); registerShutdownHook(graphDb); > > > return graphDb; > private static void registerShutdownHook(final GraphDatabaseService graphDb) < Runtime.getRuntime().addShutdownHook(new Thread() < @Override public void run() < graphDb.shutdown(); >>); > public static void clearDb() < try < if(graphDb != null)< graphDb.shutdown(); graphDb = null; >FileUtils.deleteRecursively(new File(RESOURCES_CRAWL_DB)); > catch (final IOException e) < throw new RuntimeException(e); >> >

Then using Crawler4j created a graph of all the URLs starting with my blog, their relationships to other URLs and all the words and indexes of the words that those URLs contain.

package net.briandupreez.pci.chapter4; import edu.uci.ics.crawler4j.crawler.Page; import edu.uci.ics.crawler4j.crawler.WebCrawler; import edu.uci.ics.crawler4j.parser.HtmlParseData; import edu.uci.ics.crawler4j.url.WebURL; import org.neo4j.graphdb.GraphDatabaseService; import org.neo4j.graphdb.Node; import org.neo4j.graphdb.Relationship; import org.neo4j.graphdb.Transaction; import org.neo4j.graphdb.index.Index; import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class Neo4JWebCrawler extends WebCrawler < private final GraphDatabaseService graphDb; /** * Constructor. */ public Neo4JWebCrawler() < this.graphDb = CreateDBFactory.createInMemoryDB(); >@Override public boolean shouldVisit(final WebURL url) < final String href = url.getURL().toLowerCase(); return !NodeConstants.FILTERS.matcher(href).matches(); >/** * This function is called when a page is fetched and ready * to be processed by your program. */ @Override public void visit(final Page page) < final String url = page.getWebURL().getURL(); System.out.println("URL: " + url); final IndexnodeIndex = graphDb.index().forNodes(NodeConstants.PAGE_INDEX); if (page.getParseData() instanceof HtmlParseData) < HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); //String html = htmlParseData.getHtml(); Listlinks = htmlParseData.getOutgoingUrls(); Transaction tx = graphDb.beginTx(); try < final Node pageNode = graphDb.createNode(); pageNode.setProperty(NodeConstants.URL, url); nodeIndex.add(pageNode, NodeConstants.URL, url); //get all the words final Listwords = cleanAndSplitString(text); int index = 0; for (final String word : words) < final Node wordNode = graphDb.createNode(); wordNode.setProperty(NodeConstants.WORD, word); wordNode.setProperty(NodeConstants.INDEX, index++); final Relationship relationship = pageNode.createRelationshipTo(wordNode, RelationshipTypes.CONTAINS); relationship.setProperty(NodeConstants.SOURCE, url); >for (final WebURL webURL : links) < System.out.println("Linking to " + webURL); final Node linkNode = graphDb.createNode(); linkNode.setProperty(NodeConstants.URL, webURL.getURL()); final Relationship relationship = pageNode.createRelationshipTo(linkNode, RelationshipTypes.LINK_TO); relationship.setProperty(NodeConstants.SOURCE, url); relationship.setProperty(NodeConstants.DESTINATION, webURL.getURL()); >tx.success(); > finally < tx.finish(); >> > private static List cleanAndSplitString(final String input) < if (input != null) < final String[] dic = input.toLowerCase().replaceAll("\\p", "").replaceAll("\\p", "").split("\\s+"); return Arrays.asList(dic); > return new ArrayList<>(); > >

After the data was collected, I could query it and perform the functions of a search engine. For this I decided to use java futures as it was another thing I had only read about and not yet implemented. In my day to day working environment we use Weblogic / CommonJ work managers within the application server to perform the same task.

final ExecutorService executorService = Executors.newFixedThreadPool(4); final String[] searchTerms = ; List> tasks = new ArrayList<>(); tasks.add(new WordFrequencyTask(searchTerms)); tasks.add(new DocumentLocationTask(searchTerms)); tasks.add(new PageRankTask(searchTerms)); tasks.add(new NeuralNetworkTask(searchTerms)); final List> results = executorService.invokeAll(tasks);

I then went about creating a task for each of the following counting the word frequency, document location, Page Rank and neural network (with fake input / training data) to rank the pages returned based on the search criteria. All the code is in my public github blog repo.

Читайте также:  Php позиция последнего вхождения символа

Disclaimer: The Neural Network task, either didn’t have enough data to be affective, or I implemented the data normalisation incorrectly, so it is currently not very useful, I’ll return to it once I have completed the journey through the while PCI book.

The one task worth sharing was the Page Rank one, I quickly read some of the theory for it, decided I am not that clever and went searching for a library that had it implemented. I discovered Graphstream a wonderful opensource project that does a WHOLE lot more than just PageRank, check out their video.

From that it was then simple to implement my PageRank task of this exercise.

package net.briandupreez.pci.chapter4.tasks; import net.briandupreez.pci.chapter4.NodeConstants; import net.briandupreez.pci.chapter4.NormalizationFunctions; import org.graphstream.algorithm.PageRank; import org.graphstream.graph.Graph; import org.graphstream.graph.implementations.SingleGraph; import org.neo4j.cypher.javacompat.ExecutionEngine; import org.neo4j.cypher.javacompat.ExecutionResult; import org.neo4j.graphdb.Node; import org.neo4j.graphdb.Relationship; import java.util.HashMap; import java.util.Iterator; import java.util.Map; import java.util.concurrent.Callable; public class PageRankTask extends SearchTask implements Callable  < public PageRankTask(final String. terms) < super(terms); >@Override protected ExecutionResult executeQuery(final String. words) < final ExecutionEngine engine = new ExecutionEngine(graphDb); final StringBuilder bob = new StringBuilder("START page=node(*) MATCH (page)-[:CONTAINS]->words "); bob.append(", (page)-[:LINK_TO]->related "); bob.append("WHERE words.word in ["); bob.append(formatArray(words)); bob.append("] "); bob.append("RETURN DISTINCT page, related"); return engine.execute(bob.toString()); > public TaskResponse call() < final ExecutionResult result = executeQuery(searchTerms); final MapreturnMap = convertToUrlTotalWords(result); final TaskResponse response = new TaskResponse(); response.taskClazz = this.getClass(); response.resultMap = NormalizationFunctions.normalizeMap(returnMap, true); return response; > private Map convertToUrlTotalWords(final ExecutionResult result) < final MapuniqueUrls = new HashMap<>(); final Graph g = new SingleGraph("rank", false, true); final Iterator pageIterator = result.columnAs("related"); while (pageIterator.hasNext()) < final Node node = pageIterator.next(); final IteratorrelationshipIterator = node.getRelationships().iterator(); while (relationshipIterator.hasNext()) < final Relationship relationship = relationshipIterator.next(); final String source = relationship.getProperty(NodeConstants.SOURCE).toString(); uniqueUrls.put(source, 0.0); final String destination = relationship.getProperty(NodeConstants.DESTINATION).toString(); g.addEdge(String.valueOf(node.getId()), source, destination, true); >> computeAndSetPageRankScores(uniqueUrls, g); return uniqueUrls; > /** * Compute score * * @param uniqueUrls urls * @param graph the graph of all links */ private void computeAndSetPageRankScores(final Map uniqueUrls, final Graph graph) < final PageRank pr = new PageRank(); pr.init(graph); pr.compute(); for (final Map.Entryentry : uniqueUrls.entrySet()) < final double score = 100 * pr.getRank(graph.getNode(entry.getKey())); entry.setValue(score); >> >

In between all of this I found a great implementation of sorting a map by values on Stackoverflow.

package net.briandupreez.pci.chapter4; import java.util.*; public class MapUtil < /** * Sort a map based on values. * The values must be Comparable. * * @param map the map to be sorted * @param ascending in ascending order, or descending if false * @param key generic * @param value generic * @return sorted list */ public static > List> entriesSortedByValues(final Map map, final boolean ascending) < final List> sortedEntries = new ArrayList<>(map.entrySet()); Collections.sort(sortedEntries, new Comparator>() < @Override public int compare(final Map.Entrye1, final Map.Entry e2) < if (ascending) < return e1.getValue().compareTo(e2.getValue()); >else < return e2.getValue().compareTo(e1.getValue()); >> > ); return sortedEntries; > >

The Maven dependencies used to implement all of this

 com.google.guava guava 14.0.1  org.encog encog-core 3.2.0-SNAPSHOT  edu.uci.ics crawler4j 3.5 jar compile  org.neo4j neo4j 1.9  org.graphstream gs-algo 1.1.2 

Now to chapter 5 on PCI… Optimisation.

Читайте также:  Знак присвоить в питоне

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

A simple HTML search engine implemented in Java.

License

adrianbrink/Java-Search-Engine

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

A simple search engine implemented in Java. It allows the user to specify an input file of parsed HTML and will allow searches for specific urls. Furthermore it allows users to crawl websites up to a specific depth and then search for specific words. It also supports simple boolean operations.

AND for and search on two words. OR for or search on two words.

If no results are found, it will show likely results using the Levenshtein algorithm.

About

A simple HTML search engine implemented in Java.

Источник

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

DemonZhdb/SearchEngine

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Читайте также:  Компьютерное зрение python уроки

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Git stats

Files

Failed to load latest commit information.

README.md

SearchEngine -локальный поисковый движок

Проект создания локальной поисковой системы, дающей возможность полнотекстового поиска по сайтам, указанным в конфигурационном файле. Система содержит несколько контроллеров, сервисов и репозиторий подключенный к БД PostgreSQL.
На стартой странице системы выводится статистическая информация («DASHBOARD»), о проиндексированных сайтах и страницах, а также леммах (начальная словарная форма слова), содержащихся на этих страницах

image

Система позволяет производить как полную индексацию всех страниц на сайтов из списка, так и добавление и переиндексацию отдельно заданных страниц этих сайтов.

image

В строку запроса для поиска можно вводить как одно слово, так и целую фразу. При этом можно выбирать, где искать — на конкретном сайте или выбрать все сайты.

image

В результате поиска выводится список наиболее релевантных страниц, где встречаются слова из строки запроса.

Стек используемых технологий

Java Core, Spring Boot, JPA, Hibernate, JDBC, Security, PostgreSQL, REST API, JSOUP, Maven, Git, Swagger
Также библиотеки лемматизации — RussianMorphology и стемминга (нахождения основы слова) — stemmer.

Для успешного скачивания и подключения к проекту зависимостей из GitHub необходимо настроить Maven конфигурацию в файле settings.xml .

Для работы системы в файле pom.xml необходимо добавить информацмию о фреймворке:

 org.springframework.boot spring-boot-starter-parent 2.6.4  

а также ссылку на репозиторий для скачивания зависимостей лемматизатора:

  github GitHub Apache Maven Packages - Russian Morphology https://maven.pkg.github.com/skillbox-java/russianmorphology   

Также нужно указать подключение следующих зависимостей apache Maven:

 spring-boot-starter-security spring-boot-starter-thymeleaf spring-boot-starter-web spring-boot-starter-data-jpa postgresql jsoup 

Для работы парсинга страниц нужно подключить JSOUP :

Для преобразования слов в леммы неообходимо подключение зависимостей morph, morphology, dictionary-reader, english, russian из источника : org.apache.lucene.morphology необходимо ещё создать (либо отредактировать если он имеется — в Windows он располагается в директории C:/Users//.m2) файл settings.xml, в котором указать токен для получения данных из публичного репозитория. В файл нужно внести следующие строки:

   github   Authorization Bearer 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       

Стартовая страница поискового движка находится по адресу : http://localhost:8080/
Сразу при старте система запрашивает логин/пароль, которые указаны в файле конфигурации src/resources/application.yml :

 security: user: name: user password: user roles: user 

Для реализации взаимодействия системы SearchEngine со сторонними приложениями в проекте на базе Swagger(OpenAPI 3.0) реализована документация OpenAPI, которая представляет собой спецификацию с описанием всех методов: создания, использования, визуализации и тестирования веб-сервисов REST Страница с документацией OpenAPI доступна по адресу :http://localhost:8080/swagger-ui.html
Для настройки документирования OpenAPI необходимо подключить зависимость из репозитория springdoc:

  org.springdoc springdoc-openapi-ui 1.6.11  

Источник

Оцените статью