Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
Segentation fault with Eigen library and GMP rationals
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo
View previous topic :: View next topic  
Author Message
amandini
n00b
n00b


Joined: 01 Jul 2020
Posts: 28

PostPosted: Mon Nov 18, 2024 6:25 am    Post subject: Segentation fault with Eigen library and GMP rationals Reply with quote

Hello,

I am doing some work using the Eigen C++ template library, with large sparse matrices, say 200,000 x 200,000, over the mpq_class provided by the GNU Multiple Precision (GMP) library.

The first problem is that code which compiles without any warnings on the godbolt compiler, a small example of which is given at https://godbolt.org/z/j66jdK7n6, produces the following warnings when compiled on my gentoo system

Code:

#g++ -std=c++20 -O2 -march=native -lgmp -lgmpxx -Iboost -Wall -Wpedantic -Wextra example.cpp  -o example.o
In file included from /usr/include/eigen3/Eigen/SparseCore:40,
                 from example.cpp:6:
/usr/include/eigen3/Eigen/src/SparseCore/AmbiVector.h: In instantiation of ‘void Eigen::internal::AmbiVector<_Scalar, _StorageIndex>::reallocateSparse() [with _Scalar = __gmp_expr<__mpq_struct [1], __mpq_struct [1]>; _StorageIndex = int]’:
/usr/include/eigen3/Eigen/src/SparseCore/AmbiVector.h:238:11:   required from ‘_Scalar& Eigen::internal::AmbiVector<_Scalar, _StorageIndex>::coeffRef(Eigen::Index) [with _Scalar = __gmp_expr<__mpq_struct [1], __mpq_struct [1]>; _StorageIndex = int; Eigen::Index = long int]’
/usr/include/eigen3/Eigen/src/SparseCore/TriangularSolver.h:233:28:   required from ‘static void Eigen::internal::sparse_solve_triangular_sparse_selector<Lhs, Rhs, Mode, UpLo, 0>::run(const Lhs&, Rhs&) [with Lhs = const Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]> >; Rhs = Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]> >; int Mode = 6; int UpLo = 2]’
/usr/include/eigen3/Eigen/src/SparseCore/TriangularSolver.h:306:93:   required from ‘void Eigen::TriangularViewImpl<MatrixType, Mode, Eigen::Sparse>::solveInPlace(Eigen::SparseMatrixBase<OtherDerived>&) const [with OtherDerived = Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]> >; MatrixType = const Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]> >; unsigned int Mode = 6]’
example.cpp:80:61:   required from here
/usr/include/eigen3/Eigen/src/SparseCore/AmbiVector.h:97:18: warning: ‘void* memcpy(void*, const void*, size_t)’ writing to an object of type ‘Eigen::internal::AmbiVector<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int>::Scalar’ {aka ‘class __gmp_expr<__mpq_struct [1], __mpq_struct [1]>’} with no trivial copy-assignment; use copy-assignment or copy-initialization instead [-Wclass-memaccess]
   97 |       std::memcpy(newBuffer,  m_buffer,  copyElements * sizeof(ListEl));
      |       ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from example.cpp:4:
/usr/include/gmpxx.h:1763:7: note: ‘Eigen::internal::AmbiVector<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int>::Scalar’ {aka ‘class __gmp_expr<__mpq_struct [1], __mpq_struct [1]>’} declared here
 1763 | class __gmp_expr<mpq_t, mpq_t>
      |       ^~~~~~~~~~~~~~~~~~~~~~~~


Eigen developers informed me that they could not reproduce these warnings on their machines. Nor are the warnings reproducible on the compiler explorer https://godbolt.org/z/j66jdK7n6.

Second and more pressing issue is that the process of constructing large sparse matrices (175,175 x 175,175) over mpq_class in Eigen causes the following segmentation fault:
Code:
Thread 1 "main.o" received signal SIGSEGV, Segmentation fault.
0x00007ffff79d525e in free () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff79d525e in free () from /lib64/libc.so.6
#1  0x00007ffff7edee80 in __gmpq_clear () from /usr/lib64/libgmp.so.10
#2  0x000055555555ff4d in __gmp_expr<__mpq_struct [1], __mpq_struct [1]>::~__gmp_expr (this=0x7fffffffd510, __in_chrg=<optimized out>)
    at /usr/include/gmpxx.h:1855
#3  Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 1, int>::insertBackUncompressed (col=<optimized out>, row=<optimized out>, this=0x7fffffffd4c0) at /usr/include/eigen3/Eigen/src/SparseCore/SparseMatrix.h:904
#4  Eigen::internal::set_from_triplets<__gnu_cxx::__normal_iterator<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int> const*, std::vector<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int>, std::allocator<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int> > > >, Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int>, Eigen::internal::scalar_sum_op<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, __gmp_expr<__mpq_struct [1], __mpq_struct [1]> > > (begin=..., end=..., mat=..., dup_func=...) at /usr/include/eigen3/Eigen/src/SparseCore/SparseMatrix.h:1056
#5  0x000055555556cf22 in Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int>::setFromTriplets<__gnu_cxx::__normal_iterator<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int> const*, std::vector<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int>, std::allocator<Eigen::Triplet<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, int> > > > > (end=..., begin=..., this=0x7fffffffdb78) at /usr/include/eigen3/Eigen/src/Core/functors/BinaryFunctors.h:36
#6  p_basis<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int> >::basis_builder (this=this@entry=0x7fffffffdb38, src=..., seminormal_form=..., p=p@entry=3) at /home/user_name/path_to_file/p_basis.h:165
#7  0x0000555555570c69 in p_basis<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int > >::p_basis (this=0x7fffffffdb38, src=..., seminormal_form=..., p=3) at /home/user_name/path_to_file/p_basis.h:57
#8  0x00005555555718c5 in basis_data<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int> >::basis_data (p=3, src=..., this=0x7fffffffdaf8) at /home/user_name/path_to_file/basis_data.h:61
#9  mform<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, Eigen::SparseMatrix<__gmp_expr<__mpq_struct [1], __mpq_struct [1]>, 0, int> >::mform (this=0x7fffffffdaf0, src=..., path="/home/user_name/output.txt", p=3) at /home/user_name/path_to_file/mform.h:36
#10 0x0000555555559995 in main () at /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/basic_string.tcc:242
(gdb)


The code that produced the segmentation fault above, was
Code:
typedef Eigen::Triplet<mpq_class> qtriplet;
std::vector<qtriplet> triplets{};

and line 165 of the file p_basis.h referred to in the gdb backtrace aboveis
Code:
m_basis.setFromTriplets(std::cbegin(triplets), std::cend(triplets));

Would anyone at the Gentoo Science Project (Eigen) or the Gentoo Toolchain Project (GMP) be willing or able to help look into these issues?

Any other suggestions that may help get to the bottom of this issue would be gratefully received.

[edited for typos]


Last edited by amandini on Mon Nov 18, 2024 8:25 am; edited 1 time in total
Back to top
View user's profile Send private message
sam_
Developer
Developer


Joined: 14 Aug 2020
Posts: 1972

PostPosted: Mon Nov 18, 2024 6:37 am    Post subject: Reply with quote

I see the warning locally as well but not on godbolt. Not sure why yet.

What does -march=native expand to on your system? You can check by running resolve-march-native from app-misc/resolve-march-native. I do see it without but it may be useful to know nonetheless.

Could you give me a testcase which segfaults? Can look more then.
Back to top
View user's profile Send private message
amandini
n00b
n00b


Joined: 01 Jul 2020
Posts: 28

PostPosted: Mon Nov 18, 2024 7:05 am    Post subject: Reply with quote

sam_ wrote:
What does -march=native expand to on your system? You can check by running resolve-march-native from app-misc/resolve-march-native.

I see the warning locally as well but not on godbolt. Not sure why yet.

Could you give me a testcase which segfaults?


Thank you for your reply. I run these calculations on two workstations, on one -march=native resolves as
Code:
-march=tigerlake -mabm -mno-kl -mno-sgx -mno-widekl -mshstk --param=l1-cache-line-size=64 --param=l1-cache-size=48 --param=l2-cache-size=24576

and the other it resolves as
Code:
march=znver2 --param=l1-cache-line-size=64 --param=l1-cache-size=32 --param=l2-cache-size=512

The compiler warnings occur identically on both the znver2 and the tigerlake machine.

The example on godbolt was only contrived to illustrate the compiler warnings. As it is, it took many lines of code and over 500GB of ram on the AMD machine to produce the segmentation fault, so it may not be practical to share this specific example.

I have not tried to reproduce the segmentation fault on the intel machine which has less memory.

However, it may be possible to come up with a contrived and simpler example that reproduces the segmentation fault. Unfortunately may take a bit of time to get such a segmentation fault example working - my apologies in advance.

Should I go ahead and try to make such an example?


Last edited by amandini on Tue Nov 19, 2024 12:33 am; edited 1 time in total
Back to top
View user's profile Send private message
bstaletic
Guru
Guru


Joined: 05 Apr 2014
Posts: 374

PostPosted: Mon Nov 18, 2024 8:02 am    Post subject: Reply with quote

I'm not familiar with GMP, but that operator< looks very wrong - sometimes it tells you which of the two inputs is smaller and sometimes it does the opposite.

Code:

x   y | result 
---------------
0   0 | false
0   y | true 
x   0 | false
x  2x | false
2x  x | true 


The last two return true if x is greater than y.
The first three return true if y is greater than x.
Back to top
View user's profile Send private message
GDH-gentoo
Veteran
Veteran


Joined: 20 Jul 2019
Posts: 1709
Location: South America

PostPosted: Mon Nov 18, 2024 8:05 pm    Post subject: Reply with quote

I don't know why the Compiler Explorer doesn't output the warning, but it makes sense. The code in Eigen::internal::AmbiVector<>'s reallocateSparse() member function uses std::memcpy() to copy the value of an object whose type is the type denoted by the class template's _Scalar parameter. I believe that you can have undefined behaviour when using objects of nontrivially copyable types made like that... and GMP's mpq_class is definitely not trivially copyable, it's a class that has nontrivial default and copy constructors, a nontrivial copy assignment operator, and a nontrivial destructor.

I don't know in general if Eigen documents its expectations about the types that are passed as template arguments like this, examples seem to show only arithmetic types like double or at most trivially copyable classes like std::complex<float>. The segfault shown in the backtrace happens in the destructor of some mpq_class object, hard to say if it was properly constructed.
_________________
NeddySeagoon wrote:
I'm not a witch, I'm a retired electronics engineer :)
Ionen wrote:
As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though :)
Back to top
View user's profile Send private message
amandini
n00b
n00b


Joined: 01 Jul 2020
Posts: 28

PostPosted: Mon Nov 18, 2024 10:41 pm    Post subject: Reply with quote

GDH-gentoo wrote:
I don't know why the Compiler Explorer doesn't output the warning, but it makes sense. The code in Eigen::internal::AmbiVector<>'s reallocateSparse() member function uses std::memcpy() to copy the value of an object whose type is the type denoted by the class template's _Scalar parameter. I believe that you can have undefined behaviour when using objects of nontrivially copyable types made like that... and GMP's mpq_class is definitely not trivially copyable, it's a class that has nontrivial default and copy constructors, a nontrivial copy assignment operator, and a nontrivial destructor.

I don't know in general if Eigen documents its expectations about the types that are passed as template arguments like this, examples seem to show only arithmetic types like double or at most trivially copyable classes like std::complex<float>. The segfault shown in the backtrace happens in the destructor of some mpq_class object, hard to say if it was properly constructed.


Thank you for your reply.

Apropos of the compiler warnings, an Eigen developer just posted the Discord channel that 'eigen should use the copy/move ctor instead of memcpy if the type isn't trivially copyable' and provided the following link:
https://gitlab.com/libeigen/eigen/-/blob/master/Eigen/src/SparseCore/AmbiVector.h?ref_type=heads#L91
It therefore seems that warnings may be an Eigen issue.

The segmentation fault remains a mystery - I had previously computed examples of up to dimensions 125,125 x 125,125 without encountering the problem. It has only given the segmentation fault when I moved to the next example which happens to have dimension 175,175 x 175,175 (irrelevant numerological observation: the dimensions in these two examples happen to be divisible by 1001 = 7 * 11 * 13).

Stripping out the extraneous parts of the code, I'll construct the matrix using the same mp_class elements stored in a CSV file the and see if the segmentation fault can be replicated that way.
Back to top
View user's profile Send private message
amandini
n00b
n00b


Joined: 01 Jul 2020
Posts: 28

PostPosted: Mon Nov 18, 2024 11:07 pm    Post subject: Reply with quote

bstaletic wrote:
I'm not familiar with GMP, but that operator< looks very wrong - sometimes it tells you which of the two inputs is smaller and sometimes it does the opposite.

Code:

x   y | result 
---------------
0   0 | false
0   y | true 
x   0 | false
x  2x | false
2x  x | true 


The last two return true if x is greater than y.
The first three return true if y is greater than x.


Thank you for your reply.

The code containing the operator came from Eigen https://eigen.tuxfamily.org/dox/TopicCustomizing_CustomScalar.html
Back to top
View user's profile Send private message
GDH-gentoo
Veteran
Veteran


Joined: 20 Jul 2019
Posts: 1709
Location: South America

PostPosted: Tue Nov 19, 2024 9:56 pm    Post subject: Reply with quote

amandini wrote:
The code that produced the segmentation fault above, was
Code:
typedef Eigen::Triplet<mpq_class> qtriplet;
std::vector<qtriplet> triplets{};

and line 165 of the file p_basis.h referred to in the gdb backtrace aboveis
Code:
m_basis.setFromTriplets(std::cbegin(triplets), std::cend(triplets));
amandini wrote:
It has only given the segmentation fault when I moved to the next example which happens to have dimension 175,175 x 175,175 [...].

So, m_basis here has type Eigen::SparseMatrix<mpq_class, 0, int>, right? Do you mean that when you get the segfault, m_basis.rows() and m_basis.cols() are both 175175? How many elements does triplets actually have when m_basis.setFromTriplets() is called and segfaults?
_________________
NeddySeagoon wrote:
I'm not a witch, I'm a retired electronics engineer :)
Ionen wrote:
As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though :)
Back to top
View user's profile Send private message
amandini
n00b
n00b


Joined: 01 Jul 2020
Posts: 28

PostPosted: Wed Nov 20, 2024 7:25 am    Post subject: Reply with quote

GDH-gentoo wrote:
amandini wrote:
The code that produced the segmentation fault above, was
Code:
typedef Eigen::Triplet<mpq_class> qtriplet;
std::vector<qtriplet> triplets{};

and line 165 of the file p_basis.h referred to in the gdb backtrace aboveis
Code:
m_basis.setFromTriplets(std::cbegin(triplets), std::cend(triplets));
amandini wrote:
It has only given the segmentation fault when I moved to the next example which happens to have dimension 175,175 x 175,175 [...].

So, m_basis here has type Eigen::SparseMatrix<mpq_class, 0, int>, right? Do you mean that when you get the segfault, m_basis.rows() and m_basis.cols() are both 175175? How many elements does triplets actually have when m_basis.setFromTriplets() is called and segfaults?


Thank you for your reply.

Yes, the dimensions and type you have quoted are all correct.

There are 3,010,744,671 elements in triplets when I initialise the matrix at the line
Code:
m_basis.setFromTriplets(std::cbegin(triplets), std::cend(triplets));
Back to top
View user's profile Send private message
Hu
Administrator
Administrator


Joined: 06 Mar 2007
Posts: 22694

PostPosted: Wed Nov 20, 2024 12:55 pm    Post subject: Reply with quote

The warning about memcpy is because classes with non-trivial copy constructors often (but not always) have invariants that the copy constructor will maintain, but that a straight memcpy will not. A simple example of that would be std::vector: the copy constructor is nontrivial because the copy gets its own distinct storage, with its own copy of the contained objects. Using memcpy to copy a vector would result in two vector instances that point to the same storage, and both will try to free that storage when destroyed. That free is undefined behavior and may (or may not) cause a crash due to heap corruption.

I would argue that until you can patch Eigen to build cleanly with -Werror=class-memaccess, it is not worth your time investigating the crash further. Eliminating the undefined behavior of copying the non-trivial class is worthwhile, and may (or may not) fix your crash as a side effect.
Back to top
View user's profile Send private message
GDH-gentoo
Veteran
Veteran


Joined: 20 Jul 2019
Posts: 1709
Location: South America

PostPosted: Wed Nov 20, 2024 12:55 pm    Post subject: Reply with quote

amandini wrote:
There are 3,010,744,671 elements in triplets when I initialise the matrix at the line
Code:
m_basis.setFromTriplets(std::cbegin(triplets), std::cend(triplets));

Since what happens looks like a memory management problem, I'm not sure that the values stored in mpq_class objects matters, so the first thing I would do to try to reproduce the segmentation fault is write a small program that creates a 175,175 x 175,175 SparseMatrix<mpq_class, 0, int> object, a std::vector<Triplet<mpq_class>> object with 3,010,744,671 very simple to compute elements —but without repeating (row, column) pairs, since that executes an extra code path—, and calls setFromTriplets(). I won't have access to my Gentoo computer for several hours, but I might post one if you don't beat me to it in the meantime.
_________________
NeddySeagoon wrote:
I'm not a witch, I'm a retired electronics engineer :)
Ionen wrote:
As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though :)
Back to top
View user's profile Send private message
GDH-gentoo
Veteran
Veteran


Joined: 20 Jul 2019
Posts: 1709
Location: South America

PostPosted: Wed Nov 20, 2024 11:53 pm    Post subject: Reply with quote

GDH-gentoo wrote:
[...] the first thing I would do to try to reproduce the segmentation fault is write a small program that creates a 175,175 x 175,175 SparseMatrix<mpq_class, 0, int> object, a std::vector<Triplet<mpq_class>> object with 3,010,744,671 very simple to compute elements —but without repeating (row, column) pairs, since that executes an extra code path—, and calls setFromTriplets()

Here it is, based on the example in section "Filling a sparse matrix" (semitested because of memory constraints and not wanting to install Eigen) :

Code:
#include <Eigen/SparseCore>
#include <gmpxx.h>
#include <iostream>
#include <vector>

namespace Eigen {

template<> struct NumTraits<mpq_class>: GenericNumTraits<mpq_class> {
        typedef mpq_class Real;
        typedef mpq_class NonInteger;
        typedef mpq_class Nested;

        static inline Real epsilon() { return 0; }
        static inline Real dummy_precision() { return 0; }
        static inline int digits10() { return 0; }

        enum {
                IsInteger = 0,
                IsSigned = 1,
                IsComplex = 0,
                RequireInitialization = 1,
                ReadCost = 6,
                AddCost = 150,
                MulCost = 100
        };
};
}

int main() {
        using namespace Eigen;
        using qtriplet = Triplet<mpq_class>;

        std::vector<qtriplet> t;
        t.reserve(3010744671UL);

        const Index nrows = 175175UL;
        unsigned long k = 0;
        for (Index i = 0; i < nrows; i++)
                for (Index j = 0; j < 17187UL; j++) t.push_back(qtriplet(i, j, mpq_class(k++)));
        for (Index i = 0; i < 11946UL; i++) t.push_back(qtriplet(i, 17188UL, mpq_class(k++)));
        std::cout << "Created vector of " << k << " qtriplets\nTrying setFromTriplets()\n";

        SparseMatrix<mpq_class> m(nrows, nrows);
        m.setFromTriplets(t.cbegin(), t.cend());
        std::cout << "I didn't explode\n";

        return 0;
}

Does this segfault?
_________________
NeddySeagoon wrote:
I'm not a witch, I'm a retired electronics engineer :)
Ionen wrote:
As a packager I just don't want things to get messier with weird build systems and multiple toolchains requirements though :)
Back to top
View user's profile Send private message
amandini
n00b
n00b


Joined: 01 Jul 2020
Posts: 28

PostPosted: Thu Nov 21, 2024 9:56 am    Post subject: Reply with quote

Hu wrote:
The warning about memcpy is because classes with non-trivial copy constructors often (but not always) have invariants that the copy constructor will maintain, but that a straight memcpy will not. A simple example of that would be std::vector: the copy constructor is nontrivial because the copy gets its own distinct storage, with its own copy of the contained objects. Using memcpy to copy a vector would result in two vector instances that point to the same storage, and both will try to free that storage when destroyed. That free is undefined behavior and may (or may not) cause a crash due to heap corruption.

I would argue that until you can patch Eigen to build cleanly with -Werror=class-memaccess, it is not worth your time investigating the crash further. Eliminating the undefined behavior of copying the non-trivial class is worthwhile, and may (or may not) fix your crash as a side effect.


Thank you for the replies.

I agree that the non-trivial copy constructor issue in Eigen needs to be solved as it is possibly the root cause of the fault. The issue has now been flagged by the developers: https://gitlab.com/libeigen/eigen/-/issues/2873.

In the meantime, thinking about the question that GDH-gentoo asked about the number of non-zero elements in the matrix, I realised that because there is a equivalence relation on the entries that is used to set many of them to zero, it would be more efficient to delete these entries before rather than after the matrix is initialised.

With this change to the program, the matrix for the same example contains 1,478,894 elements and the segmentation fault no longer occurs for this particular example.

I should have done the calculation this way in the first place.
Back to top
View user's profile Send private message
amandini
n00b
n00b


Joined: 01 Jul 2020
Posts: 28

PostPosted: Thu Nov 21, 2024 10:00 am    Post subject: Reply with quote

GDH-gentoo wrote:

Here it is, based on the example in section "Filling a sparse matrix" (semitested because of memory constraints and not wanting to install Eigen) :

Does this segfault?


Thank you for writing this code.

Being a bit swamped with other things right now, I won't be able to do any more work on this problem until Saturday.

Will be sure to let you know the result then.
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Other Things Gentoo All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum